Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regex replace functionality to transformation filter extractors #301

Merged
merged 15 commits into from
Feb 1, 2024

Conversation

ben-taussig-solo
Copy link
Contributor

@ben-taussig-solo ben-taussig-solo commented Jan 25, 2024

Description

Overview

  • The changes in this PR extend the transformation filter extractor functionality to support regex replace/replace all operations on extracted values.
  • As discussed in Transform to replace a portion of a header value via regex gloo#8706 , this functionality is desirable for use cases where users want to avoid defining multiple resources to handle similarly formatted inputs.
  • Previously, I had posted a design plan for this feature. The implementation details of that plan have shifted since then, but the timeline/context/overview of the feature is still accurate.

API Changes

  • Two new fields have been added to the Extraction message in the transformation filter proto (see here):
    • replacement_text (google.protobuf.StringValue): If set, the portion of the input string that matches the regex/subgroup will be replaced with this text.
      - replace_all (bool): If set to true, all matches against the regex will be replaced with the replacement text. 
      - Note: When this field is set, the reqiurement that the regex field must match the entire input string is relaxed. This means that the regex can match a substring of the input string, and the replacement will be applied to all matches.
      - Note: This field is mutually exclusive with the existing subgroup field. If both are set, the configuration will be rejected by envoy-gloo.
    • mode (enum): Used to explicitly specify the operation mode (EXTRACT, SINGLE_REPLACE,REPLACE_ALL). This field replaces the replace_all boolean flag to provide clearer semantics and allow for future extensibility.

Internal Changes

  • The Extractor class in the transformation filter has been given 5 new fields:
    • has_replacement_text_ (const bool)
      • Set at construction time based on whether the replacement_text field is set in the Extraction message.
      • Note that an empty string is considered a valid value for replacement_text, so this field is used to differentiate between the case where the user has explicitly set replacement_text to an empty string and the case where the user has not set replacement_text at all.
    • replacement_text_ (const std::string)
      • Set at construction time based on the value of the replacement_text field in the Extraction message.
      • If the replacement_text field is not set, this field will be set to an empty string.
        - replace_all_ (const bool)
        - Set at construction time based on the value of the replace_all field in the Extraction message.
        - If the replace_all field is not set, this field will be set to false.
    • mode (const enum)
      • Set to one of three possible modes (EXTRACT, SINGLE_REPLACE,REPLACE_ALL) (defaults to EXTRACT)
      • Based on the value of the mode, changes the method (replaceValue, replaceIndividualValue, replaceAllValues) that is invoked at request time
      • Validation checks have been added to ensure that the configuration is valid for the selected mode (e.g., subgroup must not be set in REPLACE_ALL mode).
    • extraction_func_ (function pointer)
      • Set at construction time based on the value of the has_replacement_text_ field.
        - If has_replacement_text_ is true, this field will be set to extractor::replaceValue. If has_replacement_text_ is false, this field will be set to extractor::extractValue, the previous default behavior.
      • Set to the appropriate method to call (extractValue, replaceIndividualValue, replaceAllValues) based on the mode_
    • replaced_value_ (mutable std::string)
      • Potentially set during replaceValue if a match is found.
      • This field stores the result of the replacement operation, which will be used as the extracted value.
  • The Extractor class in the transformation filter has been given 3 new class methods:
    • replaceValue
      • Depending on the value of the replace_all_ field, this method will either:
        • Replace a subgroup matched by the specified regex with the provided replacement text, by calling replaceIndividualValue, or
        • Replace all matches of the regex with the replacement text, by calling replaceAllValues.
      • In each of replaceIndividualValue and replaceAllValues, the following assumptions are made:
        • If there is no match, the method will return an empty string.
        • If there is a match, the method will return the result of the replacement operation.
        • If there is a match, the replaced_value_ field in the Extractor will be set to the result of the replacement operation.
    • replaceIndividualValue
      • This method will replace the subgroup matched by the specified regex with the provided replacement text.
    • replaceAllValues
      • This method will replace all matches of the regex with the provided replacement text.

Testing

  • This PR introduces a number of new unit tests to ensure that the new functionality works as expected and does not differ too far from the existing extractor functionality.
  • I am going to summarize some of the new test cases below, but I encourage you to look at the tests themselves for more details.
Summary of new test cases
  1. ExtractAndReplaceValueFromBodySubgroup

    • Extracts and replaces a subgroup within the body. Essentially a happy path test.
    • Example extractor configuration:
      extractor:
        body: {}
        regex: ".*(body)"
        subgroup: 1
        replacement_text: "BAZ"
    • Input: "not json body"
    • Expected output: "not json BAZ"
  2. ExtractAndReplaceValueFromFullBody

    • Replaces the entire body content with the replacement text.
    • Contrast this with the edge case in test 3.
    • Example extractor configuration:
      extractor:
        body: {}
        regex: ".*"
        subgroup: 0
        replacement_text: "BAZ"
    • Input: "not json body"
    • Expected output: "BAZ"
  3. ExtractAndReplaceAllFromFullBody

  4. AttemptReplaceFromPartialMatch

    • Attempts to replace using a regex which only partially matches the input, verifying no replacement occurs without replace_all.
    • Example extractor configuration:
      extractor:
        body: {}
        regex: "body"
        subgroup: 0
        replacement_text: "BAZ"
    • Input: "not json body"
    • Expected output: "" (no replacement due to partial match)
  5. AttemptReplaceFromPartialMatchNonNilSubgroup

    • Similar to the previous test, but with a defined subgroup. No replacement occurs due to partial match.
    • Example extractor configuration:
      extractor:
        body: {}
        regex: "(body)"
        subgroup: 1
        replacement_text: "BAZ"
    • Input: "not json body"
    • Expected output: ""
  6. ReplaceFromFullLiteralMatch

    • Replaces the entire body content when it fully matches the regex.
    • Example extractor configuration:
      extractor:
        body: {}
        regex: "not json body"
        subgroup: 0
        replacement_text: "BAZ"
    • Input: "not json body"
    • Expected output: "BAZ"
  7. AttemptToReplaceFromInvalidSubgroup

    • Attempts to replace using an invalid subgroup, expecting an error.
    • Example extractor configuration:
      extractor:
        body: {}
        regex: ".*"
        subgroup: 1
        replacement_text: "BAZ"
    • Input: "not json body"
    • Expected output: Exception thrown with message about invalid subgroup
  8. NestedSubgroups

    • Extracts and replaces a value from a nested subgroup within the body.
    • Example extractor configuration:
      extractor:
        body: {}
        regex: ".*(not (json) body)"
        subgroup: 2
        replacement_text: "BAZ"
    • Input: "not json body"
    • Expected output: "not BAZ body"
  9. SubgroupUnset

    • Replaces the entire matching content when subgroup is unset.
    • Example extractor configuration:
      extractor:
        body: {}
        regex: ".*(not (json) body)"
        replacement_text: "BAZ"
    • Input: "not json body"
    • Expected output: "BAZ"
  10. NoMatch

    • Tests behavior when the regex is valid but does not match any part of the input.
    • Example extractor configuration:
      extractor:
        body: {}
        regex: "this will not match the input string"
        subgroup: 0
        replacement_text: "BAZ"
    • Input: "not json body"
    • Expected output: "" (no replacement due to no match)
  11. NilReplace

    • Removes the matched content by replacing it with an empty string.
    • Example extractor configuration:
      extractor:
        body: {}
        regex: ".*(body)"
        subgroup: 1
        replacement_text: ""
    • Input: "not json body"
    • Expected output: "not json "
  12. NilReplaceWithSubgroupUnset

    • Attempts to replace entire match with an empty string, with subgroup unset
    • Example extractor configuration:
      extractor:
        body: {}
        regex: ".*(body)"
        replacement_text: ""
    • Input: "not json body"
    • Expected output: ""
  13. HeaderHappyPath

    • Replaces a value in the headers.
    • Example extractor configuration:
      extractor:
        header: "foo"
        regex: "bar"
        subgroup: 0
        replacement_text: "BAZ"
    • Input: Header foo with value "bar"
    • Expected output: "BAZ"
  14. ReplaceAllWithReplacementTextUnset

    • Attempts to replace all without specifying replacement text, expecting an exception.
    • Example extractor configuration:
      extractor:
        body: {}
        regex: "bar"
        subgroup: 0
        replace_all: true
    • Input: "bar bar bar"
    • Expected output: Exception thrown due to unset replacement text
  15. ReplaceAllWithSubgroupSet

    • Attempts replace_all with a subgroup set, expecting an exception due to configuration conflict.
    • Example extractor configuration:
      extractor:
        body: {}
        regex: ".*(bar).*"
        subgroup: 1
        replace_all: true
        replacement_text: "BAZ"
    • Input: "bar bar bar"
    • Expected output: Exception thrown due to conflicting configuration
  16. ReplaceAllHappyPath

    • Replaces all occurrences of a pattern in the body.
    • Example extractor configuration:
      extractor:
        body: {}
        regex: "bar"
        subgroup: 0
        replace_all: true
        replacement_text: "BAZ"
    • Input: "bar bar bar"
    • Expected output: "BAZ BAZ BAZ"
  17. IndividualReplaceIdentity

    • Tests replacement with the same text that is matched, ensuring no changes occur.
    • Example extractor configuration:
      extractor:
        body: {}
        regex: ".*(bar).*"
        subgroup: 1
        replacement_text: "bar"
    • Input: "bar bar bar"
    • Expected output: "bar bar bar"
  18. ReplaceAllIdentity

    • Similar to the previous test, but with replace_all enabled.
    • Example extractor configuration:
      extractor:
        body: {}
        regex: "bar"
        subgroup: 0
        replace_all: true
        replacement_text: "bar"
    • Input: "bar bar bar"
    • Expected output: "bar bar bar"
  19. ReplaceAllNoMatch

    • Tests replace_all with a regex that does not match the input.
    • Example extractor configuration:
      extractor:
        body: {}
        regex: "this will not match the input string"
        subgroup: 0
        replace_all: true
        replacement_text: "BAZ"
    • Input: "not json body"
    • Expected output: "" (no replacement due to no match)

Context

Work Remaining (in subsequent PRs)

  • Update Gloo Edge's API to allow users to configure the new functionality.
  • Update Gloo Edge translation to support the new functionality.
    • Ensure that configuration issues are handled gracefully.
  • Introduce e2e testing in Gloo Edge to ensure that the new functionality works as expected.
  • Update Gloo Edge documentation to include the new functionality.
    • Update existing documentation around extractors to appropriately describe the new functionality.

// confusing edge case in std::regex_replace when the regex is .*
// apparently, this regex matches the whole input string __AND__ the
// line ending, so the replacement is applied twice
EXPECT_EQ("BAZBAZ", res);
Copy link
Contributor Author

@ben-taussig-solo ben-taussig-solo Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: this edge case frustrates me, trying to figure out how to handle it at the moment. Any input would be appreciated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does changing line 114 to extractor.set_regex(".*$") change anything? or maybe "^.*$"?

@ben-taussig-solo ben-taussig-solo changed the title [WIP] Add regex replace functionality to Transformation Filter Extractors Add regex replace functionality to transformation filter extractors Jan 25, 2024
@solo-changelog-bot
Copy link

Issues linked to changelog:
solo-io/gloo#8706

update transformation_filter proto
}

// If a match was found, replace all instances of the regex in the input value with the replacement_text_ value
replaced = std::regex_replace(input, extract_regex_, replacement_text_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar performance concerns here since we have had issues with std::regex in the past

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes/questions:

  • Extractor regexes are compiled at config time, which is good (or really, the alternative is unacceptable)
  • The replaceAll path (i.e. the one we're discussing here) is not maximally efficient -- it does an efficient check to identify whether there are any matches, and then does the call you've highlighted here, which will replace all matches.
    • I considered extending the efficient check to identify matches to identify all matches, after which we could manually replace matches with the replacement_text_, but I chose not to as it seemed like it would be difficult to implement/read/maintain, and would potentially be prone to bugs and edge cases. But I could go that route if the performance of regex_replace is problematically bad. That being said I think I would prefer to get some performance metrics before doing so
  • Can you link any issues or PRs you can recall where there's discussion about std::regex performance issues?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link any issues or PRs you can recall where there's discussion about std::regex performance issues?

Here's an issue in DLP. Similar situation - regexes are compiled at the time of configuration, but are still extremely slow. There are a few benchmarks linked in the issue that show how... hopeless std::regex can be.

Happy to leave this as is for now, but if this issue does bubble up, there's a bit of additional and (hopefully) helpful information on that thread

@ben-taussig-solo
Copy link
Contributor Author

Note to reviewers: I've put up a draft PR in Gloo OSS with changes to support this functionality: solo-io/gloo#9124

Copy link
Contributor

@jbohanon jbohanon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits; looking really good

Copy link
Contributor

@ashishb-solo ashishb-solo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good - only one (potentially) major comment that i found (replaced_value_ seems unnecessary), but looks good to me

// confusing edge case in std::regex_replace when the regex is .*
// apparently, this regex matches the whole input string __AND__ the
// line ending, so the replacement is applied twice
EXPECT_EQ("BAZBAZ", res);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does changing line 114 to extractor.set_regex(".*$") change anything? or maybe "^.*$"?

@ben-taussig-solo
Copy link
Contributor Author

ben-taussig-solo commented Jan 31, 2024

does changing line 114 to extractor.set_regex(".$") change anything? or maybe "^.$"?

Yes -- either of those regexes results in normal/expected. The issue only exists when the regex is set to .*.

Note that this is caused by std::regex_replace, not by some quirk of our implementation. For a while I was thinking about introducing a check against this edge case to prevent this from happening, but recently I've been leaning more towards the side of leaving this in place, as I don't think it makes much sense (from a user's perspective) to match an entire input when attempting to replace multiple values in the first place

extraction_func_ = std::bind(&Extractor::replaceAllValues, this, _1, _2);
break;
default:
throw EnvoyException("Unknown mode");
Copy link
Contributor

@ashishb-solo ashishb-solo Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably fine, but you could also consider using PANIC_DUE_TO_CORRUPT_ENUM here disregard, i think this macro is intended for other uses

Copy link
Contributor

@jbohanon jbohanon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@soloio-bulldozer soloio-bulldozer bot merged commit bfdcd3e into main Feb 1, 2024
3 checks passed
@soloio-bulldozer soloio-bulldozer bot deleted the extractor_regex_replace branch February 1, 2024 14:51
ben-taussig-solo added a commit that referenced this pull request Feb 2, 2024
soloio-bulldozer bot pushed a commit that referenced this pull request Feb 2, 2024
)

* Revert "Add regex replace functionality to transformation filter extractors (#301)"

This reverts commit bfdcd3e.

* add changelog entry

* restore old changelog entry
ben-taussig-solo added a commit that referenced this pull request Feb 26, 2024
…301)

* initial extractor implementation of replace functionality

* minor changes, testing against mergeExtractorsToBody and extraction callback

* add changelog entry

* update API to use new mode selector

update transformation_filter proto

* minor updates to comments in transformation_filter.proto

* remove existing references to no-longer-existing replace_all setting

* update replacement_text_ to a std::optional<std::string>

* remove duplicate mode enum

* update comment indicating that subgroup should never exceed regex_result size

* add AttemptReplaceFromNoMatchNonNilSubgroup test

* prevent string reallocation

* remove unnecessary if block + variable in replaceAllValues

* clean up new tests

* inline replacement_text in inja_transformer_test.cc

* more test cleanup
soloio-bulldozer bot pushed a commit that referenced this pull request Mar 8, 2024
…Revised] (#309)

* Add regex replace functionality to transformation filter extractors (#301)

* initial extractor implementation of replace functionality

* minor changes, testing against mergeExtractorsToBody and extraction callback

* add changelog entry

* update API to use new mode selector

update transformation_filter proto

* minor updates to comments in transformation_filter.proto

* remove existing references to no-longer-existing replace_all setting

* update replacement_text_ to a std::optional<std::string>

* remove duplicate mode enum

* update comment indicating that subgroup should never exceed regex_result size

* add AttemptReplaceFromNoMatchNonNilSubgroup test

* prevent string reallocation

* remove unnecessary if block + variable in replaceAllValues

* clean up new tests

* inline replacement_text in inja_transformer_test.cc

* more test cleanup

* update function signatures, remove replaced_value_

* support dynamic metadata as extractor input

* update changelog location

* add API changes to go with 3175ca9

* revert support for dynamic metadata as an extractor input 3175ca9 and e2668be

* refactor calls to extract/replace

* rename replace to extractDestructive, add breaks to switch statement

* update data types to match updated function signatures in inja_transformer_test.cc

* respond to review comments

* update changelog location

* update changelog location

* separate destructive extractors and non-destructive extractors

* fix match_not_null edge case

* update inline documentation for new proto field

* add test demonstrating use of format specifiers

* update REPLACE_ALL mode to return input on no match

* return input on no match in single replace case
ben-taussig-solo added a commit that referenced this pull request Mar 11, 2024
…Revised] (#309)

* Add regex replace functionality to transformation filter extractors (#301)

* initial extractor implementation of replace functionality

* minor changes, testing against mergeExtractorsToBody and extraction callback

* add changelog entry

* update API to use new mode selector

update transformation_filter proto

* minor updates to comments in transformation_filter.proto

* remove existing references to no-longer-existing replace_all setting

* update replacement_text_ to a std::optional<std::string>

* remove duplicate mode enum

* update comment indicating that subgroup should never exceed regex_result size

* add AttemptReplaceFromNoMatchNonNilSubgroup test

* prevent string reallocation

* remove unnecessary if block + variable in replaceAllValues

* clean up new tests

* inline replacement_text in inja_transformer_test.cc

* more test cleanup

* update function signatures, remove replaced_value_

* support dynamic metadata as extractor input

* update changelog location

* add API changes to go with 3175ca9

* revert support for dynamic metadata as an extractor input 3175ca9 and e2668be

* refactor calls to extract/replace

* rename replace to extractDestructive, add breaks to switch statement

* update data types to match updated function signatures in inja_transformer_test.cc

* respond to review comments

* update changelog location

* update changelog location

* separate destructive extractors and non-destructive extractors

* fix match_not_null edge case

* update inline documentation for new proto field

* add test demonstrating use of format specifiers

* update REPLACE_ALL mode to return input on no match

* return input on no match in single replace case
ben-taussig-solo added a commit that referenced this pull request Mar 11, 2024
…Revised] (#309)

* Add regex replace functionality to transformation filter extractors (#301)

* initial extractor implementation of replace functionality

* minor changes, testing against mergeExtractorsToBody and extraction callback

* add changelog entry

* update API to use new mode selector

update transformation_filter proto

* minor updates to comments in transformation_filter.proto

* remove existing references to no-longer-existing replace_all setting

* update replacement_text_ to a std::optional<std::string>

* remove duplicate mode enum

* update comment indicating that subgroup should never exceed regex_result size

* add AttemptReplaceFromNoMatchNonNilSubgroup test

* prevent string reallocation

* remove unnecessary if block + variable in replaceAllValues

* clean up new tests

* inline replacement_text in inja_transformer_test.cc

* more test cleanup

* update function signatures, remove replaced_value_

* support dynamic metadata as extractor input

* update changelog location

* add API changes to go with 3175ca9

* revert support for dynamic metadata as an extractor input 3175ca9 and e2668be

* refactor calls to extract/replace

* rename replace to extractDestructive, add breaks to switch statement

* update data types to match updated function signatures in inja_transformer_test.cc

* respond to review comments

* update changelog location

* update changelog location

* separate destructive extractors and non-destructive extractors

* fix match_not_null edge case

* update inline documentation for new proto field

* add test demonstrating use of format specifiers

* update REPLACE_ALL mode to return input on no match

* return input on no match in single replace case
soloio-bulldozer bot pushed a commit that referenced this pull request Mar 13, 2024
* Add regex replace functionality to transformation filter extractors [Revised] (#309)

* Add regex replace functionality to transformation filter extractors (#301)

* initial extractor implementation of replace functionality

* minor changes, testing against mergeExtractorsToBody and extraction callback

* add changelog entry

* update API to use new mode selector

update transformation_filter proto

* minor updates to comments in transformation_filter.proto

* remove existing references to no-longer-existing replace_all setting

* update replacement_text_ to a std::optional<std::string>

* remove duplicate mode enum

* update comment indicating that subgroup should never exceed regex_result size

* add AttemptReplaceFromNoMatchNonNilSubgroup test

* prevent string reallocation

* remove unnecessary if block + variable in replaceAllValues

* clean up new tests

* inline replacement_text in inja_transformer_test.cc

* more test cleanup

* update function signatures, remove replaced_value_

* support dynamic metadata as extractor input

* update changelog location

* add API changes to go with 3175ca9

* revert support for dynamic metadata as an extractor input 3175ca9 and e2668be

* refactor calls to extract/replace

* rename replace to extractDestructive, add breaks to switch statement

* update data types to match updated function signatures in inja_transformer_test.cc

* respond to review comments

* update changelog location

* update changelog location

* separate destructive extractors and non-destructive extractors

* fix match_not_null edge case

* update inline documentation for new proto field

* add test demonstrating use of format specifiers

* update REPLACE_ALL mode to return input on no match

* return input on no match in single replace case

* update changelog location
soloio-bulldozer bot pushed a commit that referenced this pull request Mar 13, 2024
* Add regex replace functionality to transformation filter extractors [Revised] (#309)

* Add regex replace functionality to transformation filter extractors (#301)

* initial extractor implementation of replace functionality

* minor changes, testing against mergeExtractorsToBody and extraction callback

* add changelog entry

* update API to use new mode selector

update transformation_filter proto

* minor updates to comments in transformation_filter.proto

* remove existing references to no-longer-existing replace_all setting

* update replacement_text_ to a std::optional<std::string>

* remove duplicate mode enum

* update comment indicating that subgroup should never exceed regex_result size

* add AttemptReplaceFromNoMatchNonNilSubgroup test

* prevent string reallocation

* remove unnecessary if block + variable in replaceAllValues

* clean up new tests

* inline replacement_text in inja_transformer_test.cc

* more test cleanup

* update function signatures, remove replaced_value_

* support dynamic metadata as extractor input

* update changelog location

* add API changes to go with 3175ca9

* revert support for dynamic metadata as an extractor input 3175ca9 and e2668be

* refactor calls to extract/replace

* rename replace to extractDestructive, add breaks to switch statement

* update data types to match updated function signatures in inja_transformer_test.cc

* respond to review comments

* update changelog location

* update changelog location

* separate destructive extractors and non-destructive extractors

* fix match_not_null edge case

* update inline documentation for new proto field

* add test demonstrating use of format specifiers

* update REPLACE_ALL mode to return input on no match

* return input on no match in single replace case

* update changelog

* empty commit to kick changelog bot after switching target branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants