Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify Transpilation of $ with Extended Line Separator Support in cuDF Regex #11663

Draft
wants to merge 4 commits into
base: branch-24.12
Choose a base branch
from

Conversation

SurajAralihalli
Copy link
Collaborator

@SurajAralihalli SurajAralihalli commented Oct 25, 2024

Resolves #11554

In cuDF, support for multiple newline characters was expanded from NEW_LINE (\n) to include the following:

  • NEXT_LINE (\u0085)
  • LINE_SEPARATOR (\u2028)
  • PARAGRAPH_SEPARATOR (\u2029)
  • CARRIAGE_RETURN (\r)
  • NEW_LINE (\n)

PR #17139 introduced this change to cuDf JNI with RegexFlag::EXT_LINE. This PR simplifies the transpilation of $ by changing the pattern from (?:\r|\u0085|\u2028|\u2029|\r\n)?$ to the simpler (?:\r\n)?$ and updates all functions to use RegexFlag::EXT_LINE wherever this transpilation occurs.

This PR also drops support for $\z because \z is not supported by cuDf. Alternatively, we could transpile $\z to $(?![\r\n\u0085\u2028\u2029]). However, cuDf doesn't support negative look ahead.

Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Reimplement $ transpilation using cuDF new line terminator support
1 participant