Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Parser regular expressions using JDK to make error behavior more consistent between CPU and GPU #11651

Open
NVnavkumar opened this issue Oct 23, 2024 · 0 comments
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@NVnavkumar
Copy link
Collaborator

Describe the bug
Some regular expression patterns are invalid in Java, and throw an exception on the CPU but run with no exceptions on the GPU. This is due to the inconsistencies in the regexp parsers in the different systems (in many cases Java being excessively strict).

Steps/Code to reproduce bug

PySpark reproduce:

df = spark.createDataFrame(spark.sparkContext.parallelize([["aaaa"]]), "a string")
df.selectExpr("regexp_replace(a, 'a{', 'bb') as result").show()

When running on CPU, you get a java.util.regex.PatternSyntaxException. On the GPU, this will run without an exception

Suggested fix

I think we should run Pattern.compile(...) on any regular expressions before they hit the transpiler (and even before they hit optimized versions as well) to have consistent behavior between CPU and GPU. That way the same exception will be thrown when the SQL is evaluated.

@NVnavkumar NVnavkumar added ? - Needs Triage Need team to review and classify bug Something isn't working labels Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant