Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply sample limit to UDFailedRowsExpressionQuery #1986

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

denisdallinga
Copy link

@denisdallinga denisdallinga commented Jan 11, 2024

Github Issue: #1985

This changes the way samples are collected from UserDefinedFailedRowsExpressionQueries. In order to apply the samples limit given in the check configuration, or apply the default samples limit, the failed rows expression needs to fire off an additional SampleQuery. The SampleQuery executes a copy of the query, appending a LIMIT clause.

This changes the way samples are collected from
UserDefinedFailedRowsExpressionQueries. In order to apply the samples
limit given in the check configuration, or apply the default samples
limit, the failed rows expression needs to fire off an additional
SampleQuery. The SampleQuery executes a copy of the query, appending a
LIMIT clause.

[sodadata#1985](sodadata#1985)
@CLAassistant
Copy link

CLAassistant commented Jan 11, 2024

CLA assistant check
All committers have signed the CLA.

Copy link

sonarcloud bot commented Jan 11, 2024

Quality Gate Passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@NathanBick
Copy link

@denisdallinga I'm working on a PR to address this issue. Maybe we can team up to get more momentum on this issue?

Copy link
Contributor

@m1n0 m1n0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution! It looks like this works, but I would suggest a few changes to make it consistent with how other check work:

  • use .fetchone() method of the query class to get the result
  • set the check metric value
  • if there are results and check has samples limit >0 then create a new sample query object with limit

You can check duplicate check query that does exactly that for inspiration, it should be pretty much 100% copy-paste except for the sql itself. https://github.com/sodadata/soda-core/blob/main/soda/core/soda/execution/query/duplicates_query.py#L95-L107

One note - please keep in mind that different data sources use different syntax for limit - TOP/FETCH FIRST, and it's even different syntax (you dont just append it to the end of the query) - so the way to go here is to:

  • refactor get_failed_rows_sql to take the sql string itself from the data_source class
  • override that sql generation in each datasource that does not use LIMIT - the ones that need that override have the LIMIT_KEYWORD attribute overridden

Lastly - this should be covered by tests as its error-prone with all the variances in how limit is done across data sources, there already is a test for how the limit keyword is applied here, it just needs to be adjusted to include user defined check with expression.

I hope this is not too much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants