Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate protection for Timeseries #488

Open
browjm4 opened this issue Oct 17, 2024 · 0 comments
Open

Duplicate protection for Timeseries #488

browjm4 opened this issue Oct 17, 2024 · 0 comments

Comments

@browjm4
Copy link
Collaborator

browjm4 commented Oct 17, 2024

Reason

Multiple queries with SELECT *, or multiple of the same query, could bloat object storage with unnecessary copies of result files

Design

Two separate things to tackle here- each could be their own PR probably. Be sure to sanitize the queries (eg trim whitespace off and case desensitize them) before programmatically trying to determine their contents. Send the non-sanitized versions of the queries to datafusion if they are determined to be unique.

  1. Handling SELECT * queries: if a query of SELECT * FROM table_x (and nothing more) is detected, AND the original file is in csv format, return the original file as the query result instead of going through data fusion.
  2. Handling duplicated queries: if a query already exists, do some SQL magic to find the previous query ID, then the corresponding report ID, then the result file ID and return that to the user. If all instances of the previous query failed, send to data fusion.

Impact

Adds additional logic to the query report repository layer to save on storage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant