Duplicate protection for Timeseries #488

browjm4 · 2024-10-17T14:25:43Z

Reason

Multiple queries with SELECT *, or multiple of the same query, could bloat object storage with unnecessary copies of result files

Design

Two separate things to tackle here- each could be their own PR probably. Be sure to sanitize the queries (eg trim whitespace off and case desensitize them) before programmatically trying to determine their contents. Send the non-sanitized versions of the queries to datafusion if they are determined to be unique.

Handling SELECT * queries: if a query of SELECT * FROM table_x (and nothing more) is detected, AND the original file is in csv format, return the original file as the query result instead of going through data fusion.
Handling duplicated queries: if a query already exists, do some SQL magic to find the previous query ID, then the corresponding report ID, then the result file ID and return that to the user. If all instances of the previous query failed, send to data fusion.

Impact

Adds additional logic to the query report repository layer to save on storage

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate protection for Timeseries #488

Duplicate protection for Timeseries #488

browjm4 commented Oct 17, 2024

Duplicate protection for Timeseries #488

Duplicate protection for Timeseries #488

Comments

browjm4 commented Oct 17, 2024

Reason

Design

Impact