Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Track queries per run #2052

Open
falexwolf opened this issue Oct 15, 2024 · 2 comments
Open

✨ Track queries per run #2052

falexwolf opened this issue Oct 15, 2024 · 2 comments

Comments

@falexwolf
Copy link
Member

falexwolf commented Oct 15, 2024

We're tracking input artifacts & collections when either of them are loaded, i.e., one of cached, loaded, opened, iterated-over.

What we don't track as inputs are queries of any entity: .get() and .filter() statements.

We'll have these in an audit log but the question is how easy that's going to be to decipher if there is no run record.

Shall we also track this information? It'd lead to another big number of link tables because, e.g., there'd be a cell_type__queried_in_runs table (via a many-to-many CellType._queried_in_runs).

Opinions: @sunnyosun @Zethson @chaichontat

@Zethson
Copy link
Member

Zethson commented Oct 15, 2024

My immediate intuition says: "Not important at the moment". It'd be nice to have to get a feeling for the usage of datasets and is a cool feature but I don't see many use cases.

I'm voting "no" for now unless you have a few great use cases in mind?

@falexwolf
Copy link
Member Author

I agree that these are very advanced use cases.

But one thing I've heard again and again from the data architects & engineers is that "everything should be tracked".

The thing is: it's impossible to make unmeasured data appear. So, most platforms instrument as much as they can even if it's almost never used.

Knowing how much, when, by whom, through which code etc. a CellType record was queried is incredibly fine-grained; I know. Still it could be useful in instances.

The main downside of a naive implementation is the query speed it takes, but what could be done is keeping a cached log of all these operations and then commit them in one transaction upon .finish().

Hm. Let's keep in the backlog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants