Removing duplicated logic from `dbt-snowplow-*` packages to `dbt-snowplow-utils` and having `dbt-snowplow-utils` generate models #125

emielver · 2023-03-18T00:19:27Z

emielver
Mar 18, 2023

Summary

The idea of removing duplicated logic from dbt-snowplow-* packages and consolidating that logic into the dbt-snowplow-utils package has several benefits. By centralizing commonly used tables and macros in a separate (base) package which is already a dependency for all dbt-snowplow-* packages, it can help simplify the development process for Snowplow users who leverage the dbt packages. Developers can now utilize the shared logic across multiple packages, avoiding the need to re-implement the same logic in different packages. Additionally, this work is necessary for the merging of -web and -mobile data modeling that will be done down the road. This can save time and effort, reduce the risk of errors, and make maintenance and upgrades more straightforward.

Moreover, by generating the models in dbt-snowplow-utils, the users of dbt-snowplow-* packages can benefit from the standardization of models and the ability to use them as a foundation to build additional models. At the same time, we need to improve the level of customization available to users to allow them to unpack/include whatever custom contexts they want, both for general purposes as well as user/session identification specifically. This work could improve the overall quality and consistency of the data modeling experience, and enable faster and more reliable model development, for both Snowplow developers and our users.

Development work has already progressed on this, but this has not been without it's challenged. Below we will list some of the questions we faced internally, as well as what our preliminary thoughts/decisions were related to those questions. If anyone has any additional questions or comments related to this discussion, or isn't in agreement with the answers we've laid out below then please feel free to contribute to the discussion in any way, shape or form to enable us to develop a better experience for all users.

Questions considered

What should happen when a user introduces a new package to the mix?
If the tracking of the package has started a couple of months before, but the data modeling is only starting now, the manifest table will understand that new models are being added. As a result it will, by default, rerun all models for all packages from that start date onwards. Users can disable this behaviour by temporarily disabling the other packages in their dbt_project.yml. We can create a clear overview of how to do this on the docs and link to those docs when a New model message appears in case they weren't aware.

How should we deal with models that are out of sync?
If models are continuously out of sync, this will cause all of the models to be rerun continuously. This is likely not behaviour that is desired by end-users, so we can again link to the same docs as mentioned before when the out of sync message appears.

Do we want to allow for cases where sessionization is not needed?
Probably not, as if this is not needed then the user can just use the standard incremental framework to achieve a proper incremental implementation.

With server-side event modelling, for example, we could have a case where there are no sessions and so we just want to load and model data incrementally using only the timestamp value, which is trivial to implement in dbt. On the other hand, there could be a certain number of events that are collected from server-side tracking that are all part of the same session, i.e. through a common process_id. We would want to allow users to specify in this identifier and have the data model appropriately respond to this input.

What if the session identifier is nested in a context (custom or otherwise)?
There could be cases where the session identifer (or other identifiers) are nested within contexts. We should allow the user to specify which contexts (if any) to unpack before we generate these tables. The easiest way to do this might to be offer up a macro that the user can overwrite for unpacking values. These contexts would need to be extracted during the process of creating the base_sessions_this_run table.

What happens when the user has multiple packages running with different session identifiers?
The user should be able to enter multiple session identifiers (and multiple session contexts to unpack), and we should coalesce these identifiers toegether to ensure that we have all the identifiers necessary. The coalesce will happen in the order that the user specified the identifiers, so that the user can decide the precedence of identifiers in case both are present in an event. We could also include the platform value in the table to make joining quicker.

Should we generate the tables within dbt-snowplow-utils or should we just produce a macro that generates the table in each package?
If the tables we are generating are relatively simple and straightforward, and the logic is consistent across all packages, it would make sense to generate the tables within dbt-snowplow-utils. This would provide a centralized location for managing the logic and allow all packages to use the same tables and logic without duplicating effort.

On the other hand, if the logic for generating the tables is complex and varies significantly across packages, it may be more appropriate to create a macro that can then be used within each package. This would allow each package to tailor the logic to its specific needs while still providing a consistent interface for generating the tables. That being said, we lose the cohesion of one base table (if required) for multiple packages, off of which we would build our data models from each package.

To get the best of both worlds, the approach we will take is to create macros that generate the tables, but leverage those macros to create common tables inside the dbt-snowplow-utils package which can then be leveraged by the other packages. If users want to create their own base tables and manifests, they can leverage the macros we've created in dbt-snowplow-utils instead of starting from scratch, and otherwise they can just use the tables already created by dbt-snowplow-utils.

rlh1994 · 2023-03-22T15:00:28Z

rlh1994
Mar 22, 2023

Quite a few different points here so I'll add a comment with my thoughts on each one.

Starting with the general removing duplication logic into the upstream "Base" macros work, this makes perfect sense to me and should be part of what it achieves, and should be doable for redshift/postgres as well assuming we allow them to bring in contexts upfront. I don't think it makes sense to produce the table then make each package/user responsible for de-duping it.

0 replies

rlh1994 · 2023-03-22T15:02:51Z

rlh1994
Mar 22, 2023

For

What should happen when a user introduces a new package to the mix?

and

How should we deal with models that are out of sync?

If we work under the assumption we're still having a manifest table (whether that be a centralised one, or one per package) then both of these should following the existing approach. I suppose the complexity comes when we use the same base table and potentially the same manifest. I think these might be two different things that cause different impacts, so would be worth understanding what part of the "which models do I run this time" logic is built from the dependency graph vs the manifest, and what having a single of each would do.

0 replies

rlh1994 · 2023-03-22T15:08:54Z

rlh1994
Mar 22, 2023

Do we want to allow for cases where sessionization is not needed?

I think it makes sense, currently we do this in the normalize package already, we just use the absolute time of things (with the usual buffer for late arriving events). If we move to load_tstamp as the default this becomes even more of a sensible use case.

I think we maybe need better define the term sessionization to be clear that we reprocess a whole "session" when that "session" has new events in the latest run (up to some limit), and would allow them to define they key(s) that constitute a session. We could, and I think should, also allow for temporal "sessions" i.e. the last 24 hours or from midnight yesterday, this is more traditional reporting and useful for lots of aggregate measures.

These are two different approaches (as in the relative date case you don't need a sessionisation table), and we may want to offer less flexibility here to start with (i.e. always relative to the current date or from a fixed date, rather than letting them add whatever SQL they want), but I think being clear about what we mean by a "session" and how that translates to the data will help.

0 replies

rlh1994 · 2023-03-22T15:11:24Z

rlh1994
Mar 22, 2023

What happens when the user has multiple packages running with different session identifiers?

I think what you've proposed here only works if those identifiers are ONLY populated for their respective package's events. It may be the case I want to use domain_sessionid in one package, but some custom id in another, the issue would be that in the events I process in my second package domain_sessionid may still be populated.

I think this lends to the case to having each package produce their own set of core tables, using macros from utils, rather than all packages trying to read from the same core tables. I'll discuss further below.

0 replies

rlh1994 · 2023-03-22T15:22:43Z

rlh1994
Mar 22, 2023

Should we generate the tables within dbt-snowplow-utils or should we just produce a macro that generates the table in each package?

There's definitly cases both ways here, I think the way I see if is the case for actually building the tables in utils has two main benefits (that I currently can see):

When table setup across packages is the same, and the data is in-sync, the packages can all run at the same time and only have to scan the events table once
Less tables in the warehouse and for simple use cases a simpler dependency graph

I think my worry is that once you start trying to do something slightly different you run into some complexities/issues that make it more confusing

If we offer the option to use utils OR bespoke per package models, we add more variables, variable scope between packages becomes really confusing, and in general it feels like debugging will be more complex
It would be quite a change from the existing approach of the packages that all have the table names related to the package, which makes it easier to find the data from a particular run and help to avoid conflicts.
Whilst this can happen already if two people run the same package on the same connection, it is currently possible to run different packages on the same connection and not hit conflicts (because each table is unique to it's package), this would stop that possiblity.
It makes it very difficult/confusing to process events differently across packages, because we assume you're going to be using the same base table.

In terms of upgrading packages/changing models across all packages, I think we get the same benefit either way (just macro or macro and models in utils) because if we update the macro we only need to bump the version in other packages, which we would have to do anyway to get them to use the latest models if we put them in utils`.

I also think we get a lot out of the planned unified web+mobile package which are the main use case I think people run two packages at the same time, that will unify them to one to run from the same base table anyway.

Finally, I think we should offer a demo project, one we don't even publish to the hub we just link to in our docs, that is basically a bare bones project set up to use these macros and has a dummy model built on top, that way anyone (including us when working on new packages) can start from that if they want to build just their own models, but it will be using the functionality provided in utils rather than having to re-invent it.

I might be missing some real pros of having the models in utils, but I am just worried the risks outweight the benefits when people try to take advantage of the flexibility we are trying to offer

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing duplicated logic from `dbt-snowplow-*` packages to `dbt-snowplow-utils` and having `dbt-snowplow-utils` generate models #125

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Removing duplicated logic from dbt-snowplow-* packages to dbt-snowplow-utils and having dbt-snowplow-utils generate models #125

emielver Mar 18, 2023

Summary

Questions considered

Replies: 5 comments

rlh1994 Mar 22, 2023

rlh1994 Mar 22, 2023

rlh1994 Mar 22, 2023

rlh1994 Mar 22, 2023

rlh1994 Mar 22, 2023

Removing duplicated logic from `dbt-snowplow-*` packages to `dbt-snowplow-utils` and having `dbt-snowplow-utils` generate models #125

emielver
Mar 18, 2023

rlh1994
Mar 22, 2023

rlh1994
Mar 22, 2023

rlh1994
Mar 22, 2023

rlh1994
Mar 22, 2023

rlh1994
Mar 22, 2023