Replies: 5 comments
-
Quite a few different points here so I'll add a comment with my thoughts on each one. Starting with the general removing duplication logic into the upstream "Base" macros work, this makes perfect sense to me and should be part of what it achieves, and should be doable for redshift/postgres as well assuming we allow them to bring in contexts upfront. I don't think it makes sense to produce the table then make each package/user responsible for de-duping it. |
Beta Was this translation helpful? Give feedback.
-
For
and
If we work under the assumption we're still having a manifest table (whether that be a centralised one, or one per package) then both of these should following the existing approach. I suppose the complexity comes when we use the same base table and potentially the same manifest. I think these might be two different things that cause different impacts, so would be worth understanding what part of the "which models do I run this time" logic is built from the dependency graph vs the manifest, and what having a single of each would do. |
Beta Was this translation helpful? Give feedback.
-
I think it makes sense, currently we do this in the normalize package already, we just use the absolute time of things (with the usual buffer for late arriving events). If we move to I think we maybe need better define the term These are two different approaches (as in the relative date case you don't need a sessionisation table), and we may want to offer less flexibility here to start with (i.e. always relative to the current date or from a fixed date, rather than letting them add whatever SQL they want), but I think being clear about what we mean by a "session" and how that translates to the data will help. |
Beta Was this translation helpful? Give feedback.
-
I think what you've proposed here only works if those identifiers are ONLY populated for their respective package's events. It may be the case I want to use I think this lends to the case to having each package produce their own set of core tables, using macros from utils, rather than all packages trying to read from the same core tables. I'll discuss further below. |
Beta Was this translation helpful? Give feedback.
-
There's definitly cases both ways here, I think the way I see if is the case for actually building the tables in
I think my worry is that once you start trying to do something slightly different you run into some complexities/issues that make it more confusing
In terms of upgrading packages/changing models across all packages, I think we get the same benefit either way (just macro or macro and models in I also think we get a lot out of the planned unified web+mobile package which are the main use case I think people run two packages at the same time, that will unify them to one to run from the same base table anyway. Finally, I think we should offer a demo project, one we don't even publish to the hub we just link to in our docs, that is basically a bare bones project set up to use these macros and has a dummy model built on top, that way anyone (including us when working on new packages) can start from that if they want to build just their own models, but it will be using the functionality provided in utils rather than having to re-invent it. I might be missing some real pros of having the models in utils, but I am just worried the risks outweight the benefits when people try to take advantage of the flexibility we are trying to offer |
Beta Was this translation helpful? Give feedback.
-
Summary
The idea of removing duplicated logic from
dbt-snowplow-*
packages and consolidating that logic into thedbt-snowplow-utils
package has several benefits. By centralizing commonly used tables and macros in a separate (base) package which is already a dependency for alldbt-snowplow-*
packages, it can help simplify the development process for Snowplow users who leverage the dbt packages. Developers can now utilize the shared logic across multiple packages, avoiding the need to re-implement the same logic in different packages. Additionally, this work is necessary for the merging of-web
and-mobile
data modeling that will be done down the road. This can save time and effort, reduce the risk of errors, and make maintenance and upgrades more straightforward.Moreover, by generating the models in
dbt-snowplow-utils
, the users ofdbt-snowplow-*
packages can benefit from the standardization of models and the ability to use them as a foundation to build additional models. At the same time, we need to improve the level of customization available to users to allow them to unpack/include whatever custom contexts they want, both for general purposes as well as user/session identification specifically. This work could improve the overall quality and consistency of the data modeling experience, and enable faster and more reliable model development, for both Snowplow developers and our users.Development work has already progressed on this, but this has not been without it's challenged. Below we will list some of the questions we faced internally, as well as what our preliminary thoughts/decisions were related to those questions. If anyone has any additional questions or comments related to this discussion, or isn't in agreement with the answers we've laid out below then please feel free to contribute to the discussion in any way, shape or form to enable us to develop a better experience for all users.
Questions considered
What should happen when a user introduces a new package to the mix?
If the tracking of the package has started a couple of months before, but the data modeling is only starting now, the manifest table will understand that new models are being added. As a result it will, by default, rerun all models for all packages from that start date onwards. Users can disable this behaviour by temporarily disabling the other packages in their
dbt_project.yml
. We can create a clear overview of how to do this on the docs and link to those docs when aNew model
message appears in case they weren't aware.How should we deal with models that are out of sync?
If models are continuously out of sync, this will cause all of the models to be rerun continuously. This is likely not behaviour that is desired by end-users, so we can again link to the same docs as mentioned before when the
out of sync
message appears.Do we want to allow for cases where sessionization is not needed?
Probably not, as if this is not needed then the user can just use the standard incremental framework to achieve a proper incremental implementation.
With server-side event modelling, for example, we could have a case where there are no sessions and so we just want to load and model data incrementally using only the
timestamp
value, which is trivial to implement in dbt. On the other hand, there could be a certain number of events that are collected from server-side tracking that are all part of the same session, i.e. through a commonprocess_id
. We would want to allow users to specify in this identifier and have the data model appropriately respond to this input.What if the session identifier is nested in a context (custom or otherwise)?
There could be cases where the session identifer (or other identifiers) are nested within contexts. We should allow the user to specify which contexts (if any) to unpack before we generate these tables. The easiest way to do this might to be offer up a macro that the user can overwrite for unpacking values. These contexts would need to be extracted during the process of creating the
base_sessions_this_run
table.What happens when the user has multiple packages running with different session identifiers?
The user should be able to enter multiple session identifiers (and multiple session contexts to unpack), and we should coalesce these identifiers toegether to ensure that we have all the identifiers necessary. The coalesce will happen in the order that the user specified the identifiers, so that the user can decide the precedence of identifiers in case both are present in an event. We could also include the
platform
value in the table to make joining quicker.Should we generate the tables within
dbt-snowplow-utils
or should we just produce a macro that generates the table in each package?If the tables we are generating are relatively simple and straightforward, and the logic is consistent across all packages, it would make sense to generate the tables within
dbt-snowplow-utils
. This would provide a centralized location for managing the logic and allow all packages to use the same tables and logic without duplicating effort.On the other hand, if the logic for generating the tables is complex and varies significantly across packages, it may be more appropriate to create a macro that can then be used within each package. This would allow each package to tailor the logic to its specific needs while still providing a consistent interface for generating the tables. That being said, we lose the cohesion of one base table (if required) for multiple packages, off of which we would build our data models from each package.
To get the best of both worlds, the approach we will take is to create macros that generate the tables, but leverage those macros to create common tables inside the
dbt-snowplow-utils
package which can then be leveraged by the other packages. If users want to create their own base tables and manifests, they can leverage the macros we've created indbt-snowplow-utils
instead of starting from scratch, and otherwise they can just use the tables already created bydbt-snowplow-utils
.Beta Was this translation helpful? Give feedback.
All reactions