Skip to content

Commit

Permalink
docs: update documentation to reflect new backfill counter functionality
Browse files Browse the repository at this point in the history
  • Loading branch information
psFried committed Jan 18, 2024
1 parent 322af29 commit 7681c66
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 19 deletions.
14 changes: 7 additions & 7 deletions site/docs/concepts/advanced/evolutions.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,17 +66,17 @@ Alternatively, you could manually update all the specs to agree to your edit, bu

Evolutions can prevent errors resulting from mismatched specs in two ways:

* **Materialize data to a new resource in the endpoint system**: The evolution updates all materialization binding that from the collection to write to a new resource (database table, for example) in the endpoint system. This is done by updating the materialization's *binding specification*. For example, if the collection was previously materialized into a database table called `my_table`, the evolution would update it to instead materialize into `my_table_v2`. The Flow collection itself remains unchanged.
* **Materialize data to a new resource in the endpoint system**: The evolution updates the affected materialization bindings to increment their `backfill` counter, which causes the materialization to re-create the resource (database table, for example) and backfill it from the beginning.

This is a simpler change, and how evolutions work in most cases.

* **Re-create the Flow collection with a new name**: The evolution creates a completely new collection with numerical suffix, such as `_v2`. This collection starts out empty and backfills from the source. The evolution also updates all captures and materializations that reference the old collection to instead reference the new collection. This also updates any materializations to materialize the new collection into a new resource.
* **Re-create the Flow collection with a new name**: The evolution creates a completely new collection with numerical suffix, such as `_v2`. This collection starts out empty and backfills from the source. The evolution also updates all captures and materializations that reference the old collection to instead reference the new collection, and increments their `backfill` counters.

This is a more complicated change, and evolutions only work this way when necessary: when the collection key or logical partition changes, or when a schema change would cause the endpoint system to reject the materialized data.
This is a more complicated change, and evolutions only work this way when necessary: when the collection key or logical partitioning changes.

:::info
Evolutions will soon support the re-creation of materialization resources, such as tables, while keeping the same names.
:::
In either case, the names of the destination resources will remain the same. For example, a materialization to Postgres would drop and re-create the affected tables with the same names they had previously.

Also in either case, only the specific bindings that had incompatible changes will be affected. Other bindings will remain untouched, and will not re-backfill.

## What causes breaking schema changes?

Expand All @@ -98,4 +98,4 @@ key: [/id]
If you materialized that collection into a relational database table, the table would look something like `my_table (id integer primary key, foo timestamptz)`.

Now, say you edit the collection spec to remove `format: date-time` from `bar`. You'd expect the materialized database table to then look like `(id integer primary key, foo text)`. But since the column type of `foo` has changed, this will fail. An easy solution in this case would be to change the name of the table that the collection is materialized into. Evolutions do this by appending a suffix to the original table name. In this case, you'd end up with `my_table_v2 (id integer primary key, foo text)`.
Now, say you edit the collection spec to remove `format: date-time` from `bar`. You'd expect the materialized database table to then look like `(id integer primary key, foo text)`. But since the column type of `foo` has changed, this will fail. An easy solution in this case would be to change the name of the table that the collection is materialized into. Evolutions do this by appending a suffix to the original table name. In this case, you'd end up with `my_table_v2 (id integer primary key, foo text)`.
26 changes: 14 additions & 12 deletions site/docs/guides/schema-evolution.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,7 @@ captures:
namespace: public
stream: anvils
mode: Normal
backfill: 1
target: acmeCo/inventory/anvils_v2
collections:
Expand All @@ -125,20 +126,15 @@ materializations:
config: encrypted-snowflake-config.sops.yaml
bindings:
- source: acmeCo/inventory/anvils_v2
backfill: 1
resource:
table: anvils_v2
table: anvils
schema: inventory
```

The existing `acmeCo/inventory/anvils` collection will not be modified and will remain in place, but won't update because no captures are writing to it.

Note that the collection is now being materialized into a new Snowflake table, `anvils_v2`. This is because the primary key of the `anvils` table doesn't match the new collection key. New data going forward will be added to `anvils_v2` in the data warehouse.

:::warning
Currently, changing the `target` collection in the capture spec will _not_ cause the capture to perform another backfill. This means that the `anvils_v2` table will get all of the _new_ data going forward, but will not contain the existing data from `anvils`.

We will soon release updates that make it much easier to keep your destination tables fully in sync without needing to change the names. In the meantime, feel free to [reach out on Slack](https://join.slack.com/t/gazette-dev/shared_invite/enQtNjQxMzgyNTEzNzk1LTU0ZjZlZmY5ODdkOTEzZDQzZWU5OTk3ZTgyNjY1ZDE1M2U1ZTViMWQxMThiMjU1N2MwOTlhMmVjYjEzMjEwMGQ) for help.
:::
Also note the addition of the `backfill` property. If the `backfill` property already exists, just increment its value. For the materialization, this will ensure that the destination table in Snowflake gets dropped and re-created, and that the materialization will backfill it from the beginning. In the capture, it similarly causes it to start over from the beginning, writing the captured data into the new collection.

**Auto-Discovers:**

Expand All @@ -159,6 +155,10 @@ To manually add a field:
* **In the Flow web app,** [edit the materialization](./edit-data-flows.md#edit-a-materialization), find the affected binding, and click **Show Fields**.
* **Using flowctl,** add the field to `fields.include` in the materialization specification as shown [here](../concepts/materialization.md#projected-fields).

:::info
Newly added fields will not be set for rows that have already been materialized. If you want to ensure that all rows have the new field, just increment the `backfill` counter in the affected binding to have it re-start from the beginning.
:::

### A field's data type has changed

*Scenario: this is one way in which the schema can change.*
Expand All @@ -177,7 +177,7 @@ The best way to find out whether a change is acceptable to a given connector is

**Web app workflow**

If you're working in the Flow web app, and attempt to publish a change that's unacceptable to the connector, you'll see an error message and an option to materialize to a new table, or, in rare cases, to re-create the collection..
If you're working in the Flow web app, and attempt to publish a change that's unacceptable to the connector, you'll see an error message and an offer to increment the necessary `backfill` counters, or, in rare cases, to re-create the collection.

Click **Apply** to to accept this solution and continue to publish.

Expand Down Expand Up @@ -207,6 +207,7 @@ materializations:
config: encrypted-snowflake-config.sops.yaml
bindings:
- source: acmeCo/inventory/anvils
backfill: 3
resource:
table: anvils
schema: inventory
Expand Down Expand Up @@ -234,12 +235,13 @@ materializations:
config: encrypted-snowflake-config.sops.yaml
bindings:
- source: acmeCo/inventory/anvils
backfill: 4
resource:
table: anvils_v2
table: anvils
schema: inventory
```

Note that the collection name is the same. Only the materialization `resource` is updated to write to a new table, which will backfill from the existing collection data.
Note that the only change was to increment the `backfill` counter. If the previous binding spec did not specify `backfill`, then just add `backfill: 1`.

This works because the type is broadened, so existing values will still validate against the new schema. If this were not the case, then you'd likely need to [re-create the whole collection](#re-creating-a-collection).

Expand All @@ -251,4 +253,4 @@ If you enabled the option to [**Automatically keep schemas up to date** (`autoDi

*Scenario: this is one way in which the schema can change.*

Removing fields is generally allowed by all connectors, and does not require new tables or collections. Note that for database materializations, the existing column will _not_ be dropped, and will just be ignored by the materialization going forward. A `NOT NULL` constraint would be removed from that column, but it will otherwise be left in place.
Removing fields is generally allowed by all connectors, and does not require new tables or collections. Note that for database materializations, the existing column will _not_ be dropped, and will just be ignored by the materialization going forward. A `NOT NULL` constraint would be removed from that column, but it will otherwise be left in place.

0 comments on commit 7681c66

Please sign in to comment.