indexer-alt: sequential pipeline #20053

amnn · 2024-10-28T13:50:08Z

Description

Introduce a new kind of pipeline for indexing that needs commit data in checkpoint order. This will be used for indexing data that would previously have gone into objects or objects_snapshot, where rows are modified in place, and so can't be committed out-of-order.

Sequential pipelines are split into two parts:

A processor which is shared with the existing concurrent pipeline, and is responsible for turning checkpoint data into values to be sent to the database.
A committer which is responsible for batching up prefixes of updates and sending them to the DB when they are complete (no gaps between the last write and what has been buffered).

The key design constraints of the sequential pipeline are as follows:

Although the committer must write out rows in order, it can buffer the results of checkpoint processed out-of-order.
It uses the ingestion service's regulator for back-pressure: The ingestion service is only allowed to run ahead of all sequential pipelines by its buffer size, which bounds the memory that each pipeline must use to buffer pending writes.
Sequential pipelines have different tuning parameters compared to concurrent pipelines:
- MIN_BATCH_ROWS: The threshold for eagerly writing to the DB.
- MAX_BATCH_CHECKPOINTS: The maximum number of checkpoints that will be batched together in a single transaction.
They guarantee atomicity using DB transactions: All the writes for a single checkpoint, and the corresponding watermark update are put into the same DB transaction.
They support simplifying/merging writes to the DB: If the same object is modified multiple times across multiple checkpoints, only the latest write will make it to the DB.

Test plan

This change is primarily tested by the sum_obj_types pipeline introduced in the next change.

Stack

Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

vercel · 2024-10-28T13:50:12Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
sui-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Oct 28, 2024 3:29pm

3 Skipped Deployments

Name	Status	Preview	Updated (UTC)
multisig-toolkit	⬜️ Ignored (Inspect)	Visit Preview	Oct 28, 2024 3:29pm
sui-kiosk	⬜️ Ignored (Inspect)	Visit Preview	Oct 28, 2024 3:29pm
sui-typescript-docs	⬜️ Ignored (Inspect)	Visit Preview	Oct 28, 2024 3:29pm

## Description Introduce a new kind of pipeline for indexing that needs commit data in checkpoint order. This will be used for indexing data that would previously have gone into `objects` or `objects_snapshot`, where rows are modified in place, and so can't be committed out-of-order. Sequential pipelines are split into two parts: - A `processor` which is shared with the existing concurrent pipeline, and is responsible for turning checkpoint data into values to be sent to the database. - A `committer` which is responsible for batching up prefixes of updates and sending them to the DB when they are complete (no gaps between the last write and what has been buffered). - Although the committer must write out rows in order, it can buffer the results of checkpoint processed out-of-order. - It uses the ingestion service's regulator for back-pressure: The ingestion service is only allowed to run ahead of all sequential pipelines by its buffer size, which bounds the memory that each pipeline must use to buffer pending writes. - Sequential pipelines have different tuning parameters: - `MIN_BATCH_ROWS`: The threshold for eagerly writing to the DB. - `MAX_BATCH_CHECKPOINTS`: The maximum number of checkpoints that will be batched together in a single transaction. - They guarantee atomicity using DB transactions: All the writes for a single checkpoint, and the corresponding watermark update are put into the same DB transaction. - They support simplifying/merging writes to the DB: If the same object is modified multiple times across multiple checkpoints, only the latest write will make it to the DB. ## Test plan This change is primarily tested by the `sum_obj_types` pipeline introduced in the next change.

amnn requested review from lxfind, bmwill, emmazzz, gegaowp and wlmyng October 28, 2024 13:50

amnn self-assigned this Oct 28, 2024

amnn mentioned this pull request Oct 28, 2024

indexer-alt: sum_obj_types pipeline #20054

Open

8 tasks

vercel bot deployed to Preview – sui-docs October 28, 2024 13:58 View deployment

amnn mentioned this pull request Oct 28, 2024

easy(indexer-alt): clarify names of tuning parameters #20055

Open

8 tasks

amnn added 2 commits October 28, 2024 15:24

fixup: correct doc comment

93476f9

amnn force-pushed the amnn/idx-obo branch from 0fdcbac to 7f60367 Compare October 28, 2024 15:25

amnn force-pushed the amnn/idx-seq branch from ce52e8a to 93476f9 Compare October 28, 2024 15:25

vercel bot deployed to Preview – sui-docs October 28, 2024 15:29 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

indexer-alt: sequential pipeline #20053

indexer-alt: sequential pipeline #20053

amnn commented Oct 28, 2024 •

edited

Loading

vercel bot commented Oct 28, 2024 •

edited

Loading

indexer-alt: sequential pipeline #20053

Are you sure you want to change the base?

indexer-alt: sequential pipeline #20053

Conversation

amnn commented Oct 28, 2024 • edited Loading

Description

Test plan

Stack

Release notes

vercel bot commented Oct 28, 2024 • edited Loading

amnn commented Oct 28, 2024 •

edited

Loading

vercel bot commented Oct 28, 2024 •

edited

Loading