Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

indexer-alt: sequential pipeline #20053

Open
wants to merge 2 commits into
base: amnn/idx-obo
Choose a base branch
from
Open

indexer-alt: sequential pipeline #20053

wants to merge 2 commits into from

Commits on Oct 28, 2024

  1. indexer-alt: sequential pipeline

    ## Description
    
    Introduce a new kind of pipeline for indexing that needs commit data in
    checkpoint order. This will be used for indexing data that would
    previously have gone into `objects` or `objects_snapshot`, where rows
    are modified in place, and so can't be committed out-of-order.
    
    Sequential pipelines are split into two parts:
    
    - A `processor` which is shared with the existing concurrent pipeline,
      and is responsible for turning checkpoint data into values to be sent
      to the database.
    - A `committer` which is responsible for batching up prefixes of updates
      and sending them to the DB when they are complete (no gaps between the
      last write and what has been buffered).
    - Although the committer must write out rows in order, it can buffer
      the results of checkpoint processed out-of-order.
    - It uses the ingestion service's regulator for back-pressure: The
      ingestion service is only allowed to run ahead of all sequential
      pipelines by its buffer size, which bounds the memory that each
      pipeline must use to buffer pending writes.
    - Sequential pipelines have different tuning parameters:
      - `MIN_BATCH_ROWS`: The threshold for eagerly writing to the DB.
      - `MAX_BATCH_CHECKPOINTS`: The maximum number of checkpoints that will
        be batched together in a single transaction.
    - They guarantee atomicity using DB transactions: All the writes for a
      single checkpoint, and the corresponding watermark update are put into
      the same DB transaction.
    - They support simplifying/merging writes to the DB: If the same object
      is modified multiple times across multiple checkpoints, only the
      latest write will make it to the DB.
    
    ## Test plan
    
    This change is primarily tested by the `sum_obj_types` pipeline
    introduced in the next change.
    amnn committed Oct 28, 2024
    Configuration menu
    Copy the full SHA
    8118b59 View commit details
    Browse the repository at this point in the history
  2. fixup: correct doc comment

    amnn committed Oct 28, 2024
    Configuration menu
    Copy the full SHA
    93476f9 View commit details
    Browse the repository at this point in the history