Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Access] Add support for pebbleDB to execution data tracker/pruner #6277

Open
wants to merge 51 commits into
base: master
Choose a base branch
from

Conversation

UlyanaAndrukhiv
Copy link
Contributor

@UlyanaAndrukhiv UlyanaAndrukhiv commented Jul 29, 2024

Closes: #6260

Context

In this pull request:

  • Split out ExecutionDataTracker DB code into common ExecutionDataTracker interface.
  • Refactored badger implementation.
  • Added the pebble version of the storage object.
  • Added functional and integration tests for pebble version of execution data pruning.

@codecov-commenter
Copy link

codecov-commenter commented Jul 29, 2024

Codecov Report

Attention: Patch coverage is 39.39394% with 340 lines in your changes missing coverage. Please review.

Project coverage is 41.40%. Comparing base (9653906) to head (3362d0b).

Files with missing lines Patch % Lines
storage/badger/execution_data_tracker.go 65.43% 35 Missing and 21 partials ⚠️
storage/pebble/execution_data_tracker.go 63.30% 31 Missing and 20 partials ⚠️
storage/pebble/operation/execution_data_tracker.go 0.00% 30 Missing ⚠️
storage/badger/operation/execution_data_tracker.go 0.00% 27 Missing ⚠️
storage/pebble/operation/common.go 36.58% 25 Missing and 1 partial ⚠️
storage/mock/track_blobs_fn.go 0.00% 21 Missing ⚠️
cmd/access/node_builder/access_node_builder.go 0.00% 16 Missing ⚠️
cmd/observer/node_builder/observer_builder.go 0.00% 16 Missing ⚠️
storage/execution_data_tracker.go 37.50% 13 Missing and 2 partials ⚠️
storage/mock/prune_callback.go 0.00% 15 Missing ⚠️
... and 11 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6277      +/-   ##
==========================================
- Coverage   41.42%   41.40%   -0.03%     
==========================================
  Files        2024     2032       +8     
  Lines      144439   144742     +303     
==========================================
+ Hits        59839    59928      +89     
- Misses      78403    78624     +221     
+ Partials     6197     6190       -7     
Flag Coverage Δ
unittests 41.40% <39.39%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@UlyanaAndrukhiv UlyanaAndrukhiv marked this pull request as ready for review August 6, 2024 08:38
Copy link
Collaborator

@Guitarheroua Guitarheroua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@peterargue
Copy link
Contributor

@zhangchiqing since you've been working closely with pebble/badger lately, can you help review this one. It's refactoring the execution data pruner to support both badger and pebble

@zhangchiqing
Copy link
Member

Great work! @UlyanaAndrukhiv

Since we're adding both Pebble and Badger implementations for the execution data tracker, this PR has become quite extensive. I suggest we start with the Badger solution first, focusing on the implementation for the tracker and the pruning logic.

This PR also references the Pebble implementation I created in this pull request. However, that PR is still under review, and there's a high likelihood that the Pebble implementation will need refactoring. The patterns we're referring to here might become outdated, so it's better to wait and postpone the Pebble implementation for now.

We could initially implement the pruner in Badger, refactoring it to use Badger batch updates instead of transactions. Since Pebble doesn't support transactions, using Badger batch updates could make it easier for us to switch to the Pebble implementation later.

Design

We might need to revisit the design of the Tracker and Pruner. The original tracker and pruner were implemented a long time ago, and they face several challenges if we switch to Badger batch updates. We should take a step back and reconsider the design first.

For example:
For each height, we index a list of CIDs as execution data. CIDs are nested, meaning a CID could have multiple children, and a child CID might have multiple different parent CIDs. When we prune a height, we can't just remove all the CIDs and their child CIDs indexed by that height, because some CID might be referenced by other CIDs at higher height. This is challenging because it requires extra information to determine if a CID or a child CID is prunable.

We tried to address this by keeping track of the highest indexed height for each CID (RetrieveTrackerLatestHeight / UpsertTrackerLatestHeight), but this introduces complexity, and I'm unsure if it's concurrency-safe without database transactions. Is it possible to eliminate the extra index to simplify things? If we keep the LatestHeight index for each CID, we need to be cautious about dirty writes that might corrupt data. For instance, while we're pruning a CID, we might also be concurrently indexing a new height with a certain CID referring to the deleted CID, which could corrupt the newly indexed data. We probably don't want to solve this problem by blocking indexing with a lock during pruning, because pruning might take a long time.

Therefore, I think we need to address these challenges before proceeding with the implementation.

Copy link
Member

@zhangchiqing zhangchiqing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized I forgot to submit my reivew, and my comments was in pending.

return builder.ExecutionDataBlobstore.DeleteBlob(context.TODO(), c)
}),
)
if executionDataDBMode == execution_data.ExecutionDataDBModeBadger {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we implement a CheckExistingExecutionDataDBMode(executionDataDBMode, trackerDir) function or something similar to check if the folder has consistent data with the DB mode?

This could prevent from accidentally using existing badger db data as pebble, which might corrupt the database.

//
// No errors are expected during normal operation.
func (s *ExecutionDataTracker) trackBlobs(blockHeight uint64, cids ...cid.Cid) error {
cidsPerBatch := s.batchItemLimit(storage.CidsPerBatch, 2, storage.BlobRecordKeyLength+storage.LatestHeightKeyLength+8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this never change and can be calculated during initialization?

break
}

dInfo := &storage.DeleteInfo{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better not to create until it's actually needed.

return err
}

if err := s.db.View(func(txn *badger.Txn) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think View is a read-only op, but pruning is a write op

// - c: The CID of the blob to be tracked.
//
// No errors are expected during normal operation.
func (s *ExecutionDataTracker) trackBlob(tx *badger.Txn, blockHeight uint64, c cid.Cid) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we implement this without using badger transaction, instead using batch updates. This allows it easy to switch to pebble implementation which also uses batch updates.


// iterate over blob records, calling pruneCallback for any CIDs that should be pruned
// and cleaning up the corresponding tracker records
for it.Seek(blobRecordPrefix); it.ValidForPrefix(blobRecordPrefix); it.Next() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse badger's traverse function to implement?

It's better that we abstract the lowlevel database operation, it would make it easy to switch to pebble.

return nil
}

err = operation.UpsertTrackerLatestHeight(c, blockHeight)(tx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to keep track of the latest height of each cid?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Access] Add support for pebbleDB to execution data tracker/pruner
5 participants