Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initiate S3 Multi-part upload on receiving first event #318

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

aindriu-aiven
Copy link

This update initiates the multipart upload as soon as a record begins, and closes the file on flush.

This PR does

  • Initiate a multi part upload on retrieval of the first event, thus allowing the sink to write quicker to S3.

This PR does not

  • Clear from the recordGrouper the previously written files to the multi part upload until flush is called.
    • This will be part of a separate PR as it requires an update to the common OutputWriter interface and multiple implementations of that interface.

…, and closes the file on flush.

Signed-off-by: Aindriu Lavelle <[email protected]>
…allowing changelog records to initiate multipart upload.

Signed-off-by: Aindriu Lavelle <[email protected]>
@aindriu-aiven aindriu-aiven requested review from a team as code owners October 24, 2024 12:37
@aindriu-aiven aindriu-aiven force-pushed the aindriu-aiven/initiate-multi-part-upload branch from b6cebcc to 4776d6d Compare October 24, 2024 13:16

assertThat(expectedBlobs).allMatch(blobName -> testBucketAccessor.doesObjectExist(blobName));

assertThat(testBucketAccessor.readLines("prefix-topic0-0-00000000000000000012", compression))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an FYI, the S3MockApi does not create the file names correctly for key, value

Copy link
Contributor

@gharris1727 gharris1727 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, the S3OutputStream has had multipart upload for a long time: Aiven-Open/s3-connector-for-apache-kafka#73

But we were still buffering data as records, rather than offloading them early? Crazy. Thanks for the improvement.

Comment on lines +99 to +100
* This determines if the file is key based, and possible to change a single file multiple times per flush or if
* it's a roll over file which at each flush is reset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more about this? What is key based grouping, and why does it mutate the file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants