Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multipart optimizations #3049

Open
1 of 2 tasks
codyohl opened this issue Nov 8, 2023 · 1 comment
Open
1 of 2 tasks

Multipart optimizations #3049

codyohl opened this issue Nov 8, 2023 · 1 comment
Labels
feature-request This issue requests a feature. p2 This is a standard priority issue s3transfer

Comments

@codyohl
Copy link

codyohl commented Nov 8, 2023

  1. currently there is only a multithreaded option of S3Transfer multipart upload/download (if use_threads=True is passed). I've found that multiprocessing (where a new session/client is created in each fork, and a shared memory buffer is referenced across the process border) can be 2X faster.

  2. If there isn't already a way, there should be a way to pass down the HttpConnection write buffer size (see blocksize here, so python threads don't spend a bunch of time repeatedly writing small amounts of data then waiting on the GIL. This results in a 2X speed increase for me, even after applying optimization 1. You can do it outside of S3Transfer in a really hacky way like this:

from http.client import HTTPConnection

# the following logic changes S3Transfer library from 50MB/s to 100MB/s on a toy environment
HTTPConnection.__init__.__defaults__ = tuple(
    x if x != 8192 else 1024 * 1024 for x in HTTPConnection.__init__.__defaults__
)

Use Case

S3 multipart upload/download is slower than needs to be for all users

Proposed Solution

  1. add multiprocess implementation of the same interface (need only an option for how to obtain a session/client in the new process)
    -- use multiprocessing.Queue and start processes just-in-time
    -- pass SharedMemory across processes to avoid a chunk copy
    -- instantiate new session and client in each subprocess, and upload parts concurrently

  2. provide some way for users to set the underlying HttpConnection's buffer size

Other Information

No response

Acknowledgements

  • I may be able to implement this feature request
  • This feature might incur a breaking change

SDK version used

boto3 1.28.52, s3transfer 0.6.0

Environment details (OS name and version, etc.)

linux 5.19.0-0

@codyohl codyohl added feature-request This issue requests a feature. needs-triage This issue or PR still needs to be triaged. labels Nov 8, 2023
@tim-finnigan
Copy link
Contributor

Thank you for the feature request and suggestions. We can keep this open for tracking, further review and discussion.

@tim-finnigan tim-finnigan added s3transfer p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Nov 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request This issue requests a feature. p2 This is a standard priority issue s3transfer
Projects
None yet
Development

No branches or pull requests

2 participants