-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zstdMaxBufferedEncoders value is too low which caused frequent zstd encoder object creation #2965
Comments
A sample stack trace of a goroutine being put into GC Assist Wait when its memory allocation credit is negative:
|
The fix is actually quite simple, allow zstdMaxBufferedEncoders to be specified through a Config param |
Yes so this was pinned to 1 as part of #2375 which aimed to control memory usage because the prior default behaviour had been to allocate 1 per goroutine. So the aim was to move to a shared pool, but only keep a single idle spare. It does sound like it would be useful to tradeoff static memory usage to avoid gc by keeping a larger pool ready of re-usable ones ready for use |
@dnwe I think we should allow the user to specify the pool size. By the way, I only measured about 300KB for each zstd encoder object (when using CompressionLevel: SpeedDefault) based on github.com/klauspost/compress v1.17.4 code. |
…pressionEncoders a configurable parameter This PR is addressing issue: IBM#2965 For Zstd compression, currently the pool size of encoders (controlled by MaxBufferedEncoders param) is hard-coded to be 1, this caused excessive number of large encoder objects being created under high concurrency which caused GC performance issue. Introduce a new config parameter: MaxBufferedCompressionEncoders (default value 1) and modify the calling paths to pass this parameter through message/record_batch and down to compressor/encoder. Signed-off-by: Henry Cai <[email protected]>
I got nerd sniped by this a bit... I have an experimental zstd encoder pool (read: don't use it without full understanding of the code). An initial (probably flawed) pseudo-benchmark showed some 20% cpu improvement under high goroutine counts, even with encoder reuse (removing the limit on max buffered encoders means 100% reuse, to be verified). Most systems should see way less of an effect, but that's normal for benchmarks testing a subset, in addition to better memory efficiency. I want to turn this into a proper benchmark though. I don't want to suggest this as an alternative right now (highly experimental and with quite some complexity), but I would really like to know the settings of the original application as there is a good chance that the absolute max encoder has a higher impact than the absolute minimum, especially with a high goroutines/GOMAXPROCS ratio, which seems to be symptomatic of the original issue. There might be an easier way to just cap the maximum outstanding (handed out) encoders, I haven't investigated this yet though. |
Ok, I managed to get a benchmark going to test different variants of pool configurations that is hopefully somewhat informative
Each test launches 1000 goroutines competing for encoders. Each benchmark is executed with and without alloc measurement to avoid accounting issues. The first iteration is regarded as a "pool warmup" and always discarded. Results:
The GOMAXPROCS=1 turn out to be the same for all variants. The GOMAXPROCS=4 cases turn out to be interesting.
At high goroutine to GOMAXPROC counts it seems like limiting the in use encoders to GOMAXPROCS beats raising the idle encoders. On top of that it should conserve memory and avoid issues for high-compression use cases. There is a high memory tax to keeping many idle or in use encoders. So I am a bit skeptical about just raising the idle encoders max as I think limiting the maximum number of encoders in use is highly beneficial, and potentially more beneficial for more usecases. The benchmark indicates that it can beat this synthetic case by another 68%. I would really like to see the goroutine to GOMAXPROC ratio and how this plays out. There is a chance that either of the 2 directions is better for the specific setup. That said: the benchmark shows a potential for a 10x improvement for a rather common (anti)pattern: get a batch, split it into entries, launch a goroutine per entry. This is pretty huge. And even the the min encoders alone is a 5x improvement. If I could only find a low complexity solution that offers these benefits..... |
@rtreffer Thanks for so quickly creating the encoder pool code and the benchmark test. I think the reason that maxIdleEncoders=4 's throughput is better than maxIdleEncoders=100 is because maxIdleEncoders=4 spent much less time on creating encoders objects (the max is capped at 4). Instead maxIdleEncoders=4 spend more time waiting on encoders becomes available from channel: https://github.com/rtreffer/sarama/blob/zstd-pool/zstd_pool.go#L113. In your benchmark test, the goroutine is only doing zstd compression on 2KB data which is very fast for it to turn around to return the encoder to the pool/channel. I don't remember all the details in sarama code, that goroutine might have other work to do (e.g. sending data through tcp connection to downstream). So depending on the time ratio of the goroutine spend (the time spend on its real computation work and the time on pool related activities), the pool reuse might not be that good. Also this new pool code (especially the line on waiting for encoder from channel: https://github.com/rtreffer/sarama/blob/zstd-pool/zstd_pool.go#L113) changes the object creation dynamic from the old static MaxBufferedEncoders=1 code where there was no upper limit on how many encoders can be created. In the old code, I can see hundreds of those encoder objects being created all the time. |
I think I finally found an easy way to express this all in code: rtreffer@23168b5
I can try to get this into a PR form soon, ideally with a benchmark. I wouldn't want to propose this without cleaning up the benchmark. |
I’m not sure you have to worry about using an |
@rtreffer Are you still moving forward with your PR (rtreffer@23168b5) ? I left a comment in your PR. |
I've been pretty busy over the last few days, I expect to have some time for this over the weekend |
Description
zstdMaxBufferedEncoders value is too low which caused many temporary zstd encoders objects being created, each encoder object is quite big (200KB) and this adds to woes of frequent GCs
Versions
| Sarama | Kafka | Go |
|1.4.0 | 3.8.0 | 1.21.5 |
| | | |
Configuration
Logs
logs: CLICK ME
Additional Context
This is related to #2964, after I did more cpu and memory profilings, I realized the frequent GC problem is caused by excessive creations of zstd encoder objects.
This is the snippet of top memory allocations of the program. In the below pprof listing, the top 2 allocation sites are from creating of zstd encoder objects (300KB each). The next 3 allocation sites are from encoding messages (2-4KB size for each message), the top 2 allocation dominates the memory usage. With this high number of memory allocations, the goroutine were frequently put into "GC assist wait" mode later when its memory allocation credits are in negative.
The root cause of frequent zstd encoder object creation is because zstdMaxBufferedEncoders is hard coded to be 1 in https://github.com/IBM/sarama/blob/main/zstd.go#L11. In a real production system where you have hundreds of brokers and producer objects, you will have hundreds of goroutines going through compression path, and this would cause frequent encoder object creation and destruction.
The text was updated successfully, but these errors were encountered: