-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#2060] fix(server): Fix memory leaks when reaching memory limit #2058
Conversation
The error stack:
|
I think we need to first avoid the occurrence of OOM. The occurrence of OOM is either due to unreasonable memory configurations or bugs in the code. As for how the code should handle after OOM, I don't think it's very important, because the server has already malfunctioned at this point. Even if there is a memory leak, it's actually not important anymore. |
So can this PR be accepted ? Or should I just cancel this PR? @rickyma |
I'm OK with this PR. But it's meaningless. When an OOM error occurs, this PR will not help much. |
This PR is not to prevent the OOM exception, but to ensure that the pre-allocated ByteBuf can be released normally. |
You shouldn't catch OOM exception. If it throws OOM, more errors may throw. You can't recover it by just catching it. |
I don't understand what you mean. |
If it will OOM, the java process should exit. |
I need to clarify that it is not OOM but OutOfDirectMemoryError. From the stack trace, we can see that the server did not exit. |
Yeah, it is a Netty's internal OOM error. It's meaningless to catch this exception. On the other hand, this PR is harmless. So I choose to remain neutral. |
I have also encountered such a problem and fixed it with a similar method. Although I also think that we can solve this problem by increasing MaxDirectMemorySize or reducing rss.server.buffer.capacity. But it is difficult for many people to find this reasonable value. If at a certain peak usage moment, direct memory happens to exceed MaxDirectMemorySize, leaks will occur. I think it is necessary to merge this pr. |
This OOM is not java OOM. The OOM is that the direct memory used exceeds the direct memory limit of Netty. If Netty's direct memory limit is too low, it will easily trigger io.netty.util.internal.OutOfDirectMemoryError without causing java process exit. |
Is |
|
I read the code
|
Maybe we should add some comments to explain this. |
Sorry, I write it wrongly. Below is the netty source code. If we set DIRECT_MEMORY_LIMIT to not a big number, will throw OutOfDirectMemoryError. But the jvm process will not exit, but the ByteBuf will be leaked.
|
Maybe it's ok to add comments to merge this pull request. |
} catch (Throwable t) { | ||
if (!shuffleBlockInfoList.isEmpty()) { | ||
shuffleBlockInfoList.forEach(sbi -> sbi.getData().release()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need check empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check condition is redundant. I will remove the check.
If the users follow this guidance, this error should not happen, or else there exists a bug. |
Ping @jerqi @rickyma @zhengchenyu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is harmless.
I'm fine with this. Someone else can take a look.
Could you add some comments for this code? |
Thanks for your review. I added some comments to the code. Please take the time to review again when it is convenient for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
} catch (Throwable t) { | ||
// An OutOfDirectMemoryError will be thrown, when the direct memory reaches the limit. | ||
// OutOfDirectMemoryError does not cause the jvm exit and cause the direct memory leak. | ||
// Warning: Please set DIRECT_MEMORY_LIMIT to a reasonable value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DIRECT_MEMORY_LIMIT
does not exist? Maybe DIRECT_MEMORY_LIMIT
-> MAX_DIRECT_MEMORY_SIZE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also put the link of the guidance in the comment.
shuffleBlockInfoList.add(Decoders.decodeShuffleBlockInfo(byteBuf)); | ||
} catch (Throwable t) { | ||
// An OutOfDirectMemoryError will be thrown, when the direct memory reaches the limit. | ||
// OutOfDirectMemoryError does not cause the jvm exit and cause the direct memory leak. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OutOfDirectMemoryError does not cause the JVM to exit and it may lead to direct memory leaks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@rickyma
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2058 +/- ##
============================================
+ Coverage 52.77% 52.86% +0.09%
- Complexity 2498 3422 +924
============================================
Files 398 518 +120
Lines 18135 28359 +10224
Branches 1660 2650 +990
============================================
+ Hits 9570 14991 +5421
- Misses 7981 12400 +4419
- Partials 584 968 +384 ☔ View full report in Codecov by Sentry. |
// An OutOfDirectMemoryError will be thrown, when the direct memory reaches the limit. | ||
// OutOfDirectMemoryError will not cause the JVM to exit, but may lead to direct memory | ||
// leaks. | ||
// Warning: You can refer to docs/server_guide.md to set MAX_DIRECT_MEMORY_SIZE to a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a warning.
Maybe: Warning: -> Note:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it as you suggested.
What changes were proposed in this pull request?
Fix shuffle server memory leaks when reaching memory limit.
Why are the changes needed?
When enabling Netty, on the shuffle server side, Netty will allocate memory for SEND_SHUFFLE_DATA_REQUEST request. However, when the memory limit is reached, an
OutOfDirectMemoryError
will be thrown and the decoding process for this message will fail. This will cause the bytebuf allocated successfully in the previous batch in this message to not be released, resulting in memory leaks.Fix: #2060
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing UTs