Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#2060] fix(server): Fix memory leaks when reaching memory limit #2058

Merged
merged 1 commit into from
Oct 21, 2024

Conversation

lwllvyb
Copy link
Contributor

@lwllvyb lwllvyb commented Aug 19, 2024

What changes were proposed in this pull request?

Fix shuffle server memory leaks when reaching memory limit.

Why are the changes needed?

When enabling Netty, on the shuffle server side, Netty will allocate memory for SEND_SHUFFLE_DATA_REQUEST request. However, when the memory limit is reached, an OutOfDirectMemoryError will be thrown and the decoding process for this message will fail. This will cause the bytebuf allocated successfully in the previous batch in this message to not be released, resulting in memory leaks.

Fix: #2060

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTs

@lwllvyb
Copy link
Contributor Author

lwllvyb commented Aug 19, 2024

The error stack:

[2024-08-08 20:49:31.754] [epollEventLoopGroup-3-2] [ERROR] ShuffleServerNettyHandler - exception caught /*:38381
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 65536 byte(s) of direct memory (used: 17179815063, max: 17179869184)
at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:843)
at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:772)
at io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.allocateDirect(UnpooledUnsafeNoCleanerDirectByteBuf.java:30)
at io.netty.buffer.UnpooledByteBufAllocator$InstrumentedUnpooledUnsafeNoCleanerDirectByteBuf.allocateDirect(UnpooledByteBufAllocator.java:186)
at io.netty.buffer.UnpooledDirectByteBuf.(UnpooledDirectByteBuf.java:64) at io.netty.buffer.UnpooledUnsafeDirectByteBuf.(UnpooledUnsafeDirectByteBuf.java:41)
at io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.(UnpooledUnsafeNoCleanerDirectByteBuf.java:25)
at io.netty.buffer.UnpooledByteBufAllocator$InstrumentedUnpooledUnsafeNoCleanerDirectByteBuf.(UnpooledByteBufAllocator.java:181)
at io.netty.buffer.UnpooledByteBufAllocator.newDirectBuffer(UnpooledByteBufAllocator.java:91)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
at io.netty.channel.unix.PreferredDirectByteBufAllocator.ioBuffer(PreferredDirectByteBufAllocator.java:53)
at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:120)
at io.netty.channel.epoll.EpollRecvByteAllocatorHandle.allocate(EpollRecvByteAllocatorHandle.java:75)
at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:786)
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:501)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:399)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:750)

Copy link

github-actions bot commented Aug 19, 2024

Test Results

 2 926 files  +31   2 926 suites  +31   5h 57m 50s ⏱️ + 10m 45s
 1 043 tests ± 0   1 041 ✅ + 1   2 💤 ±0  0 ❌  - 1 
13 033 runs  +60  13 003 ✅ +61  30 💤 ±0  0 ❌  - 1 

Results for commit 6903a21. ± Comparison against base commit f9b4c0e.

♻️ This comment has been updated with latest results.

@lwllvyb lwllvyb changed the title fix(server): Fix memory leak when reach memory limit [#2060]fix(server): Fix memory leak when reach memory limit Aug 19, 2024
@jerqi jerqi requested a review from rickyma August 19, 2024 10:47
@rickyma
Copy link
Contributor

rickyma commented Aug 19, 2024

I think we need to first avoid the occurrence of OOM. The occurrence of OOM is either due to unreasonable memory configurations or bugs in the code. As for how the code should handle after OOM, I don't think it's very important, because the server has already malfunctioned at this point. Even if there is a memory leak, it's actually not important anymore.

@lwllvyb
Copy link
Contributor Author

lwllvyb commented Aug 28, 2024

I think we need to first avoid the occurrence of OOM. The occurrence of OOM is either due to unreasonable memory configurations or bugs in the code. As for how the code should handle after OOM, I don't think it's very important, because the server has already malfunctioned at this point. Even if there is a memory leak, it's actually not important anymore.

So can this PR be accepted ? Or should I just cancel this PR? @rickyma

@rickyma
Copy link
Contributor

rickyma commented Aug 28, 2024

I'm OK with this PR. But it's meaningless. When an OOM error occurs, this PR will not help much.

@lwllvyb
Copy link
Contributor Author

lwllvyb commented Aug 28, 2024

I'm OK with this PR. But it's meaningless. When an OOM error occurs, this PR will not help much.

This PR is not to prevent the OOM exception, but to ensure that the pre-allocated ByteBuf can be released normally.

@jerqi
Copy link
Contributor

jerqi commented Aug 28, 2024

I'm OK with this PR. But it's meaningless. When an OOM error occurs, this PR will not help much.

This PR is not to prevent the OOM exception, but to ensure that the pre-allocated ByteBuf can be released normally.

You shouldn't catch OOM exception. If it throws OOM, more errors may throw. You can't recover it by just catching it.

@lwllvyb
Copy link
Contributor Author

lwllvyb commented Aug 28, 2024

I'm OK with this PR. But it's meaningless. When an OOM error occurs, this PR will not help much.

This PR is not to prevent the OOM exception, but to ensure that the pre-allocated ByteBuf can be released normally.

You shouldn't catch OOM exception. If it throws OOM, more errors may throw. You can't recover it by just catching it.

I don't understand what you mean.
If the OOM exception happen, how we deal with the pre-allocated ByteBuf ? Reconfigure and restart the server , or other ways?

@jerqi
Copy link
Contributor

jerqi commented Aug 28, 2024

I'm OK with this PR. But it's meaningless. When an OOM error occurs, this PR will not help much.

This PR is not to prevent the OOM exception, but to ensure that the pre-allocated ByteBuf can be released normally.

You shouldn't catch OOM exception. If it throws OOM, more errors may throw. You can't recover it by just catching it.

I don't understand what you mean. If the OOM exception happen, how we deal with the pre-allocated ByteBuf ? Reconfigure and restart the server , or other ways?

If it will OOM, the java process should exit.

@lwllvyb
Copy link
Contributor Author

lwllvyb commented Aug 30, 2024

I'm OK with this PR. But it's meaningless. When an OOM error occurs, this PR will not help much.

This PR is not to prevent the OOM exception, but to ensure that the pre-allocated ByteBuf can be released normally.

You shouldn't catch OOM exception. If it throws OOM, more errors may throw. You can't recover it by just catching it.

I don't understand what you mean. If the OOM exception happen, how we deal with the pre-allocated ByteBuf ? Reconfigure and restart the server , or other ways?

If it will OOM, the java process should exit.

I need to clarify that it is not OOM but OutOfDirectMemoryError. From the stack trace, we can see that the server did not exit.

@rickyma
Copy link
Contributor

rickyma commented Aug 30, 2024

Yeah, it is a Netty's internal OOM error. It's meaningless to catch this exception. On the other hand, this PR is harmless.

So I choose to remain neutral.

@zhengchenyu
Copy link
Collaborator

I have also encountered such a problem and fixed it with a similar method. Although I also think that we can solve this problem by increasing MaxDirectMemorySize or reducing rss.server.buffer.capacity. But it is difficult for many people to find this reasonable value. If at a certain peak usage moment, direct memory happens to exceed MaxDirectMemorySize, leaks will occur. I think it is necessary to merge this pr.

@zhengchenyu
Copy link
Collaborator

I'm OK with this PR. But it's meaningless. When an OOM error occurs, this PR will not help much.

This PR is not to prevent the OOM exception, but to ensure that the pre-allocated ByteBuf can be released normally.

You shouldn't catch OOM exception. If it throws OOM, more errors may throw. You can't recover it by just catching it.

I don't understand what you mean. If the OOM exception happen, how we deal with the pre-allocated ByteBuf ? Reconfigure and restart the server , or other ways?

If it will OOM, the java process should exit.

This OOM is not java OOM. The OOM is that the direct memory used exceeds the direct memory limit of Netty. If Netty's direct memory limit is too low, it will easily trigger io.netty.util.internal.OutOfDirectMemoryError without causing java process exit.

@jerqi
Copy link
Contributor

jerqi commented Sep 24, 2024

I'm OK with this PR. But it's meaningless. When an OOM error occurs, this PR will not help much.

This PR is not to prevent the OOM exception, but to ensure that the pre-allocated ByteBuf can be released normally.

You shouldn't catch OOM exception. If it throws OOM, more errors may throw. You can't recover it by just catching it.

I don't understand what you mean. If the OOM exception happen, how we deal with the pre-allocated ByteBuf ? Reconfigure and restart the server , or other ways?

If it will OOM, the java process should exit.

This OOM is not java OOM. The OOM is that the direct memory used exceeds the direct memory limit of Netty. If Netty's direct memory limit is too low, it will easily trigger io.netty.util.internal.OutOfDirectMemoryError without causing java process exit.

Is io.netty.util.internal.OutOfDirectMemoryError a subclass of OutOfMemoryError? It's a little weird to catch an error. Because we can't recover an error usually.

@zhengchenyu
Copy link
Collaborator

I'm OK with this PR. But it's meaningless. When an OOM error occurs, this PR will not help much.

This PR is not to prevent the OOM exception, but to ensure that the pre-allocated ByteBuf can be released normally.

You shouldn't catch OOM exception. If it throws OOM, more errors may throw. You can't recover it by just catching it.

I don't understand what you mean. If the OOM exception happen, how we deal with the pre-allocated ByteBuf ? Reconfigure and restart the server , or other ways?

If it will OOM, the java process should exit.

This OOM is not java OOM. The OOM is that the direct memory used exceeds the direct memory limit of Netty. If Netty's direct memory limit is too low, it will easily trigger io.netty.util.internal.OutOfDirectMemoryError without causing java process exit.

Is io.netty.util.internal.OutOfDirectMemoryError a subclass of OutOfMemoryError? It's a little weird to catch an error. Because we can't recover an error usually.

io.netty.util.internal.OutOfDirectMemoryError is not a subclass of OutOfMemoryError. Even though it's a little weird to catch an error, but this error can not make the process exit.

@jerqi
Copy link
Contributor

jerqi commented Sep 24, 2024

o.netty.util.internal.OutOfDirectMemoryError

I read the code

package io.netty.util.internal;

import java.nio.ByteBuffer;

/**
 * {@link OutOfMemoryError} that is throws if {@link PlatformDependent#allocateDirectNoCleaner(int)} can not allocate
 * a new {@link ByteBuffer} due memory restrictions.
 */
public final class OutOfDirectMemoryError extends OutOfMemoryError {
    private static final long serialVersionUID = 4228264016184011555L;

    OutOfDirectMemoryError(String s) {
        super(s);
    }
}

@jerqi
Copy link
Contributor

jerqi commented Sep 24, 2024

Maybe we should add some comments to explain this.

@zhengchenyu
Copy link
Collaborator

Sorry, I write it wrongly. Below is the netty source code. If we set DIRECT_MEMORY_LIMIT to not a big number, will throw OutOfDirectMemoryError. But the jvm process will not exit, but the ByteBuf will be leaked.

    private static void incrementMemoryCounter(int capacity) {
        if (DIRECT_MEMORY_COUNTER != null) {
            long newUsedMemory = DIRECT_MEMORY_COUNTER.addAndGet(capacity);
            if (newUsedMemory > DIRECT_MEMORY_LIMIT) {
                DIRECT_MEMORY_COUNTER.addAndGet(-capacity);
                throw new OutOfDirectMemoryError("failed to allocate " + capacity
                        + " byte(s) of direct memory (used: " + (newUsedMemory - capacity)
                        + ", max: " + DIRECT_MEMORY_LIMIT + ')');
            }
        }
    }

@jerqi
Copy link
Contributor

jerqi commented Sep 24, 2024

Sorry, I write it wrongly. Below is the netty source code. If we set DIRECT_MEMORY_LIMIT to not a big number, will throw OutOfDirectMemoryError. But the jvm process will not exit, but the ByteBuf will be leaked.

    private static void incrementMemoryCounter(int capacity) {
        if (DIRECT_MEMORY_COUNTER != null) {
            long newUsedMemory = DIRECT_MEMORY_COUNTER.addAndGet(capacity);
            if (newUsedMemory > DIRECT_MEMORY_LIMIT) {
                DIRECT_MEMORY_COUNTER.addAndGet(-capacity);
                throw new OutOfDirectMemoryError("failed to allocate " + capacity
                        + " byte(s) of direct memory (used: " + (newUsedMemory - capacity)
                        + ", max: " + DIRECT_MEMORY_LIMIT + ')');
            }
        }
    }

Maybe it's ok to add comments to merge this pull request.

} catch (Throwable t) {
if (!shuffleBlockInfoList.isEmpty()) {
shuffleBlockInfoList.forEach(sbi -> sbi.getData().release());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need check empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check condition is redundant. I will remove the check.

@rickyma
Copy link
Contributor

rickyma commented Sep 24, 2024

I have also encountered such a problem and fixed it with a similar method. Although I also think that we can solve this problem by increasing MaxDirectMemorySize or reducing rss.server.buffer.capacity. But it is difficult for many people to find this reasonable value. If at a certain peak usage moment, direct memory happens to exceed MaxDirectMemorySize, leaks will occur. I think it is necessary to merge this pr.

If the users follow this guidance, this error should not happen, or else there exists a bug.

@lwllvyb
Copy link
Contributor Author

lwllvyb commented Oct 15, 2024

Ping @jerqi @rickyma @zhengchenyu
Could you please take some time to review this PR? If it doesn't need to be merged, please let me know. This PR has been verified internally.

rickyma
rickyma previously approved these changes Oct 15, 2024
Copy link
Contributor

@rickyma rickyma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is harmless.

I'm fine with this. Someone else can take a look.

@jerqi
Copy link
Contributor

jerqi commented Oct 15, 2024

Could you add some comments for this code?

@lwllvyb
Copy link
Contributor Author

lwllvyb commented Oct 16, 2024

Could you add some comments for this code?

Thanks for your review. I added some comments to the code. Please take the time to review again when it is convenient for you.
Thanks again. @jerqi

jerqi
jerqi previously approved these changes Oct 16, 2024
Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

} catch (Throwable t) {
// An OutOfDirectMemoryError will be thrown, when the direct memory reaches the limit.
// OutOfDirectMemoryError does not cause the jvm exit and cause the direct memory leak.
// Warning: Please set DIRECT_MEMORY_LIMIT to a reasonable value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DIRECT_MEMORY_LIMIT does not exist? Maybe DIRECT_MEMORY_LIMIT -> MAX_DIRECT_MEMORY_SIZE?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also put the link of the guidance in the comment.

shuffleBlockInfoList.add(Decoders.decodeShuffleBlockInfo(byteBuf));
} catch (Throwable t) {
// An OutOfDirectMemoryError will be thrown, when the direct memory reaches the limit.
// OutOfDirectMemoryError does not cause the jvm exit and cause the direct memory leak.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OutOfDirectMemoryError does not cause the JVM to exit and it may lead to direct memory leaks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.
@rickyma

@rickyma rickyma changed the title [#2060]fix(server): Fix memory leak when reach memory limit [#2060] fix(server): Fix memory leaks when reaching memory limit Oct 16, 2024
@lwllvyb lwllvyb closed this Oct 19, 2024
@lwllvyb lwllvyb reopened this Oct 19, 2024
@lwllvyb lwllvyb closed this Oct 19, 2024
@lwllvyb lwllvyb reopened this Oct 19, 2024
@codecov-commenter
Copy link

codecov-commenter commented Oct 19, 2024

Codecov Report

Attention: Patch coverage is 25.00000% with 6 lines in your changes missing coverage. Please review.

Project coverage is 52.86%. Comparing base (5ddcc28) to head (6903a21).
Report is 125 commits behind head on master.

Files with missing lines Patch % Lines
.../common/netty/protocol/SendShuffleDataRequest.java 25.00% 6 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #2058      +/-   ##
============================================
+ Coverage     52.77%   52.86%   +0.09%     
- Complexity     2498     3422     +924     
============================================
  Files           398      518     +120     
  Lines         18135    28359   +10224     
  Branches       1660     2650     +990     
============================================
+ Hits           9570    14991    +5421     
- Misses         7981    12400    +4419     
- Partials        584      968     +384     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

// An OutOfDirectMemoryError will be thrown, when the direct memory reaches the limit.
// OutOfDirectMemoryError will not cause the JVM to exit, but may lead to direct memory
// leaks.
// Warning: You can refer to docs/server_guide.md to set MAX_DIRECT_MEMORY_SIZE to a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a warning.
Maybe: Warning: -> Note:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it as you suggested.

@rickyma rickyma merged commit c3dd749 into apache:master Oct 21, 2024
41 of 42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] memory leak when reach memory limit
5 participants