Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Fargate Memory slowly increasing/leak over time #6006

Open
NevinDry opened this issue Oct 15, 2024 · 4 comments
Open

AWS Fargate Memory slowly increasing/leak over time #6006

NevinDry opened this issue Oct 15, 2024 · 4 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@NevinDry
Copy link

NevinDry commented Oct 15, 2024

Observed behavior

We observe that our aws fargate containers serving clustered NATS have their memory usage increasing over time :

image
(metric : ecs.fargate.mem.usage)

What is weird is we can observe this behavior on our STG1 environement but not on our DEV environment.
These two environments have no significant activity, and are deployed the same way through IaC. Nats is deployed using clustering on AWS fargate.

The difference in memory usage between the two envs is very significant :
dev:
image

stg1:
image

If we look at the NATS memory metrics, both environements are stable :
dev:
image

stg1:
image

There must be a leak somewhere but we are unable identify it.

Expected behavior

The fargate containers memory shouldn't be increasing over time and should stay stable.
It should follow the NATS memory metrics and stay stable.

Server and client version

Server version : 2.10.18
Go : go1.22.5

Host environment

NATS clustering inside AWS fargate without jetstream.

Operating system/Architecture
Linux/X86_64
CPU | Memory
2 vCPU | 4 GB
Platform version
1.4.0
Launch type
FARGATE

Log-router and Datadog as side containers.

Steps to reproduce

No response

@NevinDry NevinDry added the defect Suspected defect such as a bug or regression label Oct 15, 2024
@NevinDry NevinDry changed the title AWS Fargate Memory slowly increasing over time AWS Fargate Memory slowly increasing/leak over time Oct 15, 2024
@neilalexander
Copy link
Member

Can you please report the output of free -m within the containers when the memory usage is high?

@NevinDry
Copy link
Author

NevinDry commented Oct 16, 2024

Thanks for your answer @neilalexander. Our stg1 containers restarted yesterday so the memory has not increased much yet. I will keep you updated when the memory is high.
Here is the free command output in stg1 ATM (env where the memory increase over time) :

STG1
CPU | Memory
1 vCPU | 2 GB
NODE
/ # free -h
total used free shared buff/cache available
Mem: 3.6G 565.7M 292.2M 540.0K 2.8G 2.8G
Swap: 0 0 0

SEED
CPU | Memory
1 vCPU | 2 GB
/ # free -h
total used free shared buff/cache available
Mem: 3.8G 580.9M 275.3M 540.0K 2.9G 2.9G
Swap: 0 0 0

On dev, where the memory is stable, here is a free command output (note that the CPU/Memory provisioning is not the same, does this could have an impact ?) :
DEV

NODE
CPU | Memory
.25 vCPU | .5 GB
/ # free -h
total used free shared buff/cache available
Mem: 927.8M 547.5M 67.0M 540.0K 313.3M 240.1M
Swap: 0 0 0

SEED
CPU | Memory
.25 vCPU | .5 GB
/ # free -h
total used free shared buff/cache available
Mem: 927.8M 559.9M 80.8M 544.0K 287.1M 229.2M
Swap: 0 0 0

(note that the total/used memory for both env are higher than the provisioned memory we set on our containers.

Thank you for your help.

@neilalexander
Copy link
Member

What stands out to me is the buff/cache utilisation, which makes me think you're falling victim to kubernetes/kubernetes#43916. In short, Kubernetes is considering the kernel page cache when deciding whether a pod is under memory pressure. I suspect if you look at the RSS size (as is reported by nats server ls for example) that you'd see the process utilisation itself is stable.

Do you set both a memory request and a memory limit, or just one or the other?

@NevinDry
Copy link
Author

Hi @neilalexander, we did not have memory soft/hardlimit set on our fargate containers. We are going to configure it and see what happens, I will keep you updated.
On another note, we observed that only containers with more than default memory value (512M) have their memory increasing over time.
Thanks for your guidance !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

2 participants