-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
netdata state container: runaway FD use #372
Comments
Hey, @withinboredom.
|
@withinboredom I am very sorry you had this bad experience with Netdata. Please help us find the issue and fix it. In the current nightly version of netdata we added the 2 more monitoring functions, based on your suggestions at netdata/netdata#15411
But even before these changes, as @ilyam8 says, we were monitoring the file descriptors per application with apps.plugin. Keep in mind that apps.plugin monitors the file descriptors of applications from From the output of So, although I understand you have removed Netdata from your systems, could you please help us trace the issue? I have also read your blog post. You state that somehow a process managed to exhaust all file descriptors of the system and this led to a system-wide corruption. To my understanding this is not technically possible. Even if an application is leaking file descriptors, the limits of the process of far below the limits of the system. So, the process will start misbehaving but it cannot kill or corrupt the entire system. The ability of a process to corrupt the entire system, would be a big issue for Linux systems. Anyway, please help us verify that Netdata is leaking file descriptors. If it does, we need to find where it does. You mention Can you help us? |
I ended up installing netdata back, but directly on the nodes instead of using the helm chart. I really like netdata and couldn't find anything nearly as awesome. I lose some internal monitoring of the cluster, but that's ok with me for now.
It was essentially as I posted, hundreds and hundreds of thousands of lines of it (~500k). I probably should have gotten you the entire output, but didn't think of it at the time. As you can see from the output, it is entirely At first, I thought it ran out of space on the volume, but post-mortem, the volume only had about 20mb being used (out of 1gb). I have no idea why netdata did this.
You may have to scroll to the right to read the end of the lines (github doesn't wrap code blocks, apparently).
Yeah, absolutely. With this new monitoring, I'd feel much safer installing the helm chart again. I dug through the nodes logs. Here are some things I saw, before running out of file descriptors:
There are many logs like that before the system fails due to no more FD. Does a segfault leave open files behind if the segfault happens in a container? I know PID 1 normally is responsible for reaping processes, but I don't know the structure of these containers (if there isn't a PID 1 in the container, that could be it ... 🤔). Eventually, it looks like etcd loses access to FDs first (well, "first" might be that it is just the most active process in the node), followed by netdata attempting to ask k8s for a configmap repeatedly. Then, k8s itself. Eventually, there are enough processes stuck with too many FDs that it overwhelms the system. At the time, there were only a handful of containers/processes on this node. I'll see if I can get the full logs here. |
Netdata needed to be removed due to consuming ALL available file descriptors (mildly entertaining that this isn't a monitored metric in netdata, that I could find).
From
lsof
it appears that it is just opening the WAL/db in a loop:approximately ^ 100's of thousands of times until the node eventually becomes unavailable due to an inability to open any more file descriptors.
The text was updated successfully, but these errors were encountered: