netdata state container: runaway FD use #372

withinboredom · 2023-07-10T20:08:49Z

Netdata needed to be removed due to consuming ALL available file descriptors (mildly entertaining that this isn't a monitored metric in netdata, that I could find).

From lsof it appears that it is just opening the WAL/db in a loop:


netdata   1781992 1782297 RRDCONTEX              201   15r      CHR                1,3         0t0          6 /dev/null
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   16u     sock                0,8         0t0   62179404 protocol: UNIX-STREAM
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   17ur     REG              0,121     4587520   14044310 /var/cache/netdata/netdata-meta.db
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   18u      REG              0,121     4906952   14044311 /var/cache/netdata/netdata-meta.db-wal
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   19ur     REG              0,121       32768   14044312 /var/cache/netdata/netdata-meta.db-shm
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   20ur     REG              0,121        4096   14044313 /var/cache/netdata/context-meta.db
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   21u      REG              0,121       37112   14044314 /var/cache/netdata/context-meta.db-wal
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   22ur     REG              0,121       32768   14044315 /var/cache/netdata/context-meta.db-shm
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   23ur     REG              0,121     4587520   14044310 /var/cache/netdata/netdata-meta.db
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   24u      REG              0,121     4906952   14044311 /var/cache/netdata/netdata-meta.db-wal
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   25u  a_inode               0,14           0      12713 [eventpoll]
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   26r     FIFO               0,13         0t0   62177496 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   27w     FIFO               0,13         0t0   62177496 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   28u  a_inode               0,14           0      12713 [eventfd]
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   29u  a_inode               0,14           0      12713 [eventpoll]
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   30r     FIFO               0,13         0t0   62171736 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   31w     FIFO               0,13         0t0   62171736 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   32u  a_inode               0,14           0      12713 [eventfd]
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   33r     FIFO               0,13         0t0   62176463 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   34u     sock                0,8         0t0   62161843 protocol: UDPv6
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   35w     FIFO               0,13         0t0   62166944 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   36u     sock                0,8         0t0   62161844 protocol: UDP
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   37u     sock                0,8         0t0   62161847 protocol: TCPv6
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   38u     sock                0,8         0t0   62161848 protocol: TCP
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   39u     sock                0,8         0t0   62178506 protocol: UNIX-STREAM
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   40u  a_inode               0,14           0      12713 [eventfd]
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   41u  a_inode               0,14           0      12713 [eventpoll]
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   42r     FIFO               0,13         0t0   62178505 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   43w     FIFO               0,13         0t0   62178505 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   44r      CHR                1,3         0t0          6 /dev/null
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   45r     FIFO               0,13         0t0   62160878 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   46w     FIFO               0,13         0t0   62160878 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   47r      REG              0,273           0   62169795 /proc/1/task/209/stat
lsof: no pwd entry for UID 201
netdata   1781992 1782297 RRDCONTEX              201   48u     sock                0,8         0t0   62194554 protocol: TCP
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  cwd       DIR              0,121        4096   13901928 /etc/netdata
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  rtd       DIR              0,121        4096   14034715 /
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  txt       REG              0,121     6986240   13902420 /usr/sbin/netdata
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem-r     REG              0,121               14044315 /var/cache/netdata/context-meta.db-shm (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem-r     REG              0,121               14044312 /var/cache/netdata/netdata-meta.db-shm (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               13896738 /usr/lib/libzstd.so.1.5.5 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               13896436 /usr/lib/libgcc_s.so.1 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               13896627 /usr/lib/libstdc++.so.6.0.30 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               13896405 /usr/lib/libbson-1.0.so.0.0.0 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               13896522 /usr/lib/libmongoc-1.0.so.0.0.0 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               13896615 /usr/lib/libsnappy.so.1.1.10 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               13896595 /usr/lib/libprotobuf.so.32.0.12 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               13896491 /usr/lib/libjson-c.so.5.2.0 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               13896512 /usr/lib/liblz4.so.1.9.4 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               13896695 /usr/lib/libuv.so.1.0.0 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               13895904 /lib/libuuid.so.1.3.0 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               11672957 /lib/libssl.so.3 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               11672956 /lib/libcrypto.so.3 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               11672959 /lib/libz.so.1.2.13 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201  mem       REG              0,121               11672951 /lib/ld-musl-x86_64.so.1 (stat: No such file or directory)
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201    0w      CHR                1,3         0t0          6 /dev/null
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201    1w     FIFO               0,13         0t0   62178474 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201    2w     FIFO               0,13         0t0   62178474 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201    3u     sock                0,8         0t0   62173481 protocol: UNIX
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201    4w     FIFO               0,13         0t0   62178474 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201    5w     FIFO               0,13         0t0   62178474 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201    6w      REG              0,121           0   14044307 /var/log/netdata/health.log
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201    7u     sock                0,8         0t0   62173487 protocol: TCP
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201    8u     sock                0,8         0t0   62173488 protocol: TCPv6
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201    9u  a_inode               0,14           0      12713 [eventpoll]
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   10r     FIFO               0,13         0t0   62179402 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   11w     FIFO               0,13         0t0   62179402 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   12r     FIFO               0,13         0t0   62179403 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   13w     FIFO               0,13         0t0   62179403 pipe
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   14u  a_inode               0,14           0      12713 [eventfd]
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   15r      CHR                1,3         0t0          6 /dev/null
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   16u     sock                0,8         0t0   62179404 protocol: UNIX-STREAM
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   17ur     REG              0,121     4587520   14044310 /var/cache/netdata/netdata-meta.db
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   18u      REG              0,121     4906952   14044311 /var/cache/netdata/netdata-meta.db-wal
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   19ur     REG              0,121       32768   14044312 /var/cache/netdata/netdata-meta.db-shm
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   20ur     REG              0,121        4096   14044313 /var/cache/netdata/context-meta.db
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   21u      REG              0,121       37112   14044314 /var/cache/netdata/context-meta.db-wal
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   22ur     REG              0,121       32768   14044315 /var/cache/netdata/context-meta.db-shm
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   23ur     REG              0,121     4587520   14044310 /var/cache/netdata/netdata-meta.db
lsof: no pwd entry for UID 201
netdata   1781992 1782298 REPLAY[1]              201   24u      REG              0,121     4906952   14044311 /var/cache/netdata/netdata-meta.db-wal

approximately ^ 100's of thousands of times until the node eventually becomes unavailable due to an inability to open any more file descriptors.

The text was updated successfully, but these errors were encountered:

ilyam8 · 2023-07-17T17:03:22Z

Hey, @withinboredom.

approximately ^ 100's of thousands of times until the node

How did you come to this conclusion?
Can you show lsof output with a lot of open WAL/db files?
And can you show the "Applications -> disk -> apps.files" chart when there are a lot of open WAL/db files? Netdata tracks the number of open files for application groups.

screenshot

ktsaou · 2023-07-18T07:30:44Z

@withinboredom I am very sorry you had this bad experience with Netdata.

Please help us find the issue and fix it.

In the current nightly version of netdata we added the 2 more monitoring functions, based on your suggestions at netdata/netdata#15411

apps.plugin monitors the open file descriptors of all processes and raises alerts
proc.plugin monitors the total file descriptors of the system and raises alerts.

But even before these changes, as @ilyam8 says, we were monitoring the file descriptors per application with apps.plugin. Keep in mind that apps.plugin monitors the file descriptors of applications from /proc, so even for netdata processes it monitors the them, as understood by the kernel.

From the output of lsof you posted above, we don't see any leaks in file descriptors.

So, although I understand you have removed Netdata from your systems, could you please help us trace the issue?

I have also read your blog post. You state that somehow a process managed to exhaust all file descriptors of the system and this led to a system-wide corruption. To my understanding this is not technically possible. Even if an application is leaking file descriptors, the limits of the process of far below the limits of the system. So, the process will start misbehaving but it cannot kill or corrupt the entire system. The ability of a process to corrupt the entire system, would be a big issue for Linux systems.

Anyway, please help us verify that Netdata is leaking file descriptors. If it does, we need to find where it does. You mention wal, but we don't see it in the lsof you posted.

Can you help us?

withinboredom · 2023-07-18T09:18:13Z

I ended up installing netdata back, but directly on the nodes instead of using the helm chart. I really like netdata and couldn't find anything nearly as awesome. I lose some internal monitoring of the cluster, but that's ok with me for now.

From the output of lsof you posted above, we don't see any leaks in file descriptors.

It was essentially as I posted, hundreds and hundreds of thousands of lines of it (~500k). I probably should have gotten you the entire output, but didn't think of it at the time. As you can see from the output, it is entirely netdata that has these open files. This isn't filtered for netdata, it's raw output of lsof on the node.

At first, I thought it ran out of space on the volume, but post-mortem, the volume only had about 20mb being used (out of 1gb). I have no idea why netdata did this.

You mention wal, but we don't see it in the lsof you posted.

You may have to scroll to the right to read the end of the lines (github doesn't wrap code blocks, apparently).

Can you help us?

Yeah, absolutely. With this new monitoring, I'd feel much safer installing the helm chart again.

I dug through the nodes logs. Here are some things I saw, before running out of file descriptors:

Jul 04 20:55:44 cameo kernel: netdata[20554]: segfault at 30 ip 00007f19a348a2f9 sp 00007f19a0fab510 error 4 in ld-musl-x86_64.so.1[7f19a3447000+4c000]

There are many logs like that before the system fails due to no more FD. Does a segfault leave open files behind if the segfault happens in a container? I know PID 1 normally is responsible for reaping processes, but I don't know the structure of these containers (if there isn't a PID 1 in the container, that could be it ... 🤔).

Eventually, it looks like etcd loses access to FDs first (well, "first" might be that it is just the most active process in the node), followed by netdata attempting to ask k8s for a configmap repeatedly. Then, k8s itself. Eventually, there are enough processes stuck with too many FDs that it overwhelms the system.

At the time, there were only a handful of containers/processes on this node.

I'll see if I can get the full logs here.

withinboredom mentioned this issue Jul 15, 2023

[Feat]: Monitor remaining file descriptors netdata/netdata#15411

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

netdata state container: runaway FD use #372

netdata state container: runaway FD use #372

withinboredom commented Jul 10, 2023

ilyam8 commented Jul 17, 2023

ktsaou commented Jul 18, 2023

withinboredom commented Jul 18, 2023 •

edited

Loading

netdata state container: runaway FD use #372

netdata state container: runaway FD use #372

Comments

withinboredom commented Jul 10, 2023

ilyam8 commented Jul 17, 2023

ktsaou commented Jul 18, 2023

withinboredom commented Jul 18, 2023 • edited Loading

withinboredom commented Jul 18, 2023 •

edited

Loading