Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service stuck without crash #120

Open
YentlFrickx opened this issue Jan 19, 2023 · 2 comments
Open

Service stuck without crash #120

YentlFrickx opened this issue Jan 19, 2023 · 2 comments

Comments

@YentlFrickx
Copy link

We are running the nri-bundle on our openshift clusters and notice that after a while the new-relic logging pods are 'stuck' and don't forward the logs to new-relic anymore. When we delete the pod, the new pod works as expected and continues tailing the log files. I tried debugging the pod but can't open a shell in it, I could however execute the fluent-bit command through: /fluent-bit/bin/fluent-bit --version, which correctly returned the version. My assumption is that somewhere the fluent-bit command is stuck without it crashing and making the container restart.

@jsubirat
Copy link
Contributor

Hi @YentlFrickx ,

Thank you for reporting this. Could you please try to capture the newrelic-logging pod logs of one of these "blocked pods" with `kubectl logs ?

On the other hand, please note that you can use the debug images to diagnose these kind of issues. The debug images come with the busybox binary, which you can use to execute commands on that box, including sh. To use this image, you'll need this configuration in values.yml:

newrelic-logging:
  image:
    tag: 1.14.0-debug

Version 1.14.0 is the one currently being used by our Helm chart. Nevertheless, there are also 1.14.1 and 1.14.2 available that you can use if you overwrite the newrelic-logging.image.tag. In particular, version 1.14.1 includes a fix for a Fluent Bit bug that might be related to what you experience.

So, in summary, could you please:

  1. Send us the newrelic-logging logs.
  2. Overwrite the image tag to 1.14.0-debug and use busybox to get shell access (busybox sh) and diagnose the issue. List of Busybox commands here.
  3. Overwrite the image tag to 1.14.1 and check if that fixes your issue.

Best regards,

Josep

@adri
Copy link

adri commented Jul 31, 2024

We also noticed stuck containers (still running but not reporting to New Relic). We see these errors before the container stopped.

[2024/07/19 18:34:13] [error] [tls] syscall error: error:00000005:lib(0):func(0):DH lib
[2024/07/19 18:34:13] [error] [http_client] broken connection to kubernetes.default.svc.cluster.local:443 ?

Maybe it's an idea to upgrade newrelic-logging to FluentBit 3.1? Now 3.0.4 is used, but other versions contained fixes related to TLS

@jsubirat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants