Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telegraf stops producing data after a network disconnect and reconnect. #16035

Open
mmachenry opened this issue Oct 16, 2024 · 2 comments
Open
Assignees
Labels
bug unexpected problem or unintended behavior

Comments

@mmachenry
Copy link

Relevant telegraf.conf

[agent]
  omit_hostname = true
  debug = true

[[outputs.influxdb_v2]]
  urls = ["http://influxdb:8086"]
  token = "$KTS_TELEMETRY_INFLUXDB_TOKEN"
  organization = "$KTS_TELEMETRY_INFLUXDB_ORG"
  bucket = "$KTS_TELEMETRY_INFLUXDB_BUCKET"


[[inputs.mqtt_consumer]]
  data_format = "json_v2"
  client_trace = true
  servers = ["tcp://10.0.200.10:1883"]
  topics = [
    "/sample/#", "/request/#", "/event/#"
  ]
  [[inputs.mqtt_consumer.json_v2]]
    [[inputs.mqtt_consumer.json_v2.object]]
      path = "@this"
      excluded_keys = []

Logs from Telegraf

telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  internalConnLost called
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  stopCommsWorkers called
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     incoming complete
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     startIncomingComms: ibound complete
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     startIncomingComms goroutine complete
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [router]  matchAndDispatch exiting
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     outgoing waiting for an outbound message
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     outgoing waiting for an outbound message
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     outgoing waiting for an outbound message
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     outgoing comms stopping
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     startComms closing outError
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  incoming comms goroutine done
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  internalConnLost waiting on workers
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  startCommsWorkers output redirector finished
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  stopCommsWorkers waiting for workers
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  stopCommsWorkers waiting for comms
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  stopCommsWorkers done
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  internalConnLost workers stopped
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  BUG BUG BUG reconnection function is nil<nil>
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [msgids]  cleaned up
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  internalConnLost complete
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  status is already disconnected
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z E! [inputs.mqtt_consumer] Error in plugin: connection lost: pingresp not received, disconnecting
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] Disconnected [tcp://10.0.200.10:1883]
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:36Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:40Z D! [inputs.mqtt_consumer] Connecting [tcp://10.0.200.10:1883]
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:40Z D! [inputs.mqtt_consumer] [client]  Connect()
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:40Z D! [inputs.mqtt_consumer] [store]   memorystore initialized
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:40Z D! [inputs.mqtt_consumer] [client]  about to write new connect msg
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:46Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:50Z D! [inputs.mqtt_consumer] Previous collection has not completed; scheduled collection skipped
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:50Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:56Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:00Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:00Z D! [inputs.mqtt_consumer] Previous collection has not completed; scheduled collection skipped
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:06Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:10Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:10Z D! [inputs.mqtt_consumer] Previous collection has not completed; scheduled collection skipped
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:10Z D! [inputs.mqtt_consumer] [client]  dial tcp 10.0.200.10:1883: i/o timeout
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:10Z D! [inputs.mqtt_consumer] [client]  failed to connect to broker, trying next
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:10Z D! [inputs.mqtt_consumer] [client]  Failed to connect to a broker
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:10Z D! [inputs.mqtt_consumer] [store]   memorystore closed
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:10Z E! [inputs.mqtt_consumer] Error in plugin: network Error : dial tcp 10.0.200.10:1883: i/o timeout

System info

Telegraf-1.32.1 Ubuntu-20.04.3 LTS

Docker

version: "3"
services:
influxdb:
image: influxdb:2.7.10 # influxdb:latest
ports:
- 8086:8086
volumes:
- influxdb-storage:/var/lib/influxdb2
ulimits:
nproc: 65535
nofile:
soft: 20000
hard: 40000
restart: always
environment:
- DOCKER_INFLUXDB_INIT_MODE=setup
- DOCKER_INFLUXDB_INIT_USERNAME=$KTS_TELEMETRY_INFLUXDB_USERNAME
- DOCKER_INFLUXDB_INIT_PASSWORD=$KTS_TELEMETRY_INFLUXDB_PASSWORD
- DOCKER_INFLUXDB_INIT_ORG=$KTS_TELEMETRY_INFLUXDB_ORG
- DOCKER_INFLUXDB_INIT_BUCKET=$KTS_TELEMETRY_INFLUXDB_BUCKET
- DOCKER_INFLUXDB_INIT_RETENTION=$KTS_TELEMETRY_INFLUXDB_RETENTION
- DOCKER_INFLUXDB_INIT_ADMIN_TOKEN=$KTS_TELEMETRY_INFLUXDB_TOKEN

grafana:
image: grafana/grafana:11.2.0 # grafana/grafana:latest
environment:
- GF_SERVER_ROOT_URL=http://192.168.1.97:3000/
- GF_PLUGINS_ENABLE_ALPHA=true
ports:
- 3000:3000
volumes:
- grafana-storage:/var/lib/grafana
- ./grafana.ini:/etc/grafana/grafana.ini:ro
restart: always
depends_on:
- influxdb

mqtt_telegraf:
image: telegraf:1.32.1
environment:
- KTS_TELEMETRY_INFLUXDB_TOKEN=$KTS_TELEMETRY_INFLUXDB_TOKEN
- KTS_TELEMETRY_INFLUXDB_ORG=$KTS_TELEMETRY_INFLUXDB_ORG
- KTS_TELEMETRY_INFLUXDB_BUCKET=$KTS_TELEMETRY_INFLUXDB_BUCKET
- KTS_INFLUXDB_HOST=influxdb:8086
volumes:
- ./telegraf.conf:/etc/telegraf/telegraf.conf:ro
restart: always
depends_on:
- influxdb

volumes:
influxdb-storage:
grafana-storage:

Steps to reproduce

  1. Run the given docker-compose.
  2. Run an MQTT server on another device on the network
  3. Publish to /event/#, any message
  4. Disconnect the network.
  5. Reboot the network connection

Expected behavior

The telegraf instance should reconnect to the MQTT broker upon having network connectivity again and then begin producing data to insert into InfluxDB as normal.

Actual behavior

Upon reconnection, a bunch of cached data is dumped to InfluxDB and shortly there after, no new data is ingested. If telegraf is rebooted, it reconnects normally and produces data. A reboot to telegraf should not be required. It does not notice that it's broken so a manual effort is require to realize data is not being ingested to InfluxDB and a manual forced restart needs to be triggered.

Additional info

A more full version of the logs which did not fit here is located on gitst.
https://gist.github.com/mmachenry/59139541bbb628abff0c648d37eb41fb

@mmachenry mmachenry added the bug unexpected problem or unintended behavior label Oct 16, 2024
@srebhan srebhan self-assigned this Oct 16, 2024
@srebhan
Copy link
Member

srebhan commented Oct 18, 2024

@evilmumi let's continue in one issue rather than splitting up.

@mmachenry or @evilmumi can you please check if this behavior was introduced in a certain version? A first debugging session on slack revealed that the underlying MQTT library itself seems to cause the issue...

@evilmumi
Copy link

I have Auto Update configurated for my telegraf Container. So maybe it came in the last 2-3 releases. But at the moment not clear how to test it? Can i just use an old version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants