Telegraf stops producing data after a network disconnect and reconnect. #16035

mmachenry · 2024-10-16T19:14:13Z

Relevant telegraf.conf

[agent]
  omit_hostname = true
  debug = true

[[outputs.influxdb_v2]]
  urls = ["http://influxdb:8086"]
  token = "$KTS_TELEMETRY_INFLUXDB_TOKEN"
  organization = "$KTS_TELEMETRY_INFLUXDB_ORG"
  bucket = "$KTS_TELEMETRY_INFLUXDB_BUCKET"


[[inputs.mqtt_consumer]]
  data_format = "json_v2"
  client_trace = true
  servers = ["tcp://10.0.200.10:1883"]
  topics = [
    "/sample/#", "/request/#", "/event/#"
  ]
  [[inputs.mqtt_consumer.json_v2]]
    [[inputs.mqtt_consumer.json_v2.object]]
      path = "@this"
      excluded_keys = []

Logs from Telegraf

telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  internalConnLost called
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  stopCommsWorkers called
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     incoming complete
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     startIncomingComms: ibound complete
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     startIncomingComms goroutine complete
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [router]  matchAndDispatch exiting
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     outgoing waiting for an outbound message
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     outgoing waiting for an outbound message
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     outgoing waiting for an outbound message
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     outgoing comms stopping
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [net]     startComms closing outError
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  incoming comms goroutine done
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  internalConnLost waiting on workers
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  startCommsWorkers output redirector finished
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  stopCommsWorkers waiting for workers
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  stopCommsWorkers waiting for comms
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  stopCommsWorkers done
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  internalConnLost workers stopped
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  BUG BUG BUG reconnection function is nil<nil>
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [msgids]  cleaned up
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  internalConnLost complete
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] [client]  status is already disconnected
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z E! [inputs.mqtt_consumer] Error in plugin: connection lost: pingresp not received, disconnecting
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:30Z D! [inputs.mqtt_consumer] Disconnected [tcp://10.0.200.10:1883]
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:36Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:40Z D! [inputs.mqtt_consumer] Connecting [tcp://10.0.200.10:1883]
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:40Z D! [inputs.mqtt_consumer] [client]  Connect()
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:40Z D! [inputs.mqtt_consumer] [store]   memorystore initialized
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:40Z D! [inputs.mqtt_consumer] [client]  about to write new connect msg
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:46Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:50Z D! [inputs.mqtt_consumer] Previous collection has not completed; scheduled collection skipped
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:50Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
telemetry-mqtt_telegraf-1  | 2024-10-16T08:11:56Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:00Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:00Z D! [inputs.mqtt_consumer] Previous collection has not completed; scheduled collection skipped
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:06Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:10Z W! [inputs.mqtt_consumer] Collection took longer than expected; not complete after interval of 10s
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:10Z D! [inputs.mqtt_consumer] Previous collection has not completed; scheduled collection skipped
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:10Z D! [inputs.mqtt_consumer] [client]  dial tcp 10.0.200.10:1883: i/o timeout
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:10Z D! [inputs.mqtt_consumer] [client]  failed to connect to broker, trying next
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:10Z D! [inputs.mqtt_consumer] [client]  Failed to connect to a broker
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:10Z D! [inputs.mqtt_consumer] [store]   memorystore closed
telemetry-mqtt_telegraf-1  | 2024-10-16T08:12:10Z E! [inputs.mqtt_consumer] Error in plugin: network Error : dial tcp 10.0.200.10:1883: i/o timeout

System info

Telegraf-1.32.1 Ubuntu-20.04.3 LTS

Docker

version: "3"
services:
influxdb:
image: influxdb:2.7.10 # influxdb:latest
ports:
- 8086:8086
volumes:
- influxdb-storage:/var/lib/influxdb2
ulimits:
nproc: 65535
nofile:
soft: 20000
hard: 40000
restart: always
environment:
- DOCKER_INFLUXDB_INIT_MODE=setup
- DOCKER_INFLUXDB_INIT_USERNAME=$KTS_TELEMETRY_INFLUXDB_USERNAME
- DOCKER_INFLUXDB_INIT_PASSWORD=$KTS_TELEMETRY_INFLUXDB_PASSWORD
- DOCKER_INFLUXDB_INIT_ORG=$KTS_TELEMETRY_INFLUXDB_ORG
- DOCKER_INFLUXDB_INIT_BUCKET=$KTS_TELEMETRY_INFLUXDB_BUCKET
- DOCKER_INFLUXDB_INIT_RETENTION=$KTS_TELEMETRY_INFLUXDB_RETENTION
- DOCKER_INFLUXDB_INIT_ADMIN_TOKEN=$KTS_TELEMETRY_INFLUXDB_TOKEN

grafana:
image: grafana/grafana:11.2.0 # grafana/grafana:latest
environment:
- GF_SERVER_ROOT_URL=http://192.168.1.97:3000/
- GF_PLUGINS_ENABLE_ALPHA=true
ports:
- 3000:3000
volumes:
- grafana-storage:/var/lib/grafana
- ./grafana.ini:/etc/grafana/grafana.ini:ro
restart: always
depends_on:
- influxdb

mqtt_telegraf:
image: telegraf:1.32.1
environment:
- KTS_TELEMETRY_INFLUXDB_TOKEN=$KTS_TELEMETRY_INFLUXDB_TOKEN
- KTS_TELEMETRY_INFLUXDB_ORG=$KTS_TELEMETRY_INFLUXDB_ORG
- KTS_TELEMETRY_INFLUXDB_BUCKET=$KTS_TELEMETRY_INFLUXDB_BUCKET
- KTS_INFLUXDB_HOST=influxdb:8086
volumes:
- ./telegraf.conf:/etc/telegraf/telegraf.conf:ro
restart: always
depends_on:
- influxdb

volumes:
influxdb-storage:
grafana-storage:

Steps to reproduce

Run the given docker-compose.
Run an MQTT server on another device on the network
Publish to /event/#, any message
Disconnect the network.
Reboot the network connection

Expected behavior

The telegraf instance should reconnect to the MQTT broker upon having network connectivity again and then begin producing data to insert into InfluxDB as normal.

Actual behavior

Upon reconnection, a bunch of cached data is dumped to InfluxDB and shortly there after, no new data is ingested. If telegraf is rebooted, it reconnects normally and produces data. A reboot to telegraf should not be required. It does not notice that it's broken so a manual effort is require to realize data is not being ingested to InfluxDB and a manual forced restart needs to be triggered.

Additional info

A more full version of the logs which did not fit here is located on gitst.
https://gist.github.com/mmachenry/59139541bbb628abff0c648d37eb41fb

The text was updated successfully, but these errors were encountered:

srebhan · 2024-10-18T10:37:31Z

@evilmumi let's continue in one issue rather than splitting up.

@mmachenry or @evilmumi can you please check if this behavior was introduced in a certain version? A first debugging session on slack revealed that the underlying MQTT library itself seems to cause the issue...

evilmumi · 2024-10-19T09:14:56Z

I have Auto Update configurated for my telegraf Container. So maybe it came in the last 2-3 releases. But at the moment not clear how to test it? Can i just use an old version?

mmachenry added the bug unexpected problem or unintended behavior label Oct 16, 2024

srebhan self-assigned this Oct 16, 2024

srebhan mentioned this issue Oct 18, 2024

Telegraf did not grab data after a reconnect to a restarted mqtt broker #16038

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telegraf stops producing data after a network disconnect and reconnect. #16035

Telegraf stops producing data after a network disconnect and reconnect. #16035

mmachenry commented Oct 16, 2024

srebhan commented Oct 18, 2024

evilmumi commented Oct 19, 2024

Telegraf stops producing data after a network disconnect and reconnect. #16035

Telegraf stops producing data after a network disconnect and reconnect. #16035

Comments

mmachenry commented Oct 16, 2024

Relevant telegraf.conf

Logs from Telegraf

System info

Docker

Steps to reproduce

Expected behavior

Actual behavior

Additional info

srebhan commented Oct 18, 2024

evilmumi commented Oct 19, 2024