Skip to content

Troubleshooting

Sergio Bengoechea Guerrero edited this page Jan 17, 2022 · 8 revisions

Troubleshooting

Logs

SNMPCollector has a complete set of log files to review all what is happening while gathering snmp data from our infrastructure. All logs are located in the same directory LOG_DIR.

Default LOG_DIR is at /var/log/snmpcollector if snmpcollector has been installed with debian and redhat based packages. is at /opt/snmpcollector/log in docker and always can be set with -log option passed to the snmpcollector binary.

If installed with debian/redhat packages you can also tune this parameters in these files

  • rpm /etc/sysconfig/snmpcollector
  • deb /etc/default/snmpcollector

Main agent logs

$LOG_DIR/snmpcollector.log

Show basic initialisation process and the result of runtime administration from the web ui Default Level: set in the main config.toml file under the general section Supported Levels: panic,fatal,error,warn,info,debug Can be changed online?: no

HTTP access logs

$LOG_DIR/http_access.log

Show us all http access request the result and response time. This log has not levelling support

Device specific logs

$LOG_DIR/<device_id>.log

This is the main log when you if you have problems with only a set of devices and only under certain conditions.

Default Level: set in the device configuration section on the configuration database Supported Levels: panic,fatal,error,warn,info,debug Can be changed online?: yes in the runtime webui

SNMP debug logs

$LOG_DIR/snmpdebug_<device_id>_<measurement_id>.log

This log is disabled by default and can be enabled online on the webui , when stabilising snmp links with remote devices snmpcollector has one link by measurement. When enabling snmpdebug log each measurement on the device will create a new file with snmp protocol related debug. This debug will help us to review connection and or snmp protocol related problems.

SQL debug log

$LOG_DIR/sql.log

This log is disabled by default.

Default Level: set in the main config.toml file under the general section Supported Levels: on/off ( debug = true / debug = false) Can be changed online?: no

Self Monitoring

When snmpcollector has self-monitoring activate it can send data from itself to the "default" backend (you should have both selfmon active and one influx backend configured with id = "default".

You can activate on the main config file config.toml on the [selfmon] section.

[selfmon]
 #enable true/false enable/disable self monitoring
 enabled = true
 #send data Frequency
 freq = 60
 #prefix for measurement naming
 prefix = ""
 #inherit device tags (only apply to the selfmon_device_stats measurements)
 inheritdevicetags = true
 #adds extra tags to the measurement config should be set as a csv - tag=value1,tag2=value2,...,tagN=valN
 extratags = [ "instance=snmpcollector01" ]

When active it will send 2 measurements.

Dashboards

The following dashboards allow the user to see the internal statistics of SNMPCollector to know the status of the platform

Dashboard Descripton Required version
snmpcollector_platform_instance Overview metrics to know the SNMPCollector instance status
  • SNMPCollector: 0.12.0+
  • Grafana: 7.5.5+
snmpcollector_platform_device Detailed device view to show the device stats on a SNMPCollector instance
  • SNMPCollector: 0.12.0+
  • Grafana: 7.5.5+
snmpcollector_platform_measurement Detailed measurement view to show the device stats on a SNMPCollector instance
  • SNMPCollector: 0.12.0+
  • Grafana: 7.5.5+

Defined Measurements

These are the defined measurements, where user can add prefix in the config.toml if needed.

measurement description
selfmon_gvm send statistics about the Go Virtual Machine.
selfmon_device_stats send statistic data form each gathering device
selfmon_outdb_stats statistics measurement for each output db

selfmon_gvm

FieldName Source Unit Description
runtime_goroutines runtime.NumGoroutine() number Number of currently running goroutines
mem.alloc runtime.ReadMemStats.Alloc bytes Total bytes allocated
mem.mallocs runtime.ReadMemStats.Mallocs mallocs per second Number of Mallocs issued to the system
mem.frees runtime.ReadMemStats.Frees frees per second Number of frees issued to the system
mem.heapAlloc runtime.ReadMemStats.HeapAlloc bytes allocated heap objects.
mem.stackInuse runtime.ReadMemStats.StackInuse bytes in stack spans. In-use stack spans have at least one stack in them. These spans can only be used for other stacks of the same size. There is no StackIdle because unused stack spans are returned to the heap (and hence counted toward HeapIdle).
gc.total_pause_ns memStats.PauseTotalNs ms accumulated paused in ms
gc.pause_per_interval memStats.PauseTotalNs ms/interval accumulated paused in ms since last gathered statistic
gc.pause_per_second memStats.PauseTotalNs ms/second accumulated paused in ms per second (normalized)
gc.gc_per_interval memStats.NumGC #gc/second number of gc's since last gathered statistic
gc.gc_per_second memStats.NumGC #gc/second number of gc's per second ( normalized)

selfmon_device_stats [> 0.12.0]

From 0.12.0 statistics are taken from each measurement, could apply only on the measurement (M) or also could apply on device with some special kind of consolidation for the device period ( M/D )

FieldName type Apply on (M/D) Description in Measurement Context description y device context
active_value integer M/D 0 not active / 1 active, ,where activation depends on the device config active for configuration
connected_value integer M/D 0 not connected / 1 connected 0 not connected if all measurements in the last gather period appears as not connected
snmp_oid_get_all integer M/D All Gathered snmp metrics ( sum of snmpget oid's and all received oid's in snmpwalk queries) in this measurement som of gathered snmp metrics for all measurements in the device period
snmp_oid_get_processed integer M/D Gathered and processed snmp metrics after filters are applied ( not always sent to the backend it depends on the report flag) in this measurement sum of processed metrics for all measurements in the device period
snmp_oid_get_errors integer M/D number of oid with errors for this measurements sum of oid with errors for all measurements in the device period
cycle_gather_start_time integer M/D Last gathered time in unix timestamp minimum timestamp for all gathered measurements in the device period
cycle_gather_duration float M/D elapsed time taken to get this measurement data ( in seconds) maximum elapsed time for all measurements finished in the device period
filter_start_time integer M/D Last Applied Filter time in unix timestamp minimum timestamp for all filtered measurements done in the device period
filter_duration float M/D elapsed time taken to compute all applicable filters on the measurement in seconds Sum of elapsed time for all filters applied on all measurements in the device period
backend_sent_start_time integer M/D Last sent time to the internal output buffer ( as UNIX TIMESTAMP) minimum timestamp for all sent done in the device period
backend_sent_duration float M/D elapsed time taken to send data to the internal output buffer backend ( in seconds ) Sum of elapsed time for all process sent duration applied on all measurements in the device period
metric_sent integer M/D number of metrics sent (taken as fields) for the measurement Sum of metrics for each measurement in device period
metric_sent_errors integer M/D number of metrics (taken as fields) with errors for all measurements Sum of metrics with errors for each measurement in device period
measurement_sent integer M/D number of series build to send as a single request sent to the backend Sum of series for all measurements in the device period
measurement_sent_errors integer M/D number of series with errors for this measurements Sum of series with error in the device period
TagName Description
active true ( 1 for active_value field) or false( 0 on active_value field) as tag for fast filtering purposes
connected "true" ( 1 for connected_value field) or "false" ( 0 on connected_value field) as tag for fast filtering purposes
device device name where this statistic will apply
meas_name the name of the measurement where this statistic will apply (only for measurement)
type could be of type measurement (applied on the measurement context) o device (applied on the device context as a consolidation for the measurement stats)

NOTE each statistic could have more tags from config file extratags and from any device if inheritdevicetags = true

selfmon_device_stats [< 0.12.0]

FieldName Description
snmp_oid_get_all All Gathered snmp metrics ( sum of snmpget oid's and all received oid's in snmpwalk queries)
snmp_oid_get_processed Gathered and processed snmp metrics after filters are applied ( not always sent to the backend it depends on the report flag)
snmp_oid_get_errors number of oid with errors for all measurements
cycle_gather_start_time Last gathered time in unix timestamp
cycle_gather_duration elapsed time taken to get all measurement info in seconds
filter_start_time Last Applied Filter time in unix timestamp
filter_duration elapsed time taken to compute all applicable filters on the device in seconds
backend_sent_start_time Last sent time to the internal output buffer
backend_sent_Duration elapsed time taken to send data to the internal output buffer backend
metric_sent number of metrics sent (taken as fields) for all measurements
metric_sent_errors number of metrics (taken as fields) with errors for all measurements
measurement_sent (number of series build to send as a single request sent to the backend)
measurement_sent_errors number of series build to send as a single request with errors for all measurements

selfmon_outdb_stats

field description
write_sent number of HTTP writes sent to the DB (each write sends a batchPoint) on the last period
write_error number of HTTP write errors on the period
points_sent number of Points sent on each Write (on each BatchPoint) on the last period
points_sent_max max number of points sent on all writes on the last period
points_sent_avg (only if write_sent > 0) averaged points sent for all writes on the last period
write_time sum of all HTTP response times on all writes on the last period
write_time_max max HTTP response time in all writes on the last period
write_time_avg (only if write_sent > 0) average response time for all writes on the last period
fields_sent number of fields sent to the DB on the last period
fields_sent_max max number of fields sent to the DB on the last period
buffer_percent_used percent of the usage of the total buffer used for each.