-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide metrics for monitoring tools #2904
Comments
Push based monitoring with GraphiteIn the past I used Graphite (plus statsd/collectd/jolokia/jmx/etc) pushing metrics. It was done normally by adding a cron job somewhere or a daemon, or using application extension points, such as event handlers, listeners, or plugins. The daemon or cron job would then collect the metrics and push to the monitoring system, which normally uses - normally - a type of round-robin time series database. Alternatively, the application may also send the metrics through something like an event handler that sends the metric to syslog, to a database table, a JMS queue, or another messaging system (like Kafka/RabbitMQ/etc). But it can be a bit hard to scale. A Java cluster with 4 machines and a few hundreds users is enough to put some stress on the monitoring end. There were even some set-up examples with Statsd as a network buffer, accumulating several metrics, summarizing, and then passing it along the network, to reduce the final load on the messaging systems or monitoring server. |
Pull based monitoring with PrometheusA different approach, that requires less tooling, is using a pull based approach, where you have a very cheap web service end point, which in Prometheus defaults to JupyterHub comes with Prometheus, and exposes several metrics. There are built-in metric types in Prometheus such as Histogram, Gauge, and Counter. Here's an example of how to measure the method time in Python with Prometheus, adapted from their README: from prometheus_client import Summary
import time
# Create a metric to track time spent and requests made.
REQUEST_TIME = Summary('request_processing_seconds', 'Time spent processing request')
# Decorate function with metric.
@REQUEST_TIME.time()
def process_request(t):
"""A dummy function that takes some time."""
time.sleep(t) The decorator makes sure to track the summary of the request times (or you could have used another metric type). Then you have a normal endpoint in Tornado, for instance, to expose all the metrics collected. Here's what JupyterHub does: from prometheus_client import REGISTRY, CONTENT_TYPE_LATEST, generate_latest
from tornado import gen
from .base import BaseHandler
from ..utils import metrics_authentication
class MetricsHandler(BaseHandler):
"""
Handler to serve Prometheus metrics
"""
@metrics_authentication
async def get(self):
self.set_header('Content-Type', CONTENT_TYPE_LATEST)
self.write(generate_latest(REGISTRY))
default_handlers = [
(r'/metrics$', MetricsHandler)
] The Here's what the metrics in JupyterHub look like, from my local notebook running
|
Push vs. PullThere's no silver bullet, as in most technologies. There are even solutions to have a push based daemon sending metrics to Prometheus, creating a hybrid solution. But endpoints for metrics like the above are very cheap for Open Source tools. It allows you to provide several metrics, without having to modify a lot of your code, and users are expected to simply plug whatever tool they prefer to retrieve the metrics. |
I am working on Inside Prometheus there is also an alert system, that can be used to send notifications when a metric reaches a threshold. But other systems can be plugged too. It is possible to use the time series DB and the metrics to:
|
And example of plotting the metrics with Grafana (default visualisation tool for Prometheus): https://prometheus.io/docs/visualization/grafana/. There are also plugins capable of displaying metrics with diagrams, such as Gantt charts (e.g. https://grafana.com/plugins/jdbranham-diagram-panel) Also another example from users reporting issues with their screenshots of the metrics/monitoring (useful for maintainers familiar with the tool?): jupyterhub/mybinder.org-deploy#350. |
Good idea(s) @kinow 👍 - definitely worth pursuing. The metrics of most interest to Cylc users (or site admins) are job execution time, job queue time, job failures, number of jobs per day, etc. I suppose the "pull" variant you describe above would require a back-end service that scrapes that information from individual suite DBs, in order to extract the data for these metrics. |
(those grafana examples look very nice) |
That's correct. You basically avoid promising to deliver a package to some server, and instead tell users metrics will be here, so just come and fetch them. If you download prometheus, editing the
Graphite's was all Python, with Twisted at its core (precursor of asyncio I believe). Prometheus was written in Go to allow for the loop to be optimized (related talk) |
A nice tool for integration with CI, helps identify when performance characteristics change and track down negative changes which lie within the "jitter" by scanning back at graphs. |
(Probably move UIS) |
Going to close this cylc-flow issue and continue to track in cylc/cylc-admin#132. |
I have already bothered @hjoliver and @dwsutherland with questions regarding monitoring Cylc. During the Melbourne workshop there were some talks that involved monitoring/reporting, but done only in a push-based fashion.
This is a placeholder issue for discussion around monitoring in Cylc. I built something in Melbourne, but it is not ready to be reviewed by others yet, so will use it to explain my line of thinking, and then later ask for review to see whether it would be useful for others or not.
The text was updated successfully, but these errors were encountered: