Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide metrics for monitoring tools #2904

Closed
kinow opened this issue Dec 12, 2018 · 11 comments
Closed

Provide metrics for monitoring tools #2904

kinow opened this issue Dec 12, 2018 · 11 comments
Labels
speculative blue-skies ideas

Comments

@kinow
Copy link
Member

kinow commented Dec 12, 2018

I have already bothered @hjoliver and @dwsutherland with questions regarding monitoring Cylc. During the Melbourne workshop there were some talks that involved monitoring/reporting, but done only in a push-based fashion.

This is a placeholder issue for discussion around monitoring in Cylc. I built something in Melbourne, but it is not ready to be reviewed by others yet, so will use it to explain my line of thinking, and then later ask for review to see whether it would be useful for others or not.

@kinow
Copy link
Member Author

kinow commented Dec 12, 2018

Push based monitoring with Graphite

In the past I used Graphite (plus statsd/collectd/jolokia/jmx/etc) pushing metrics. It was done normally by adding a cron job somewhere or a daemon, or using application extension points, such as event handlers, listeners, or plugins.

The daemon or cron job would then collect the metrics and push to the monitoring system, which normally uses - normally - a type of round-robin time series database. Alternatively, the application may also send the metrics through something like an event handler that sends the metric to syslog, to a database table, a JMS queue, or another messaging system (like Kafka/RabbitMQ/etc).

But it can be a bit hard to scale. A Java cluster with 4 machines and a few hundreds users is enough to put some stress on the monitoring end.

There were even some set-up examples with Statsd as a network buffer, accumulating several metrics, summarizing, and then passing it along the network, to reduce the final load on the messaging systems or monitoring server.

@kinow
Copy link
Member Author

kinow commented Dec 12, 2018

Pull based monitoring with Prometheus

A different approach, that requires less tooling, is using a pull based approach, where you have a very cheap web service end point, which in Prometheus defaults to /metrics.

JupyterHub comes with Prometheus, and exposes several metrics. There are built-in metric types in Prometheus such as Histogram, Gauge, and Counter.

Here's an example of how to measure the method time in Python with Prometheus, adapted from their README:

from prometheus_client import Summary
import time

# Create a metric to track time spent and requests made.
REQUEST_TIME = Summary('request_processing_seconds', 'Time spent processing request')

# Decorate function with metric.
@REQUEST_TIME.time()
def process_request(t):
    """A dummy function that takes some time."""
    time.sleep(t)

The decorator makes sure to track the summary of the request times (or you could have used another metric type).

Then you have a normal endpoint in Tornado, for instance, to expose all the metrics collected. Here's what JupyterHub does:

from prometheus_client import REGISTRY, CONTENT_TYPE_LATEST, generate_latest
from tornado import gen

from .base import BaseHandler
from ..utils import metrics_authentication

class MetricsHandler(BaseHandler):
    """
    Handler to serve Prometheus metrics
    """
    @metrics_authentication
    async def get(self):
        self.set_header('Content-Type', CONTENT_TYPE_LATEST)
        self.write(generate_latest(REGISTRY))

default_handlers = [
    (r'/metrics$', MetricsHandler)
]

The metrics_authentication is a decorator from JupyterHub, to protect the endpoint when a configuration is enabled. But other than that, that's all it takes to get metrics.

Here's what the metrics in JupyterHub look like, from my local notebook running jupyterhub in the command line with default settings, and starting one notebook.

# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 274182144.0
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 60334080.0
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1544652775.28
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 1.98
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 13.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1024.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="7",patchlevel="0",version="3.7.0"} 1.0
# HELP request_duration_seconds request duration for all HTTP requests
# TYPE request_duration_seconds histogram
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="0.005",method="GET"} 3.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="0.01",method="GET"} 3.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="0.025",method="GET"} 3.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="0.05",method="GET"} 3.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="0.075",method="GET"} 3.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="0.1",method="GET"} 3.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="0.25",method="GET"} 3.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="0.5",method="GET"} 3.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="0.75",method="GET"} 3.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="1.0",method="GET"} 3.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="2.5",method="GET"} 3.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="5.0",method="GET"} 3.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="7.5",method="GET"} 3.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="10.0",method="GET"} 3.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="+Inf",method="GET"} 3.0
request_duration_seconds_count{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",method="GET"} 3.0
request_duration_seconds_sum{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",method="GET"} 0.006404399871826172
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.AddSlashHandler",le="0.005",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.AddSlashHandler",le="0.01",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.AddSlashHandler",le="0.025",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.AddSlashHandler",le="0.05",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.AddSlashHandler",le="0.075",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.AddSlashHandler",le="0.1",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.AddSlashHandler",le="0.25",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.AddSlashHandler",le="0.5",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.AddSlashHandler",le="0.75",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.AddSlashHandler",le="1.0",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.AddSlashHandler",le="2.5",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.AddSlashHandler",le="5.0",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.AddSlashHandler",le="7.5",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.AddSlashHandler",le="10.0",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.AddSlashHandler",le="+Inf",method="GET"} 1.0
request_duration_seconds_count{code="302",handler="jupyterhub.handlers.base.AddSlashHandler",method="GET"} 1.0
request_duration_seconds_sum{code="302",handler="jupyterhub.handlers.base.AddSlashHandler",method="GET"} 0.0016713142395019531
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="0.005",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="0.01",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="0.025",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="0.05",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="0.075",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="0.1",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="0.25",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="0.5",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="0.75",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="1.0",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="2.5",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="5.0",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="7.5",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="10.0",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="+Inf",method="GET"} 1.0
request_duration_seconds_count{code="302",handler="jupyterhub.handlers.pages.RootHandler",method="GET"} 1.0
request_duration_seconds_sum{code="302",handler="jupyterhub.handlers.pages.RootHandler",method="GET"} 0.0017647743225097656
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="0.005",method="GET"} 0.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="0.01",method="GET"} 0.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="0.025",method="GET"} 0.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="0.05",method="GET"} 0.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="0.075",method="GET"} 0.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="0.1",method="GET"} 0.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="0.25",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="0.5",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="0.75",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="1.0",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="2.5",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="5.0",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="7.5",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="10.0",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="+Inf",method="GET"} 1.0
request_duration_seconds_count{code="200",handler="jupyterhub.handlers.login.LoginHandler",method="GET"} 1.0
request_duration_seconds_sum{code="200",handler="jupyterhub.handlers.login.LoginHandler",method="GET"} 0.2296288013458252
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.CacheControlStaticFilesHandler",le="0.005",method="GET"} 3.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.CacheControlStaticFilesHandler",le="0.01",method="GET"} 5.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.CacheControlStaticFilesHandler",le="0.025",method="GET"} 5.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.CacheControlStaticFilesHandler",le="0.05",method="GET"} 5.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.CacheControlStaticFilesHandler",le="0.075",method="GET"} 5.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.CacheControlStaticFilesHandler",le="0.1",method="GET"} 5.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.CacheControlStaticFilesHandler",le="0.25",method="GET"} 5.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.CacheControlStaticFilesHandler",le="0.5",method="GET"} 5.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.CacheControlStaticFilesHandler",le="0.75",method="GET"} 5.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.CacheControlStaticFilesHandler",le="1.0",method="GET"} 5.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.CacheControlStaticFilesHandler",le="2.5",method="GET"} 5.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.CacheControlStaticFilesHandler",le="5.0",method="GET"} 5.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.CacheControlStaticFilesHandler",le="7.5",method="GET"} 5.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.CacheControlStaticFilesHandler",le="10.0",method="GET"} 5.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.CacheControlStaticFilesHandler",le="+Inf",method="GET"} 5.0
request_duration_seconds_count{code="200",handler="jupyterhub.handlers.static.CacheControlStaticFilesHandler",method="GET"} 5.0
request_duration_seconds_sum{code="200",handler="jupyterhub.handlers.static.CacheControlStaticFilesHandler",method="GET"} 0.018769264221191406
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.LogoHandler",le="0.005",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.LogoHandler",le="0.01",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.LogoHandler",le="0.025",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.LogoHandler",le="0.05",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.LogoHandler",le="0.075",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.LogoHandler",le="0.1",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.LogoHandler",le="0.25",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.LogoHandler",le="0.5",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.LogoHandler",le="0.75",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.LogoHandler",le="1.0",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.LogoHandler",le="2.5",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.LogoHandler",le="5.0",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.LogoHandler",le="7.5",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.LogoHandler",le="10.0",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.LogoHandler",le="+Inf",method="GET"} 1.0
request_duration_seconds_count{code="200",handler="jupyterhub.handlers.static.LogoHandler",method="GET"} 1.0
request_duration_seconds_sum{code="200",handler="jupyterhub.handlers.static.LogoHandler",method="GET"} 0.002638101577758789
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.login.LoginHandler",le="0.005",method="POST"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.login.LoginHandler",le="0.01",method="POST"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.login.LoginHandler",le="0.025",method="POST"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.login.LoginHandler",le="0.05",method="POST"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.login.LoginHandler",le="0.075",method="POST"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.login.LoginHandler",le="0.1",method="POST"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.login.LoginHandler",le="0.25",method="POST"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.login.LoginHandler",le="0.5",method="POST"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.login.LoginHandler",le="0.75",method="POST"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.login.LoginHandler",le="1.0",method="POST"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.login.LoginHandler",le="2.5",method="POST"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.login.LoginHandler",le="5.0",method="POST"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.login.LoginHandler",le="7.5",method="POST"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.login.LoginHandler",le="10.0",method="POST"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.login.LoginHandler",le="+Inf",method="POST"} 1.0
request_duration_seconds_count{code="302",handler="jupyterhub.handlers.login.LoginHandler",method="POST"} 1.0
request_duration_seconds_sum{code="302",handler="jupyterhub.handlers.login.LoginHandler",method="POST"} 2.2172911167144775
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.hub.RootAPIHandler",le="0.005",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.hub.RootAPIHandler",le="0.01",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.hub.RootAPIHandler",le="0.025",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.hub.RootAPIHandler",le="0.05",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.hub.RootAPIHandler",le="0.075",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.hub.RootAPIHandler",le="0.1",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.hub.RootAPIHandler",le="0.25",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.hub.RootAPIHandler",le="0.5",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.hub.RootAPIHandler",le="0.75",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.hub.RootAPIHandler",le="1.0",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.hub.RootAPIHandler",le="2.5",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.hub.RootAPIHandler",le="5.0",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.hub.RootAPIHandler",le="7.5",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.hub.RootAPIHandler",le="10.0",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.hub.RootAPIHandler",le="+Inf",method="GET"} 1.0
request_duration_seconds_count{code="200",handler="jupyterhub.apihandlers.hub.RootAPIHandler",method="GET"} 1.0
request_duration_seconds_sum{code="200",handler="jupyterhub.apihandlers.hub.RootAPIHandler",method="GET"} 0.00177001953125
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.UserSpawnHandler",le="0.005",method="GET"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.UserSpawnHandler",le="0.01",method="GET"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.UserSpawnHandler",le="0.025",method="GET"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.UserSpawnHandler",le="0.05",method="GET"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.UserSpawnHandler",le="0.075",method="GET"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.UserSpawnHandler",le="0.1",method="GET"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.UserSpawnHandler",le="0.25",method="GET"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.UserSpawnHandler",le="0.5",method="GET"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.UserSpawnHandler",le="0.75",method="GET"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.UserSpawnHandler",le="1.0",method="GET"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.UserSpawnHandler",le="2.5",method="GET"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.UserSpawnHandler",le="5.0",method="GET"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.UserSpawnHandler",le="7.5",method="GET"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.UserSpawnHandler",le="10.0",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.UserSpawnHandler",le="+Inf",method="GET"} 1.0
request_duration_seconds_count{code="302",handler="jupyterhub.handlers.base.UserSpawnHandler",method="GET"} 1.0
request_duration_seconds_sum{code="302",handler="jupyterhub.handlers.base.UserSpawnHandler",method="GET"} 8.127685308456421
request_duration_seconds_bucket{code="302",handler="jupyterhub.apihandlers.auth.OAuthAuthorizeHandler",le="0.005",method="GET"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.apihandlers.auth.OAuthAuthorizeHandler",le="0.01",method="GET"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.apihandlers.auth.OAuthAuthorizeHandler",le="0.025",method="GET"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.apihandlers.auth.OAuthAuthorizeHandler",le="0.05",method="GET"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.apihandlers.auth.OAuthAuthorizeHandler",le="0.075",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.apihandlers.auth.OAuthAuthorizeHandler",le="0.1",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.apihandlers.auth.OAuthAuthorizeHandler",le="0.25",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.apihandlers.auth.OAuthAuthorizeHandler",le="0.5",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.apihandlers.auth.OAuthAuthorizeHandler",le="0.75",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.apihandlers.auth.OAuthAuthorizeHandler",le="1.0",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.apihandlers.auth.OAuthAuthorizeHandler",le="2.5",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.apihandlers.auth.OAuthAuthorizeHandler",le="5.0",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.apihandlers.auth.OAuthAuthorizeHandler",le="7.5",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.apihandlers.auth.OAuthAuthorizeHandler",le="10.0",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.apihandlers.auth.OAuthAuthorizeHandler",le="+Inf",method="GET"} 1.0
request_duration_seconds_count{code="302",handler="jupyterhub.apihandlers.auth.OAuthAuthorizeHandler",method="GET"} 1.0
request_duration_seconds_sum{code="302",handler="jupyterhub.apihandlers.auth.OAuthAuthorizeHandler",method="GET"} 0.06373739242553711
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.OAuthTokenHandler",le="0.005",method="POST"} 0.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.OAuthTokenHandler",le="0.01",method="POST"} 0.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.OAuthTokenHandler",le="0.025",method="POST"} 0.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.OAuthTokenHandler",le="0.05",method="POST"} 0.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.OAuthTokenHandler",le="0.075",method="POST"} 0.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.OAuthTokenHandler",le="0.1",method="POST"} 0.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.OAuthTokenHandler",le="0.25",method="POST"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.OAuthTokenHandler",le="0.5",method="POST"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.OAuthTokenHandler",le="0.75",method="POST"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.OAuthTokenHandler",le="1.0",method="POST"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.OAuthTokenHandler",le="2.5",method="POST"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.OAuthTokenHandler",le="5.0",method="POST"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.OAuthTokenHandler",le="7.5",method="POST"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.OAuthTokenHandler",le="10.0",method="POST"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.OAuthTokenHandler",le="+Inf",method="POST"} 1.0
request_duration_seconds_count{code="200",handler="jupyterhub.apihandlers.auth.OAuthTokenHandler",method="POST"} 1.0
request_duration_seconds_sum{code="200",handler="jupyterhub.apihandlers.auth.OAuthTokenHandler",method="POST"} 0.1146388053894043
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.TokenAPIHandler",le="0.005",method="GET"} 0.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.TokenAPIHandler",le="0.01",method="GET"} 0.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.TokenAPIHandler",le="0.025",method="GET"} 0.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.TokenAPIHandler",le="0.05",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.TokenAPIHandler",le="0.075",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.TokenAPIHandler",le="0.1",method="GET"} 2.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.TokenAPIHandler",le="0.25",method="GET"} 2.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.TokenAPIHandler",le="0.5",method="GET"} 2.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.TokenAPIHandler",le="0.75",method="GET"} 2.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.TokenAPIHandler",le="1.0",method="GET"} 2.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.TokenAPIHandler",le="2.5",method="GET"} 2.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.TokenAPIHandler",le="5.0",method="GET"} 2.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.TokenAPIHandler",le="7.5",method="GET"} 2.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.TokenAPIHandler",le="10.0",method="GET"} 2.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.auth.TokenAPIHandler",le="+Inf",method="GET"} 2.0
request_duration_seconds_count{code="200",handler="jupyterhub.apihandlers.auth.TokenAPIHandler",method="GET"} 2.0
request_duration_seconds_sum{code="200",handler="jupyterhub.apihandlers.auth.TokenAPIHandler",method="GET"} 0.10728096961975098
# HELP server_spawn_duration_seconds time taken for server spawning operation
# TYPE server_spawn_duration_seconds histogram
server_spawn_duration_seconds_bucket{le="0.5",status="success"} 0.0
server_spawn_duration_seconds_bucket{le="1.0",status="success"} 0.0
server_spawn_duration_seconds_bucket{le="2.5",status="success"} 0.0
server_spawn_duration_seconds_bucket{le="5.0",status="success"} 0.0
server_spawn_duration_seconds_bucket{le="10.0",status="success"} 1.0
server_spawn_duration_seconds_bucket{le="15.0",status="success"} 1.0
server_spawn_duration_seconds_bucket{le="30.0",status="success"} 1.0
server_spawn_duration_seconds_bucket{le="60.0",status="success"} 1.0
server_spawn_duration_seconds_bucket{le="120.0",status="success"} 1.0
server_spawn_duration_seconds_bucket{le="+Inf",status="success"} 1.0
server_spawn_duration_seconds_count{status="success"} 1.0
server_spawn_duration_seconds_sum{status="success"} 8.083470355020836
server_spawn_duration_seconds_bucket{le="0.5",status="failure"} 0.0
server_spawn_duration_seconds_bucket{le="1.0",status="failure"} 0.0
server_spawn_duration_seconds_bucket{le="2.5",status="failure"} 0.0
server_spawn_duration_seconds_bucket{le="5.0",status="failure"} 0.0
server_spawn_duration_seconds_bucket{le="10.0",status="failure"} 0.0
server_spawn_duration_seconds_bucket{le="15.0",status="failure"} 0.0
server_spawn_duration_seconds_bucket{le="30.0",status="failure"} 0.0
server_spawn_duration_seconds_bucket{le="60.0",status="failure"} 0.0
server_spawn_duration_seconds_bucket{le="120.0",status="failure"} 0.0
server_spawn_duration_seconds_bucket{le="+Inf",status="failure"} 0.0
server_spawn_duration_seconds_count{status="failure"} 0.0
server_spawn_duration_seconds_sum{status="failure"} 0.0
server_spawn_duration_seconds_bucket{le="0.5",status="already-pending"} 0.0
server_spawn_duration_seconds_bucket{le="1.0",status="already-pending"} 0.0
server_spawn_duration_seconds_bucket{le="2.5",status="already-pending"} 0.0
server_spawn_duration_seconds_bucket{le="5.0",status="already-pending"} 0.0
server_spawn_duration_seconds_bucket{le="10.0",status="already-pending"} 0.0
server_spawn_duration_seconds_bucket{le="15.0",status="already-pending"} 0.0
server_spawn_duration_seconds_bucket{le="30.0",status="already-pending"} 0.0
server_spawn_duration_seconds_bucket{le="60.0",status="already-pending"} 0.0
server_spawn_duration_seconds_bucket{le="120.0",status="already-pending"} 0.0
server_spawn_duration_seconds_bucket{le="+Inf",status="already-pending"} 0.0
server_spawn_duration_seconds_count{status="already-pending"} 0.0
server_spawn_duration_seconds_sum{status="already-pending"} 0.0
server_spawn_duration_seconds_bucket{le="0.5",status="throttled"} 0.0
server_spawn_duration_seconds_bucket{le="1.0",status="throttled"} 0.0
server_spawn_duration_seconds_bucket{le="2.5",status="throttled"} 0.0
server_spawn_duration_seconds_bucket{le="5.0",status="throttled"} 0.0
server_spawn_duration_seconds_bucket{le="10.0",status="throttled"} 0.0
server_spawn_duration_seconds_bucket{le="15.0",status="throttled"} 0.0
server_spawn_duration_seconds_bucket{le="30.0",status="throttled"} 0.0
server_spawn_duration_seconds_bucket{le="60.0",status="throttled"} 0.0
server_spawn_duration_seconds_bucket{le="120.0",status="throttled"} 0.0
server_spawn_duration_seconds_bucket{le="+Inf",status="throttled"} 0.0
server_spawn_duration_seconds_count{status="throttled"} 0.0
server_spawn_duration_seconds_sum{status="throttled"} 0.0
server_spawn_duration_seconds_bucket{le="0.5",status="too-many-users"} 0.0
server_spawn_duration_seconds_bucket{le="1.0",status="too-many-users"} 0.0
server_spawn_duration_seconds_bucket{le="2.5",status="too-many-users"} 0.0
server_spawn_duration_seconds_bucket{le="5.0",status="too-many-users"} 0.0
server_spawn_duration_seconds_bucket{le="10.0",status="too-many-users"} 0.0
server_spawn_duration_seconds_bucket{le="15.0",status="too-many-users"} 0.0
server_spawn_duration_seconds_bucket{le="30.0",status="too-many-users"} 0.0
server_spawn_duration_seconds_bucket{le="60.0",status="too-many-users"} 0.0
server_spawn_duration_seconds_bucket{le="120.0",status="too-many-users"} 0.0
server_spawn_duration_seconds_bucket{le="+Inf",status="too-many-users"} 0.0
server_spawn_duration_seconds_count{status="too-many-users"} 0.0
server_spawn_duration_seconds_sum{status="too-many-users"} 0.0
# HELP running_servers the number of user servers currently running
# TYPE running_servers gauge
running_servers 1.0
# HELP total_users toal number of users
# TYPE total_users gauge
total_users 0.0
# HELP check_routes_duration_seconds Time taken to validate all routes in proxy
# TYPE check_routes_duration_seconds histogram
check_routes_duration_seconds_bucket{le="0.005"} 3.0
check_routes_duration_seconds_bucket{le="0.01"} 3.0
check_routes_duration_seconds_bucket{le="0.025"} 3.0
check_routes_duration_seconds_bucket{le="0.05"} 4.0
check_routes_duration_seconds_bucket{le="0.075"} 4.0
check_routes_duration_seconds_bucket{le="0.1"} 4.0
check_routes_duration_seconds_bucket{le="0.25"} 4.0
check_routes_duration_seconds_bucket{le="0.5"} 4.0
check_routes_duration_seconds_bucket{le="0.75"} 4.0
check_routes_duration_seconds_bucket{le="1.0"} 4.0
check_routes_duration_seconds_bucket{le="2.5"} 4.0
check_routes_duration_seconds_bucket{le="5.0"} 4.0
check_routes_duration_seconds_bucket{le="7.5"} 4.0
check_routes_duration_seconds_bucket{le="10.0"} 4.0
check_routes_duration_seconds_bucket{le="+Inf"} 4.0
check_routes_duration_seconds_count 4.0
check_routes_duration_seconds_sum 0.03857685689581558
# HELP proxy_add_duration_seconds duration for adding user routes to proxy
# TYPE proxy_add_duration_seconds histogram
proxy_add_duration_seconds_bucket{le="0.005",status="success"} 1.0
proxy_add_duration_seconds_bucket{le="0.01",status="success"} 1.0
proxy_add_duration_seconds_bucket{le="0.025",status="success"} 1.0
proxy_add_duration_seconds_bucket{le="0.05",status="success"} 1.0
proxy_add_duration_seconds_bucket{le="0.075",status="success"} 1.0
proxy_add_duration_seconds_bucket{le="0.1",status="success"} 1.0
proxy_add_duration_seconds_bucket{le="0.25",status="success"} 1.0
proxy_add_duration_seconds_bucket{le="0.5",status="success"} 1.0
proxy_add_duration_seconds_bucket{le="0.75",status="success"} 1.0
proxy_add_duration_seconds_bucket{le="1.0",status="success"} 1.0
proxy_add_duration_seconds_bucket{le="2.5",status="success"} 1.0
proxy_add_duration_seconds_bucket{le="5.0",status="success"} 1.0
proxy_add_duration_seconds_bucket{le="7.5",status="success"} 1.0
proxy_add_duration_seconds_bucket{le="10.0",status="success"} 1.0
proxy_add_duration_seconds_bucket{le="+Inf",status="success"} 1.0
proxy_add_duration_seconds_count{status="success"} 1.0
proxy_add_duration_seconds_sum{status="success"} 0.0027783819823525846
proxy_add_duration_seconds_bucket{le="0.005",status="failure"} 0.0
proxy_add_duration_seconds_bucket{le="0.01",status="failure"} 0.0
proxy_add_duration_seconds_bucket{le="0.025",status="failure"} 0.0
proxy_add_duration_seconds_bucket{le="0.05",status="failure"} 0.0
proxy_add_duration_seconds_bucket{le="0.075",status="failure"} 0.0
proxy_add_duration_seconds_bucket{le="0.1",status="failure"} 0.0
proxy_add_duration_seconds_bucket{le="0.25",status="failure"} 0.0
proxy_add_duration_seconds_bucket{le="0.5",status="failure"} 0.0
proxy_add_duration_seconds_bucket{le="0.75",status="failure"} 0.0
proxy_add_duration_seconds_bucket{le="1.0",status="failure"} 0.0
proxy_add_duration_seconds_bucket{le="2.5",status="failure"} 0.0
proxy_add_duration_seconds_bucket{le="5.0",status="failure"} 0.0
proxy_add_duration_seconds_bucket{le="7.5",status="failure"} 0.0
proxy_add_duration_seconds_bucket{le="10.0",status="failure"} 0.0
proxy_add_duration_seconds_bucket{le="+Inf",status="failure"} 0.0
proxy_add_duration_seconds_count{status="failure"} 0.0
proxy_add_duration_seconds_sum{status="failure"} 0.0
# HELP server_poll_duration_seconds time taken to poll if server is running
# TYPE server_poll_duration_seconds histogram
server_poll_duration_seconds_bucket{le="0.005",status="ServerPollStatus.running"} 0.0
server_poll_duration_seconds_bucket{le="0.01",status="ServerPollStatus.running"} 0.0
server_poll_duration_seconds_bucket{le="0.025",status="ServerPollStatus.running"} 0.0
server_poll_duration_seconds_bucket{le="0.05",status="ServerPollStatus.running"} 0.0
server_poll_duration_seconds_bucket{le="0.075",status="ServerPollStatus.running"} 0.0
server_poll_duration_seconds_bucket{le="0.1",status="ServerPollStatus.running"} 0.0
server_poll_duration_seconds_bucket{le="0.25",status="ServerPollStatus.running"} 0.0
server_poll_duration_seconds_bucket{le="0.5",status="ServerPollStatus.running"} 0.0
server_poll_duration_seconds_bucket{le="0.75",status="ServerPollStatus.running"} 0.0
server_poll_duration_seconds_bucket{le="1.0",status="ServerPollStatus.running"} 0.0
server_poll_duration_seconds_bucket{le="2.5",status="ServerPollStatus.running"} 0.0
server_poll_duration_seconds_bucket{le="5.0",status="ServerPollStatus.running"} 0.0
server_poll_duration_seconds_bucket{le="7.5",status="ServerPollStatus.running"} 0.0
server_poll_duration_seconds_bucket{le="10.0",status="ServerPollStatus.running"} 0.0
server_poll_duration_seconds_bucket{le="+Inf",status="ServerPollStatus.running"} 0.0
server_poll_duration_seconds_count{status="ServerPollStatus.running"} 0.0
server_poll_duration_seconds_sum{status="ServerPollStatus.running"} 0.0
server_poll_duration_seconds_bucket{le="0.005",status="ServerPollStatus.stopped"} 0.0
server_poll_duration_seconds_bucket{le="0.01",status="ServerPollStatus.stopped"} 0.0
server_poll_duration_seconds_bucket{le="0.025",status="ServerPollStatus.stopped"} 0.0
server_poll_duration_seconds_bucket{le="0.05",status="ServerPollStatus.stopped"} 0.0
server_poll_duration_seconds_bucket{le="0.075",status="ServerPollStatus.stopped"} 0.0
server_poll_duration_seconds_bucket{le="0.1",status="ServerPollStatus.stopped"} 0.0
server_poll_duration_seconds_bucket{le="0.25",status="ServerPollStatus.stopped"} 0.0
server_poll_duration_seconds_bucket{le="0.5",status="ServerPollStatus.stopped"} 0.0
server_poll_duration_seconds_bucket{le="0.75",status="ServerPollStatus.stopped"} 0.0
server_poll_duration_seconds_bucket{le="1.0",status="ServerPollStatus.stopped"} 0.0
server_poll_duration_seconds_bucket{le="2.5",status="ServerPollStatus.stopped"} 0.0
server_poll_duration_seconds_bucket{le="5.0",status="ServerPollStatus.stopped"} 0.0
server_poll_duration_seconds_bucket{le="7.5",status="ServerPollStatus.stopped"} 0.0
server_poll_duration_seconds_bucket{le="10.0",status="ServerPollStatus.stopped"} 0.0
server_poll_duration_seconds_bucket{le="+Inf",status="ServerPollStatus.stopped"} 0.0
server_poll_duration_seconds_count{status="ServerPollStatus.stopped"} 0.0
server_poll_duration_seconds_sum{status="ServerPollStatus.stopped"} 0.0
# HELP server_stop_seconds time taken for server stopping operation
# TYPE server_stop_seconds histogram
server_stop_seconds_bucket{le="0.005",status="success"} 0.0
server_stop_seconds_bucket{le="0.01",status="success"} 0.0
server_stop_seconds_bucket{le="0.025",status="success"} 0.0
server_stop_seconds_bucket{le="0.05",status="success"} 0.0
server_stop_seconds_bucket{le="0.075",status="success"} 0.0
server_stop_seconds_bucket{le="0.1",status="success"} 0.0
server_stop_seconds_bucket{le="0.25",status="success"} 0.0
server_stop_seconds_bucket{le="0.5",status="success"} 0.0
server_stop_seconds_bucket{le="0.75",status="success"} 0.0
server_stop_seconds_bucket{le="1.0",status="success"} 0.0
server_stop_seconds_bucket{le="2.5",status="success"} 0.0
server_stop_seconds_bucket{le="5.0",status="success"} 0.0
server_stop_seconds_bucket{le="7.5",status="success"} 0.0
server_stop_seconds_bucket{le="10.0",status="success"} 0.0
server_stop_seconds_bucket{le="+Inf",status="success"} 0.0
server_stop_seconds_count{status="success"} 0.0
server_stop_seconds_sum{status="success"} 0.0
server_stop_seconds_bucket{le="0.005",status="failure"} 0.0
server_stop_seconds_bucket{le="0.01",status="failure"} 0.0
server_stop_seconds_bucket{le="0.025",status="failure"} 0.0
server_stop_seconds_bucket{le="0.05",status="failure"} 0.0
server_stop_seconds_bucket{le="0.075",status="failure"} 0.0
server_stop_seconds_bucket{le="0.1",status="failure"} 0.0
server_stop_seconds_bucket{le="0.25",status="failure"} 0.0
server_stop_seconds_bucket{le="0.5",status="failure"} 0.0
server_stop_seconds_bucket{le="0.75",status="failure"} 0.0
server_stop_seconds_bucket{le="1.0",status="failure"} 0.0
server_stop_seconds_bucket{le="2.5",status="failure"} 0.0
server_stop_seconds_bucket{le="5.0",status="failure"} 0.0
server_stop_seconds_bucket{le="7.5",status="failure"} 0.0
server_stop_seconds_bucket{le="10.0",status="failure"} 0.0
server_stop_seconds_bucket{le="+Inf",status="failure"} 0.0
server_stop_seconds_count{status="failure"} 0.0
server_stop_seconds_sum{status="failure"} 0.0

@kinow kinow self-assigned this Dec 12, 2018
@kinow kinow added speculative blue-skies ideas WIP labels Dec 12, 2018
@kinow kinow added this to the some-day milestone Dec 12, 2018
@kinow
Copy link
Member Author

kinow commented Dec 12, 2018

Push vs. Pull

There's no silver bullet, as in most technologies. There are even solutions to have a push based daemon sending metrics to Prometheus, creating a hybrid solution. But endpoints for metrics like the above are very cheap for Open Source tools.

It allows you to provide several metrics, without having to modify a lot of your code, and users are expected to simply plug whatever tool they prefer to retrieve the metrics.

@kinow
Copy link
Member Author

kinow commented Dec 12, 2018

I am working on isodatetime Python 3 support, then Cylc Python 3 support, and the new Web UI with Vue.js. But hope to have time to prepare a presentation showing how to monitor multiple metrics in Cylc with Prometheus, storing their values in the Prometheus time series DB.

Inside Prometheus there is also an alert system, that can be used to send notifications when a metric reaches a threshold. But other systems can be plugged too.

It is possible to use the time series DB and the metrics to:

@kinow
Copy link
Member Author

kinow commented Dec 12, 2018

And example of plotting the metrics with Grafana (default visualisation tool for Prometheus): https://prometheus.io/docs/visualization/grafana/. There are also plugins capable of displaying metrics with diagrams, such as Gantt charts (e.g. https://grafana.com/plugins/jdbranham-diagram-panel)

Also another example from users reporting issues with their screenshots of the metrics/monitoring (useful for maintainers familiar with the tool?): jupyterhub/mybinder.org-deploy#350.

@hjoliver
Copy link
Member

Good idea(s) @kinow 👍 - definitely worth pursuing.

The metrics of most interest to Cylc users (or site admins) are job execution time, job queue time, job failures, number of jobs per day, etc. I suppose the "pull" variant you describe above would require a back-end service that scrapes that information from individual suite DBs, in order to extract the data for these metrics.

@hjoliver
Copy link
Member

(those grafana examples look very nice)

@kinow
Copy link
Member Author

kinow commented Dec 13, 2018

I suppose the "pull" variant you describe above would require a back-end service that scrapes that information from individual suite DBs, in order to extract the data for these metrics.

That's correct. You basically avoid promising to deliver a package to some server, and instead tell users metrics will be here, so just come and fetch them.

If you download prometheus, editing the yaml file to have a line like the following under the scrape_configs:

scrape_configs:
  - job_name: 'cylc'
    # here users can change the scrape interval to shorter periods to investigate issues, for example
    scrape_interval: 5s
    static_configs:
      # by default will look for cylc-hub-node:9090/metrics, but that can be configured
      - targets: ['cylc-hub-node:9090']

Graphite's was all Python, with Twisted at its core (precursor of asyncio I believe). Prometheus was written in Go to allow for the loop to be optimized (related talk)

@oliver-sanders
Copy link
Member

A nice tool for integration with CI, helps identify when performance characteristics change and track down negative changes which lie within the "jitter" by scanning back at graphs.

https://asv.readthedocs.io/en/stable/

@hjoliver hjoliver modified the milestones: some-day, cylc-8.x Aug 4, 2021
@hjoliver
Copy link
Member

hjoliver commented Aug 4, 2021

(Probably move UIS)

@oliver-sanders
Copy link
Member

Going to close this cylc-flow issue and continue to track in cylc/cylc-admin#132.

@oliver-sanders oliver-sanders removed this from the 8.x milestone Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
speculative blue-skies ideas
Projects
None yet
Development

No branches or pull requests

3 participants