Skip to content

lunarfs/kured

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

Kured (KUbernetes REboot Daemon) is a Kubernetes daemonset that performs safe automatic node reboots when the need to do so is indicated by the package management system of the underlying OS.

  • Watches for the presence of a reboot sentinel e.g. /var/run/reboot-required
  • Utilises a lock in the API server to ensure only one node reboots at a time
  • Optionally defers reboots in the presence of active Prometheus alerts
  • Cordons & drains worker nodes before reboot, uncordoning them after

Kubernetes & OS Compatibility

The daemon image contains a 1.7.x k8s.io/client-go and kubectl binary for the purposes of maintaining the lock and draining worker nodes. Whilst it has only been tested on a 1.7.x cluster, Kubernetes typically has good forwards/backwards compatibility so there is a reasonable chance it will work on adjacent versions; please file an issue if this is not the case.

Additionally, the image contains a systemctl binary from Ubuntu 16.04 in order to command reboots. Again, although this has not been tested against other systemd distributions there is a good chance that it will work.

Installation

To obtain a default installation without Prometheus alerting interlock or Slack notifications:

kubectl apply -f https://github.com/weaveworks/kured/releases/download/1.0.0/kured-ds.yaml

If you want to customise the installation, download the manifest and edit it in accordance with the following section before application.

Configuration

The following arguments can be passed to kured via the daemonset pod template:

Flags:
      --alert-filter-regexp value   alert names to ignore when checking for active alerts
      --ds-name string              namespace containing daemonset on which to place lock (default "kube-system")
      --ds-namespace string         name of daemonset on which to place lock (default "kured")
      --lock-annotation string      annotation in which to record locking node (default "weave.works/kured-node-lock")
      --period duration             reboot check period (default 1h0m0s)
      --prometheus-url string       Prometheus instance to probe for active alerts
      --reboot-sentinel string      path to file whose existence signals need to reboot (default "/var/run/reboot-required")
      --slack-hook-url string       slack hook URL for reboot notfications
      --slack-username string       slack username for reboot notfications (default "kured")

Reboot Sentinel File & Period

By default kured checks for the existence of /var/run/reboot-required every sixty minutes; you can override these values with --reboot-sentinel and --period. Each replica of the daemon uses a random offset derived from the period on startup so that nodes don't all contend for the lock simultaneously.

Blocking Reboots via Alerts

You may find it desirable to block automatic node reboots when there are active alerts - you can do so by providing the URL of your Prometheus server:

--prometheus-url=http://prometheus.monitoring.svc.cluster.local

By default the presence of any active (pending or firing) alerts will block reboots, however you can ignore specific alerts:

--alert-filter-regexp=^(RebootRequired|AnotherBenignAlert|...$

An important application of this filter will become apparent in the next section.

Prometheus Metrics

Each kured pod exposes a single gauge metric (:8080/metrics) that indicates the presence of the sentinel file:

# HELP kured_reboot_required OS requires reboot due to software updates.
# TYPE kured_reboot_required gauge
kured_reboot_required{node="ip-xxx-xxx-xxx-xxx.ec2.internal"} 0

The purpose of this metric is to power an alert which will summon an operator if the cluster cannot reboot itself automatically for a prolonged period:

# Alert if a reboot is required for any machines. Acts as a failsafe for the
# reboot daemon, which will not reboot nodes if there are pending alerts save
# this one.
ALERT RebootRequired
  IF          max(kured_reboot_required) != 0
  FOR         24h
  LABELS      { severity="warning" }
  ANNOTATIONS {
    summary = "Machine(s) require being rebooted, and the reboot daemon has failed to do so for 24 hours",
    impact = "Cluster nodes more vulnerable to security exploits. Eventually, no disk space left.",
    description = "Machine(s) require being rebooted, probably due to kernel update.",
  }

If you choose to employ such an alert and have configured kured to probe for active alerts before rebooting, be sure to specify --alert-filter-regexp=^RebootRequired$ to avoid deadlock!

Slack Notifications

If you specify a Slack hook via --slack-hook-url, kured will notify you immediately prior to rebooting a node:

We recommend setting --slack-username to be the name of the environment, e.g. dev or prod.

Overriding Lock Configuration

The --ds-name and --ds-namespace arguments should match the name and namespace of the daemonset used to deploy the reboot daemon - the locking is implemented by means of an annotation on this resource. The defaults match the daemonset YAML provided in the repository.

Similarly --lock-annotation can be used to change the name of the annotation kured will use to store the lock, but the default is almost certainly safe.

Operation

The example commands in this section assume that you have not overriden the default lock annotation, daemonset name or namespace; if you have, you will have to adjust the commands accordingly.

Testing

You can test your configuration by provoking a reboot on a node:

sudo touch /var/run/reboot-required

Disabling Reboots

If you need to temporarily stop kured from rebooting any nodes, you can take the lock manually:

kubectl -n kube-system annotate ds kured weave.works/kured-node-lock='{"nodeID":"manual"}'

Don't forget to release it afterwards!

Manual Unlock

In exceptional circumstances, such as a node experiencing a permanent failure whilst rebooting, manual intervention may be required to remove the cluster lock:

kubectl -n kube-system annotate ds kured weave.works/kured-node-lock-

NB the - at the end of the command is important - it instructs kubectl to remove that annotation entirely.

Building

dep ensure && make

Getting Help

If you have any questions about, feedback for or problems with kured:

Your feedback is always welcome!

About

Kubernetes Reboot Daemon

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Go 93.1%
  • Makefile 6.9%