The goal of this document is to keep a consistent overview of how the components of our autoscaling setup fit together. Any protocol details are also written here as well.
We also briefly touch on the implementation of NeonVM, because it's relevant to the inter-node communication that goes on.
This document should be up-to-date. If it isn't, that's a mistake (open an issue!).
Table of contents:
- See also
- High-level overview
- Network connections between components
- Repository structure
- Agent-Scheduler protocol details
- High-level consequences of the Agent-Scheduler protocol
- Agent-Monitor protocol details
- Footguns
This isn't the only architecture document. You may also want to look at:
pkg/agent/ARCHITECTURE.md
— detail on the implementation of the autoscaler-agentpkg/plugin/ARCHITECTURE.md
— detail on the implementation of the scheduler pluginneondatabase/neon/.../vm_monitor
(a different repo) — where the vm-monitor, an autoscaling component running inside each VM lives.
At a high level, this repository provides five components with a non-trivial amount of code:
- A Kubernetes custom resource definition (CRD) and controller (
neonvm-controller
) for managing resizeable VMs — NeonVM. - The underlying NeonVM pods run
neonvm-runner
- NeonVM virtual machine images are built with
vm-builder
- A modified Kubernetes scheduler (using the plugin interface) — known as "the (scheduler)
plugin",
AutoscaleEnforcer
,autoscale-scheduler
- A daemonset responsible for making VM scaling decisions & checking with interested parties
— known as
autoscaler-agent
or simplyagent
One last component, a binary running inside of the VM to (a) handle being upscaled
(b) validate that downscaling is ok, and (c) request immediate upscaling due to sharp changes in demand
— known as "the (VM) monitor", lives in [neondatabase/neon/.../vm-monitor
].
For information on NeonVM, see README-NeonVM.md.
The scheduler plugin is responsible for handling resource requests from the autoscaler-agent
,
capping increases so that node resources aren't overcommitted.
The autoscaler-agent
periodically reads from a metrics source in the VM (currently vector's
node_exporter
-like functionality) and makes scaling decisions about the desired resource
allocation. It then requests these resources from the scheduler plugin, and submits a patch request
for its NeonVM to update the resources.
The VM monitor is responsible for handling all of the resource management functionality inside
the VM that the autoscaler-agent
cannot. This constitutes handling upscales (eg. increasing Postgres
file cache size), approving attempts to downscale resource usage (or: rejecting them, if those
resources are still in use), and requesting upscale when memory usage increases too rapidly for
metrics to catch.
NeonVM is able to live-scale the resources given to a VM (i.e. CPU and memory slots) by handling patches to the Kubernetes VM object, which requires connecting to QEMU running on the outer container.
Scaling decisions use compute units, which ensures that the relationship between allocated CPU and memory is roughly linear. In practice, it's possible for this to become slightly off. This is discussed more in the high-level consequences section below.
At a high level, each component gets its own directory and resulting YAML for its deployment, where applicable.
These are:
autoscale-scheduler
— the scheduler (with our plugin)autoscaler-agent
cluster-autoscaler
— patch for building a NeonVM-compatible cluster-autoscalerneonvm
— CRDs and other related YAMLs for NeonVM, alongside Go definitions and a generated client. Note that the generated YAML includes a dependency onneonvm-controller
via a webhook for create/update/delete operations on the CRDs.neonvm-controller
— controller for the NeonVM CRDsneonvm-kernel
— files relating to the virtual machine kernel we use in NeonVMneonvm-runner
— per-VM management process, created byneonvm-controller
neonvm-vxlan-controller
vm-builder
— binary for building VM images for use by NeonVM
Each component directory contains:
cmd/
— the entrypoint of the component, containingmain.go
Dockerfile
— for building the component, if applicablekustomize.yaml
— if the component is separately deployed, instructions for how Kustomize should generate the YAML*.yaml
— if the component is separately deployed, there will be other YAML files as the resources for Kustomize to include -- e.g.daemonset.yaml
orconfig_map.yaml
.
k3d/
andkind/
— configuration for local test clusterspkg/
— the bulk of the Go codebase. For more complex components,cmd/main.go
often just calls the relevant entrypoint function in itspkg/
directory.pkg/
also includes the common packages shared by multiple components.scripts
— a collection of scripts for common taskstests
— end-to-end teststests/e2e
—kuttl
test scenarios itself
vm-examples/
— collection of VMs:pg16-disk-test/
— VM with Postgres 16 and and ssh access- Refer to
vm-examples/pg16-disk-test/README.md
for more information.
- Refer to
Broadly speaking, the autoscaler-agent
notifies on decrease and requests increases. This means
that the scheduler always stores the upper bound on resource usage, so we don't need to worry about
overcommitting due to racy behavior.
At present, the protocol consists entirely of HTTP POST requests by the autoscaler-agent
to the
scheduler plugin, which serves these requests on port 10299
. Each request sent by the
autoscaler-agent
is an api.AgentRequest
and the scheduler plugin responds with an
api.PluginResponse
(see: pkg/api/types.go
).
In general, a PluginResponse
primarily provides a Permit
, which grants permission for the
autoscaler-agent
to assign the VM some amount of resources. By tracking total resource allocation
on each node, the scheduler can reject a scale up request to avoid having undesired over-commit.
- On startup (for a particular VM), the
autoscaler-agent
connects to the VM monitor and fetches some initial metrics. - After successfully receiving a response, the autoscaler-agent sends an
AgentRequest
with the metrics and current resource allocation (i.e. it does not request any scaling). - The scheduler's
PluginResponse
informs the autoscaler-agent of the "compute unit" CPU and memory slots. The permit will exactly equal the resources in theAgentRequest
. - The
autoscaler-agent
repeatedly fetches metrics from the VM. For each:- Calculate the new desired resource allocation, as a multiple of compute units.
- If reaching the desired resource allocation requires scaling any resources down, submit a VM patch
request that scales those resources down, but keeps the rest the same.
- Note: this requires submitting a Kubernetes patch request. This goes through the NeonVM
controller, which then connects back to QEMU running outside the VM to make the change. The
full flow is then:
autoscaler-agent
→ API server → NeonVM controller → VM's QEMU.
- Note: this requires submitting a Kubernetes patch request. This goes through the NeonVM
controller, which then connects back to QEMU running outside the VM to make the change. The
full flow is then:
- Send an
AgentRequest
with the desired resource allocation (even if it's the same!), the last permit agent has received, and metrics to the plugin. Sending the current resources as desired is useful for relieving pressure from denied increases. (Refer to Node pressure and watermarks for more) - The plugin's
PluginResponse
guarantees that itsPermit
will satisfy for each resource:- If the resource has scaled down, then the
Permit
is equal to the new value - If the resource wants to scale up, then the
Permit
is between the old and requested new value, inclusive.
- If the resource has scaled down, then the
- If the
PluginResponse
signals that the VM will be migrating, exit without further patch requests. - If the
PluginResponse
's permit approved any amount of increase, submit a VM patch request that scales those resources up.- This has the same connection flow as the earlier patch request.
In order to trigger migrations before we run out of resources on a node, the scheduler maintains a few logical concepts. Here, we're talking about "watermarks" and "pressure".
Each node has a "watermark" for each resource — the level above which the scheduler should preemptively start migrating VMs to make room. If everything is operating smoothly, we should never run out (which would cause us to deny requests for additional resources).
In order to make sure we don't over-correct and migrate away too many VMs, we also track the "pressure" that a particular resource on a node is under. This is, roughly speaking, the amount of allocation above the watermark that has been requested. It also includes requests that were denied because we ran out of room.
When a VM migration is started, the scheduler marks its current resource allocation as "pressure accounted for", along with any pressure from denied requests for increases in the VM's resource allocation. This pressure from requests that couldn't be satisfied is referred to as capacity pressure, whereas the pressure from usage above the watermark is referred to as logical pressure.
The scheduler migrates VMs whenever the combination of logical and capacity pressure is greater than the amount of pressure already accounted for.
- If a VM is continuously migrated, it will never have a chance to scale up.
- Because resources are handled separately, any discrepancy in external resource availability can
cause the scheduler to return
Permit
s that aren't a clean multiple of a compute unit. (e.g., nodes have mismatched memory vs CPU, or external pods / system reserved are mismatched)
Agent-Monitor communication is carried out through a relatively simple versioned protocol over websocket. One party sends a message, the other responds. There are various message types that each party sends, and all messages are annotated with an ID. The allows a sender to recognize responses to its previous messages. If the out-message has ID X, then the return message will also have ID X.
Like the other protocols, relevant types are located in [pkg/api/types.go
].
- On startup, the VM monitor listens for websocket connections on
127.0.0.1:10369
- On startup, the agent connects to the monitor via websocket on
127.0.0.1:10369/monitor
- The agent then sends a
VersionRange[MonitorProtocolVersion]
with the range of protocols it supports. - The monitor responds with the highest common version between the two. If there is no compatible protocol, it returns an error.
- From this point on, either party may initiate a transaction by sending a Message.
- The other party responds with the appropriate message, with the same ID attached so that the receiver knows it has received a response.
Currently, the following interactions are supported:
Monitor sends UpscaleRequest
Agent returns NotifyUpscale
Agent sends TryDownscale
Monitor returns DownscaleResult
Agent sends NotifyUpscale
Monitor returns UpscaleConfirmation
Agent sends HealthCheck
Monitor returns HealthCheck
Healthchecks: the agent initiates a health check every 5 seconds. The monitor simply returns with an ack.
There are two additional messages types that either party may send:
InvalidMessage
: sent when either party fails to deserialize a message it receivedInternalError
: used to indicate that an error occurred while processing a request, for example, if the monitor errors while trying to downscale
An alternate name for this section: Things to watch out for
- Creating a pod with
PodSpec.Node
set will bypass the scheduler (even ifSchedulerName
is set), so we cannot prevent it from overcommitting node resources. Do not do this except for "system" functions. - Non-VM pods will be accounted for only by their
resources.requests
— this is for compatibility with tools like cluster autoscaler, but allows potentially overcommitting memory. To avoid overcommitting in this way, setresources.requests
equal toresources.limits
. For more, seepkg/plugin/ARCHITECTURE.md