Skip to content

Commit

Permalink
feat: Single process optimization
Browse files Browse the repository at this point in the history
Python interpreter initialization and module import time is a
significant portion of cloud-init's total runtime when the
default configuration is used, and in all cases it contributes
a significant amount of wall clock time to cloud-init's runtime.

This commit significantly improves cloud-init time to completion
by eliminating redundant interpreter starts and module loads.
Since multiple cloud-init processes sit in the critical chain of
the boot order, this significantly reduces cloud-init's time to
ssh and time to completion.

Cloud-init has four stages. Each stage starts its own Python
interpreter and loads the same libraries. To eliminate the
redundant work of starting an interpreter and loading libraries,
this changes cloud-init to run as a single process. Systemd
service ordering is retained by using the existing cloud-init
services as shims which use a synchronization protocol to start
each cloud-init stage and to communicate that each stage is
complete to the init system.

Currently only systemd is supported, but the synchronization
protocol should be capable of supporting other init systems
as well with minor changes.

Note: This makes possible many additional improvements that
eliminate redundant work. However, these potential improvements
are temporarily ignored. This commit has been structured to
minimize the changes required to capture the majority of primary
performance savings while preserving correctness and the ability
to preserve backwards compatibility. Many additional
improvements will be possible once this merges.

Synchronization protocol
========================
- create one Unix socket for each systemd service stage
- send sd_notify()
- For each of the four stages (local, network, config, final):
   - when init system sends "start" to the Unix socket, start the
     stage
   - when running stage is complete, send "done" to Unix socket

socket.py (new)
---------------

- define a systemd-notify helper function
- define a context manager which implements a multi-socket
  synchronization protocol

cloud-init-single.service (new)
-------------------------------

 - use service type to 'notify'
 - invoke cloud-init in single process mode
 - adopt systemd ordering requirements from cloud-init-local.service
 - adopt KillMode from cloud-final.service

main.py
-------

 - Add command line flag to indicate single process mode
 - In this mode run each stage followed by an IPC
   synchronization protocol step

cloud-{local,init,config,final}.services
----------------------------------

- change ExecStart to use netcat to connect to Unix socket and:
  - send a start message
  - wait for completion response
- note: a pure Python equivalent is possible for any downstreams
  which do not package openbsd's netcat

cloud-final.services
--------------------
- drop KillMode

cloud-init-local.services
--------------------
- drop dependencies made redundant by ordering after
  cloud-init-single.service

Performance Results
===================

An integration test [1] on a Noble lxd container comparing POC to current
release showed significant improvement. In the POC, cloud-config.service
didn't register in the critical-chain (~340ms improvement),
cloud-init.service added ~378ms to total boot time (~400ms improvement),
and cloud-init-local.service had a marginal improvement (~90ms) which was
likely in the threshold of noise. The total improvement in this (early
stage) test showed a 0.83s improvement to total boot time with 0.66s of
boot time remaining due to cloud-init. This 0.83s second improvement
roughly corresponds to the total boot time, with systemd-analyze
critical-chain reporting 2.267s to reach graphical.target, which is a
0.8s improvement over the current release time.

Note: The numbers quoted above gathered from only one series (Noble),
one platform (lxc), one host machine (Ryzen 7840U), and run environment
was not controlled. I ran the test multiple times to ensure that the
results were repeatable, but not enough times to be considered
statistically significant. I verified that cloud-init worked as expected,
but given the limited scope of this integration test, this is still very
much a proof of concept.

[1] test_logging.py

BREAKING_CHANGE: Run all four cloud-init services as a single systemd service.
  • Loading branch information
holmanb committed Jul 5, 2024
1 parent ee1b25b commit bdc76a4
Show file tree
Hide file tree
Showing 9 changed files with 369 additions and 11 deletions.
54 changes: 52 additions & 2 deletions cloudinit/cmd/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
from cloudinit import netinfo
from cloudinit import signal_handler
from cloudinit import sources
from cloudinit import socket
from cloudinit import stages
from cloudinit import url_helper
from cloudinit import util
Expand All @@ -37,7 +38,12 @@
from cloudinit.config.schema import validate_cloudconfig_schema
from cloudinit import log
from cloudinit.reporting import events
from cloudinit.settings import PER_INSTANCE, PER_ALWAYS, PER_ONCE, CLOUD_CONFIG
from cloudinit.settings import (
PER_INSTANCE,
PER_ALWAYS,
PER_ONCE,
CLOUD_CONFIG,
)

# Welcome message template
WELCOME_MSG_TPL = (
Expand Down Expand Up @@ -932,9 +938,19 @@ def main(sysv_args=None):
default=False,
)

parser.add_argument(
"--single-process",
dest="single_process",
action="store_true",
help=(
"Run run the four stages as a single process as an optimization."
"Requires init system integration."
),
default=False,
)

parser.set_defaults(reporter=None)
subparsers = parser.add_subparsers(title="Subcommands", dest="subcommand")
subparsers.required = True

# Each action and its sub-options (if any)
parser_init = subparsers.add_parser(
Expand Down Expand Up @@ -1122,8 +1138,42 @@ def main(sysv_args=None):

status_parser(parser_status)
parser_status.set_defaults(action=("status", handle_status_args))
else:
parser.error("a subcommand is required")

args = parser.parse_args(args=sysv_args)
if not args.single_process:
return sub_main(args)
LOG.info("Running cloud-init in single process mode.")

# this _must_ be called before sd_notify is called otherwise netcat may
# attempt to send "start" before a socket exists
sync = socket.SocketSync("local", "network", "config", "final")

# notify cloud-init-local.service that this stage has completed
socket.sd_notify(b"READY=1")

# wait for cloud-init-local.service to start
with sync("local"):
sub_main(parser.parse_args(args=["init", "--local"]))

# wait for cloud-init.service to start
with sync("network"):
# init stage
sub_main(parser.parse_args(args=["init"]))

# wait for cloud-config.service to start
with sync("config"):
# config stage
sub_main(parser.parse_args(args=["modules", "--mode=config"]))

with sync("final"):
# final stage
sub_main(parser.parse_args(args=["modules", "--mode=final"]))
socket.sd_notify(b"STOPPING=1")


def sub_main(args):

# Subparsers.required = True and each subparser sets action=(name, functor)
(name, functor) = args.action
Expand Down
117 changes: 117 additions & 0 deletions cloudinit/socket.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# This file is part of cloud-init. See LICENSE file for license information.
"""A module for common socket helpers."""
import logging
import os
import socket
from contextlib import suppress

from cloudinit.settings import DEFAULT_RUN_DIR

LOG = logging.getLogger(__name__)


def sd_notify(message: bytes):
"""Send a sd_notify message."""
LOG.info("Sending sd_notify(%s)", str(message))
socket_path = os.environ.get("NOTIFY_SOCKET", "")

# abstract
if socket_path[0] == "@":
socket_path.replace("@", "\0", 1)

# unix domain
elif not socket_path[0] == "/":
raise OSError("Unsupported socket type")

with socket.socket(
socket.AF_UNIX, socket.SOCK_DGRAM | socket.SOCK_CLOEXEC
) as sock:
sock.connect(socket_path)
sock.sendall(message)


class SocketSync:
"""A two way synchronization protocol over Unix domain sockets."""

def __init__(self, *names: str):
"""Initialize a synchronization context.
1) Ensure that the socket directory exists.
2) Bind a socket for each stage.
Binding the sockets on initialization allows receipt of stage
"start" notifications prior to the cloud-init stage being ready to
start.
:param names: stage names, used as a unique identifiers
"""
self.stage = ""
self.remote = ""
self.sockets = {
name: socket.socket(
socket.AF_UNIX, socket.SOCK_DGRAM | socket.SOCK_CLOEXEC
)
for name in names
}
# ensure the directory exists
os.makedirs(f"{DEFAULT_RUN_DIR}/share", mode=0o700, exist_ok=True)
# removing stale sockets and bind
for name, sock in self.sockets.items():
socket_path = f"{DEFAULT_RUN_DIR}/share/{name}.sock"
with suppress(FileNotFoundError):
os.remove(socket_path)
sock.bind(socket_path)

def __call__(self, stage: str):
"""Set the stage before entering context.
This enables the context manager to be initialized separately from
each stage synchronization.
:param stage: the name of a stage to synchronize
Example:
sync = SocketSync("stage 1", "stage 2"):
with sync("stage 1"):
pass
with sync("stage 2"):
pass
"""
self.stage = stage
return self

def __enter__(self):
"""Wait until a message has been received on this stage's socket.
Once the message has been received, enter the context.
"""
LOG.debug("sync(%s): initial synchronization starting", self.stage)
# block until init system sends us data
# the first value returned contains a message from the init system
# (should be "start")
# the second value contains the path to a unix socket on which to
# reply, which is expected to be /path/to/{self.stage}-return.sock
sock = self.sockets[self.stage]
chunk, self.remote = sock.recvfrom(5)

if b"start" != chunk:
# The protocol expects to receive a command "start"
self.__exit__(None, None, None)
raise ValueError(f"Received invalid message: [{str(chunk)}]")
elif f"{DEFAULT_RUN_DIR}/share/{self.stage}-return.sock" != str(
self.remote
):
# assert that the return path is in a directory with appropriate
# permissions
self.__exit__(None, None, None)
raise ValueError(f"Unexpected path to unix socket: {self.remote}")

LOG.debug("sync(%s): initial synchronization complete", self.stage)
return self

def __exit__(self, exc_type, exc_val, exc_tb):
"""Notify the socket that this stage is complete."""
sock = self.sockets[self.stage]
sock.connect(self.remote)
sock.sendall(b"done")
sock.close()
3 changes: 2 additions & 1 deletion systemd/cloud-config.service.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,13 @@ ConditionEnvironment=!KERNEL_CMDLINE=cloud-init=disabled

[Service]
Type=oneshot
ExecStart=/usr/bin/cloud-init modules --mode=config
ExecStart=nc.openbsd -Uu -W1 /run/cloud-init/share/config.sock -s /run/cloud-init/share/config-return.sock
RemainAfterExit=yes
TimeoutSec=0

# Output needs to appear in instance console output
StandardOutput=journal+console
StandardInputText=start

[Install]
WantedBy=cloud-init.target
4 changes: 2 additions & 2 deletions systemd/cloud-final.service.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,9 @@ ConditionEnvironment=!KERNEL_CMDLINE=cloud-init=disabled

[Service]
Type=oneshot
ExecStart=/usr/bin/cloud-init modules --mode=final
ExecStart=nc.openbsd -Uu -W1 /run/cloud-init/share/final.sock -s /run/cloud-init/share/final-return.sock
RemainAfterExit=yes
TimeoutSec=0
KillMode=process
{% if variant in ["almalinux", "cloudlinux", "rhel"] %}
# Restart NetworkManager if it is present and running.
ExecStartPost=/bin/sh -c 'u=NetworkManager.service; \
Expand All @@ -32,6 +31,7 @@ TasksMax=infinity

# Output needs to appear in instance console output
StandardOutput=journal+console
StandardInputText=start

[Install]
WantedBy=cloud-init.target
4 changes: 2 additions & 2 deletions systemd/cloud-init-local.service.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ DefaultDependencies=no
{% endif %}
Wants=network-pre.target
After=hv_kvp_daemon.service
After=systemd-remount-fs.service
{% if variant in ["almalinux", "cloudlinux", "rhel"] %}
Requires=dbus.socket
After=dbus.socket
Expand Down Expand Up @@ -38,12 +37,13 @@ ExecStartPre=/bin/mkdir -p /run/cloud-init
ExecStartPre=/sbin/restorecon /run/cloud-init
ExecStartPre=/usr/bin/touch /run/cloud-init/enabled
{% endif %}
ExecStart=/usr/bin/cloud-init init --local
ExecStart=nc.openbsd -Uu -W1 /run/cloud-init/share/local.sock -s /run/cloud-init/share/local-return.sock
RemainAfterExit=yes
TimeoutSec=0

# Output needs to appear in instance console output
StandardOutput=journal+console
StandardInputText=start

[Install]
WantedBy=cloud-init.target
27 changes: 27 additions & 0 deletions systemd/cloud-init-single.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
[Unit]
Description=Cloud-init: Single Process
DefaultDependencies=no
Wants=network-pre.target
After=systemd-remount-fs.service
Before=NetworkManager.service
Before=network-pre.target
Before=shutdown.target
Before=sysinit.target
Before=cloud-init-local.service
Conflicts=shutdown.target
RequiresMountsFor=/var/lib/cloud
ConditionPathExists=!/etc/cloud/cloud-init.disabled
ConditionKernelCommandLine=!cloud-init=disabled
ConditionEnvironment=!KERNEL_CMDLINE=cloud-init=disabled

[Service]
Type=notify
ExecStart=/usr/bin/cloud-init --single-process
KillMode=process
TimeoutStartSec=infinity

# Output needs to appear in instance console output
StandardOutput=journal+console

[Install]
WantedBy=cloud-init.target
3 changes: 2 additions & 1 deletion systemd/cloud-init.service.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -46,12 +46,13 @@ ConditionEnvironment=!KERNEL_CMDLINE=cloud-init=disabled

[Service]
Type=oneshot
ExecStart=/usr/bin/cloud-init init
ExecStart=nc.openbsd -Uu -W1 /run/cloud-init/share/network.sock -s /run/cloud-init/share/network-return.sock
RemainAfterExit=yes
TimeoutSec=0

# Output needs to appear in instance console output
StandardOutput=journal+console
StandardInputText=start

[Install]
WantedBy=cloud-init.target
4 changes: 1 addition & 3 deletions tests/unittests/test_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -160,9 +160,7 @@ def test_no_arguments_shows_usage(self, capsys):

def test_no_arguments_shows_error_message(self, capsys):
exit_code = self._call_main()
missing_subcommand_message = (
"the following arguments are required: subcommand"
)
missing_subcommand_message = "a subcommand is required"
_out, err = capsys.readouterr()
assert (
missing_subcommand_message in err
Expand Down
Loading

0 comments on commit bdc76a4

Please sign in to comment.