-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: single process optimization #5489
Conversation
5920138
to
bdc76a4
Compare
🤣 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work here! I have some comments but they're all very minor. We should take some extra time to think about any additional testing that may need happen, but overall +1 to the approach.
[RFC] Names
--single-process
is fine, but I'm -1 to cloud-init-single.service
simply because we have the cloud-init single
command that could potentially be confusing.
What about cloud-init boot
and cloud-init-boot.service
?
Long term, I like the idea of making this cloud-init.service
, but I'm not sure it's worth the churn of doing it at the same time as this PR. Ideally, this PR won't introduce any backwards incompatibility, but given the scope of the change, I'm sure there will be some effects we need to deal with, so I think I'd rather keep intentional breakage separate.
Why openbsd netcat?
I don't know the history or difference between implementations, but is nc.openbsd
installed everywhere? If not, is the difference meaningful such that we couldn't just use nc
?
Thanks!
Agreed. I have an idea for an integration test that would support verifying the behavior more easily.
This definitely breaks backwards compatibility and will require a patch and will require downstream maintainers that to update their vendored service files if they want to benefit from the performance improvements here. If we do this change separately from a rename, that means two breaking changes rather than one. Since we already require downstream changes as a result of this PR, I'd prefer to do the rename now - with the justification of a cleanup alongside a performance improvement is less painful than two breaking changes. This will require a community communication email / post to communicate changes required of downstreams, so this can be included in the "what this means for you" section for maintainers. Thoughts?
I can't promise that it is used everywhere, but I do know that it is pretty widely used, and is already a dependency of ubuntu-minimal:
Like I mentioned before, openbsd's netcat supports Unix sockets. I don't see any other implementations that do. If we'd prefer to avoid more dependencies, a C / Rust / Go implementation of the behavior that we need would be trivial to write using minimal dependencies, but that comes with the complexity of cloud-init packages becoming arch-dependent - which is not a huge burden, but it is another packaging burdon (though it's probably worth considering since this would likely improve performance further). |
5692c50
to
c6e9e4b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work here, thanks for the effort!
I don't know the history or difference between implementations, but is nc.openbsd installed everywhere? If not, is the difference meaningful such that we couldn't just use nc?
@TheRealFalcon: AFAIK the standard netcat does not implement unix sockets.
@@ -38,12 +37,13 @@ ExecStartPre=/bin/mkdir -p /run/cloud-init | |||
ExecStartPre=/sbin/restorecon /run/cloud-init | |||
ExecStartPre=/usr/bin/touch /run/cloud-init/enabled | |||
{% endif %} | |||
ExecStart=/usr/bin/cloud-init init --local | |||
ExecStart=nc.openbsd -Uu -W1 /run/cloud-init/share/local.sock -s /run/cloud-init/share/local-return.sock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires an explicit downstream Ubuntu explicit runtime dependency on netcat-openbsd
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires an explicit downstream Ubuntu explicit runtime dependency on
netcat-openbsd
.
Yes, it would be best to add this dependency to the packaging. However, since ubuntu-minimal already includes it it won't functionally change anything.
I plan to add that to the Ubuntu packaging in a followup PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet! Even if it's a noop, I think explicitly expressing the dependency is the way to go as, cloud-init may be installed in environments where ubuntu-minimal is not, or ubuntu-minimal might drop the dependency on netcat-openbsd in the future.
@aciba90 thanks for the review! I'll ping you when I've addressed your points.
@TheRealFalcon also worth noting that on Ubuntu, I'm pretty sure that |
Makes sense. +1. We're just changing filenames though, right? Not also the CLI as you had initially speced out? You and Alberto also worked through my |
Correct, I'm not going to touch pre-existing cli names. For now I'm going to rename it to |
ac9e590
to
49bebbf
Compare
Requesting re-review @TheRealFalcon. I added some another unit test. I think that the testing coverage does a pretty good job of exercising the ordering requirements and error conditions of the protocol in just the unit test suite, and it uses only standard library so I'm not sure what integration tests would add above what is already covered since every integration test in the test suite will already exercise the netcat / systemd service portion of this. If dedicated integration tests are desired I'm happy to add them (with expectations laid out for what is desired). Note that the unittests added in this PR make use of the same operating system primitives (unix sockets) as any integration test would - with only the sd_notify mocked plus a couple of calls that were added which allow running I'm currently working on a community notice to advertise the impacts to users and packagers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like an awesome performance improvement.
I'm concerned about reusing a systemd unit name for a completely different thing. I'd rather no names are re-used: someone somewhere either has fingers that know the names or documentation that refers to the names or (maybe not, for cloud-init?) service files and configurations on disk, or terraform plans, or ansible recipes, etc, that refer to the old units with the old meanings.
So: I suggest not using an old name for a new purpose.
One common problem with "multiple-stage" software is when something, or someone, causes the stages to execute out of order. This might not be quite as vulnerable as a TLS implementation in a webserver, but if the different phases have strict ordering requirements, we should have some simple checks for that ordering here.
Thanks
# notify systemd that this stage has completed | ||
socket.sd_notify("READY=1") | ||
# wait for cloud-init-local.service to start | ||
with sync("local"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there ordering requirements inherent to cloud-init's different stages? If local
must be run before network
, for example, this would be an ideal place to encode these requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there ordering requirements inherent to cloud-init's different stages?
Yes, these ordering requirements are inherent requirements. This order is already encoded in the init system. Currently cloud-init-local.service
is ordered before cloud-init.service
which is before cloud-config.service
which is before cloud-final.service
.
local must be run before network, for example, this would be an ideal place to encode these requirements.
@setharnold Maybe you had something specific in mind that I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these have to happen in lockstep order, maybe a variable to show what stage it's on:
stage=0
with sync("local"):
if (stage++ < 0):
error_and_exit("unexpected stage error")
# stuff
with sync("network"):
if (stage++ < 1):
error_and_exit("network must follow local")
# stuff
with sync("config"):
if (stage++ < 2):
error_and_exit("config must follow network")
# stuff
with sync # over and over again
...
If the ordering is more vague, like "config" requires "local" and doesn't care about "network", then local_done
and network_done
and config_done
and so on boolean variables to mark these completed in turn, then each one could check the status of the phases that it depends upon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these have to happen in lockstep order, maybe a variable to show what stage it's on:
stage=0 with sync("local"): if (stage++ < 0): error_and_exit("unexpected stage error") # stuff with sync("network"): if (stage++ < 1): error_and_exit("network must follow local") # stuff with sync("config"): if (stage++ < 2): error_and_exit("config must follow network") # stuff with sync # over and over again ...
If the ordering is more vague, like "config" requires "local" and doesn't care about "network", then
local_done
andnetwork_done
andconfig_done
and so on boolean variables to mark these completed in turn, then each one could check the status of the phases that it depends upon.
We could add code like this, but it wouldn't be possible for error_and_exit()
to ever be called, regardless of the order that the messages are received on the unix sockets. This is single threaded code and it only has a single entry point. So in order for the network stage to run, the local stage has to have run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've just got less faith than you in the systemd-enforced ordering, that's all. If you're sure I'm over-thinking it, feel free to leave it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've just got less faith than you in the systemd-enforced ordering, that's all.
Systemd is not the only thing enforcing ordering - it is already enforced in the Python code as well. If systemd triggers stages out of order, earlier stages will not get skipped - if "network" is triggered before "local", this Python code will wait until systemd triggers "local".
The reason is that the ordering of these stages isn't enforced in the definition of the context manager. It is enforced by the order that the context manager is invoked in - since with sync("local")
is called prior to with sync("network")
is called, the network stage cannot start until the local stage has completed. No parallalism / async / threading / coroutines / etc are in use; the code simply runs one stage after the next but not until systemd has signaled that it is time for the stage to run.
If you're sure I'm over-thinking it, feel free to leave it out.
I'm pretty sure that's what is happening, but I want to make sure you're comfortable with / understanding with how it works too. Hopefully the following annotated code helps.
with sync("local"):
# "start" must have been received on the local socket to enter the context manager
# (triggered by the one-liner in cloud-init-local.service)
# once the code gets to this point, the "local" stage has completed (and a response was
# sent in the context manager to the shim process in cloud-init-local.service)
with sync("network"):
# "start" must have been received on the network socket to enter the context manager
# (triggered by the one-liner in cloud-init-network.service) - in addition to
# all of the code above being completed
# once the code gets to this point, the "network" stage has completed (and a response
# was sent in the context manager to the shim process in cloud-init-network.service)
with sync("config"):
# "start" must have been received on the config socket to enter the context manager
# (triggered by the one-liner in cloud-config.service) - in addition to
# the code above being completed
# once the code gets to this point, the "config" stage has completed (and a response
# was sent in the context manager to the shim process in cloud-config.service)
with sync("final"):
# "start" must have been received on the final socket to enter the context manager
# (triggered by the one-liner in cloud-config.service) - in addition to
# the code above being completed
So even if systemd triggers events out of order, cloud-init won't run them out of order. It will just wait until the next stage in the order has completed.
Please let me know if you have any questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! now I can see why you're confident :) thanks!
@@ -10,7 +10,7 @@ ConditionEnvironment=!KERNEL_CMDLINE=cloud-init=disabled | |||
|
|||
[Service] | |||
Type=oneshot | |||
ExecStart=/usr/bin/cloud-init modules --mode=config | |||
ExecStart=sh -c 'echo "start" | nc.openbsd -Uu -W1 /run/cloud-init/share/config.sock -s /run/cloud-init/share/config-return.sock | sh' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fairly complex invocation for an ExecStart with no surrounding commentary or documentation to aid in explaining the intent of the socket setup and pipe chain. It also makes typical triage entrypoints harder to read when folks may jump into systemctl status cloud-*.service
as it pushes visibility of exit codes off screen because our ExecStart is so long.
It may aid in readability/maintainability if cloud-init delivered a simple wrapper script in /usr/lib/cloudinit/that would take a single command line param for the stage name (local|network|config|final)
and create these seperate named sockets. We can then document in that script the intent of opening the socket and the act of passing through socket payloads to sh
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fairly complex invocation for an ExecStart with no surrounding commentary or documentation to aid in explaining the intent of the socket setup and pipe chain.
Agreed, inline-documentation would help a lot. I'm happy to add that.
It also makes typical triage entrypoints harder to read when folks may jump into systemctl status cloud-*.service as it pushes visibility of exit codes off screen because our ExecStart is so long.
You're referring to systemd's pager truncating the message on the Process:
line, right?
$ systemctl status cloud-init-local.service
● cloud-init-local.service - Cloud-init: Local Stage (pre-network)
Loaded: loaded (/usr/lib/systemd/system/cloud-init-local.service; enabled; preset: enabled)
Active: active (exited) since Wed 2024-07-24 19:53:46 UTC; 2min 25s ago
Process: 254 ExecStart=sh -c echo "start" | nc.openbsd -Uu -W1 /run/cloud-init/share/local.sock -s /run/cloud-init/share/local-return.sock | sh (code=exited, status=0/S>
Main PID: 254 (code=exited, status=0/SUCCESS)
CPU: 5ms
I don't think that this is really a concern. This off-screen information is duplicated on the following line, and even if it wasn't, one can always use --no-pager
or pipe to cat
/ less
/ etc.
It may aid in readability/maintainability if cloud-init delivered a simple wrapper script in /usr/lib/cloudinit/that would take a single command line param for the stage name (local|network|config|final) and create these seperate named sockets. We can then document in that script the intent of opening the socket and the act of passing through socket payloads to sh.
I'd really rather not hide the implementation in a script. The original implementation was dead simple - just a single nc
invocation but at the request of getting error codes into the shim services I made this compromise to the previous simplicity. I'm very averse to making this any more complex than it already is. I don't want another ds-identify
. Frankly I'd sooner drop the error codes in the shim services or re-write this one-liner in C/Rust/Go (but in preference of shipping code that we already have packaging for I've proposed it like this for now with the intent of potentially rewriting in a different language eventually).
Also, putting this in a file would add a file load and read to the critical chain per-stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, inline-documentation would help a lot. I'm happy to add that.
This meets the majority of my concern, I'm good with your concerns about adding a wrapper layer adding complexity or obscurity where it may not be needed.
If we have comments within this unit that describe the intent of this interaction, then I can back down on the "please make systemctl status cloud-init.service more readable/simple" point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the latest commit, which I think addresses the remaining concerns in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes looking great.
# the second value contains the path to a unix socket on which to | ||
# reply, which is expected to be /path/to/{self.stage}-return.sock | ||
sock = self.sockets[self.stage] | ||
chunk, self.remote = sock.recvfrom(5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historically systemd units have been kicked out of systemd boot goals due to misconfiguraiton ordering cycles (unit dependency chain issues) with other systemd units. Do we want to implement some sort of long timeout exception handling here to not block indefinitely and hard error that "cloud-init didn't start expected X stage in X mins" to unblock the rest of boot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historically systemd units have been kicked out of systemd boot goals due to misconfiguraiton ordering cycles (unit dependency chain issues) with other systemd units.
Do we want to implement some sort of long timeout exception handling here to not block indefinitely and hard error that "cloud-init didn't start expected X stage in X mins" to unblock the rest of boot?
This is a condition which should be discovered before the image ever gets shipped, not something that should bite end users.
I really don't think that we want that. What you are referring to is a symptom seen at runtime of a problem that in almost all cases is introduced by a combination of packages that have been installed on the system (build time). Services in systemd can take an unknowable amount of time to complete and a timeout could break a real user of cloud-init that has to deal with long boot times. Such a timeout isn't a correct solution for this problem. The correct solution is to fix the ordering cycle. Adding such a timeout could break real users that have long boot times just to introduce a broken workaround for packagers.
Yes cloud-init will behave differently from before if booted out due to an ordering cycle. However I don't think that this is something that an end user should ever see unless they themselves are messing with the boot order or their OS vendor made a mistake - and even in those cases it still doesn't seem like a problem for cloud-init to attempt to solve at the expense of real users with slow boot times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a condition which should be discovered before the image ever gets shipped, not something that should bite end users.
This does happen sometimes to (image creator) end-users who install supplemental systemd services in their golden images which has resulted in bogus bugs/questions about cloud-init not running right (because it was ejected from the boot goals due to systemd unit dependency conflicts). I was wondering if there are ways we could make that condition a bit more conspicuous, but the symptoms below are probably "good enough" for those image creators to triage and determine what is causing the block in boot.
Here are the symptoms a user would encounter if we don't instrument a timeout of some sort our failure mode in images where cloud-network.service were to be ejected from boot goals due to cyclic dependency ordering the system will be in the following state:
- systemctl status` will indefinitely set at "starting" State.
- systemctl list-jobs will not show the missing/disabled cloud-network.service but instead show cloud-final.service, cloud-init.target in waiting state and cloud-config.service sitting indefinitely in "running" state with not logs
- systemctl status cloud-init.service will show
Waiting on external services to complete (network stage)
- cloud-init status --format=yaml will show extended_status: running and no start times for init stage in the resulting YAML output.
These are probably reasonable enough breadcrumbs to look into on systems which exhibit this configuration problem and may not warrant further arbitrary timeouts which could adversely affect slow system startup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a condition which should be discovered before the image ever gets shipped, not something that should bite end users.
This does happen sometimes to (image creator) end-users who install supplemental systemd services in their golden images which has resulted in bogus bugs/questions about cloud-init not running right (because it was ejected from the boot goals due to systemd unit dependency conflicts). I was wondering if there are ways we could make that condition a bit more conspicuous, but the symptoms below are probably "good enough" for those image creators to triage and determine what is causing the block in boot.
This can easily be checked with systemd-analyze verify default.target
and checking the output, or by checking the journal for messages like: "Breaking ordering cycle by deleting job cloud-init.service".
I can add that to the docs in the appropriate place for anyone that runs into this. I'll make it a separate PR. I've added it to the "future work" list in the PR description.
Thanks for the review @setharnold!
You make good points. I was hopeful that this would be a good name for it, and I still think it is, however you're right that this may make the transition unnecessarily painful. I've changed the name of the primary service to
With these changes, cloud-init can only run these stages in the required order. Cloud-init uses systemd ordering to guarantee that a stage is not started before it is ready, and furthermore will only run each stage once the last one has ran and once it has received the |
40992ea
to
ae7a4da
Compare
I just rebased on tip of main to fix a unit test that failed due to service rename. |
- remove logs duplicated across stages - send the single line traceback to systemd - fix a minor string format in user output
0de6016
to
cb7bb25
Compare
The cc_mounts PR that we just landed introduced a merge conflict with this PR, so I just pushed a change which resolves that. |
Python interpreter initialization and module import time contributes a significant amount of wall clock time to cloud-init's runtime (and therefore to total boot time). Cloud-init has four stages. Each stage starts its own Python interpreter and loads the same libraries. To eliminate the redundant work of starting an interpreter and loading libraries, this changes cloud-init to run as a single process. Systemd service ordering is retained by using the existing cloud-init services as shims which use a synchronization protocol to start each cloud-init stage and to communicate that each stage is complete to the init system. Since multiple cloud-init processes sit in the critical chain of starting the system, this reduces boot time (including time to ssh login and time to cloud-init completion). Currently only systemd is supported, but the synchronization protocol should be capable of supporting other init systems as well with minor changes. Note: This enables many additional follow-on improvements that eliminate redundant work. However, these potential improvements are temporarily ignored. This commit has been structured to minimize the changes required to capture the majority of primary performance savings while preserving correctness and the ability to preserve backwards compatibility. Since this changes the semantics of the existing cloud-init unit files, this change takes the opportunity to rename one of its systemd units which causes frequent user confusion. The unit named cloud-init.service is often mistaken by users for being the only cloud-init service, when it is simply one of four stages. This stage is documented as the "network" stage, so this service will be renamed to "cloud-init-network.service". A new notify service is added as part of this implementation which contains the cloud-init process. This unit is named "cloud-init-main.service". Synchronization protocol ======================== - create one Unix socket for each systemd service stage - send sd_notify() - For each of the four stages (local, network, config, final): - when init system sends "start" to the Unix socket, start the stage - when running stage is complete, send "done" to Unix socket File changes ============ socket.py (new) --------------- - define a systemd-notify helper function - define a context manager which implements a multi-socket synchronization protocol cloud-init.service -> cloud-init-network.service (renamed) ---------------------------------------------------------- - renamed to cloud-network.service cloud-{init-local,init-network,config,final}.services ------------------------------------------- - change ExecStart to use netcat to connect to Unix socket and: - send a start message - wait for completion response - note: a pure Python equivalent is possible for any downstreams which do not package openbsd's netcat cloud-init-main.service (new) ----------------------------- - use service type to 'notify' - invoke cloud-init in single process mode - adopt systemd ordering requirements from cloud-init-local.service - adopt KillMode from cloud-final.service main.py ------- - Add command line flag to indicate "all stages" mode - In this mode run each stage followed by an IPC synchronization protocol step cloud-final.services -------------------- - drop KillMode cloud-init-local.services ------------------------- - drop dependencies made redundant by ordering after cloud-init-main.service Performance Impact ================== On Ubuntu 24.04, Python's wall clock start up time as measured with `time python3 -c 'import cloudinit.cmd.main' on a few cloud types: lxc container: 0.256s QEMU machine: 0.300s gce instance: 0.367s ec2 instance: 0.491s This change eliminates x1 this start up time from time to ssh. This change eliminates x3 this start up time from cloud-init's total completion. Total benefit varies based on the platform that the instance is hosted by, but all platforms will measurably benefit from this change. BREAKING_CHANGE: Run all four cloud-init services as a single systemd service.
Python interpreter initialization and module import time contributes a significant amount of wall clock time to cloud-init's runtime (and therefore to total boot time). Cloud-init has four stages. Each stage starts its own Python interpreter and loads the same libraries. To eliminate the redundant work of starting an interpreter and loading libraries, this changes cloud-init to run as a single process. Systemd service ordering is retained by using the existing cloud-init services as shims which use a synchronization protocol to start each cloud-init stage and to communicate that each stage is complete to the init system. Since multiple cloud-init processes sit in the critical chain of starting the system, this reduces boot time (including time to ssh login and time to cloud-init completion). Currently only systemd is supported, but the synchronization protocol should be capable of supporting other init systems as well with minor changes. Note: This enables many additional follow-on improvements that eliminate redundant work. However, these potential improvements are temporarily ignored. This commit has been structured to minimize the changes required to capture the majority of primary performance savings while preserving correctness and the ability to preserve backwards compatibility. Since this changes the semantics of the existing cloud-init unit files, this change takes the opportunity to rename one of its systemd units which causes frequent user confusion. The unit named cloud-init.service is often mistaken by users for being the only cloud-init service, when it is simply one of four stages. This stage is documented as the "network" stage, so this service will be renamed to "cloud-init-network.service". A new notify service is added as part of this implementation which contains the cloud-init process. This unit is named "cloud-init-main.service". Synchronization protocol ======================== - create one Unix socket for each systemd service stage - send sd_notify() - For each of the four stages (local, network, config, final): - when init system sends "start" to the Unix socket, start the stage - when running stage is complete, send "done" to Unix socket File changes ============ socket.py (new) --------------- - define a systemd-notify helper function - define a context manager which implements a multi-socket synchronization protocol cloud-init.service -> cloud-init-network.service (renamed) ---------------------------------------------------------- - renamed to cloud-network.service cloud-{init-local,init-network,config,final}.services ------------------------------------------- - change ExecStart to use netcat to connect to Unix socket and: - send a start message - wait for completion response - note: a pure Python equivalent is possible for any downstreams which do not package openbsd's netcat cloud-init-main.service (new) ----------------------------- - use service type to 'notify' - invoke cloud-init in single process mode - adopt systemd ordering requirements from cloud-init-local.service - adopt KillMode from cloud-final.service main.py ------- - Add command line flag to indicate "all stages" mode - In this mode run each stage followed by an IPC synchronization protocol step cloud-final.services -------------------- - drop KillMode cloud-init-local.services ------------------------- - drop dependencies made redundant by ordering after cloud-init-main.service Performance Impact ================== On Ubuntu 24.04, Python's wall clock start up time as measured with `time python3 -c 'import cloudinit.cmd.main' on a few cloud types: lxc container: 0.256s QEMU machine: 0.300s gce instance: 0.367s ec2 instance: 0.491s This change eliminates x1 this start up time from time to ssh. This change eliminates x3 this start up time from cloud-init's total completion. Total benefit varies based on the platform that the instance is hosted by, but all platforms will measurably benefit from this change. BREAKING_CHANGE: Run all four cloud-init services as a single systemd service.
Any that prefer to avoid the openbsd netcat dependency might consider the following patch. |
Adapt enabled services in both baseline and releng profiles to cope with changes in cloud-init 24.3. Related-to: https://gitlab.archlinux.org/archlinux/packaging/packages/cloud-init/-/issues/3 Related-to: canonical/cloud-init#5489 Signed-off-by: David Runge <[email protected]>
@holmanb If this is possible, I don't understand why cloud-init would not just rely on Python fully in the first place and instead on something quite undermaintained as openbsd-netcat? 🤔 Relatedly, this change has not been mentioned in the release announcements, or the changelog (it can only be found in the last footnote of a separate announcement on an Ubuntu discourse post). I'm raising this, because I am packaging this project for Arch Linux and due to the intransparent handling these breaking changes were not initially clear to me (which broke first cloud-init generally, but also our integration with our installation medium afterwards and I am still fixing that). |
Adapt enabled services in both baseline and releng profiles to cope with changes in cloud-init 24.3. Related-to: https://gitlab.archlinux.org/archlinux/packaging/packages/cloud-init/-/issues/3 Related-to: canonical/cloud-init#5489 Fixes: https://gitlab.archlinux.org/archlinux/archiso/-/issues/236 Signed-off-by: David Runge <[email protected]>
Adapt enabled services in both baseline and releng profiles to cope with changes in cloud-init 24.3. Related-to: https://gitlab.archlinux.org/archlinux/packaging/packages/cloud-init/-/issues/3 Related-to: canonical/cloud-init#5489 Fixes: https://gitlab.archlinux.org/archlinux/archiso/-/issues/236 Signed-off-by: David Runge <[email protected]>
Adapt enabled services in both baseline and releng profiles to cope with changes in cloud-init 24.3. Related-to: https://gitlab.archlinux.org/archlinux/packaging/packages/cloud-init/-/issues/3 Related-to: canonical/cloud-init#5489 Fixes: https://gitlab.archlinux.org/archlinux/archiso/-/issues/236 Signed-off-by: David Runge <[email protected]>
Openbsd-netcat is significantly faster to start based on our testing. A compiled language would likely perform even better with less dependencies (outside of a compiler) - we may implement that in the future. Since cloud-init sits in the critical chain during boot, this would reduce the benefits received by this change - while keeping the other complexity of this change.
How is it undermaintained?
This was actually mentioned under the "Breaking Changes" section in the release announcements. We broadcasted it everywhere, but the release notes admittedly didn't describe the change in detail.
I see the frustration, sorry :(
If you were to replace gnu-netcat with openbsd-netcat, like some other distros have, this wouldn't be an issue. Would that be possible? |
Adapt enabled services in both baseline and releng profiles to cope with changes in cloud-init 24.3. Related-to: https://gitlab.archlinux.org/archlinux/packaging/packages/cloud-init/-/issues/3 Related-to: canonical/cloud-init#5489 Fixes: https://gitlab.archlinux.org/archlinux/archiso/-/issues/236 Signed-off-by: David Runge <[email protected]>
Hey David, I just saw the downstream commit[1] and was confused about why the units are being manually enabled. Ubuntu and other distros don't do that. The generator should discover when it is using a cloud and enable If a user wants to force cloud-init to use a specific datasource when building the image, they should set P.S. Sorry to comment here rather than on the Gitlab issue. It seems account creation is disabled there and I'm waiting on account support to give me one. [1] archlinux/archiso@a0af54f |
@dvzrv See my last comment. Also, we're currently considering a compiled netcat replacement since this is should be trivially simple and will eliminate the external dependencies (would just add a build dep for the compiler). I'll ping if / when this becomes available since that will obviously affect downstreams. |
I'm not sure. Someone would need to investigate whether some of the dependents use gnu-netcat-only features.
The use of cloud-init in the archiso project is a bit different from a "package", as the files you are referring to are used in the context of a
We also do not enable the units in the cloud-init package, but only in the archiso profiles basically.
Ah, interesting! Please feel free to keep me in the loop! Many thanks 🙏 |
Proposed Commit Message
(if you made it this far, don't worry the implementation code is only 2x longer than this commit message!)
Design
Why use Unix sockets?
The implementation requires a fast, event-driven, lossless, bi-directional, cross platform IPC mechanism that is easily limited to root-only send/receive. Alternatives obviously exist, but Unix sockets seem like a natural fit.
Why not just share one socket?
A socket per stage keeps the implementation and makes failure modes simple.
Why openbsd netcat?
It supports Unix sockets and it's 7x faster than a bare bones pure Python implementation. A C / Rust / Go based implementation would be trivial to write, but netcat already exists and does the job sufficiently fast.
Why a new service?
any external service that is ordered
Before=cloud-init-local.service
can now run in parallel to the interpreter loadingby decoupling interpreter loading from running cloud-init logic, systemd can potentially start loading cloud-init earlier in boot (note the difference in ordering between cloud-init-single.service vs cloud-init-local.service)
changing all four services to be shims that use an identical synchronization protocol to trigger and wait for the cloud-init stages keeps the design symmetrical - this makes it easier to debug and describe
Security implications
Dependencies
No additional dependencies required - this change implements our own sd_notify.
File permissions
The directory which contains the unix sockets is limited to root-only (see
cloudinit/socket.py::67
):On Ubuntu
DEFAULT_RUN_DIR
resolves to/run/cloud-init/
which has the following access:Currently (before this PR), writes to this directory could modify cloud-init's configuration; cloud-init depends on POSIX permissions.
Todo (this PR)
would benefit development and could better stress test the synchronization code
for race conditions
tests/integration_tests/test_multi_part_user_data_handling.py::test_cloud_config_archive_boot_hook_logging
Future work (followup PRs, required pre-24.3 release)
tee
by default (which is completely unnecessary for stream handling - we can probably ditch this dependency), which leaves processes hanging around after cloud-init signals completion. This leaves extraneous log messages in the output ofsystemctl status cloud-init-single.service
(see Appendix B).[RFC] Names (closed)
The
cloud-init --single-process
andcloud-init-single.service
is the working names that I've used, but I think that we can do better.cloud-init
- this naming proposal is the spiritual successor to this spec we could renamecloud-init.service
tocloud-init-network.service
and repurpose the service namecloud-init.service
to be the service that, you know, "runs cloud-init". This would probably be the best option in the long run, but would come at the cost of short-term pain for distributions. Since we won't backport this change to old releases this might be feasible. I'd be willing to write a blog post and release notes regarding the choice, and the changes that are visible to the user/distributor should be pretty straightforward: for any package that has an ordering dependency oncloud-init.service
they could justsed -i
s/cloud-init.service/cloud-init-network.service/g`.runner
/run
- the namescloud-init --run
andcloud-init-runner.service
make it clear that a thing is running, but some might still see acloud-init.service
and incorrectly assume that this is the name of the service that runs cloud-init (many have made this incorrect assumption)Rejected names
daemon
- this isn't a long-lived process. While this lives longer than the previous four processes, it simply combines their four lifespans into one. This name would be misleading.all
-cloud-init --all
andcloud-init-all.service
, convey the right meaning, but it requires context to understand - the only reason to call this "all" is because it replaces multiple parts.Appendix
A.
systemctl status cloud-init-local.service
B.
systemctl status cloud-init-single.service
C. Example log from instance booted with latest commit:
single-process.log
Merge type