Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: single process optimization #5489

Merged
merged 22 commits into from
Aug 2, 2024

Commits on Aug 2, 2024

  1. feat: Single process optimization

    Python interpreter initialization and module import time is a
    significant portion of cloud-init's total runtime when the
    default configuration is used, and in all cases it contributes
    a significant amount of wall clock time to cloud-init's runtime.
    
    This commit significantly improves cloud-init time to completion
    by eliminating redundant interpreter starts and module loads.
    Since multiple cloud-init processes sit in the critical chain of
    the boot order, this significantly reduces cloud-init's time to
    ssh and time to completion.
    
    Cloud-init has four stages. Each stage starts its own Python
    interpreter and loads the same libraries. To eliminate the
    redundant work of starting an interpreter and loading libraries,
    this changes cloud-init to run as a single process. Systemd
    service ordering is retained by using the existing cloud-init
    services as shims which use a synchronization protocol to start
    each cloud-init stage and to communicate that each stage is
    complete to the init system.
    
    Currently only systemd is supported, but the synchronization
    protocol should be capable of supporting other init systems
    as well with minor changes.
    
    Note: This makes possible many additional improvements that
    eliminate redundant work. However, these potential improvements
    are temporarily ignored. This commit has been structured to
    minimize the changes required to capture the majority of primary
    performance savings while preserving correctness and the ability
    to preserve backwards compatibility. Many additional
    improvements will be possible once this merges.
    
    Synchronization protocol
    ========================
    - create one Unix socket for each systemd service stage
    - send sd_notify()
    - For each of the four stages (local, network, config, final):
       - when init system sends "start" to the Unix socket, start the
         stage
       - when running stage is complete, send "done" to Unix socket
    
    socket.py (new)
    ---------------
    
    - define a systemd-notify helper function
    - define a context manager which implements a multi-socket
      synchronization protocol
    
    cloud-init-single.service (new)
    -------------------------------
    
     - use service type to 'notify'
     - invoke cloud-init in single process mode
     - adopt systemd ordering requirements from cloud-init-local.service
     - adopt KillMode from cloud-final.service
    
    main.py
    -------
    
     - Add command line flag to indicate single process mode
     - In this mode run each stage followed by an IPC
       synchronization protocol step
    
    cloud-{local,init,config,final}.services
    ----------------------------------
    
    - change ExecStart to use netcat to connect to Unix socket and:
      - send a start message
      - wait for completion response
    - note: a pure Python equivalent is possible for any downstreams
      which do not package openbsd's netcat
    
    cloud-final.services
    --------------------
    - drop KillMode
    
    cloud-init-local.services
    --------------------
    - drop dependencies made redundant by ordering after
      cloud-init-single.service
    
    Performance Results
    ===================
    
    An integration test [1] on a Noble lxd container comparing POC to current
    release showed significant improvement. In the POC, cloud-config.service
    didn't register in the critical-chain (~340ms improvement),
    cloud-init.service added ~378ms to total boot time (~400ms improvement),
    and cloud-init-local.service had a marginal improvement (~90ms) which was
    likely in the threshold of noise. The total improvement in this (early
    stage) test showed a 0.83s improvement to total boot time with 0.66s of
    boot time remaining due to cloud-init. This 0.83s second improvement
    roughly corresponds to the total boot time, with systemd-analyze
    critical-chain reporting 2.267s to reach graphical.target, which is a
    0.8s improvement over the current release time.
    
    Note: The numbers quoted above gathered from only one series (Noble),
    one platform (lxc), one host machine (Ryzen 7840U), and run environment
    was not controlled. I ran the test multiple times to ensure that the
    results were repeatable, but not enough times to be considered
    statistically significant. I verified that cloud-init worked as expected,
    but given the limited scope of this integration test, this is still very
    much a proof of concept.
    
    [1] test_logging.py
    
    BREAKING_CHANGE: Run all four cloud-init services as a single systemd service.
    holmanb committed Aug 2, 2024
    Configuration menu
    Copy the full SHA
    f7ccda9 View commit details
    Browse the repository at this point in the history
  2. comments

    holmanb committed Aug 2, 2024
    Configuration menu
    Copy the full SHA
    79e191f View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    7a62897 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    c127fca View commit details
    Browse the repository at this point in the history
  5. Rename cloud-init services to be more intuitive.

    Make cloud-network.service map to the cloud-init network stage.
    Make cloud-init.service map to all of cloud-init.
    
    BREAKING CHANGE: Changes the semantics of the cloud-init.service files
    holmanb committed Aug 2, 2024
    Configuration menu
    Copy the full SHA
    3247c11 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    d87be7c View commit details
    Browse the repository at this point in the history
  7. Improve intra-stage error handling

    - make it such that if one stage fails, the next stage isn't blocked
      indefinitely
    - notify the init system of per-stage exit codes and failure messages
    - make parent process (cloud-init.service) exit with representative exit code
    holmanb committed Aug 2, 2024
    Configuration menu
    Copy the full SHA
    18aa5b3 View commit details
    Browse the repository at this point in the history
  8. Do not set up logger multiple times

    Add a new attribute flag to the argparser Namespace attribute which is used
    to disable logging.
    
    This isn't elegant, but fixing logging is going to be a large refactor
    so this gets logging "working" for now while minimizing number of LOC
    changed
    holmanb committed Aug 2, 2024
    Configuration menu
    Copy the full SHA
    5c05690 View commit details
    Browse the repository at this point in the history
  9. fix commandline (for debugger use)

    skips sync protocol when stdin is a tty
    holmanb committed Aug 2, 2024
    Configuration menu
    Copy the full SHA
    14ca37f View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    c2079ea View commit details
    Browse the repository at this point in the history
  11. drop unused stdin

    holmanb committed Aug 2, 2024
    Configuration menu
    Copy the full SHA
    f0944d0 View commit details
    Browse the repository at this point in the history
  12. comments

    holmanb committed Aug 2, 2024
    Configuration menu
    Copy the full SHA
    417d550 View commit details
    Browse the repository at this point in the history
  13. fix new tests

    holmanb committed Aug 2, 2024
    Configuration menu
    Copy the full SHA
    cef0f5e View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    7d13021 View commit details
    Browse the repository at this point in the history
  15. Configuration menu
    Copy the full SHA
    0848734 View commit details
    Browse the repository at this point in the history
  16. format

    holmanb committed Aug 2, 2024
    Configuration menu
    Copy the full SHA
    212f841 View commit details
    Browse the repository at this point in the history
  17. Configuration menu
    Copy the full SHA
    5089e59 View commit details
    Browse the repository at this point in the history
  18. Clean up UI

    - remove logs duplicated across stages
    - send the single line traceback to systemd
    - fix a minor string format in user output
    holmanb committed Aug 2, 2024
    Configuration menu
    Copy the full SHA
    a053c19 View commit details
    Browse the repository at this point in the history
  19. format

    holmanb committed Aug 2, 2024
    Configuration menu
    Copy the full SHA
    9ca7a97 View commit details
    Browse the repository at this point in the history
  20. comments

    holmanb committed Aug 2, 2024
    Configuration menu
    Copy the full SHA
    7df7d83 View commit details
    Browse the repository at this point in the history
  21. fix merge conflicts

    holmanb committed Aug 2, 2024
    Configuration menu
    Copy the full SHA
    cb7bb25 View commit details
    Browse the repository at this point in the history
  22. update unit tests

    holmanb committed Aug 2, 2024
    Configuration menu
    Copy the full SHA
    4b7dbbb View commit details
    Browse the repository at this point in the history