feat: Return error in `cloud-init status` if stage not run #5579

TheRealFalcon · 2024-08-02T19:35:18Z

In draft state as I'm looking for feedback and still needs test updates.

Proposed Commit Message

feat: Return error in `cloud-init status` if stage not run

Fixes GH-5304

Additional Context

The original issue in GH-5304 was not reproducible on current versions of cloud-init. Since that version of cloud-init, we've updated our status checking to only be running if status.json exists but results.json doesn't exist, and checking systemd status can only change cloud-init status from RUNNING to ERROR rather than the previous code that could change status back to RUNNING.

So instead I addressed this comment:
Long term cloud-init status should return a fatal error code and indicate this issue

Thinking about it more though, I'm not actually sure we want this behavior, and if we do it probably shouldn't be applied to existing releases. While the supported cloud-init path is to run all 4 services, we have seen numerous use cases where people do things we wouldn't endorse. Amazon Linux completely removing cloud-init-local.service comes to mind. I'm not proposing we officially support those use cases, but I'm also not sure it makes sense to declare there was an error when the lack of a service running may have been intended behavior and technically nothing "bad" happened.

This PR makes it error to have not run one of the stages. After masking cloud-init-local.service and cloud-init.service, I get these results:

root@me:~# cloud-init status --long
status: error
extended_status: error - done
boot_status_code: enabled-by-generator
last_update: Thu, 01 Jan 1970 00:00:02 +0000
detail: Cloud-init enabled by systemd cloud-init-generator
errors:
	- Failed to start stage: init
	- Failed to start stage: init-local
recoverable_errors:
WARNING:
	- Failed to write boot finished file /var/lib/cloud/instance/boot-finished
	- Used fallback datasource

root@me:~# cloud-init status --format json
{
  "boot_status_code": "enabled-by-generator",
  "datasource": null,
  "detail": "Cloud-init enabled by systemd cloud-init-generator",
  "errors": [
    "Failed to start stage: init",
    "Failed to start stage: init-local"
  ],
  "extended_status": "error - done",
  "init": {
    "errors": [],
    "finished": null,
    "recoverable_errors": {},
    "start": null
  },
  "init-local": {
    "errors": [],
    "finished": null,
    "recoverable_errors": {},
    "start": null
  },
  "last_update": "Thu, 01 Jan 1970 00:00:02 +0000",
  "modules-config": {
    "errors": [],
    "finished": 2.23,
    "recoverable_errors": {},
    "start": 2.12
  },
  "modules-final": {
    "errors": [],
    "finished": 2.5,
    "recoverable_errors": {
      "WARNING": [
        "Failed to write boot finished file /var/lib/cloud/instance/boot-finished",
        "Used fallback datasource"
      ]
    },
    "start": 2.44
  },
  "recoverable_errors": {
    "WARNING": [
      "Failed to write boot finished file /var/lib/cloud/instance/boot-finished",
      "Used fallback datasource"
    ]
  },
  "stage": null,
  "status": "error"
}

(the timestamps are related to my environment and not an issue)

Test Steps

Merge type

Squash merge using "Proposed Commit Message"
Rebase and merge unique commits. Requires commit messages per-commit each referencing the pull request number (#<PR_NUM>)

Fixes canonicalGH-5304

blackboxsw · 2024-08-02T20:33:42Z

cloudinit/cmd/status.py

+    return [
+        f"Failed to start stage: {stage_name}"
+        for stage_name, stage_info in status_v1.items()
+        if stage_name in STAGE_NAME.keys() and stage_info.get("start") is None
+    ]


Couldn't this logic instead be iterating by known STAGE_NAMES instead of looping over status_v1.items and filtering out?

Also this has the affect of reporting errors while cloud-init is still running because early calls to status may not have crossed the threshold where all cloud-init stages have yet run. A call to cloud-init status before cloud-init-local even starts would report errors too. Do we want to also check is_running() too before trying to return these errors?

Another concern, is if cloud-final.service is ever not part of the systemd (or other) boot targets, we would never create /run/cloud-init/result.json file which would leave us in a state where we probably want to peek at whether the system's desired run level was reached and the init system is 'up/active' too.

Suggested change

return [

f"Failed to start stage: {stage_name}"

for stage_name, stage_info in status_v1.items()

if stage_name in STAGE_NAME.keys() and stage_info.get("start") is None

]

return [

f"Failed to start stage: {stage_name}"

for stage_name in STAGE_NAME.keys()

if status_v1.get(stage_name, {}).get("start") is None

]

Thanks for the comment and patch. Also, my apologies, I meant to mark this PR as draft as I knew it wasn't quite ready for review yet. I've applied your patch and added an additional unit test.

Another concern, is if cloud-final.service is ever not part of the systemd (or other) boot targets, we would never create /run/cloud-init/result.json file which would leave us in a state where we probably want to peek at whether the system's desired run level was reached and the init system is 'up/active' too.

I'm open to seeing an implementation of this, but my previous attempt at doing this wasn't it. 😉

Since that case is outside of what we support, I see it as a future nice to have rather than a need to have.

Yep, that concern is just noting a gap in behavior if cloud-final is kicked out of systemd boot goals due to ordering cycle, but, in those events systemctl status list-units --failed or systemctl status cloud-init-main or cloud-final will all show the error condition as well. So, I'm not concerned with this PR having to address this gap.

blackboxsw

Potential alternative that at least takes into account DONE status. We also don't want to error if RunningStatus.DISABLED as nothing should run at that point

--- a/cloudinit/cmd/status.py
+++ b/cloudinit/cmd/status.py
@@ -435,12 +435,12 @@ def get_not_started_errors(status_v1) -> List[str]:
     """Return a list of errors from status_v1 that are not started."""
     return [
         f"Failed to start stage: {stage_name}"
-        for stage_name, stage_info in status_v1.items()
-        if stage_name in STAGE_NAME.keys() and stage_info.get("start") is None
+        for stage_name in STAGE_NAME
+        if status_v1.get(stage_name, {}).get("start") is None
     ]
 
 
-def get_errors(status_v1) -> Tuple[List, Dict]:
+def get_errors(status_v1, running_status: RunningStatus) -> Tuple[List, Dict]:
     """Return a list of errors and recoverable_errors from status_v1."""
     errors = []
     recoverable_errors = {}
@@ -461,7 +461,8 @@ def get_errors(status_v1) -> Tuple[List, Dict]:
                     recoverable_errors[err_type].extend(
                         current_recoverable_errors[err_type]
                     )
-    errors.extend(get_not_started_errors(status_v1)) 
+    if running_status == RunningStatus.DONE:
+        errors.extend(get_not_started_errors(status_v1))
     return errors, recoverable_errors
 
 
@@ -496,16 +497,16 @@ def get_status_details(
         else ""
     )
 
-    errors, recoverable_errors = get_errors(status_v1)
+    running_status = get_running_status(
+        status_file, result_file, boot_status_code, latest_event
+    )
+
+    errors, recoverable_errors = get_errors(status_v1, running_status)
     if errors:
         condition_status = ConditionStatus.ERROR
     elif recoverable_errors:
         condition_status = ConditionStatus.DEGRADED
 
-    running_status = get_running_status(
-        status_file, result_file, boot_status_code, latest_event
-    )
-
     if (
         running_status == RunningStatus.RUNNING
         and uses_systemd()

holmanb · 2024-08-02T21:00:22Z

Thinking about it more though, I'm not actually sure we want this behavior, and if we do it probably shouldn't be applied to existing releases. While the supported cloud-init path is to run all 4 services, we have seen numerous use cases where people do things we wouldn't endorse. Amazon Linux completely removing cloud-init-local.service comes to mind.

I'm not proposing we officially support those use cases, but I'm also not sure it makes sense to declare that there was an error when the lack of a service running may have been intended behavior and technically nothing "bad" happened.

Downstreams can do whatever they want, but if they choose to do really weird things like skip entire stages that doesn't mean that it's upstream's responsibility to maintain their code. The issue that you have addressed here should be fixed in upstream. This doesn't prevent their downstream from working; they will just have to patch this behavior to work for their modified downstream version. Downstreams doing weird things doesn't make it upstream's maintenance problem.

If there were a way to fix this that was the same amount of work that didn't require breaking a downstream, I would probably advocate for that (taking into account good design, maintainability, etc). However it isn't feasible to only allow those changes which do not break any downstreams - mutually exclusive changes could exist which would paralyze an upstream that operates this way. I think that we should do what is right by fixing this issue, and if we're feeling generous then a ping in their downstream bug tracker about this change would be a friendly thing to do.

TheRealFalcon · 2024-08-02T21:21:22Z

@holmanb , I said I'm not proposing we officially support those use cases, and then it kind of felt like your response was about how we shouldn't support those use cases. I'm not disagreeing on that point. I'm more wondering if this really is a problem or error case in the first place.

holmanb · 2024-08-02T21:26:48Z

@holmanb , I said I'm not proposing we officially support those use cases, and then it kind of felt like your response was about how we shouldn't support those use cases. I'm not disagreeing on that point. I'm more wondering if this really is a problem or error case in the first place.

I guess I don't understand what you're trying to say. Under which officially supported use case would this not be a problem or error condition?

TheRealFalcon · 2024-08-06T03:49:16Z

I guess I don't understand what you're trying to say. Under which officially supported use case would this not be a problem or error condition?

Just to record what me and @holmanb discussed side-channel, my main concern was the space between "supported use case" and "break at will". I didn't really see the need for adding this while knowing that it could cause some behavior change for some unsupported use cases. He made a good point that dependency ordering can create a system where cloud-init is unintentionally removed from boot, and it's right for cloud-init to report this.

TheRealFalcon · 2024-08-06T04:11:46Z

If #5586 lands, it will eliminate the mypy issue.

blackboxsw · 2024-08-07T14:22:18Z

If there were a way to fix this that was the same amount of work that didn't require breaking a downstream, I would probably advocate for that (taking into account good design, maintainability, etc). However it isn't feasible to only allow those changes which do not break any downstreams - mutually exclusive changes could exist which would paralyze an upstream that operates this way. I think that we should do what is right by fixing this issue, and if we're feeling generous then a ping in their downstream bug tracker about this change would be a friendly thing to do.

Even if we made this a "simpler" breaking change such as a config flag for "required_boot_stages" or some such flag, downstreams which don't want to run init-local will still have to carry a patch that modifies that config switch. In this case, any downstream that wants to avoid calling our intended boot stages can just remove that stage key from cmd.main.STAGE_NAME

blackboxsw · 2024-08-07T15:44:33Z

Now that singleprocess work landed in 143bc9e, cloud-init will not be able to actually get to the point where result.json is written if we are missing a boot stage because with cloud-init-main will block on a sync point for each expected stage. As such, I'm not sure we need this anymore for typical use cases.

TheRealFalcon · 2024-08-07T15:55:32Z

Now that singleprocess work landed in 143bc9e, cloud-init will not be able to actually get to the point where result.json is written if we are missing a boot stage because with cloud-init-main will block on a sync point for each expected stage. As such, I'm not sure we need this anymore for typical use cases.

Good point. Since I was planning on patching this out of existing releases, and single process is the way forward, I don't really see this providing any value. Closing.

feat: Return error in cloud-init status if stage not run

678ddeb

Fixes canonicalGH-5304

blackboxsw self-assigned this Aug 2, 2024

blackboxsw reviewed Aug 2, 2024

View reviewed changes

TheRealFalcon added 2 commits August 5, 2024 16:55

apply suggestions

4c8aa75

update unit tests

ba73ec1

TheRealFalcon closed this Aug 7, 2024

TheRealFalcon deleted the status-skip branch August 7, 2024 15:55

TheRealFalcon mentioned this pull request Aug 7, 2024

Cloud-init status always reporting running #5304

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Return error in `cloud-init status` if stage not run #5579

feat: Return error in `cloud-init status` if stage not run #5579

TheRealFalcon commented Aug 2, 2024 •

edited

Loading

blackboxsw Aug 2, 2024 •

edited

Loading

TheRealFalcon Aug 6, 2024

blackboxsw Aug 7, 2024 •

edited

Loading

blackboxsw left a comment

holmanb commented Aug 2, 2024

TheRealFalcon commented Aug 2, 2024

holmanb commented Aug 2, 2024

TheRealFalcon commented Aug 6, 2024

TheRealFalcon commented Aug 6, 2024

blackboxsw commented Aug 7, 2024

blackboxsw commented Aug 7, 2024

TheRealFalcon commented Aug 7, 2024

feat: Return error in cloud-init status if stage not run #5579

feat: Return error in cloud-init status if stage not run #5579

Conversation

TheRealFalcon commented Aug 2, 2024 • edited Loading

Proposed Commit Message

Additional Context

Test Steps

Merge type

blackboxsw Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

TheRealFalcon Aug 6, 2024

Choose a reason for hiding this comment

blackboxsw Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

blackboxsw left a comment

Choose a reason for hiding this comment

holmanb commented Aug 2, 2024

TheRealFalcon commented Aug 2, 2024

holmanb commented Aug 2, 2024

TheRealFalcon commented Aug 6, 2024

TheRealFalcon commented Aug 6, 2024

blackboxsw commented Aug 7, 2024

blackboxsw commented Aug 7, 2024

TheRealFalcon commented Aug 7, 2024

feat: Return error in `cloud-init status` if stage not run #5579

feat: Return error in `cloud-init status` if stage not run #5579

TheRealFalcon commented Aug 2, 2024 •

edited

Loading

blackboxsw Aug 2, 2024 •

edited

Loading

blackboxsw Aug 7, 2024 •

edited

Loading