Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Return error in cloud-init status if stage not run #5579

Closed
wants to merge 3 commits into from

Conversation

TheRealFalcon
Copy link
Member

@TheRealFalcon TheRealFalcon commented Aug 2, 2024

In draft state as I'm looking for feedback and still needs test updates.

Proposed Commit Message

feat: Return error in `cloud-init status` if stage not run

Fixes GH-5304

Additional Context

The original issue in GH-5304 was not reproducible on current versions of cloud-init. Since that version of cloud-init, we've updated our status checking to only be running if status.json exists but results.json doesn't exist, and checking systemd status can only change cloud-init status from RUNNING to ERROR rather than the previous code that could change status back to RUNNING.

So instead I addressed this comment:
Long term cloud-init status should return a fatal error code and indicate this issue

Thinking about it more though, I'm not actually sure we want this behavior, and if we do it probably shouldn't be applied to existing releases. While the supported cloud-init path is to run all 4 services, we have seen numerous use cases where people do things we wouldn't endorse. Amazon Linux completely removing cloud-init-local.service comes to mind. I'm not proposing we officially support those use cases, but I'm also not sure it makes sense to declare there was an error when the lack of a service running may have been intended behavior and technically nothing "bad" happened.

This PR makes it error to have not run one of the stages. After masking cloud-init-local.service and cloud-init.service, I get these results:

root@me:~# cloud-init status --long
status: error
extended_status: error - done
boot_status_code: enabled-by-generator
last_update: Thu, 01 Jan 1970 00:00:02 +0000
detail: Cloud-init enabled by systemd cloud-init-generator
errors:
	- Failed to start stage: init
	- Failed to start stage: init-local
recoverable_errors:
WARNING:
	- Failed to write boot finished file /var/lib/cloud/instance/boot-finished
	- Used fallback datasource
root@me:~# cloud-init status --format json
{
  "boot_status_code": "enabled-by-generator",
  "datasource": null,
  "detail": "Cloud-init enabled by systemd cloud-init-generator",
  "errors": [
    "Failed to start stage: init",
    "Failed to start stage: init-local"
  ],
  "extended_status": "error - done",
  "init": {
    "errors": [],
    "finished": null,
    "recoverable_errors": {},
    "start": null
  },
  "init-local": {
    "errors": [],
    "finished": null,
    "recoverable_errors": {},
    "start": null
  },
  "last_update": "Thu, 01 Jan 1970 00:00:02 +0000",
  "modules-config": {
    "errors": [],
    "finished": 2.23,
    "recoverable_errors": {},
    "start": 2.12
  },
  "modules-final": {
    "errors": [],
    "finished": 2.5,
    "recoverable_errors": {
      "WARNING": [
        "Failed to write boot finished file /var/lib/cloud/instance/boot-finished",
        "Used fallback datasource"
      ]
    },
    "start": 2.44
  },
  "recoverable_errors": {
    "WARNING": [
      "Failed to write boot finished file /var/lib/cloud/instance/boot-finished",
      "Used fallback datasource"
    ]
  },
  "stage": null,
  "status": "error"
}

(the timestamps are related to my environment and not an issue)

Test Steps

Merge type

  • Squash merge using "Proposed Commit Message"
  • Rebase and merge unique commits. Requires commit messages per-commit each referencing the pull request number (#<PR_NUM>)

@blackboxsw blackboxsw self-assigned this Aug 2, 2024
Comment on lines 436 to 440
return [
f"Failed to start stage: {stage_name}"
for stage_name, stage_info in status_v1.items()
if stage_name in STAGE_NAME.keys() and stage_info.get("start") is None
]
Copy link
Collaborator

@blackboxsw blackboxsw Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Couldn't this logic instead be iterating by known STAGE_NAMES instead of looping over status_v1.items and filtering out?

  • Also this has the affect of reporting errors while cloud-init is still running because early calls to status may not have crossed the threshold where all cloud-init stages have yet run. A call to cloud-init status before cloud-init-local even starts would report errors too. Do we want to also check is_running() too before trying to return these errors?

  • Another concern, is if cloud-final.service is ever not part of the systemd (or other) boot targets, we would never create /run/cloud-init/result.json file which would leave us in a state where we probably want to peek at whether the system's desired run level was reached and the init system is 'up/active' too.

Suggested change
return [
f"Failed to start stage: {stage_name}"
for stage_name, stage_info in status_v1.items()
if stage_name in STAGE_NAME.keys() and stage_info.get("start") is None
]
return [
f"Failed to start stage: {stage_name}"
for stage_name in STAGE_NAME.keys()
if status_v1.get(stage_name, {}).get("start") is None
]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment and patch. Also, my apologies, I meant to mark this PR as draft as I knew it wasn't quite ready for review yet. I've applied your patch and added an additional unit test.

Another concern, is if cloud-final.service is ever not part of the systemd (or other) boot targets, we would never create /run/cloud-init/result.json file which would leave us in a state where we probably want to peek at whether the system's desired run level was reached and the init system is 'up/active' too.

I'm open to seeing an implementation of this, but my previous attempt at doing this wasn't it. 😉

Since that case is outside of what we support, I see it as a future nice to have rather than a need to have.

Copy link
Collaborator

@blackboxsw blackboxsw Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that concern is just noting a gap in behavior if cloud-final is kicked out of systemd boot goals due to ordering cycle, but, in those events systemctl status list-units --failed or systemctl status cloud-init-main or cloud-final will all show the error condition as well. So, I'm not concerned with this PR having to address this gap.

Copy link
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential alternative that at least takes into account DONE status. We also don't want to error if RunningStatus.DISABLED as nothing should run at that point

--- a/cloudinit/cmd/status.py
+++ b/cloudinit/cmd/status.py
@@ -435,12 +435,12 @@ def get_not_started_errors(status_v1) -> List[str]:
     """Return a list of errors from status_v1 that are not started."""
     return [
         f"Failed to start stage: {stage_name}"
-        for stage_name, stage_info in status_v1.items()
-        if stage_name in STAGE_NAME.keys() and stage_info.get("start") is None
+        for stage_name in STAGE_NAME
+        if status_v1.get(stage_name, {}).get("start") is None
     ]
 
 
-def get_errors(status_v1) -> Tuple[List, Dict]:
+def get_errors(status_v1, running_status: RunningStatus) -> Tuple[List, Dict]:
     """Return a list of errors and recoverable_errors from status_v1."""
     errors = []
     recoverable_errors = {}
@@ -461,7 +461,8 @@ def get_errors(status_v1) -> Tuple[List, Dict]:
                     recoverable_errors[err_type].extend(
                         current_recoverable_errors[err_type]
                     )
-    errors.extend(get_not_started_errors(status_v1)) 
+    if running_status == RunningStatus.DONE:
+        errors.extend(get_not_started_errors(status_v1))
     return errors, recoverable_errors
 
 
@@ -496,16 +497,16 @@ def get_status_details(
         else ""
     )
 
-    errors, recoverable_errors = get_errors(status_v1)
+    running_status = get_running_status(
+        status_file, result_file, boot_status_code, latest_event
+    )
+
+    errors, recoverable_errors = get_errors(status_v1, running_status)
     if errors:
         condition_status = ConditionStatus.ERROR
     elif recoverable_errors:
         condition_status = ConditionStatus.DEGRADED
 
-    running_status = get_running_status(
-        status_file, result_file, boot_status_code, latest_event
-    )
-
     if (
         running_status == RunningStatus.RUNNING
         and uses_systemd()

@holmanb
Copy link
Member

holmanb commented Aug 2, 2024

Thinking about it more though, I'm not actually sure we want this behavior, and if we do it probably shouldn't be applied to existing releases. While the supported cloud-init path is to run all 4 services, we have seen numerous use cases where people do things we wouldn't endorse. Amazon Linux completely removing cloud-init-local.service comes to mind.

I'm not proposing we officially support those use cases, but I'm also not sure it makes sense to declare that there was an error when the lack of a service running may have been intended behavior and technically nothing "bad" happened.

Downstreams can do whatever they want, but if they choose to do really weird things like skip entire stages that doesn't mean that it's upstream's responsibility to maintain their code. The issue that you have addressed here should be fixed in upstream. This doesn't prevent their downstream from working; they will just have to patch this behavior to work for their modified downstream version. Downstreams doing weird things doesn't make it upstream's maintenance problem.

If there were a way to fix this that was the same amount of work that didn't require breaking a downstream, I would probably advocate for that (taking into account good design, maintainability, etc). However it isn't feasible to only allow those changes which do not break any downstreams - mutually exclusive changes could exist which would paralyze an upstream that operates this way. I think that we should do what is right by fixing this issue, and if we're feeling generous then a ping in their downstream bug tracker about this change would be a friendly thing to do.

@TheRealFalcon
Copy link
Member Author

@holmanb , I said I'm not proposing we officially support those use cases, and then it kind of felt like your response was about how we shouldn't support those use cases. I'm not disagreeing on that point. I'm more wondering if this really is a problem or error case in the first place.

@holmanb
Copy link
Member

holmanb commented Aug 2, 2024

@holmanb , I said I'm not proposing we officially support those use cases, and then it kind of felt like your response was about how we shouldn't support those use cases. I'm not disagreeing on that point. I'm more wondering if this really is a problem or error case in the first place.

I guess I don't understand what you're trying to say. Under which officially supported use case would this not be a problem or error condition?

@TheRealFalcon
Copy link
Member Author

I guess I don't understand what you're trying to say. Under which officially supported use case would this not be a problem or error condition?

Just to record what me and @holmanb discussed side-channel, my main concern was the space between "supported use case" and "break at will". I didn't really see the need for adding this while knowing that it could cause some behavior change for some unsupported use cases. He made a good point that dependency ordering can create a system where cloud-init is unintentionally removed from boot, and it's right for cloud-init to report this.

@TheRealFalcon
Copy link
Member Author

If #5586 lands, it will eliminate the mypy issue.

@blackboxsw
Copy link
Collaborator

If there were a way to fix this that was the same amount of work that didn't require breaking a downstream, I would probably advocate for that (taking into account good design, maintainability, etc). However it isn't feasible to only allow those changes which do not break any downstreams - mutually exclusive changes could exist which would paralyze an upstream that operates this way. I think that we should do what is right by fixing this issue, and if we're feeling generous then a ping in their downstream bug tracker about this change would be a friendly thing to do.

Even if we made this a "simpler" breaking change such as a config flag for "required_boot_stages" or some such flag, downstreams which don't want to run init-local will still have to carry a patch that modifies that config switch. In this case, any downstream that wants to avoid calling our intended boot stages can just remove that stage key from cmd.main.STAGE_NAME

@blackboxsw
Copy link
Collaborator

Now that singleprocess work landed in 143bc9e, cloud-init will not be able to actually get to the point where result.json is written if we are missing a boot stage because with cloud-init-main will block on a sync point for each expected stage. As such, I'm not sure we need this anymore for typical use cases.

@TheRealFalcon
Copy link
Member Author

Now that singleprocess work landed in 143bc9e, cloud-init will not be able to actually get to the point where result.json is written if we are missing a boot stage because with cloud-init-main will block on a sync point for each expected stage. As such, I'm not sure we need this anymore for typical use cases.

Good point. Since I was planning on patching this out of existing releases, and single process is the way forward, I don't really see this providing any value. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants