-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Return error in cloud-init status
if stage not run
#5579
Conversation
cloudinit/cmd/status.py
Outdated
return [ | ||
f"Failed to start stage: {stage_name}" | ||
for stage_name, stage_info in status_v1.items() | ||
if stage_name in STAGE_NAME.keys() and stage_info.get("start") is None | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Couldn't this logic instead be iterating by known STAGE_NAMES instead of looping over status_v1.items and filtering out?
-
Also this has the affect of reporting errors while cloud-init is still running because early calls to status may not have crossed the threshold where all cloud-init stages have yet run. A call to
cloud-init status
before cloud-init-local even starts would report errors too. Do we want to also check is_running() too before trying to return these errors? -
Another concern, is if cloud-final.service is ever not part of the systemd (or other) boot targets, we would never create /run/cloud-init/result.json file which would leave us in a state where we probably want to peek at whether the system's desired run level was reached and the init system is 'up/active' too.
return [ | |
f"Failed to start stage: {stage_name}" | |
for stage_name, stage_info in status_v1.items() | |
if stage_name in STAGE_NAME.keys() and stage_info.get("start") is None | |
] | |
return [ | |
f"Failed to start stage: {stage_name}" | |
for stage_name in STAGE_NAME.keys() | |
if status_v1.get(stage_name, {}).get("start") is None | |
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment and patch. Also, my apologies, I meant to mark this PR as draft as I knew it wasn't quite ready for review yet. I've applied your patch and added an additional unit test.
Another concern, is if cloud-final.service is ever not part of the systemd (or other) boot targets, we would never create /run/cloud-init/result.json file which would leave us in a state where we probably want to peek at whether the system's desired run level was reached and the init system is 'up/active' too.
I'm open to seeing an implementation of this, but my previous attempt at doing this wasn't it. 😉
Since that case is outside of what we support, I see it as a future nice to have rather than a need to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that concern is just noting a gap in behavior if cloud-final is kicked out of systemd boot goals due to ordering cycle, but, in those events systemctl status list-units --failed
or systemctl status cloud-init-main
or cloud-final
will all show the error condition as well. So, I'm not concerned with this PR having to address this gap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential alternative that at least takes into account DONE status. We also don't want to error if RunningStatus.DISABLED as nothing should run at that point
--- a/cloudinit/cmd/status.py
+++ b/cloudinit/cmd/status.py
@@ -435,12 +435,12 @@ def get_not_started_errors(status_v1) -> List[str]:
"""Return a list of errors from status_v1 that are not started."""
return [
f"Failed to start stage: {stage_name}"
- for stage_name, stage_info in status_v1.items()
- if stage_name in STAGE_NAME.keys() and stage_info.get("start") is None
+ for stage_name in STAGE_NAME
+ if status_v1.get(stage_name, {}).get("start") is None
]
-def get_errors(status_v1) -> Tuple[List, Dict]:
+def get_errors(status_v1, running_status: RunningStatus) -> Tuple[List, Dict]:
"""Return a list of errors and recoverable_errors from status_v1."""
errors = []
recoverable_errors = {}
@@ -461,7 +461,8 @@ def get_errors(status_v1) -> Tuple[List, Dict]:
recoverable_errors[err_type].extend(
current_recoverable_errors[err_type]
)
- errors.extend(get_not_started_errors(status_v1))
+ if running_status == RunningStatus.DONE:
+ errors.extend(get_not_started_errors(status_v1))
return errors, recoverable_errors
@@ -496,16 +497,16 @@ def get_status_details(
else ""
)
- errors, recoverable_errors = get_errors(status_v1)
+ running_status = get_running_status(
+ status_file, result_file, boot_status_code, latest_event
+ )
+
+ errors, recoverable_errors = get_errors(status_v1, running_status)
if errors:
condition_status = ConditionStatus.ERROR
elif recoverable_errors:
condition_status = ConditionStatus.DEGRADED
- running_status = get_running_status(
- status_file, result_file, boot_status_code, latest_event
- )
-
if (
running_status == RunningStatus.RUNNING
and uses_systemd()
Downstreams can do whatever they want, but if they choose to do really weird things like skip entire stages that doesn't mean that it's upstream's responsibility to maintain their code. The issue that you have addressed here should be fixed in upstream. This doesn't prevent their downstream from working; they will just have to patch this behavior to work for their modified downstream version. Downstreams doing weird things doesn't make it upstream's maintenance problem. If there were a way to fix this that was the same amount of work that didn't require breaking a downstream, I would probably advocate for that (taking into account good design, maintainability, etc). However it isn't feasible to only allow those changes which do not break any downstreams - mutually exclusive changes could exist which would paralyze an upstream that operates this way. I think that we should do what is right by fixing this issue, and if we're feeling generous then a ping in their downstream bug tracker about this change would be a friendly thing to do. |
@holmanb , I said |
I guess I don't understand what you're trying to say. Under which officially supported use case would this not be a problem or error condition? |
Just to record what me and @holmanb discussed side-channel, my main concern was the space between "supported use case" and "break at will". I didn't really see the need for adding this while knowing that it could cause some behavior change for some unsupported use cases. He made a good point that dependency ordering can create a system where cloud-init is unintentionally removed from boot, and it's right for cloud-init to report this. |
If #5586 lands, it will eliminate the mypy issue. |
Even if we made this a "simpler" breaking change such as a config flag for "required_boot_stages" or some such flag, downstreams which don't want to run init-local will still have to carry a patch that modifies that config switch. In this case, any downstream that wants to avoid calling our intended boot stages can just remove that stage key from cmd.main.STAGE_NAME |
Now that singleprocess work landed in 143bc9e, cloud-init will not be able to actually get to the point where result.json is written if we are missing a boot stage because with cloud-init-main will block on a sync point for each expected stage. As such, I'm not sure we need this anymore for typical use cases. |
Good point. Since I was planning on patching this out of existing releases, and single process is the way forward, I don't really see this providing any value. Closing. |
In draft state as I'm looking for feedback and still needs test updates.
Proposed Commit Message
Additional Context
The original issue in
GH-5304
was not reproducible on current versions of cloud-init. Since that version of cloud-init, we've updated our status checking to only be running ifstatus.json
exists butresults.json
doesn't exist, and checking systemd status can only change cloud-init status from RUNNING to ERROR rather than the previous code that could change status back to RUNNING.So instead I addressed this comment:
Long term cloud-init status should return a fatal error code and indicate this issue
Thinking about it more though, I'm not actually sure we want this behavior, and if we do it probably shouldn't be applied to existing releases. While the supported cloud-init path is to run all 4 services, we have seen numerous use cases where people do things we wouldn't endorse. Amazon Linux completely removing cloud-init-local.service comes to mind. I'm not proposing we officially support those use cases, but I'm also not sure it makes sense to declare there was an error when the lack of a service running may have been intended behavior and technically nothing "bad" happened.
This PR makes it error to have not run one of the stages. After masking cloud-init-local.service and cloud-init.service, I get these results:
(the timestamps are related to my environment and not an issue)
Test Steps
Merge type