-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delay update of artifacts until final job save #11832
Conversation
45e9329
to
7a753f9
Compare
awx/main/tasks/callback.py
Outdated
|
||
def update_model(self, pk, _attempt=0, **updates): | ||
return update_model(self.model, pk, _attempt=0, **updates) | ||
|
||
def delay_update(self, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By caching artifact updates on the in-memory object like this, won't we be making more pain for ourselves in the future when we try to move to an async style of processing jobs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to wait until the end of the job, is it possible to do a final sweep over the job events for that job and accumulate the data that way? Or do we lose this data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can save things on the job record and job events in one of two places
- the control process
- the callback receiver
The problem with the callback receiver is that it can (notoriously) be delayed as the events processing is the long-tail action, where a large queue can build up because we don't have the throughput to save at the rate we can produce events. The job status can flip long before all of its events come in. This may have nothing to do with the output of that job. Events may come in at a delayed time before the last job before that job produced more events than could be immediately handled, so all new events have to wait in line.
We cannot have artifacts wait in line with other events. The status changes, and dependent jobs in workflows will start with or without the events saved, and these can use artifacts. Likewise, the event_ct is needed to figure out if all events have been saved in the first place.
I realize that we have other saves in |
20227e1
to
56d3033
Compare
# If ansible-runner ran, but an error occured at runtime, the traceback information | ||
# is saved via the status_handler passed in to the processor. | ||
if state_name == 'Succeeded': | ||
return res |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an in-congruence here. Yes, if the status_handler received this information the receptor output should be this... but it's an imperfect metric. These two things are associated, but other events could cause one to be true but not the other.
so I forced it by doing this
diff --git a/awx/main/tasks/callback.py b/awx/main/tasks/callback.py
index ab27ce9f36..fd720ac985 100644
--- a/awx/main/tasks/callback.py
+++ b/awx/main/tasks/callback.py
@@ -220,6 +220,8 @@ class RunnerCallback:
result_traceback = status_data.get('result_traceback', None)
if result_traceback:
self.delay_update(result_traceback=result_traceback)
+ elif status_data['status'] == 'successful':
+ self.delay_update(result_traceback='alan set this!')
class RunnerCallbackForProjectUpdate(RunnerCallback):
And this works as I expected. The new if
condition is accurate. I like this also because it avoids running additional receptor commands when they are not necessary. That's something the if
here was accomplishing before, and I didn't want to lose that.
3c979e2
to
2649484
Compare
2649484
to
bb5df58
Compare
Please ping me for another review of this once it's ready to go. |
Sorry, I had neglected to pick up a test fixture change that came from the Django upgrade. That should be fixed with the latest commit, otherwise I was happy with this and the integration testing looked good. |
Hi all - we've been encountering this issue since upgrading to AWX 20 earlier this week, and it's caused us a fair amount of trouble with several workflows. Are there any workarounds we can implement while waiting for the review to be completed? |
7198489
to
43cf9ef
Compare
Trivial conflicts resolved. No, I don't know of any other easy workaround. It was flaky, but pretty easy to hit the error at least once out of 5 attempts. This patch seemed like the easiest possible fix. |
43cf9ef
to
9a1a601
Compare
9a1a601
to
c9a9378
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only other thought is could this induce a timing issue. i.e. I launch a job its almost done, I click cancel its cancels but then this delayed update thread "finishes" and it changes from canceled to some other status? I haven't really thought through if that would be possible or not.
awx/main/tasks/callback.py
Outdated
else: | ||
self.extra_update_fields[key] = value | ||
|
||
def get_extra_update_fields(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about changing this name to match the delay_update function? i.e. get_delayed_update_fields
.
awx/main/tasks/jobs.py
Outdated
original_traceback = self.runner_callback.get_extra_update_fields().get('result_traceback', '') | ||
if 'got an unexpected keyword argument' in original_traceback: | ||
if ANSIBLE_RUNNER_NEEDS_UPDATE_MESSAGE not in original_traceback: | ||
self.runner_callback.delay_update(result_traceback=ANSIBLE_RUNNER_NEEDS_UPDATE_MESSAGE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the things I never really loved was this code being somewhat duplicated in here and receptor.py could we just move this into an if statement in delay_update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably. What I think that would look like - if delay_update
gets job_explanation
or result_traceback
and the given string is already in that field, then it does nothing. So we would never repeat the same string twice. Seems reasonable but I should add a test.
We never call It could be the case that the job is on its way to error, but a cancel is processed before it finalizes. In that case, we would get a result_traceback or job_explanation in conjunction with the canceled status. That sounds like desired behavior. I would want that information if I was the user. |
Save tracebacks from receptor module to callback object Move receptor traceback check up to be more logical Use new mock_me fixture to avoid DB call with me method Update the special runner message to the delay_update pattern
583f4b8
to
9bcce9c
Compare
@john-westcott-iv In the last commit, I've come up with a better thought-out answer for how to handle your special ansible-runner error message. I can't bear adding more code into the main (monolithic) code path for running jobs. So the special case is moved to |
SUMMARY
Connect #11821
The test I had that was failing with that is passing with this patch.
EDIT: Since I raised this PR, it appears that the artifacts issue that was failing tests before was fixed by some other change. I still advocate for this code change for the tech debt aspect of things.
ISSUE TYPE
COMPONENT NAME
ADDITIONAL INFORMATION
I don't know exactly what was happening, but some process was clearly doing a
.save
on the job which was regressing theartifacts
field back to{}
. By delaying until the final save, that actually leaves us in a safer position, because it's less likely for us to hit one of these unqualified saves after that point, and less likely for one of those saves to have a stale version of the job.This might need some unit tests added.
I hope it also addresses the concerns that others brought up about the artifacts save in the callback class. It really shouldn't be touching the database.