-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consume job_explanation we sometimes set in ansible-runner #12089
Conversation
Regarding the part of this that I wrote, what data would this get us?
These are pretty vague messages. If I got them by themselves, I wouldn't know what to do. I'm not against surfacing them to the user, indeed, I think that's a very good idea. However, this save will overwrite any Because of that, I have cold feet about doing this without stacking it on top of #11832, which fixed tests that were actively failing (we just forget because they get set to skip). If we can get this in a way that stacks on top of any other error details, then that's the ideal, that would be perfect. |
Right now jobs fail with an even more vague explanation: This at least tells us the last thing that went wrong. |
17f0627
to
aad5fb1
Compare
aad5fb1
to
f4e8beb
Compare
# We call this to get the current values from the database, in case update_model was called | ||
# within the threadpools inside of AWXReceptorJob. We use update_model instead of | ||
# refresh_from_db here because it contains retry logic that is resilient to database failures. | ||
self.instance = self.update_model(self.instance.pk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @sarabrajsingh, as he and I discussed error logs that came from post_run_hook
due to not doing this.
Summarizing here...
Each thread has its own connection object. When connections are dropped, that object is stale. If it's not hitting one particular corner case, it will correct itself after 1 failed query. So post_run_hook errors, but then code after it starts to work again, so it's not easy to observe unless you check the logs.
I've modified adjacent code, so here's what I think a rebase should look like: https://github.com/ansible/awx/compare/devel...AlanCoding:capture-runner-error?expand=1 EDIT: to resolve conflicts. |
A replacement PR exists |
Replaces / updates #11376