Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-monitor: handle k8s' FailedScheduling properly #314

Open
mvidalgarcia opened this issue Apr 29, 2021 · 0 comments
Open

job-monitor: handle k8s' FailedScheduling properly #314

mvidalgarcia opened this issue Apr 29, 2021 · 0 comments

Comments

@mvidalgarcia
Copy link
Member

Current behavior

If a REANA cluster has not kubernetes_jobs_max_user_memory_limit set, users can set any memory limit to their workflows via reana.yaml's kubernetes_memory_limit. If the user sets a value high enough (depending on the cluster memory resources) that a node can't schedule it, the job pod will remain in Pending status and generate a warning event with the reason FailedScheduling.

$ kubectl describe pod reana-run-job
...
Events:
  Type     Reason            Age        From  Message
  ----     ------            ----       ----  -------
  Warning  FailedScheduling  <unknown>        0/4 nodes are available: 4 Insufficient memory.
  Warning  FailedScheduling  <unknown>        0/4 nodes are available: 4 Insufficient memory.

However, the workflow remains in a running status forever, and no logs propagated to the user.

Expected behavior

We should catch this case here and set the workflow status to failed, as well as generate meaningful logs for the user to consult via reana-client logs.

In addition, we should investigate similar cases of other k8s event reasons to handle them properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant