job-monitor: handle k8s' `FailedScheduling` properly #314

mvidalgarcia · 2021-04-29T14:59:25Z

Current behavior

If a REANA cluster has not kubernetes_jobs_max_user_memory_limit set, users can set any memory limit to their workflows via reana.yaml's kubernetes_memory_limit. If the user sets a value high enough (depending on the cluster memory resources) that a node can't schedule it, the job pod will remain in Pending status and generate a warning event with the reason FailedScheduling.

$ kubectl describe pod reana-run-job
...
Events:
  Type     Reason            Age        From  Message
  ----     ------            ----       ----  -------
  Warning  FailedScheduling  <unknown>        0/4 nodes are available: 4 Insufficient memory.
  Warning  FailedScheduling  <unknown>        0/4 nodes are available: 4 Insufficient memory.

However, the workflow remains in a running status forever, and no logs propagated to the user.

Expected behavior

We should catch this case here and set the workflow status to failed, as well as generate meaningful logs for the user to consult via reana-client logs.

In addition, we should investigate similar cases of other k8s event reasons to handle them properly.

The text was updated successfully, but these errors were encountered:

mvidalgarcia added the compute/kubernetes label Apr 29, 2021

mvidalgarcia added priority/longterm size/m labels Mar 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job-monitor: handle k8s' `FailedScheduling` properly #314

job-monitor: handle k8s' `FailedScheduling` properly #314

mvidalgarcia commented Apr 29, 2021

job-monitor: handle k8s' FailedScheduling properly #314

job-monitor: handle k8s' FailedScheduling properly #314

Comments

mvidalgarcia commented Apr 29, 2021

job-monitor: handle k8s' `FailedScheduling` properly #314

job-monitor: handle k8s' `FailedScheduling` properly #314