Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job controller doesn't respect graceful shutdown send by k8s #347

Comments

@VMois
Copy link

VMois commented Nov 24, 2021

When REANA asks k8s to terminate batch pod, k8s sends SIGTERM (soft kill) signal to all containers in batch pod. At this time, workflow-engine is already exited with status 0, so the only running container that receives SIGTERM is job-controller. k8s waits for 30 seconds (default graceful shutdown period, details) but job-controller doesn't exit so k8s sends SIGKILL and marks batch pod's phase (what is phase?) as Failed.

Additional info: How k8s terminates pods?

Originated in reanahub/reana#593 (in this conversation).

Details

Example output of kubectl get pods -o json command:

"containerStatuses": [
      {
        "containerID": "containerd://400831aac7443eba9a41228abf264d1adfa63d8279b3a4b71f34f819af7df420",
        "image": "docker.io/reanahub/reana-job-controller:latest",
        "imageID": "sha256:268ad55769e69f1c41d448fb7b54a95dc074cba5e2e1cfa6a2642d8484756551",
        "lastState": {},
        "name": "job-controller",
        "ready": false,
        "restartCount": 0,
        "started": false,
        "state": {
          "terminated": {
            "containerID": "containerd://400831aac7443eba9a41228abf264d1adfa63d8279b3a4b71f34f819af7df420",
            "exitCode": 137,
            "finishedAt": "2021-11-24T10:28:32Z",
            "reason": "Error",
            "startedAt": "2021-11-24T10:27:44Z"
          }
        }
      },
      {
        "containerID": "containerd://d5e4a82bfe1acd86411058ebaa0302dd8a6d9632b1a07d6f849789071918b1b3",
        "image": "docker.io/reanahub/reana-workflow-engine-serial:latest",
        "imageID": "sha256:5756537cf4d2673d2725e0824f6f6883062c363229d6c81a500b19a1487b8641",
        "lastState": {},
        "name": "workflow-engine",
        "ready": false,
        "restartCount": 0,
        "started": false,
        "state": {
          "terminated": {
            "containerID": "containerd://d5e4a82bfe1acd86411058ebaa0302dd8a6d9632b1a07d6f849789071918b1b3",
            "exitCode": 0,
            "finishedAt": "2021-11-24T10:28:02Z",
            "reason": "Completed",
            "startedAt": "2021-11-24T10:27:44Z"
          }
        }
      }
    ],

Check timestamps in containerStatuses. workflow-engine finished at 10:28:02 so around that time REANA asks k8s to delete pod. Take a look at job-controller finish time - 10:28:32, exactly 30 seconds after workflow-engine finishes. In addition, check exitCode, it is 137 (SIGKILL).

@VMois
Copy link
Author

VMois commented Nov 24, 2021

My initial idea was that the job monitor thread is still running even after the Flask app is terminated by SIGTERM. But, job-monitor is a daemon thread so it should stop together with the main thread. Local testing on a dummy example confirmed that it stops as expected:

from threading import *
import time
from flask import Flask

app = Flask(__name__)


@app.route("/")
def spawn():
    t = Thread(target=thread)
    t.daemon = True
    t.start()
    return "Spawned"


def thread():
    while True:
        print('this is daemon thread')
        time.sleep(3)


app.run()

Find a process with ps | grep python and kill it with SIGTERM kill -s SIGTERM <pid>. App exits as expected.

So, maybe, the problem is with how the Flask app exits? I don't have an answer yet but found one interesting comment.

@VMois
Copy link
Author

VMois commented Nov 24, 2021

This issue can provide a possible fix.

@VMois VMois self-assigned this Mar 22, 2022
@mdonadoni mdonadoni assigned mdonadoni and unassigned VMois Sep 21, 2023
mdonadoni added a commit to mdonadoni/reana-server that referenced this issue Sep 21, 2023
Handle SIGTERM in `start-scheduler` to gracefully stop consuming
the workflow submission queue.

Closes reanahub/reana-job-controller#347
mdonadoni added a commit to mdonadoni/reana that referenced this issue Sep 21, 2023
Avoid using `/bin/sh -c` to run uwsgi, as that breaks signal
propagation.

Closes reanahub/reana-job-controller#347
mdonadoni added a commit to mdonadoni/reana-workflow-controller that referenced this issue Sep 21, 2023
When using the shell form of `CMD`, the provided command is executed
using `/bin/sh -c`, which breaks signal propagation. Use `exec` to
substitute the `sh` process and fix handling of signals by uwsgi.

Also handle SIGTERM in `consume-job-queue` to gracefully stop consuming
the job status queue.

Closes reanahub/reana-job-controller#347
mdonadoni added a commit to mdonadoni/reana-server that referenced this issue Sep 21, 2023
Handle SIGTERM in `start-scheduler` to gracefully stop consuming
the workflow submission queue.

Closes reanahub/reana-job-controller#347
mdonadoni added a commit to mdonadoni/reana-server that referenced this issue Sep 21, 2023
Handle SIGTERM in `start-scheduler` to gracefully stop consuming
the workflow submission queue.

Closes reanahub/reana-job-controller#347
@mdonadoni
Copy link
Member

mdonadoni commented Sep 26, 2023

Fixed for reana-server and reana-workflow-controller.

Still missing:

  • reana-job-controller (reana-run-job)
  • reana-workflow-engine (reana-run-job)
  • reana-message-broker

@mdonadoni mdonadoni reopened this Sep 27, 2023
mdonadoni added a commit to mdonadoni/reana-message-broker that referenced this issue Dec 19, 2023
Use `exec` to execute RabbitMQ's server, so that the server process can
receive signals such as `SIGTERM`.

Closes reanahub/reana-job-controller#347
mdonadoni added a commit to mdonadoni/reana-message-broker that referenced this issue Dec 19, 2023
Use `exec` to execute RabbitMQ's server, so that the server process can
receive signals such as `SIGTERM`.

Closes reanahub/reana-job-controller#347
mdonadoni added a commit to mdonadoni/reana-message-broker that referenced this issue Jan 12, 2024
Use `exec` to execute RabbitMQ's server, so that the server process can
receive signals such as `SIGTERM`.

Closes reanahub/reana-job-controller#347
@mdonadoni mdonadoni reopened this Jan 23, 2024
mdonadoni added a commit to mdonadoni/reana-workflow-controller that referenced this issue Jan 23, 2024
Use exec to execute job-controller, so that the server can receive
signals such as `SIGTERM`.

Closes reanahub/reana-job-controller#347
mdonadoni added a commit to mdonadoni/reana-workflow-controller that referenced this issue Jan 23, 2024
Use exec to execute job-controller, so that the server can receive
signals such as `SIGTERM`.

Closes reanahub/reana-job-controller#347
mdonadoni added a commit to mdonadoni/reana-workflow-controller that referenced this issue Jan 23, 2024
Use exec to execute job-controller, so that the server can receive
signals such as `SIGTERM`.

Closes reanahub/reana-job-controller#347
@giuseppe-steduto giuseppe-steduto self-assigned this Jan 24, 2024
mdonadoni added a commit to mdonadoni/reana-workflow-controller that referenced this issue Jan 29, 2024
Use exec to execute job-controller, so that the server can receive
signals such as `SIGTERM`.

Closes reanahub/reana-job-controller#347
mdonadoni added a commit to mdonadoni/reana-workflow-controller that referenced this issue Jan 29, 2024
Use exec to execute job-controller, so that the server can receive
signals such as `SIGTERM`.

Closes reanahub/reana-job-controller#347
mdonadoni added a commit to mdonadoni/reana-workflow-controller that referenced this issue Jan 30, 2024
Use exec to execute job-controller, so that the server can receive
signals such as `SIGTERM`.

Closes reanahub/reana-job-controller#347
@mdonadoni mdonadoni reopened this Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment