Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails to run jobs if serving with webserver=True #15733

Open
gigaverse-oz opened this issue Oct 16, 2024 · 6 comments
Open

Fails to run jobs if serving with webserver=True #15733

gigaverse-oz opened this issue Oct 16, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@gigaverse-oz
Copy link
Contributor

gigaverse-oz commented Oct 16, 2024

Bug Summary

When attempting to serve a flow with the command topic_poll_flow.serve(webserver=True), the tasks fail to start.

Steps to Reproduce:

  • Run the deployment through the webserver API or connect it to the Prefect Server.

Error:

The following error occurs:

20:28:33.879 | INFO    | prefect.webserver - Created flow run 'knowing-chamois' from deployment 'topic-poll-flow'
20:28:33.891 | INFO    | prefect.flow_runs.runner - Opening process...
20:28:33.894 | ERROR   | prefect.flow_runs.runner - Failed to start process for flow run '61719a08-c4d9-49bd-92bb-0124efeb2830'.
Traceback (most recent call last):
  File "/workspaces/gigaverse-ai/.venv/lib/python3.11/site-packages/prefect/runner/runner.py", line 1051, in _submit_run_and_capture_errors
    status_code = await self._run_process(
                  ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/gigaverse-ai/.venv/lib/python3.11/site-packages/prefect/runner/runner.py", line 600, in _run_process
    process = await run_process(
              ^^^^^^^^^^^^^^^^^^
  ...
  File "/usr/local/lib/python3.11/asyncio/events.py", line 637, in get_child_watcher
    raise NotImplementedError
NotImplementedError
20:28:33.935 | INFO    | prefect.flow_runs.runner - Reported flow run '61719a08-c4d9-49bd-92bb-0124efeb2830' as crashed: Flow run process could not be started
  • Behavior when webserver=False: Tasks start successfully, but the pods are terminated (no /health endpoint available).

Version Information

Prefect versions tested:

  • Prefect 3.0.3, 3.0.7, and 3.0.10

Environment:

Version:             3.0.10
API version:         0.8.4
Python version:      3.11.6
Git commit:          3aa2d893
Built:               Tue, Oct 15, 2024 1:31 PM
OS/Arch:             linux/x86_64
Profile:             ephemeral
Server type:         ephemeral
Pydantic version:    2.9.2
Database:            sqlite (SQLite version: 3.40.1)

Additional Context

The webserver functionality for the runner is currently broken (related issue and PR: PrefectHQ/prefect#15707 and PrefectHQ/prefect#15680).

Workaround:

A temporary workaround involves patching the server log level as shown below:

# Temporary workaround until Prefect PR is merged
from unittest.mock import patch
from prefect.settings import PREFECT_RUNNER_SERVER_LOG_LEVEL

with patch.object(PREFECT_RUNNER_SERVER_LOG_LEVEL, "value", return_value="error"):
    topic_poll_flow.serve(webserver=True)
# End of workaround
@zzstoatzz
Copy link
Collaborator

thank you for the issue @gigaverse-oz - I've opened the PR linked above to resolve this

@gigaverse-oz
Copy link
Contributor Author

@zzstoatzz The issue here appears after the .lower() fix
It seems like a different and more complex failure. The async library fails to start the tasks in a different process/thread.
Please refer to the full error log.

@zzstoatzz
Copy link
Collaborator

hi @gigaverse-oz - after the .lower() fix, I am not able to reproduce your error with the following

import os

from prefect import flow


@flow(log_prints=True)
def hello():
    print(os.getenv("PREFECT__FLOW_RUN_ID"))


if __name__ == "__main__":
    hello.serve(webserver=True)

can you explain what's going on in your flow that's being served?

@gigaverse-oz
Copy link
Contributor Author

gigaverse-oz commented Oct 17, 2024

Hi @zzstoatzz ,
It happens with your example too.

My setup:
I run prefect server in a container A (prefect server start)
In container B I run the code you supplied (with the fix we added now) with the env PREFECT_API_URL=http://127.0.0.1:4200/api.

In the server (http://localhost:4200) I run the deployment using the UI (Quick Run)

When webserver=False the setup works

@gigaverse-oz
Copy link
Contributor Author

gigaverse-oz commented Oct 17, 2024

Debugged it a little and this is have far I got:

  1. The error is in prefect.utilities.processutils - line 202
  2. These are the values I have for the variables:
command = ['/workspaces/gigaverse-ai/.venv/bin/python', '-m', 'prefect.engine']
kwargs = {\'stdout\': -1, \'stderr\': -1, \'env\': {\'PREFECT_API_URL\': \'http://127.0.0.1:4200/api\', \'PREFECT_SERVER_ALLOW_EPHEMERAL_MODE\': \'True\', \'PREFECT__FLOW_RUN_ID\': \'987941ad-8b10-47d9-bd61-8cf3f511d30c\', \'PREFECT__STORAGE_BASE_PATH\': \'/tmp/runner_storage/cc3a16e0-dc44-4344-ac70-d976ab21d780\', \'PREFECT__ENABLE_CANCELLATION_AND_CRASHED_HOOKS\': \'false\', \'SHELL\': \'/bin/bash\', \'COLORTERM\': \'truecolor\', \'PYTHONUNBUFFERED\': \'1\', \'TERM_PROGRAM_VERSION\': \'1.94.2\', \'HOSTNAME\': \'***\', \'PYTHON_VERSION\': \'3.11.6\', \'***\': \'***\', \'***\': \'***\', \'***\': \'****\', \'REMOTE_CONTAINERS_IPC\': \'/tmp/vscode-remote-containers-ipc-f9cc2b8a-2bcf-4c6c-b814-56db95be357f.sock\', \'***\': \'***\', \'***\': \'****\', \'PWD\': \'/workspaces/gigaverse-ai\', \'PYTHON_SETUPTOOLS_VERSION\': \'65.5.1\', \'****\': \'****\', \'VSCODE_GIT_ASKPASS_NODE\': \'***\', \'***\': \'***/\', \'HOME\': \'/root\', \'LANG\': \'C.UTF-8\', \'VIRTUAL_ENV\': \'/workspaces/gigaverse-ai/.venv\', \'REMOTE_CONTAINERS\': \'true\', \'WAYLAND_DISPLAY\': \'vscode-wayland-45a1552d-a287-482a-939f-3859471911af.sock\', \'***\': \'***\', \'***\': \'***\', \'GIT_ASKPASS\': \'***\', \'***\': \'***\', \'POETRY_VIRTUALENVS_IN_PROJECT\': \'true\', \'***\': \'***\', \'PIP_DEFAULT_TIMEOUT\': \'100\', \'POETRY_NO_INTERACTION\': \'1\', \'VSCODE_GIT_ASKPASS_EXTRA_ARGS\': \'\', \'***\': \'***\', \'TERM\': \'xterm-256color\', \'VSCODE_ENV_REPLACE\': \'***', \'GETSTREAM_API_LOCATION\': \'us-east-1\', \'REMOTE_CONTAINERS_SOCKETS\': \'["/tmp/.X11-unix/X6"]\', \'PIP_DISABLE_PIP_VERSION_CHECK\': \'on\', \'VSCODE_GIT_IPC_HANDLE\': \'/tmp/user/0/vscode-git-d578e5dbf9.sock\', \'DISPLAY\': \':6\', \'SHLVL\': \'2\', \'_GV_DEPLOY_ENVIRONMENT\': \'local_test\', \'***\': \'***\', \'***\': \'***\', \'***\': \'***\', \'PYTHON_PIP_VERSION\': \'23.2.1\', \'VIRTUAL_ENV_PROMPT\': \'gv-py3.11\', \'***\': \'***\', \'***\': \'***\', \'XDG_RUNTIME_DIR\': \'/tmp/user/0\', \'PYTHON_GET_PIP_SHA256\': \'9cc01665956d22b3bf057ae8287b035827bfd895da235bcea200ab3b811790b6\', \'***\': \'***\', \'***\': \'***\', \'***\': \'***\', \'VSCODE_GIT_ASKPASS_MAIN\': \'/vscode/vscode-server/bin/linux-x64/384ff7382de624fb94dbaf6da11977bba1ecd427/extensions/git/dist/askpass-main.js\', \'PYTHON_GET_PIP_URL\': \'https://github.com/pypa/get-pip/raw/4cfa4081d27285bda1220a62a5ebf5b4bd749cdb/public/get-pip.py\', \'BROWSER\': \'/vscode/vscode-server/bin/linux-x64/384ff7382de624fb94dbaf6da11977bba1ecd427/bin/helpers/browser.sh\', \'PATH\': \'/root/.vscode-server/extensions/ms-python.python-2024.16.1-linux-x64/python_files/deactivate/bash:/workspaces/gigaverse-ai/.venv/bin:/vscode/vscode-server/bin/linux-x64/384ff7382de624fb94dbaf6da11977bba1ecd427/bin/remote-cli:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\', \'***\': \'***\', \'***\': \'***\', \'***\': \'***\', \'PIP_NO_CACHE_DIR\': \'off\', \'VSCODE_ENV_PREPEND\': \'PS1=gv-py3.11:PATH=/root/.vscode-server/extensions/ms-python.python-2024.16.1-linux-x64/python_files/deactivate/bash\\\\x3a/workspaces/gigaverse-ai/.venv/bin\\\\x3a\', \'VENV_PATH\': \'/app/.venv\', \'REMOTE_CONTAINERS_DISPLAY_SOCK\': \'/tmp/.X11-unix/X6\', \'TERM_PROGRAM\': \'vscode\', \'VSCODE_IPC_HOOK_CLI\': \'/tmp/vscode-ipc-62c0e5db-0867-4436-9e70-c83a31cd5097.sock\', \'_\': \'/usr/bin/env\', \'OLDPWD\': \'/workspaces/gigaverse-ai\', \'DEBUG\': \'True\', \'PYTHONIOENCODING\': \'UTF-8\', \'PYDEVD_USE_FRAME_EVAL\': \'NO\', \'DEBUGPY_RUNNING\': \'true\'}, \'cwd\': None}

I cleaned my API keys from the env that were in the kwargs.

@gigaverse-oz
Copy link
Contributor Author

@zzstoatzz

Some more information about the bug after digging in your code.

It happens when you are trying to create a new process for the task and you have uvloop event loop. One of the function there is not implemented (child watch).

When running the worker without the server, your system is running with the default asyncio.unix_events loop and the same function is implemented.

I don't know your code well enough to understand when and why you decide to work with uvloop or asyncio.

For now I've just stopped working with your built in webserver for the worker and I deployed my own.
Would prefer to have it fixed - it prevents me from working "cleanly" with k8s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants