Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translation on Multiple GPUs with device_index in v.2.15.0+ #786

Closed
ymoslem opened this issue Apr 27, 2022 · 6 comments · Fixed by #788
Closed

Translation on Multiple GPUs with device_index in v.2.15.0+ #786

ymoslem opened this issue Apr 27, 2022 · 6 comments · Fixed by #788

Comments

@ymoslem
Copy link

ymoslem commented Apr 27, 2022

The issue is in versions 2.15.0 and 2.15.1 during translation, while it works fine in 2.14.0

Code:

translator = ctranslate2.Translator(model_path, device="cuda", device_index=[0,1])

Error:

RuntimeError: CUDA failed with error invalid argument
@guillaumekln
Copy link
Collaborator

guillaumekln commented Apr 28, 2022

I'm not reproducing this error. Do you confirm the error is raised when creating the Translator object?

Can you also describe the model you are using (original framework, model type, quantization, etc.)? Reporting the output with CT2_VERBOSE=1 could be helpful as well.

@guillaumekln
Copy link
Collaborator

Ok, I got the same error when loading a newly converted model. I will check. Thanks for the report!

@henyee
Copy link

henyee commented May 5, 2022

It seems like the model is now able to call different GPUs, but under load, it tends to hit a waitress problem with Task queue depth, stopping the REST server. A restart of the REST server is needed to restore service. Switching back to only using one GPU (0) doesn't have this problem despite the same load.

@guillaumekln
Copy link
Collaborator

Do you mean the performance is reduced when using multiple GPUs? Can you be more specific? Consider opening a separate issue if you can isolate the issue with CTranslate2.

@henyee
Copy link

henyee commented May 5, 2022

The error is actually a waitress problem that's used by OpenNMT-py REST server. It happens under very intense load (when api calls are repeatedly made) which is understandable.

However, getting the REST server to load different ctranslate models depending on different GPUs seem to make the "task queue depth" error happen more frequently (attempts to fix it on waitress's side by increasing the number of threads from a default of 4 to 32 doesn't help at all). But letting the REST server serve/load all the models on 1 GPU doesn't have this problem. I noticed this when switching back to back (across GPUs or with one GPU) with load being fairly similar.

@guillaumekln
Copy link
Collaborator

As far as I know, the OpenNMT-py server cannot process multiple translations in parallel. So the model running on multiple GPUs will only use 1 GPU at a time. See this issue OpenNMT/OpenNMT-py#2001 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants