Translation on Multiple GPUs with device_index in v.2.15.0+ #786

ymoslem · 2022-04-27T23:44:26Z

The issue is in versions 2.15.0 and 2.15.1 during translation, while it works fine in 2.14.0

Code:

translator = ctranslate2.Translator(model_path, device="cuda", device_index=[0,1])

Error:

RuntimeError: CUDA failed with error invalid argument

The text was updated successfully, but these errors were encountered:

guillaumekln · 2022-04-28T08:51:09Z

I'm not reproducing this error. Do you confirm the error is raised when creating the Translator object?

Can you also describe the model you are using (original framework, model type, quantization, etc.)? Reporting the output with CT2_VERBOSE=1 could be helpful as well.

guillaumekln · 2022-04-28T08:58:15Z

Ok, I got the same error when loading a newly converted model. I will check. Thanks for the report!

henyee · 2022-05-05T03:39:30Z

It seems like the model is now able to call different GPUs, but under load, it tends to hit a waitress problem with Task queue depth, stopping the REST server. A restart of the REST server is needed to restore service. Switching back to only using one GPU (0) doesn't have this problem despite the same load.

guillaumekln · 2022-05-05T07:31:43Z

Do you mean the performance is reduced when using multiple GPUs? Can you be more specific? Consider opening a separate issue if you can isolate the issue with CTranslate2.

henyee · 2022-05-05T07:38:46Z

The error is actually a waitress problem that's used by OpenNMT-py REST server. It happens under very intense load (when api calls are repeatedly made) which is understandable.

However, getting the REST server to load different ctranslate models depending on different GPUs seem to make the "task queue depth" error happen more frequently (attempts to fix it on waitress's side by increasing the number of threads from a default of 4 to 32 doesn't help at all). But letting the REST server serve/load all the models on 1 GPU doesn't have this problem. I noticed this when switching back to back (across GPUs or with one GPU) with load being fairly similar.

guillaumekln · 2022-05-05T08:26:16Z

As far as I know, the OpenNMT-py server cannot process multiple translations in parallel. So the model running on multiple GPUs will only use 1 GPU at a time. See this issue OpenNMT/OpenNMT-py#2001 (comment).

guillaumekln mentioned this issue Apr 28, 2022

Fix model loading when GPU index is > 0 #788

Merged

guillaumekln closed this as completed in #788 Apr 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Translation on Multiple GPUs with device_index in v.2.15.0+ #786

Translation on Multiple GPUs with device_index in v.2.15.0+ #786

ymoslem commented Apr 27, 2022

guillaumekln commented Apr 28, 2022 •

edited

Loading

guillaumekln commented Apr 28, 2022

henyee commented May 5, 2022

guillaumekln commented May 5, 2022

henyee commented May 5, 2022

guillaumekln commented May 5, 2022

Translation on Multiple GPUs with device_index in v.2.15.0+ #786

Translation on Multiple GPUs with device_index in v.2.15.0+ #786

Comments

ymoslem commented Apr 27, 2022

guillaumekln commented Apr 28, 2022 • edited Loading

guillaumekln commented Apr 28, 2022

henyee commented May 5, 2022

guillaumekln commented May 5, 2022

henyee commented May 5, 2022

guillaumekln commented May 5, 2022

guillaumekln commented Apr 28, 2022 •

edited

Loading