-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Adding ensemble support for vllm container #68
Conversation
Co-authored-by: Ryan McCormick <[email protected]>
@@ -163,6 +164,17 @@ def test_exclude_input_in_output_true(self): | |||
expected_output=expected_output, | |||
) | |||
|
|||
def test_ensemble_model(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to break EXPECTED_NUM_TESTS=6
.
You could update it to EXPECTED_NUM_TESTS=7
, but I think it's easier to just append the ensemble to the other test that is testing model loading here:
vllm_backend/ci/L0_backend_vllm/vllm_backend/vllm_backend_test.py
Lines 55 to 59 in 78bf0fb
# Load both vllm and add_sub models | |
self.triton_client.load_model(self.vllm_load_test) | |
self.assertTrue(self.triton_client.is_model_ready(self.vllm_load_test)) | |
self.triton_client.load_model(self.python_model_name) | |
self.assertTrue(self.triton_client.is_model_ready(self.python_model_name)) |
ex:
# Load both vllm and add_sub models
self.triton_client.load_model(self.vllm_load_test)
self.assertTrue(self.triton_client.is_model_ready(self.vllm_load_test))
self.triton_client.load_model(self.python_model_name)
self.assertTrue(self.triton_client.is_model_ready(self.python_model_name))
# Test to ensure that ensemble models are supported in vllm container.
# If ensemble support is not enabled, triton will fail to load the ensemble.
self.triton_client.load_model(self.ensemble_model_name)
self.assertTrue(self.triton_client.is_model_ready(self.ensemble_model_name))
Also use vllm_load_test
model inside the ensemble instead of vllm_opt
for the same reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests documents some expected behavior around vllm_opt
staying alive for the duration of the test:
vllm_backend/ci/L0_backend_vllm/vllm_backend/test.sh
Lines 53 to 57 in 78bf0fb
# `vllm_opt` model will be loaded on server start and stay loaded throughout | |
# unittesting. To test vllm model load/unload we use a dedicated | |
# `vllm_load_test`. To ensure that vllm's memory profiler will not error out | |
# on `vllm_load_test` load, we reduce "gpu_memory_utilization" for `vllm_opt`, | |
# so that at least 60% of GPU memory was available for other models. |
So I think it's best not to mess with it in a new test and make use of vllm_load_test
instead.
Co-authored-by: Ryan McCormick <[email protected]>
nice one! |
Currently, while building the
vLLM
container, we are not passing--backend=ensemble
as one of the flags, which leads to lack of support when people try running ensemble models in thevLLM
container.To support this, adding
--backend=ensemble
to the build instructions should provide support going forward (>=24.10).Additionally, adding a test case within the existing L0_vllm_backend job to catch lack of support in CI pipeline.