fix: Adding ensemble support for vllm container #68

KrishnanPrash · 2024-09-23T23:19:38Z

Currently, while building the vLLM container, we are not passing --backend=ensemble as one of the flags, which leads to lack of support when people try running ensemble models in the vLLM container.

To support this, adding --backend=ensemble to the build instructions should provide support going forward (>=24.10).
Additionally, adding a test case within the existing L0_vllm_backend job to catch lack of support in CI pipeline.

rmccorm4 · 2024-09-23T23:39:32Z

Fixes triton-inference-server/server#7632

README.md

ci/L0_backend_vllm/vllm_backend/test.sh

ci/L0_backend_vllm/vllm_backend/vllm_backend_test.py

Co-authored-by: Ryan McCormick <[email protected]>

rmccorm4 · 2024-09-23T23:49:11Z

ci/L0_backend_vllm/vllm_backend/vllm_backend_test.py

@@ -163,6 +164,17 @@ def test_exclude_input_in_output_true(self):
            expected_output=expected_output,
        )

+    def test_ensemble_model(self):


This is going to break EXPECTED_NUM_TESTS=6.

You could update it to EXPECTED_NUM_TESTS=7, but I think it's easier to just append the ensemble to the other test that is testing model loading here:

vllm_backend/ci/L0_backend_vllm/vllm_backend/vllm_backend_test.py

Lines 55 to 59 in 78bf0fb

# Load both vllm and add_sub models

self.triton_client.load_model(self.vllm_load_test)

self.assertTrue(self.triton_client.is_model_ready(self.vllm_load_test))

self.triton_client.load_model(self.python_model_name)

self.assertTrue(self.triton_client.is_model_ready(self.python_model_name))

ex:

# Load both vllm and add_sub models self.triton_client.load_model(self.vllm_load_test) self.assertTrue(self.triton_client.is_model_ready(self.vllm_load_test)) self.triton_client.load_model(self.python_model_name) self.assertTrue(self.triton_client.is_model_ready(self.python_model_name)) # Test to ensure that ensemble models are supported in vllm container. # If ensemble support is not enabled, triton will fail to load the ensemble. self.triton_client.load_model(self.ensemble_model_name) self.assertTrue(self.triton_client.is_model_ready(self.ensemble_model_name))

Also use vllm_load_test model inside the ensemble instead of vllm_opt for the same reason.

The tests documents some expected behavior around vllm_opt staying alive for the duration of the test:

vllm_backend/ci/L0_backend_vllm/vllm_backend/test.sh

Lines 53 to 57 in 78bf0fb

# `vllm_opt` model will be loaded on server start and stay loaded throughout

# unittesting. To test vllm model load/unload we use a dedicated

# `vllm_load_test`. To ensure that vllm's memory profiler will not error out

# on `vllm_load_test` load, we reduce "gpu_memory_utilization" for `vllm_opt`,

# so that at least 60% of GPU memory was available for other models.

So I think it's best not to mess with it in a new test and make use of vllm_load_test instead.

Co-authored-by: Ryan McCormick <[email protected]>

ci/L0_backend_vllm/vllm_backend/ensemble_config.pbtxt

Co-authored-by: Ryan McCormick <[email protected]>

oandreeva-nv · 2024-09-27T22:51:10Z

nice one!

KrishnanPrash added 5 commits September 23, 2024 12:40

Working ensemble example

17b3710

adding ensemble testing

13bd540

copyright info

cc0dfeb

ci modifications

54ace7a

Removing extra curl.out file

a600a2c

KrishnanPrash requested a review from rmccorm4 September 23, 2024 23:19

updated formatting

78bf0fb

rmccorm4 changed the title ~~bug: Adding ensemble support for vllm container~~ fix: Adding ensemble support for vllm container Sep 23, 2024