Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Adding ensemble support for vllm container #68

Merged
merged 12 commits into from
Sep 24, 2024

Conversation

KrishnanPrash
Copy link
Contributor

Currently, while building the vLLM container, we are not passing --backend=ensemble as one of the flags, which leads to lack of support when people try running ensemble models in the vLLM container.

To support this, adding --backend=ensemble to the build instructions should provide support going forward (>=24.10).
Additionally, adding a test case within the existing L0_vllm_backend job to catch lack of support in CI pipeline.

@rmccorm4 rmccorm4 changed the title bug: Adding ensemble support for vllm container fix: Adding ensemble support for vllm container Sep 23, 2024
@rmccorm4
Copy link
Contributor

Fixes triton-inference-server/server#7632

README.md Outdated Show resolved Hide resolved
@@ -163,6 +164,17 @@ def test_exclude_input_in_output_true(self):
expected_output=expected_output,
)

def test_ensemble_model(self):
Copy link
Contributor

@rmccorm4 rmccorm4 Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to break EXPECTED_NUM_TESTS=6.

You could update it to EXPECTED_NUM_TESTS=7, but I think it's easier to just append the ensemble to the other test that is testing model loading here:

# Load both vllm and add_sub models
self.triton_client.load_model(self.vllm_load_test)
self.assertTrue(self.triton_client.is_model_ready(self.vllm_load_test))
self.triton_client.load_model(self.python_model_name)
self.assertTrue(self.triton_client.is_model_ready(self.python_model_name))

ex:

        # Load both vllm and add_sub models 
	self.triton_client.load_model(self.vllm_load_test) 
	self.assertTrue(self.triton_client.is_model_ready(self.vllm_load_test)) 
	self.triton_client.load_model(self.python_model_name) 
	self.assertTrue(self.triton_client.is_model_ready(self.python_model_name)) 
        
        # Test to ensure that ensemble models are supported in vllm container.
        # If ensemble support is not enabled, triton will fail to load the ensemble.
	self.triton_client.load_model(self.ensemble_model_name) 
	self.assertTrue(self.triton_client.is_model_ready(self.ensemble_model_name)) 

Also use vllm_load_test model inside the ensemble instead of vllm_opt for the same reason.

Copy link
Contributor

@rmccorm4 rmccorm4 Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests documents some expected behavior around vllm_opt staying alive for the duration of the test:

# `vllm_opt` model will be loaded on server start and stay loaded throughout
# unittesting. To test vllm model load/unload we use a dedicated
# `vllm_load_test`. To ensure that vllm's memory profiler will not error out
# on `vllm_load_test` load, we reduce "gpu_memory_utilization" for `vllm_opt`,
# so that at least 60% of GPU memory was available for other models.

So I think it's best not to mess with it in a new test and make use of vllm_load_test instead.

@KrishnanPrash KrishnanPrash merged commit 0df1013 into main Sep 24, 2024
3 checks passed
@KrishnanPrash KrishnanPrash deleted the kprashanth-ensemble-fix branch September 24, 2024 18:01
@oandreeva-nv
Copy link
Collaborator

nice one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants