Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: OpenAI Compatible Frontend #7561

Merged
merged 81 commits into from
Oct 11, 2024
Merged
Show file tree
Hide file tree
Changes from 75 commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
637db32
Initial code migration, start the testing structure
rmccorm4 Jul 31, 2024
e14128b
Restructure to recommended FastAPI project structure, add simple test…
rmccorm4 Aug 2, 2024
a37b0b3
Start a CONTRIBUTING.md
rmccorm4 Aug 2, 2024
7eb1ffc
Add simple /completions endpoint test
rmccorm4 Aug 3, 2024
530c871
Add some plumbing for /v1/models routes, add mock_llm python model to…
rmccorm4 Aug 6, 2024
9eba9c3
Add simple tests for /v1/models and remove chat_completions test unti…
rmccorm4 Aug 6, 2024
fb7ce72
Add some basic chat completions support and testing
rmccorm4 Aug 7, 2024
0cf8fae
WIP: Add OpenAI client test that works when server is already running…
rmccorm4 Aug 7, 2024
3d227dd
Flesh out /completions tests more, refactor to class fixture for runn…
rmccorm4 Aug 8, 2024
4c1ac55
Update chat completions schema to enforce max_tokens >= 0, and lower …
rmccorm4 Aug 8, 2024
5b15877
Add more tests around max_tokens and temperature behavior, as well as…
rmccorm4 Aug 9, 2024
f9f4b07
Remove unused parts from tokenizer.py
rmccorm4 Aug 9, 2024
773aee0
All existing tests passing for both TRT-LLM and vLLM, updated model l…
rmccorm4 Aug 14, 2024
567abf3
Add streaming test placeholders, add test where no tokenizer is defined
rmccorm4 Aug 14, 2024
6e1bfaf
Add OpenAI Python Client tests, add streaming chat completions test, …
rmccorm4 Aug 16, 2024
4e3a441
Add 'echo' parameter test, but skip it for TRT-LLm due to only suppor…
rmccorm4 Aug 16, 2024
523f369
Fix issue with finish_reason for non-streaming completion when using …
rmccorm4 Aug 16, 2024
75f71ce
Move triton response validation into common triton utils
rmccorm4 Aug 16, 2024
118887c
Reduce code copying and global variables, use conftest.py for shared …
rmccorm4 Aug 16, 2024
6cf2e77
Split Dockefile in 2 to capture llama3.1 requirement for vllm
rmccorm4 Aug 16, 2024
66afc48
Split Dockerfile in 2 to capture llama3.1 requirement for vllm
rmccorm4 Aug 16, 2024
0bbd248
Add configurable model parameter to examples
rmccorm4 Aug 16, 2024
6e59f6e
Fix streaming for genai-perf by setting the content-type to text/even…
rmccorm4 Aug 19, 2024
763b3a4
Update examples to default to vllm model for simplicity
rmccorm4 Aug 19, 2024
0328ea6
Start high level README for other developers
rmccorm4 Aug 19, 2024
43dd329
Move openai source code into server/python/openai folder, and flesh o…
rmccorm4 Aug 19, 2024
363b40e
Move openai code to server/python folder
rmccorm4 Aug 19, 2024
d35d336
Add disclaimer for TRT-LLM to README
rmccorm4 Aug 19, 2024
63fc4a7
Fix README typos
rmccorm4 Aug 19, 2024
4a729c0
Fix relative path for OpenAI server helper after moving locations
rmccorm4 Aug 19, 2024
0f459b1
Add placeholder L0_openai test folder back
rmccorm4 Aug 19, 2024
0b3def0
Add transformers upgrade for Llama3.1 in vllm
rmccorm4 Aug 20, 2024
2e897b9
Add requirements.txt files for use in testing
rmccorm4 Aug 20, 2024
f54a4fa
Add placeholder test script
rmccorm4 Aug 20, 2024
c2786b2
Cleanup test script for local file reference
rmccorm4 Aug 21, 2024
021c577
Fix paths and empty function
rmccorm4 Aug 21, 2024
a69bfd1
Install tritonserver python wheel
rmccorm4 Aug 21, 2024
6361bd1
Add TRT-LLM detection and model repo generation
rmccorm4 Aug 21, 2024
c096ba5
Fix trtllm model count comparison to 4, excluding ensemble
rmccorm4 Aug 21, 2024
5631231
Fail on pytest errors
rmccorm4 Aug 21, 2024
e77f85c
Try copying engines out of NFS mount for faster test I/O
rmccorm4 Aug 21, 2024
b41a6f7
Use model var
rmccorm4 Aug 21, 2024
8251923
Time the duration of copying from nfs mount
rmccorm4 Aug 21, 2024
f928a81
Try rsync over cp
rmccorm4 Aug 21, 2024
81ef479
Remove use of NFS mount due to slow I/O for now
rmccorm4 Aug 21, 2024
42676da
Propagate test failure to job failure and log collection
rmccorm4 Aug 21, 2024
cacaf0b
Add xml files to gitignore
rmccorm4 Aug 21, 2024
b6c3f9e
Test /v1/models with multiple models and remove TODOs
rmccorm4 Aug 21, 2024
5cc80fe
Add openai folder copy to gitignore in testing
rmccorm4 Aug 21, 2024
9f70a1d
Add streaming completion test, remove trtllm models from git repo
rmccorm4 Aug 21, 2024
d00d237
Remove unnecessary TODOs
rmccorm4 Aug 22, 2024
ae2fcd6
Add copyrights and replace dupe test model
rmccorm4 Aug 22, 2024
fc4c15a
Add disclaimer around application state and multiprocessing
rmccorm4 Aug 22, 2024
1ca9889
Address CodeQL warnings
rmccorm4 Aug 22, 2024
92a27e5
Add quickstart vllm dockerfile for sharing purposes
rmccorm4 Aug 23, 2024
9c3ee15
Remove workspace mount mention
rmccorm4 Aug 23, 2024
886ee7d
Review feedback: rename package, move tests out of package, remove ne…
rmccorm4 Aug 23, 2024
21c0996
Review feedback: naming nits, more type hints, helper functions
rmccorm4 Aug 24, 2024
f84aec4
Fix CodeQL import warning
rmccorm4 Aug 24, 2024
b230697
refactor: Use thinner API server with an engine interface (#7570)
rmccorm4 Aug 29, 2024
ea23eeb
Update dockerfile branch, fix CodeQL error
rmccorm4 Aug 29, 2024
156535c
Add tests for custom tokenizers by local file path
rmccorm4 Aug 29, 2024
9b7dc59
Expose --backend request format override to main.py, and expose env v…
rmccorm4 Aug 31, 2024
a1484e4
Fix tokenizer test, remove TODO
rmccorm4 Sep 4, 2024
33eee48
perf: Improve chat completions performance at high concurrency (#7653)
rmccorm4 Sep 25, 2024
0882b60
review feedback: use _to_string helper function, add some clarifying …
rmccorm4 Sep 25, 2024
f073fbf
feat: KServe Bindings to start tritonfrontend (#7662)
KrishnanPrash Sep 26, 2024
2d0f7e6
chore: Fix argparse typo, cleanup argparse groups, make kserve fronte…
rmccorm4 Sep 27, 2024
78e571d
fix: Support sampling parameters of type List for vLLM backend (stop …
rmccorm4 Oct 7, 2024
579ad63
Review feedback: remove examples/ and docker/ folders, update README …
rmccorm4 Oct 9, 2024
815eebe
Add a few FIXMEs for follow-up
rmccorm4 Oct 9, 2024
8f92734
Add requirements.txt back in, fix test and docs accordingly
rmccorm4 Oct 9, 2024
5c0b2e6
Fix TRT-LLM model repo test path
rmccorm4 Oct 9, 2024
44b2282
Explicitly return error on unknown fields not defined in schema, excl…
rmccorm4 Oct 9, 2024
dc7bdf4
Merge branch 'main' of github.com:triton-inference-server/server into…
rmccorm4 Oct 10, 2024
49162be
Add missing copyright headers
rmccorm4 Oct 10, 2024
fe45d39
Review feedback: split app and test requirements to 2 requirements files
rmccorm4 Oct 10, 2024
2261d13
Fix whitespace pre-commit, remove auto 'git add' from copyright tool
rmccorm4 Oct 10, 2024
2e2a190
Disable copyright pre-commit hook until fixed on GitHub Actions side
rmccorm4 Oct 10, 2024
cc8657d
Fix attribution for tokenizer util
rmccorm4 Oct 10, 2024
fa9501e
Fix copyright header on copyright tool, remove unused import
rmccorm4 Oct 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,13 @@
__pycache__
tmp
*.log
*.xml
test_results.txt
artifacts
cprofile
*.prof

# Test exclusions
qa/L0_openai/openai
tensorrtllm_models
custom_tokenizer
rmccorm4 marked this conversation as resolved.
Show resolved Hide resolved
177 changes: 177 additions & 0 deletions python/openai/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
# OpenAI-Compatible Frontend for Triton Inference Server

## Pre-requisites

1. Docker + NVIDIA Container Runtime
2. A correctly configured `HF_TOKEN` for access to HuggingFace models.
- The current examples and testing primarily use the
[`meta-llama/Meta-Llama-3.1-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)
model, but you can manually bring your own models and adjust accordingly.

## VLLM

1. Launch the container and install dependencies:
- Mounts the `~/.huggingface/cache` for re-use of downloaded models across runs, containers, etc.
- Sets the [`HF_TOKEN`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hftoken) environment variable to
access gated models, make sure this is set in your local environment if needed.

```bash
docker run -it --net=host --gpus all --rm \
-v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
-e HF_TOKEN \
nvcr.io/nvidia/tritonserver:24.08-vllm-python-py3
```

2. Install dependencies inside the container:
```bash
# Install python bindings for tritonserver and tritonfrontend
pip install /opt/tritonserver/python/triton*.whl

# Install application/testing requirements
git clone https://github.com/triton-inference-server/server.git
cd server/python/openai/
pip install -r requirements.txt
```

3. Launch the OpenAI-compatible Triton Inference Server:
```bash
# NOTE: Adjust the --tokenizer based on the model being used
python3 openai_frontend/main.py --model-repository tests/vllm_models --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct
```

4. Send a `/v1/chat/completions` request:
- Note the use of `jq` is optional, but provides a nicely formatted output for JSON responses.
```bash
MODEL="llama-3.1-8b-instruct"
curl -s http://localhost:9000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "'${MODEL}'",
"messages": [{"role": "user", "content": "Say this is a test!"}]
}' | jq
```

5. Send a `/v1/completions` request:
- Note the use of `jq` is optional, but provides a nicely formatted output for JSON responses.
```bash
MODEL="llama-3.1-8b-instruct"
curl -s http://localhost:9000/v1/completions -H 'Content-Type: application/json' -d '{
"model": "'${MODEL}'",
"prompt": "Machine learning is"
}' | jq
rmccorm4 marked this conversation as resolved.
Show resolved Hide resolved
```

6. Benchmark with `genai-perf`:
```bash
MODEL="llama-3.1-8b-instruct"
TOKENIZER="meta-llama/Meta-Llama-3.1-8B-Instruct"
genai-perf \
--model ${MODEL} \
--tokenizer ${TOKENIZER} \
--service-kind openai \
--endpoint-type chat \
--synthetic-input-tokens-mean 256 \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean 256 \
--output-tokens-stddev 0 \
--streaming
```

7. Use the OpenAI python client directly:
```python
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:9000/v1",
api_key="EMPTY",
)

model = "llama-3.1-8b-instruct"
completion = client.chat.completions.create(
model=model,
messages=[
{
"role": "system",
"content": "You are a helpful assistant.",
},
{"role": "user", "content": "What are LLMs?"},
],
max_tokens=256,
)

print(completion.choices[0].message.content)
```

8. Run tests (NOTE: The server should not be running, the tests will handle starting/stopping the server as necessary):
```bash
cd server/python/openai/
pytest -v tests/
```

## TensorRT-LLM

0. Prepare your model repository for serving a TensorRT-LLM model:
https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#quick-start

1. Launch the container:
- Mounts the `~/.huggingface/cache` for re-use of downloaded models across runs, containers, etc.
- Sets the [`HF_TOKEN`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hftoken) environment variable to
access gated models, make sure this is set in your local environment if needed.
rmccorm4 marked this conversation as resolved.
Show resolved Hide resolved

```bash
docker run -it --net=host --gpus all --rm \
-v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
-e HF_TOKEN \
nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
```

2. Install dependencies inside the container:
```bash
# Install python bindings for tritonserver and tritonfrontend
pip install /opt/tritonserver/python/triton*.whl

# Install application/testing requirements
git clone https://github.com/triton-inference-server/server.git
cd server/python/openai/
pip install -r requirements.txt
```

2. Launch the OpenAI server:
```bash
# NOTE: Adjust the --tokenizer based on the model being used
python3 openai_frontend/main.py --model-repository tests/tensorrtllm_models --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct
```

3. Send a `/v1/chat/completions` request:
- Note the use of `jq` is optional, but provides a nicely formatted output for JSON responses.
```bash
MODEL="tensorrt_llm_bls"
curl -s http://localhost:9000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "'${MODEL}'",
"messages": [{"role": "user", "content": "Say this is a test!"}]
}' | jq
```

The other examples should be the same as vLLM, except that you should set `MODEL="tensorrt_llm_bls"`,
everywhere applicable as seen in the example request above.

## KServe Frontends

To support serving requests through both the OpenAI-Compatible and
KServe Predict v2 frontends to the same running Triton Inference Server,
the `tritonfrontend` python bindings are included for optional use in this
application as well.

You can opt-in to including these additional frontends, assuming `tritonfrontend`
is installed, with `--enable-kserve-frontends` like below:

```
python3 openai_frontend/main.py \
--model-repository tests/vllm_models \
--tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct \
--enable-kserve-frontends
```

See `python3 openai_frontend/main.py --help` for more information on the
available arguments and default values.

For more information on the `tritonfrontend` python bindings, see the docs
[here](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/tritonfrontend.md).
25 changes: 25 additions & 0 deletions python/openai/openai_frontend/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
rmccorm4 marked this conversation as resolved.
Show resolved Hide resolved
Empty file.
94 changes: 94 additions & 0 deletions python/openai/openai_frontend/engine/engine.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


from __future__ import annotations

from typing import Iterator, List, Protocol

from schemas.openai import (
CreateChatCompletionRequest,
CreateChatCompletionResponse,
CreateCompletionRequest,
CreateCompletionResponse,
Model,
)


class LLMEngine(Protocol):
"""
Interface for an OpenAI-aware inference engine to be attached to an
OpenAI-compatible frontend.

NOTE: This interface is subject to change, and may land on something more
generic rather than the current 1:1 with OpenAI endpoints over time.
"""

def ready(self) -> bool:
"""
Returns True if the engine is ready to accept inference requests, or False otherwise.
"""
pass

def metrics(self) -> str:
"""
Returns the engine's metrics in a Prometheus-compatible string format.
"""
pass

def models(self) -> List[Model]:
"""
Returns a List of OpenAI Model objects.
"""
pass

def chat(
self, request: CreateChatCompletionRequest
) -> CreateChatCompletionResponse | Iterator[str]:
"""
If request.stream is True, this returns an Iterator (or Generator) that
produces server-sent-event (SSE) strings in the following form:
'data: {CreateChatCompletionStreamResponse}\n\n'
...
'data: [DONE]\n\n'

If request.stream is False, this returns a CreateChatCompletionResponse.
"""
pass

def completion(
self, request: CreateCompletionRequest
) -> CreateCompletionResponse | Iterator[str]:
"""
If request.stream is True, this returns an Iterator (or Generator) that
produces server-sent-event (SSE) strings in the following form:
'data: {CreateCompletionResponse}\n\n'
...
'data: [DONE]\n\n'

If request.stream is False, this returns a CreateCompletionResponse.
"""
pass
Loading
Loading