Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ibm 20241008 roberta2 #198

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
136 commits
Select commit Hold shift + click to select a range
4bb98f2
[Misc] Update config loading for Qwen2-VL and remove Granite (#8837)
ywang96 Sep 26, 2024
f70bcca
[Build/CI] Upgrade to gcc 10 in the base build Docker image (#8814)
tlrmchlsmth Sep 26, 2024
520db4d
[Docs] Add README to the build docker image (#8825)
mgoin Sep 26, 2024
68988d4
[CI/Build] Fix missing ci dependencies (#8834)
fyuan1316 Sep 26, 2024
70de39f
[misc][installation] build from source without compilation (#8818)
youkaichao Sep 26, 2024
d9cfbc8
[ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM (#8872)
khluu Sep 26, 2024
93d364d
[Bugfix] Include encoder prompts len to non-stream api usage response…
Pernekhan Sep 26, 2024
b28d210
[Misc] Change dummy profiling and BOS fallback warns to log once (#8820)
mgoin Sep 26, 2024
e2f6f26
[Bugfix] Fix print_warning_once's line info (#8867)
tlrmchlsmth Sep 26, 2024
ee2da3e
fix validation: Only set tool_choice `auto` if at least one tool is p…
chiragjn Sep 26, 2024
71d21c7
[Bugfix] Fixup advance_step.cu warning (#8815)
tlrmchlsmth Sep 26, 2024
4b377d6
[BugFix] Fix test breakages from transformers 4.45 upgrade (#8829)
njhill Sep 26, 2024
1b49148
[Installation] Allow lower versions of FastAPI to maintain Ray 2.9 co…
DarkLight1337 Sep 26, 2024
344cd2b
[Feature] Add support for Llama 3.1 and 3.2 tool use (#8343)
maxdebayser Sep 27, 2024
3b00b9c
[Core] rename`PromptInputs` and `inputs` (#8876)
DarkLight1337 Sep 27, 2024
dc4e3df
[misc] fix collect env (#8894)
youkaichao Sep 27, 2024
0e08875
[MISC] Fix invalid escape sequence '\' (#8830)
panpan0000 Sep 27, 2024
6d792d2
[Bugfix][VLM] Fix Fuyu batching inference with `max_num_seqs>1` (#8892)
Isotr0py Sep 27, 2024
8df2dc3
[TPU] Update pallas.py to support trillium (#8871)
bvrockwell Sep 27, 2024
a9b15c6
[torch.compile] use empty tensor instead of None for profiling (#8875)
youkaichao Sep 27, 2024
172d1cd
[Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear meth…
ProExpertProg Sep 27, 2024
c5d5535
[Bugfix] fix for deepseek w4a16 (#8906)
LucasWilkinson Sep 27, 2024
c2ec430
[Core] Multi-Step + Single Step Prefills via Chunked Prefill code pat…
varun-sundar-rabindranath Sep 27, 2024
18e60d7
[misc][distributed] add VLLM_SKIP_P2P_CHECK flag (#8911)
youkaichao Sep 27, 2024
bd429f2
[Core] Priority-based scheduling in async engine (#8850)
schoennenbeck Sep 27, 2024
d86f6b2
[misc] fix wheel name (#8919)
youkaichao Sep 28, 2024
260024a
[Bugfix][Intel] Fix XPU Dockerfile Build (#7824)
tylertitsworth Sep 28, 2024
b0298aa
[Misc] Remove vLLM patch of `BaichuanTokenizer` (#8921)
DarkLight1337 Sep 28, 2024
39d3f8d
[Bugfix] Fix code for downloading models from modelscope (#8443)
tastelikefeet Sep 28, 2024
19d02ff
[Bugfix] Fix PP for Multi-Step (#8887)
varun-sundar-rabindranath Sep 28, 2024
e1a3f5e
[CI/Build] Update models tests & examples (#8874)
DarkLight1337 Sep 28, 2024
090e945
[Frontend] Make beam search emulator temperature modifiable (#8928)
nFunctor Sep 28, 2024
e585b58
[Bugfix] Support testing prefill throughput with benchmark_serving.py…
heheda12345 Sep 28, 2024
cc27644
[doc] organize installation doc and expose per-commit docker (#8931)
youkaichao Sep 29, 2024
d153703
[Core] Improve choice of Python multiprocessing method (#8823)
russellb Sep 29, 2024
5bf8789
[Bugfix] Block manager v2 with preemption and lookahead slots (#8824)
sroy745 Sep 29, 2024
d081da0
[Bugfix] Fix Marlin MoE act order when is_k_full == False (#8741)
ElizaWszola Sep 29, 2024
26a68d5
[CI/Build] Add test decorator for minimum GPU memory (#8925)
DarkLight1337 Sep 29, 2024
2e7fe7e
[Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better cachi…
tlrmchlsmth Sep 29, 2024
bc2ef1f
[Model] Support Qwen2.5-Math-RM-72B (#8896)
zhuzilin Sep 29, 2024
3d49776
[Model][LoRA]LoRA support added for MiniCPMV2.5 (#7199)
jeejeelee Sep 29, 2024
31f46a0
[BugFix] Fix seeded random sampling with encoder-decoder models (#8870)
njhill Sep 29, 2024
1fb9c1b
[Misc] Fix typo in BlockSpaceManagerV1 (#8944)
juncheoll Sep 29, 2024
6c9ba48
[Frontend] Added support for HF's new `continue_final_message` parame…
danieljannai21 Sep 29, 2024
f13a07b
[Kernel][Model] Varlen prefill + Prefill chunking support for mamba k…
mzusman Sep 29, 2024
e01ab59
[Model] support input embeddings for qwen2vl (#8856)
whyiug Sep 30, 2024
b6d7392
[Misc][CI/Build] Include `cv2` via `mistral_common[opencv]` (#8951)
ywang96 Sep 30, 2024
8e60afa
[Model][LoRA]LoRA support added for MiniCPMV2.6 (#8943)
jeejeelee Sep 30, 2024
2ae25f7
[Model] Expose InternVL2 max_dynamic_patch as a mm_processor_kwarg (#…
Isotr0py Sep 30, 2024
be76e5a
[Core] Make scheduling policy settable via EngineArgs (#8956)
schoennenbeck Sep 30, 2024
1cabfce
[Misc] Adjust max_position_embeddings for LoRA compatibility (#8957)
jeejeelee Sep 30, 2024
1425a1b
[ci] Add CODEOWNERS for test directories (#8795)
khluu Oct 1, 2024
bce3244
[CI][SpecDecode] Fix spec decode tests, use flash attention backend f…
LiuXiaoxuanPKU Oct 1, 2024
062c89e
[Frontend][Core] Move guided decoding params into sampling params (#8…
joerunde Oct 1, 2024
aaccca2
[CI/Build] Fix machete generated kernel files ordering (#8976)
khluu Oct 1, 2024
7da2487
[torch.compile] fix tensor alias (#8982)
youkaichao Oct 1, 2024
82f3937
[Misc] add process_weights_after_loading for DummyLoader (#8969)
divakar-amd Oct 1, 2024
bc4eb65
[Bugfix] Fix Fuyu tensor parallel inference (#8986)
Isotr0py Oct 1, 2024
1fe0a42
[Bugfix] Fix Token IDs Reference for MiniCPM-V When Images are Provid…
alex-jw-brooks Oct 1, 2024
35bd215
[Core] [Frontend] Priority scheduling for embeddings and in the OpenA…
schoennenbeck Oct 1, 2024
4f341bd
[Doc] Update list of supported models (#8987)
DarkLight1337 Oct 1, 2024
22f5851
Update benchmark_serving.py to read and write json-datasets, results …
vlsav Oct 1, 2024
1570203
[Spec Decode] (1/2) Remove batch expansion (#8839)
LiuXiaoxuanPKU Oct 1, 2024
563649a
[Core] Combined support for multi-step scheduling, chunked prefill & …
afeldman-nm Oct 2, 2024
7f60520
[Misc] Update Default Image Mapper Error Log (#8977)
alex-jw-brooks Oct 2, 2024
afb050b
[Core] CUDA Graphs for Multi-Step + Chunked-Prefill (#8645)
varun-sundar-rabindranath Oct 2, 2024
f58d4fc
[OpenVINO] Enable GPU support for OpenVINO vLLM backend (#8192)
sshlyapn Oct 2, 2024
19f0d25
[Model] Adding Granite MoE. (#8206)
shawntan Oct 3, 2024
18c2e30
[Doc] Update Granite model docs (#9025)
njhill Oct 3, 2024
19a4dd0
[Bugfix] example template should not add parallel_tool_prompt if tool…
tjohnson31415 Oct 3, 2024
01843c8
[Misc] log when using default MoE config (#8971)
divakar-amd Oct 3, 2024
83caf35
[BugFix] Enforce Mistral ToolCall id constraint when using the Mistra…
gcalmettes Oct 3, 2024
f5d72b2
[Core] Make BlockSpaceManagerV2 the default BlockManager to use. (#8678)
sroy745 Oct 3, 2024
63e3993
[Frontend] [Neuron] Parse literals out of override-neuron-config (#8959)
xendo Oct 3, 2024
9aaf14c
[misc] add forward context for attention (#9029)
youkaichao Oct 3, 2024
91add85
Fix failing spec decode test (#9054)
sroy745 Oct 3, 2024
2838d6b
[Bugfix] Weight loading fix for OPT model (#9042)
domenVres Oct 3, 2024
3dbb215
[Frontend][Feature] support tool calling for internlm/internlm2_5-7b-…
sydnash Oct 4, 2024
aeb37c2
[CI/Build] Per file CUDA Archs (improve wheel size and dev build time…
LucasWilkinson Oct 4, 2024
303d447
[Misc] Enable multi-step output streaming by default (#9047)
mgoin Oct 4, 2024
0f6d7a9
[Models] Add remaining model PP support (#7168)
andoorve Oct 4, 2024
0e36fd4
[Misc] Move registry to its own file (#9064)
DarkLight1337 Oct 4, 2024
3d826d2
[Bugfix] Reshape the dimensions of the input image embeddings in Qwen…
whyiug Oct 4, 2024
22482e4
[Bugfix] Flash attention arches not getting set properly (#9062)
LucasWilkinson Oct 4, 2024
9ade8bb
[Model] add a bunch of supported lora modules for mixtral (#9008)
prashantgupta24 Oct 4, 2024
36eecfb
Remove AMD Ray Summit Banner (#9075)
simon-mo Oct 4, 2024
e5dc713
[Hardware][PowerPC] Make oneDNN dependency optional for Power (#9039)
varad-ahirwadkar Oct 4, 2024
26aa325
[Core][VLM] Test registration for OOT multimodal models (#8717)
ywang96 Oct 4, 2024
0dcc8cb
Adds truncate_prompt_tokens param for embeddings creation (#8999)
flaviabeo Oct 4, 2024
05d6864
[Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE…
ElizaWszola Oct 4, 2024
fbb7442
[CI] Update performance benchmark: upgrade trt-llm to r24.07, and add…
KuntaiDu Oct 4, 2024
05c531b
[Misc] Improved prefix cache example (#9077)
Imss27 Oct 4, 2024
0cc566c
[Misc] Add random seed for prefix cache benchmark (#9081)
Imss27 Oct 4, 2024
27302dd
[Misc] Fix CI lint (#9085)
comaniac Oct 4, 2024
cc90419
[Hardware][Neuron] Add on-device sampling support for Neuron (#8746)
chongmni-aws Oct 4, 2024
663874e
[torch.compile] improve allreduce registration (#9061)
youkaichao Oct 4, 2024
a95354a
[Doc] Update README.md with Ray summit slides (#9088)
zhuohan123 Oct 5, 2024
dac914b
[Bugfix] use blockmanagerv1 for encoder-decoder (#9084)
heheda12345 Oct 5, 2024
53b3a33
[Bugfix] Fixes Phi3v & Ultravox Multimodal EmbeddingInputs (#8979)
hhzhang16 Oct 5, 2024
15986f5
[Model] Support Gemma2 embedding model (#9004)
xyang16 Oct 5, 2024
cfadb9c
[Bugfix] Deprecate registration of custom configs to huggingface (#9083)
heheda12345 Oct 5, 2024
5df1834
[Bugfix] Fix order of arguments matters in config.yaml (#8960)
Imss27 Oct 5, 2024
f4dd830
[core] use forward context for flash infer (#9097)
youkaichao Oct 6, 2024
23fea87
[Bugfix] Fix try-catch conditions to import correct Flash Attention B…
tjtanaa Oct 6, 2024
168cab6
[Frontend] API support for beam search (#9087)
LunrEclipse Oct 6, 2024
f22619f
[Misc] Remove user-facing error for removed VLM args (#9104)
DarkLight1337 Oct 6, 2024
b22b798
[Model] PP support for embedding models and update docs (#9090)
DarkLight1337 Oct 6, 2024
fdf59d3
[Bugfix] fix tool_parser error handling when serve a model not suppor…
liuyanyi Oct 6, 2024
cb3b2b9
[Bugfix] Fix incorrect updates to num_computed_tokens in multi-step s…
varun-sundar-rabindranath Oct 6, 2024
487678d
[Bugfix][Hardware][CPU] Fix CPU model input for decode (#9044)
Isotr0py Oct 7, 2024
c8f26bb
[BugFix][Core] Fix BlockManagerV2 when Encoder Input is None (#9103)
sroy745 Oct 7, 2024
18b296f
[core] remove beam search from the core (#9105)
youkaichao Oct 7, 2024
8c6de96
[Model] Explicit interface for vLLM models and support OOT embedding …
DarkLight1337 Oct 7, 2024
4f95ffe
[Hardware][CPU] Cross-attention and Encoder-Decoder models support on…
Isotr0py Oct 7, 2024
f19da64
[Core] Refactor GGUF parameters packing and forwarding (#8859)
Isotr0py Oct 7, 2024
151ef4e
[Model] Support NVLM-D and fix QK Norm in InternViT (#9045)
DarkLight1337 Oct 7, 2024
93cf74a
[Doc]: Add deploying_with_k8s guide (#8451)
haitwang-cloud Oct 7, 2024
e0dbdb0
[CI/Build] Add linting for github actions workflows (#7876)
russellb Oct 7, 2024
c0d9a98
[Doc] Include performance benchmark in README (#9135)
KuntaiDu Oct 7, 2024
fa45513
[misc] fix comment and variable name (#9139)
youkaichao Oct 7, 2024
8eeb857
Add Slack to README (#9137)
simon-mo Oct 8, 2024
04c12f8
[misc] update utils to support comparing multiple settings (#9140)
youkaichao Oct 8, 2024
80b57f0
[Intel GPU] Fix xpu decode input (#9145)
jikunshang Oct 8, 2024
e1faa2a
[misc] improve ux on readme (#9147)
youkaichao Oct 8, 2024
e7db59c
Merge branch 'main' of https://github.com/vllm-project/vllm into ibm-…
fialhocoelho Oct 8, 2024
573d075
Squash 5733
fialhocoelho Oct 8, 2024
61f6474
Squash 6357
fialhocoelho Oct 8, 2024
6d27ed2
Squash 9034
fialhocoelho Oct 8, 2024
22a7fe8
Squash 9049
fialhocoelho Oct 8, 2024
bf8438e
Squash 9027
fialhocoelho Oct 8, 2024
6b500af
install adapter from @main to use adapter/#137
fialhocoelho Oct 8, 2024
baeec70
Hack up de build to use as base-image :rocket:
fialhocoelho Oct 8, 2024
1297cc8
:bug: Reverted logic to fix build; potential GGUF-related issues.
fialhocoelho Oct 8, 2024
bc59bd3
Revert "Hack up de build to use as base-image :rocket:"
fialhocoelho Oct 8, 2024
cf0b2dd
Add fix to identify encoder-only models
maxdebayser Oct 14, 2024
3c5605e
Merge branch 'odh_main' into ibm-20241008-roberta2
maxdebayser Oct 14, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.764
- name: "exact_match,flexible-extract"
value: 0.764
limit: 250
num_fewshot: 5
1 change: 1 addition & 0 deletions .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
Minitron-4B-Base-FP8.yaml
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# We can use this script to compute baseline accuracy on GSM for transformers.
#
# Make sure you have lm-eval-harness installed:
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@9516087b81a61d0e220b22cc1b75be76de23bc10
# pip install lm-eval==0.4.4

usage() {
echo``
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# We use this for fp8, which HF does not support.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.3
# pip install lm-eval==0.4.4

usage() {
echo``
Expand Down
7 changes: 6 additions & 1 deletion .buildkite/lm-eval-harness/test_lm_eval_correctness.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,10 +49,15 @@ def test_lm_eval_correctness():
results = launch_lm_eval(eval_config)

# Confirm scores match ground truth.
success = True
for task in eval_config["tasks"]:
for metric in task["metrics"]:
ground_truth = metric["value"]
measured_value = results["results"][task["name"]][metric["name"]]
print(f'{task["name"]} | {metric["name"]}: '
f'ground_truth={ground_truth} | measured={measured_value}')
assert numpy.isclose(ground_truth, measured_value, rtol=RTOL)
success = success and numpy.isclose(
ground_truth, measured_value, rtol=RTOL)

# Assert at the end, print all scores even on failure for debugging.
assert success
28 changes: 28 additions & 0 deletions .buildkite/nightly-benchmarks/nightly-annotation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@

## Description

This file contains the downloading link for benchmarking results.

- [benchmarking pipeline](artifact://nightly-pipeline.yaml)
- [benchmarking results](artifact://results.zip)
- [benchmarking code](artifact://nightly-benchmarks.zip)

Please download the visualization scripts in the post


## Results reproduction

- Find the docker we use in `benchmarking pipeline`
- Deploy the docker, and inside the docker:
- Download `nightly-benchmarks.zip`.
- In the same folder, run the following code
```
export HF_TOKEN=<your HF token>
apt update
apt install -y git
unzip nightly-benchmarks.zip
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
```

And the results will be inside `./benchmarks/results`.

78 changes: 36 additions & 42 deletions .buildkite/nightly-benchmarks/nightly-descriptions.md
Original file line number Diff line number Diff line change
@@ -1,45 +1,39 @@

# Nightly benchmark

The main goal of this benchmarking is two-fold:
- Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
- Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in [reproduce.md]().


## Docker images

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images:
- vllm/vllm-openai:v0.5.0.post1
- nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
- openmmlab/lmdeploy:v0.5.0
- ghcr.io/huggingface/text-generation-inference:2.1

<!-- Please check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/nightly-pipeline.yaml">nightly-pipeline.yaml</a> artifact for more details on how we deploy the docker images. -->


## Hardware

One AWS node with 8x NVIDIA A100 GPUs.


## Workload description

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload:

- Input length: randomly sample 500 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 500 prompts.
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
- Average QPS (query per second): 4 for the small model (llama-3 8B) and 2 for other two models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).

<!-- Check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/tests/nightly-tests.json">nightly-tests.json</a> artifact for more details. -->

## Plots

In the following plots, the dot shows the mean and the error bar shows the standard error of the mean. Value 0 means that the corresponding benchmark crashed.

<img src="artifact://nightly_results.png" alt="Benchmarking results" height=250 >

## Results

{nightly_results_benchmarking_table}
This benchmark aims to:
- Provide performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and SGLang) leads in performance in what workload.
- Be reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions.

Latest results: [results link](https://blog.vllm.ai/2024/09/05/perf-update.html), scroll to the end.

Latest reproduction guilde: [github issue link](https://github.com/vllm-project/vllm/issues/8176)


## Setup

- Docker images:
- vLLM: `vllm/vllm-openai:v0.6.2`
- SGLang: `lmsysorg/sglang:v0.3.2-cu121`
- LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12`
- TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
- *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.*
- Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark.
- Hardware
- 8x Nvidia A100 GPUs
- Workload:
- Dataset
- ShareGPT dataset
- Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output)
- Decode-heavy dataset (in average 462 input tokens, 256 output tokens)
- Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use.
- Models: llama-3 8B, llama-3 70B.
- We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)).
- Average QPS (query per second): 2, 4, 8, 16, 32 and inf.
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).

# Known issues

- TRT-LLM crashes with Llama 3.1 8B [issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105).
- TGI does not support `ignore-eos` flag.
98 changes: 87 additions & 11 deletions .buildkite/nightly-benchmarks/nightly-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ common_pod_spec: &common_pod_spec

common_container_settings: &common_container_settings
command:
- bash .buildkite/nightly-benchmarks/run-nightly-suite.sh
- bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
resources:
limits:
nvidia.com/gpu: 8
Expand All @@ -37,7 +37,10 @@ common_container_settings: &common_container_settings

steps:
- block: ":rocket: Ready for comparing vllm against alternatives? This will take 4 hours."
- label: "A100 trt benchmark"



- label: "A100 vllm step 10"
priority: 100
agents:
queue: A100
Expand All @@ -46,7 +49,21 @@ steps:
podSpec:
<<: *common_pod_spec
containers:
- image: nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
- image: vllm/vllm-openai:v0.6.2
<<: *common_container_settings



- label: "A100 sglang benchmark"
priority: 100
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
<<: *common_pod_spec
containers:
- image: lmsysorg/sglang:v0.3.2-cu121
<<: *common_container_settings

- label: "A100 lmdeploy benchmark"
Expand All @@ -58,11 +75,13 @@ steps:
podSpec:
<<: *common_pod_spec
containers:
- image: openmmlab/lmdeploy:v0.5.0
- image: openmmlab/lmdeploy:v0.6.1-cu12
<<: *common_container_settings


- label: "A100 vllm benchmark"



- label: "A100 trt llama-8B"
priority: 100
agents:
queue: A100
Expand All @@ -71,10 +90,25 @@ steps:
podSpec:
<<: *common_pod_spec
containers:
- image: vllm/vllm-openai:latest
- image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
<<: *common_container_settings
env:
- name: VLLM_USAGE_SOURCE
value: ci-test
- name: HF_HOME
value: /root/.cache/huggingface
- name: VLLM_SOURCE_CODE_LOC
value: /workspace/build/buildkite/vllm/performance-benchmark
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
- name: TEST_SELECTOR
value: "llama8B"

- label: "A100 tgi benchmark"

- label: "A100 trt llama-70B"
priority: 100
agents:
queue: A100
Expand All @@ -83,12 +117,54 @@ steps:
podSpec:
<<: *common_pod_spec
containers:
- image: ghcr.io/huggingface/text-generation-inference:2.1
- image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
<<: *common_container_settings
env:
- name: VLLM_USAGE_SOURCE
value: ci-test
- name: HF_HOME
value: /root/.cache/huggingface
- name: VLLM_SOURCE_CODE_LOC
value: /workspace/build/buildkite/vllm/performance-benchmark
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
- name: TEST_SELECTOR
value: "llama70B"


# FIXME(Kuntai): uncomment this after NVIDIA gives us their test docker image
# - label: "A100 trt benchmark"
# priority: 100
# agents:
# queue: A100
# plugins:
# - kubernetes:
# podSpec:
# <<: *common_pod_spec
# containers:
# - image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
# <<: *common_container_settings


# FIXME(Kuntai): uncomment this after TGI supports `--ignore-eos`.
# - label: "A100 tgi benchmark"
# priority: 100
# agents:
# queue: A100
# plugins:
# - kubernetes:
# podSpec:
# <<: *common_pod_spec
# containers:
# - image: ghcr.io/huggingface/text-generation-inference:2.2.0
# <<: *common_container_settings

- wait

- label: "Plot"
- label: "Collect the results"
priority: 100
agents:
queue: A100
Expand Down Expand Up @@ -117,4 +193,4 @@ steps:
name: hf-token-secret
key: token

- wait
- block: ":rocket: check the results!"
76 changes: 0 additions & 76 deletions .buildkite/nightly-benchmarks/run-nightly-suite.sh

This file was deleted.

Loading
Loading