Catch errors and improve logging in Profiler #23

nv-hwoo · 2024-01-12T22:16:27Z

The change includes:

catching when PA throws an error
sanity checking profile results to make sure the expected number of responses are generated
using logger instead of print statement
adding verbose mode

The sanity check prevents potential miscalculation that can arise due to user requesting the model to generate a sequence that's longer than the model's context length.

Non-verbose (no error):

root@nv-ubuntu:/triton_cli# triton bench -m gpt2 --input-length 800 --output-length 128
triton - INFO - Clearing all contents from /root/models...
triton - INFO - Known model source found for 'gpt2': 'hf:gpt2'
triton - INFO - Starting a Triton Server locally with model repository: /root/models
triton - INFO - Starting a Triton Server via docker image 'nvcr.io/nvidia/tritonserver:23.12-vllm-python-py3' with model repository: /root/models
triton - INFO - Server is ready for inference.
triton - INFO - Running Perf Analyzer profiler on 'gpt2'...
triton - INFO - Warming up...
triton - INFO - Warmed up, profiling now...
[ PROFILE CONFIGURATIONS ]
 * Model: gpt2
 * Backend: vllm
 * Batch size: 1
 * Input tokens: 800
 * Output tokens: 128

[ BENCHMARK SUMMARY ]
 * Avg first token latency: 27.3099 ms
 * Avg generation throughput: 186.8861 output tokens/s

triton - INFO - Stopping server...

Non-verbose (with PA error):

root@nv-ubuntu:/triton_cli# triton bench -m gpt2 --input-length 800 --output-length 128
triton - INFO - Clearing all contents from /root/models...
triton - INFO - Known model source found for 'gpt2': 'hf:gpt2'
triton - INFO - Starting a Triton Server locally with model repository: /root/models
triton - INFO - Starting a Triton Server via docker image 'nvcr.io/nvidia/tritonserver:23.12-vllm-python-py3' with model repository: /root/models
triton - INFO - Server is ready for inference.
triton - INFO - Running Perf Analyzer profiler on 'gpt2'...
triton - INFO - Warming up...
triton - ERROR - Unexpected error: Encountered the following error while running Perf Analyzer:
Failed to retrieve results from inference request.
Thread [0] had error: Error generating stream: 'ascii' codec can't encode character '\xab' in position 77: ordinal not in range(128)
triton - INFO - Stopping server...

Non-verbose (with sanity check error):

root@nv-ubuntu:/triton_cli# triton bench -m gpt2 --input-length 1024 --output-length 128
triton - INFO - Clearing all contents from /root/models...
triton - INFO - Known model source found for 'gpt2': 'hf:gpt2'
triton - INFO - Starting a Triton Server locally with model repository: /root/models
triton - INFO - Starting a Triton Server via docker image 'nvcr.io/nvidia/tritonserver:23.12-vllm-python-py3' with model repository: /root/models
triton - INFO - Server is ready for inference.
triton - INFO - Running Perf Analyzer profiler on 'gpt2'...
triton - INFO - Warming up...
triton - INFO - Warmed up, profiling now...
[ PROFILE CONFIGURATIONS ]
 * Model: gpt2
 * Backend: vllm
 * Batch size: 1
 * Input tokens: 1024
 * Output tokens: 128

triton - ERROR - Unexpected error: Expecting 128 tokens but received 1 tokens. This could be due to a long prompt size. Please double check the input and output length.
triton - INFO - Stopping server...

Verbose (no error):

root@nv-ubuntu:/triton_cli# triton bench -m gpt2 --input-length 1024 --output-length 128 --verbose
triton - DEBUG - Using existing model repository: /root/models
triton - INFO - Clearing all contents from /root/models...
triton - INFO - Known model source found for 'gpt2': 'hf:gpt2'
triton - DEBUG - HuggingFace prefix detected, parsing HuggingFace ID
triton - DEBUG - Adding new model to repo at: /root/models/gpt2/1
triton - INFO - Current repo at /root/models:
models/
└── gpt2/
    ├── 1/
    │   └── model.json
    └── config.pbtxt
triton - DEBUG - No --mode specified, trying the following modes: ['local', 'docker']
triton - INFO - Starting a Triton Server locally with model repository: /root/models
triton - INFO - Starting a Triton Server via docker image 'nvcr.io/nvidia/tritonserver:23.12-vllm-python-py3' with model repository: /root/models
triton - DEBUG - Failed to start server in 'docker' mode. Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))
triton - INFO - Server is ready for inference.
triton - INFO - Running Perf Analyzer profiler on 'gpt2'...
triton - INFO - Warming up...
triton - INFO - Perf Analyzer output:
 Successfully read data for 1 stream/streams with 1 step/steps.
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 10
  Using asynchronous calls for inference
  Detected decoupled model, using the first response for measuring latency
  Stabilizing using average latency

Request concurrency: 1
  Client:
    Request count: 71
    Throughput: 23.6401 infer/sec
    Response Throughput: 23.6401 infer/sec
    Avg latency: 41843 usec (standard deviation 3879 usec)
    p50 latency: 42018 usec
    p90 latency: 47008 usec
    p95 latency: 48011 usec
    p99 latency: 48873 usec

  Server:
    Inference count: 72
    Execution count: 72
    Successful request count: 72
    Avg request latency: 510 usec (overhead 4 usec + queue 71 usec + compute input 42 usec + compute infer 387 usec + compute output 6 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 23.6401 infer/sec, latency 41843 usec

triton - INFO - Warmed up, profiling now...
[ PROFILE CONFIGURATIONS ]
 * Model: gpt2
 * Backend: vllm
 * Batch size: 1
 * Input tokens: 1024
 * Output tokens: 128

triton - INFO - Perf Analyzer output:
 Successfully read data for 1 stream/streams with 1 step/steps.
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 10
  Using asynchronous calls for inference
  Detected decoupled model, using the first response for measuring latency
  Stabilizing using average latency

Request concurrency: 1
  Client:
    Request count: 96
    Throughput: 31.9703 infer/sec
    Response Throughput: 31.9703 infer/sec
    Avg latency: 31033 usec (standard deviation 11848 usec)
    p50 latency: 35807 usec
    p90 latency: 44262 usec
    p95 latency: 45771 usec
    p99 latency: 48866 usec

  Server:
    Inference count: 97
    Execution count: 97
    Successful request count: 97
    Avg request latency: 473 usec (overhead 3 usec + queue 60 usec + compute input 40 usec + compute infer 364 usec + compute output 5 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 31.9703 infer/sec, latency 31033 usec

triton - ERROR - Unexpected error: Expecting 128 tokens but received 1 tokens. This could be due to a long prompt size. Please double check the input and output length.
triton - INFO - Stopping server...
triton - DEBUG - Stopped Triton Server.

nv-hwoo · 2024-01-12T22:18:36Z

src/triton_cli/profiler.py

+    if args.ignore_eos:
+        requests = export_data["experiments"][0]["requests"]
+        for request in requests:
+            if len(request["response_timestamps"]) != args.max_tokens:


Must merge after triton-inference-server/vllm_backend#28 in order for this line to be valid.

Are we receiving back any output data for the responses from PA? If so, we could check if the bonus response is empty for validation.

Otherwise, to smooth the transition, can we do something like this for now?

if len(request["response_timestamps"]) == args.max_tokens: # Assume FINAL flag is returned with final token response pass elif len(request["response_timestamps"]) == args.max_tokens + 1: # Assume FINAL flag was returned with an empty response after the final token logger.warning('Received an extra response from the backend. This may be due to the backend sending an "empty final response".') else: raise ValueError(...)

Unfortunately, the current PA output only has timestamps of requests responses. Updated to add additional check/warning for empty final response.

rmccorm4

Great changes! Only feedback is on the vllm / final response comment.

src/triton_cli/profiler.py

rmccorm4 · 2024-01-13T00:05:18Z

src/triton_cli/profiler.py

+            if len(request["response_timestamps"]) == args.max_tokens:
+                pass
+            elif len(request["response_timestamps"]) == args.max_tokens + 1:
+                logger.warning(


Can you include the comments I had in the other comment in-line here? The # Assume ... ones

This will help me remember the reasoning for the check/split later on

Added comments

Catch errors and improve logging

c10e8ea

nv-hwoo commented Jan 12, 2024

View reviewed changes

nv-hwoo requested a review from rmccorm4 January 12, 2024 22:18

rmccorm4 reviewed Jan 12, 2024

View reviewed changes

src/triton_cli/profiler.py Outdated Show resolved Hide resolved

Address feedback

bc89a8c

nv-hwoo requested a review from rmccorm4 January 12, 2024 23:37

rmccorm4 reviewed Jan 13, 2024

View reviewed changes

Add comments

523950d

rmccorm4 approved these changes Jan 14, 2024

View reviewed changes

rmccorm4 merged commit bd76c71 into main Jan 14, 2024
3 checks passed

rmccorm4 deleted the hwoo-profiler-logging branch January 14, 2024 19:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch errors and improve logging in Profiler #23

Catch errors and improve logging in Profiler #23

nv-hwoo commented Jan 12, 2024 •

edited

Loading

nv-hwoo Jan 12, 2024

rmccorm4 Jan 12, 2024 •

edited

Loading

nv-hwoo Jan 12, 2024

rmccorm4 left a comment

rmccorm4 Jan 13, 2024

rmccorm4 Jan 13, 2024

nv-hwoo Jan 14, 2024

Catch errors and improve logging in Profiler #23

Catch errors and improve logging in Profiler #23

Conversation

nv-hwoo commented Jan 12, 2024 • edited Loading

nv-hwoo Jan 12, 2024

Choose a reason for hiding this comment

rmccorm4 Jan 12, 2024 • edited Loading

Choose a reason for hiding this comment

nv-hwoo Jan 12, 2024

Choose a reason for hiding this comment

rmccorm4 left a comment

Choose a reason for hiding this comment

rmccorm4 Jan 13, 2024

Choose a reason for hiding this comment

rmccorm4 Jan 13, 2024

Choose a reason for hiding this comment

nv-hwoo Jan 14, 2024

Choose a reason for hiding this comment

nv-hwoo commented Jan 12, 2024 •

edited

Loading

rmccorm4 Jan 12, 2024 •

edited

Loading