Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch errors and improve logging in Profiler #23

Merged
merged 3 commits into from
Jan 14, 2024
Merged

Conversation

nv-hwoo
Copy link
Contributor

@nv-hwoo nv-hwoo commented Jan 12, 2024

The change includes:

  • catching when PA throws an error
  • sanity checking profile results to make sure the expected number of responses are generated
  • using logger instead of print statement
  • adding verbose mode

The sanity check prevents potential miscalculation that can arise due to user requesting the model to generate a sequence that's longer than the model's context length.

Non-verbose (no error):

root@nv-ubuntu:/triton_cli# triton bench -m gpt2 --input-length 800 --output-length 128
triton - INFO - Clearing all contents from /root/models...
triton - INFO - Known model source found for 'gpt2': 'hf:gpt2'
triton - INFO - Starting a Triton Server locally with model repository: /root/models
triton - INFO - Starting a Triton Server via docker image 'nvcr.io/nvidia/tritonserver:23.12-vllm-python-py3' with model repository: /root/models
triton - INFO - Server is ready for inference.
triton - INFO - Running Perf Analyzer profiler on 'gpt2'...
triton - INFO - Warming up...
triton - INFO - Warmed up, profiling now...
[ PROFILE CONFIGURATIONS ]
 * Model: gpt2
 * Backend: vllm
 * Batch size: 1
 * Input tokens: 800
 * Output tokens: 128

[ BENCHMARK SUMMARY ]
 * Avg first token latency: 27.3099 ms
 * Avg generation throughput: 186.8861 output tokens/s

triton - INFO - Stopping server...

Non-verbose (with PA error):

root@nv-ubuntu:/triton_cli# triton bench -m gpt2 --input-length 800 --output-length 128
triton - INFO - Clearing all contents from /root/models...
triton - INFO - Known model source found for 'gpt2': 'hf:gpt2'
triton - INFO - Starting a Triton Server locally with model repository: /root/models
triton - INFO - Starting a Triton Server via docker image 'nvcr.io/nvidia/tritonserver:23.12-vllm-python-py3' with model repository: /root/models
triton - INFO - Server is ready for inference.
triton - INFO - Running Perf Analyzer profiler on 'gpt2'...
triton - INFO - Warming up...
triton - ERROR - Unexpected error: Encountered the following error while running Perf Analyzer:
Failed to retrieve results from inference request.
Thread [0] had error: Error generating stream: 'ascii' codec can't encode character '\xab' in position 77: ordinal not in range(128)
triton - INFO - Stopping server...

Non-verbose (with sanity check error):

root@nv-ubuntu:/triton_cli# triton bench -m gpt2 --input-length 1024 --output-length 128
triton - INFO - Clearing all contents from /root/models...
triton - INFO - Known model source found for 'gpt2': 'hf:gpt2'
triton - INFO - Starting a Triton Server locally with model repository: /root/models
triton - INFO - Starting a Triton Server via docker image 'nvcr.io/nvidia/tritonserver:23.12-vllm-python-py3' with model repository: /root/models
triton - INFO - Server is ready for inference.
triton - INFO - Running Perf Analyzer profiler on 'gpt2'...
triton - INFO - Warming up...
triton - INFO - Warmed up, profiling now...
[ PROFILE CONFIGURATIONS ]
 * Model: gpt2
 * Backend: vllm
 * Batch size: 1
 * Input tokens: 1024
 * Output tokens: 128

triton - ERROR - Unexpected error: Expecting 128 tokens but received 1 tokens. This could be due to a long prompt size. Please double check the input and output length.
triton - INFO - Stopping server...

Verbose (no error):

root@nv-ubuntu:/triton_cli# triton bench -m gpt2 --input-length 1024 --output-length 128 --verbose
triton - DEBUG - Using existing model repository: /root/models
triton - INFO - Clearing all contents from /root/models...
triton - INFO - Known model source found for 'gpt2': 'hf:gpt2'
triton - DEBUG - HuggingFace prefix detected, parsing HuggingFace ID
triton - DEBUG - Adding new model to repo at: /root/models/gpt2/1
triton - INFO - Current repo at /root/models:
models/
└── gpt2/
    ├── 1/
    │   └── model.json
    └── config.pbtxt
triton - DEBUG - No --mode specified, trying the following modes: ['local', 'docker']
triton - INFO - Starting a Triton Server locally with model repository: /root/models
triton - INFO - Starting a Triton Server via docker image 'nvcr.io/nvidia/tritonserver:23.12-vllm-python-py3' with model repository: /root/models
triton - DEBUG - Failed to start server in 'docker' mode. Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))
triton - INFO - Server is ready for inference.
triton - INFO - Running Perf Analyzer profiler on 'gpt2'...
triton - INFO - Warming up...
triton - INFO - Perf Analyzer output:
 Successfully read data for 1 stream/streams with 1 step/steps.
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 10
  Using asynchronous calls for inference
  Detected decoupled model, using the first response for measuring latency
  Stabilizing using average latency

Request concurrency: 1
  Client:
    Request count: 71
    Throughput: 23.6401 infer/sec
    Response Throughput: 23.6401 infer/sec
    Avg latency: 41843 usec (standard deviation 3879 usec)
    p50 latency: 42018 usec
    p90 latency: 47008 usec
    p95 latency: 48011 usec
    p99 latency: 48873 usec

  Server:
    Inference count: 72
    Execution count: 72
    Successful request count: 72
    Avg request latency: 510 usec (overhead 4 usec + queue 71 usec + compute input 42 usec + compute infer 387 usec + compute output 6 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 23.6401 infer/sec, latency 41843 usec

triton - INFO - Warmed up, profiling now...
[ PROFILE CONFIGURATIONS ]
 * Model: gpt2
 * Backend: vllm
 * Batch size: 1
 * Input tokens: 1024
 * Output tokens: 128

triton - INFO - Perf Analyzer output:
 Successfully read data for 1 stream/streams with 1 step/steps.
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 10
  Using asynchronous calls for inference
  Detected decoupled model, using the first response for measuring latency
  Stabilizing using average latency

Request concurrency: 1
  Client:
    Request count: 96
    Throughput: 31.9703 infer/sec
    Response Throughput: 31.9703 infer/sec
    Avg latency: 31033 usec (standard deviation 11848 usec)
    p50 latency: 35807 usec
    p90 latency: 44262 usec
    p95 latency: 45771 usec
    p99 latency: 48866 usec

  Server:
    Inference count: 97
    Execution count: 97
    Successful request count: 97
    Avg request latency: 473 usec (overhead 3 usec + queue 60 usec + compute input 40 usec + compute infer 364 usec + compute output 5 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 31.9703 infer/sec, latency 31033 usec

triton - ERROR - Unexpected error: Expecting 128 tokens but received 1 tokens. This could be due to a long prompt size. Please double check the input and output length.
triton - INFO - Stopping server...
triton - DEBUG - Stopped Triton Server.

if args.ignore_eos:
requests = export_data["experiments"][0]["requests"]
for request in requests:
if len(request["response_timestamps"]) != args.max_tokens:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must merge after triton-inference-server/vllm_backend#28 in order for this line to be valid.

Copy link
Collaborator

@rmccorm4 rmccorm4 Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we receiving back any output data for the responses from PA? If so, we could check if the bonus response is empty for validation.

Otherwise, to smooth the transition, can we do something like this for now?

            if len(request["response_timestamps"]) == args.max_tokens:
                # Assume FINAL flag is returned with final token response
                pass
            elif len(request["response_timestamps"]) == args.max_tokens + 1:
                # Assume FINAL flag was returned with an empty response after the final token
                logger.warning('Received an extra response from the backend. This may be due to the backend sending an "empty final response".') 
            else:
                raise ValueError(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, the current PA output only has timestamps of requests responses. Updated to add additional check/warning for empty final response.

@nv-hwoo nv-hwoo requested a review from rmccorm4 January 12, 2024 22:18
Copy link
Collaborator

@rmccorm4 rmccorm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great changes! Only feedback is on the vllm / final response comment.

@nv-hwoo nv-hwoo requested a review from rmccorm4 January 12, 2024 23:37
if len(request["response_timestamps"]) == args.max_tokens:
pass
elif len(request["response_timestamps"]) == args.max_tokens + 1:
logger.warning(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include the comments I had in the other comment in-line here? The # Assume ... ones

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will help me remember the reasoning for the check/split later on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments

@rmccorm4 rmccorm4 merged commit bd76c71 into main Jan 14, 2024
3 checks passed
@rmccorm4 rmccorm4 deleted the hwoo-profiler-logging branch January 14, 2024 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants