Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add histogram support and TTFT histogram metric #396

Merged
merged 12 commits into from
Oct 23, 2024

Conversation

yinggeh
Copy link
Contributor

@yinggeh yinggeh commented Oct 12, 2024

What does the PR do?

  1. Adds histogram support to core metrics.
  2. Adding a new triton version TTFT (time-to-first-token) metric nv_inference_first_response_histogram_ms. The new histogram metric is specific to decoupled models only.
  3. Add --metrics-config histogram_latencies=<bool> to disable histograms.

Example a request sent to TRT-LLM backend ensemble model.

# HELP nv_inference_first_response_histogram_ms Duration from request to first response in milliseconds
# TYPE nv_inference_first_response_histogram_ms histogram
nv_inference_first_response_histogram_ms_count{model="ensemble",version="1"} 1
nv_inference_first_response_histogram_ms_sum{model="ensemble",version="1"} 1159
nv_inference_first_response_histogram_ms_bucket{model="ensemble",version="1",le="100"} 0
nv_inference_first_response_histogram_ms_bucket{model="ensemble",version="1",le="500"} 0
nv_inference_first_response_histogram_ms_bucket{model="ensemble",version="1",le="2000"} 1
nv_inference_first_response_histogram_ms_bucket{model="ensemble",version="1",le="5000"} 1
nv_inference_first_response_histogram_ms_bucket{model="ensemble",version="1",le="+Inf"} 1
nv_inference_first_response_histogram_ms_count{model="tensorrt_llm",version="1"} 1
nv_inference_first_response_histogram_ms_sum{model="tensorrt_llm",version="1"} 1137
nv_inference_first_response_histogram_ms_bucket{model="tensorrt_llm",version="1",le="100"} 0
nv_inference_first_response_histogram_ms_bucket{model="tensorrt_llm",version="1",le="500"} 0
nv_inference_first_response_histogram_ms_bucket{model="tensorrt_llm",version="1",le="2000"} 1
nv_inference_first_response_histogram_ms_bucket{model="tensorrt_llm",version="1",le="5000"} 1
nv_inference_first_response_histogram_ms_bucket{model="tensorrt_llm",version="1",le="+Inf"} 1
nv_inference_first_response_histogram_ms_count{model="tensorrt_llm_bls",version="1"} 0
nv_inference_first_response_histogram_ms_sum{model="tensorrt_llm_bls",version="1"} 0
nv_inference_first_response_histogram_ms_bucket{model="tensorrt_llm_bls",version="1",le="100"} 0
nv_inference_first_response_histogram_ms_bucket{model="tensorrt_llm_bls",version="1",le="500"} 0
nv_inference_first_response_histogram_ms_bucket{model="tensorrt_llm_bls",version="1",le="2000"} 0
nv_inference_first_response_histogram_ms_bucket{model="tensorrt_llm_bls",version="1",le="5000"} 0
nv_inference_first_response_histogram_ms_bucket{model="tensorrt_llm_bls",version="1",le="+Inf"} 0

Example a request sent to TRT-LLM backend tensorrt_llm_bls model.

# HELP nv_inference_first_response_histogram_ms Duration from request to first response in milliseconds
# TYPE nv_inference_first_response_histogram_ms histogram
nv_inference_first_response_histogram_ms_count{model="ensemble",version="1"} 0
nv_inference_first_response_histogram_ms_sum{model="ensemble",version="1"} 0
nv_inference_first_response_histogram_ms_bucket{model="ensemble",version="1",le="100"} 0
nv_inference_first_response_histogram_ms_bucket{model="ensemble",version="1",le="500"} 0
nv_inference_first_response_histogram_ms_bucket{model="ensemble",version="1",le="2000"} 0
nv_inference_first_response_histogram_ms_bucket{model="ensemble",version="1",le="5000"} 0
nv_inference_first_response_histogram_ms_bucket{model="ensemble",version="1",le="+Inf"} 0
nv_inference_first_response_histogram_ms_count{model="tensorrt_llm_bls",version="1"} 1
nv_inference_first_response_histogram_ms_sum{model="tensorrt_llm_bls",version="1"} 1125
nv_inference_first_response_histogram_ms_bucket{model="tensorrt_llm_bls",version="1",le="100"} 0
nv_inference_first_response_histogram_ms_bucket{model="tensorrt_llm_bls",version="1",le="500"} 0
nv_inference_first_response_histogram_ms_bucket{model="tensorrt_llm_bls",version="1",le="2000"} 1
nv_inference_first_response_histogram_ms_bucket{model="tensorrt_llm_bls",version="1",le="5000"} 1
nv_inference_first_response_histogram_ms_bucket{model="tensorrt_llm_bls",version="1",le="+Inf"} 1
nv_inference_first_response_histogram_ms_count{model="tensorrt_llm",version="1"} 1
nv_inference_first_response_histogram_ms_sum{model="tensorrt_llm",version="1"} 1113
nv_inference_first_response_histogram_ms_bucket{model="tensorrt_llm",version="1",le="100"} 0
nv_inference_first_response_histogram_ms_bucket{model="tensorrt_llm",version="1",le="500"} 0
nv_inference_first_response_histogram_ms_bucket{model="tensorrt_llm",version="1",le="2000"} 1
nv_inference_first_response_histogram_ms_bucket{model="tensorrt_llm",version="1",le="5000"} 1
nv_inference_first_response_histogram_ms_bucket{model="tensorrt_llm",version="1",le="+Inf"} 1

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

  • feat

Related PRs:

triton-inference-server/server#7694

Where should the reviewer start?

n/a

Test plan:

L0_metrics--base
L0_response_cache--base

  • CI Pipeline ID:
    19614087

Caveats:

  1. Neelay suggested adding backend_name to metric label. During the implementation, I found it wasn't easy to only add the backend label to new metric without changing all existing ones. @nnshah1
  2. Customer asked for TTFT in millisecond. However, it is not consistent with microsecond we use in other metrics. Please advise.

Background

Standardizing Large Model Server Metrics in Kubernetes

@yinggeh yinggeh added the PR: feat A new feature label Oct 12, 2024
@yinggeh yinggeh self-assigned this Oct 12, 2024
@yinggeh yinggeh force-pushed the DLIS-7383-yinggeh-metrics-standardization-TTFT branch from 0c2893d to ba4495d Compare October 12, 2024 00:23
@yinggeh yinggeh force-pushed the DLIS-7383-yinggeh-metrics-standardization-TTFT branch from ba4495d to b9231d8 Compare October 12, 2024 00:26
@yinggeh yinggeh changed the title feat: Add histogram support and new histogram metric feat: Add histogram support and TTFT histogram metric Oct 12, 2024
src/infer_response.cc Outdated Show resolved Hide resolved
src/model.h Outdated Show resolved Hide resolved
src/infer_response.cc Outdated Show resolved Hide resolved
@yinggeh yinggeh requested a review from rmccorm4 October 14, 2024 21:41
src/infer_response.cc Outdated Show resolved Hide resolved
@yinggeh yinggeh requested a review from kthui October 16, 2024 20:58
kthui
kthui previously approved these changes Oct 16, 2024
Copy link
Contributor

@kthui kthui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @rmccorm4 should also take a look before merging.

src/model.h Outdated Show resolved Hide resolved
@@ -214,6 +232,10 @@ InferenceResponse::Send(
TRITONSERVER_TRACE_TENSOR_BACKEND_OUTPUT, "InferenceResponse Send");
#endif // TRITON_ENABLE_TRACING

#ifdef TRITON_ENABLE_METRICS
response->UpdateResponseMetrics();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an initial merge with a default of histograms being disabled - I think this is fine to go ahead with if we need to cherry-pick. However, please take note of the following:

I think this is relatively on the hot path, possibly impacting latency, compared to our other inference metrics (TRITONBACKEND_ReportStatistics) which are generally reported after response sending in backends (impacting throughput but not response latency).

You can find some perf numbers of each prometheus-cpp metric type at the bottom of the README here: https://github.com/jupp0r/prometheus-cpp

One individual observation for a single metric and a small number of buckets may not be impactful for one request, but as we scale up high concurrency, more metrics, more buckets, etc - this could present a noticeable latency impact.

It would be good to do some light validation of overall latency before/after the feature via genai-perf. Especially for high concurrency and streaming many responses/tokens - as there can be some synchronization in interaction with the prometheus registry with many concurrent responses as well.

It would probably be advantageous to do the actual prometheus registry interaction after sending the response if possible, such as by only doing the bare minimum of determining if we should report metrics (check if first response and record latency), then using that information to report the metric after initiating response send.

src/infer_response.h Outdated Show resolved Hide resolved
@yinggeh yinggeh force-pushed the DLIS-7383-yinggeh-metrics-standardization-TTFT branch from 1b4f571 to b6b5af9 Compare October 17, 2024 18:49
src/infer_response.cc Outdated Show resolved Hide resolved
@yinggeh yinggeh requested a review from rmccorm4 October 17, 2024 23:11
@rmccorm4
Copy link
Contributor

Deferring to Jacky as I'll be OOO, but let's make sure the test cases give us confidence in the feature and that we're addressing the spirit of the use case that motivates the feature

…into DLIS-7383-yinggeh-metrics-standardization-TTFT
…into DLIS-7383-yinggeh-metrics-standardization-TTFT
Copy link
Contributor

@kthui kthui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The test cases are giving us confidence in the feature and the feature is addressing the use case.

Please make sure the CI passes with the all final changes before merging.

@yinggeh yinggeh merged commit c815a13 into main Oct 23, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR: feat A new feature
Development

Successfully merging this pull request may close these issues.

3 participants