Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add histogram support and TTFT histogram metric #396

Merged
merged 12 commits into from
Oct 23, 2024
2 changes: 1 addition & 1 deletion src/backend_model_instance.cc
Original file line number Diff line number Diff line change
Expand Up @@ -192,7 +192,7 @@ TritonModelInstance::TritonModelInstance(
model_->Server()->ResponseCacheEnabled();
MetricModelReporter::Create(
model_->ModelId(), model_->Version(), id, response_cache_enabled,
model_->Config().metric_tags(), &reporter_);
model_->IsDecoupled(), model_->Config().metric_tags(), &reporter_);
}
#endif // TRITON_ENABLE_METRICS
}
Expand Down
5 changes: 3 additions & 2 deletions src/ensemble_scheduler/ensemble_scheduler.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1470,12 +1470,13 @@ EnsembleScheduler::EnsembleScheduler(
}
#endif // TRITON_ENABLE_GPU

const bool is_decoupled = config.model_transaction_policy().decoupled();
#ifdef TRITON_ENABLE_METRICS
if (Metrics::Enabled()) {
// Ensemble scheduler doesn't currently support response cache at top level.
MetricModelReporter::Create(
model_id, 1 /* model_version */, METRIC_REPORTER_ID_CPU,
false /* response_cache_enabled */, config.metric_tags(),
false /* response_cache_enabled */, is_decoupled, config.metric_tags(),
&metric_reporter_);
}
#endif // TRITON_ENABLE_METRICS
Expand All @@ -1486,7 +1487,7 @@ EnsembleScheduler::EnsembleScheduler(
info_->ensemble_name_ = config.name();

// This config field is filled internally for ensemble models
info_->is_decoupled_ = config.model_transaction_policy().decoupled();
info_->is_decoupled_ = is_decoupled;

// field to check if response cache enabled in the ensemble model config.
info_->is_cache_enabled_ =
Expand Down
50 changes: 46 additions & 4 deletions src/infer_response.cc
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
// Copyright 2020-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
// Copyright 2020-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions
Expand Down Expand Up @@ -42,7 +42,12 @@ InferenceResponseFactory::CreateResponse(
{
response->reset(new InferenceResponse(
model_, id_, allocator_, alloc_userp_, response_fn_, response_userp_,
response_delegator_));
response_delegator_
#ifdef TRITON_ENABLE_METRICS
,
responses_sent_, infer_start_ns_
#endif // TRITON_ENABLE_METRICS
));
#ifdef TRITON_ENABLE_TRACING
(*response)->SetTrace(trace_);
#endif // TRITON_ENABLE_TRACING
Expand Down Expand Up @@ -72,10 +77,21 @@ InferenceResponse::InferenceResponse(
TRITONSERVER_InferenceResponseCompleteFn_t response_fn,
void* response_userp,
const std::function<
void(std::unique_ptr<InferenceResponse>&&, const uint32_t)>& delegator)
void(std::unique_ptr<InferenceResponse>&&, const uint32_t)>& delegator
#ifdef TRITON_ENABLE_METRICS
,
std::shared_ptr<std::atomic<uint64_t>> responses_sent,
uint64_t infer_start_ns
#endif // TRITON_ENABLE_METRICS
)
: model_(model), id_(id), allocator_(allocator), alloc_userp_(alloc_userp),
response_fn_(response_fn), response_userp_(response_userp),
response_delegator_(delegator), null_response_(false)
response_delegator_(delegator),
#ifdef TRITON_ENABLE_METRICS
responses_sent_(std::move(responses_sent)),
infer_start_ns_(infer_start_ns),
#endif // TRITON_ENABLE_METRICS
null_response_(false)
{
// If the allocator has a start_fn then invoke it.
TRITONSERVER_ResponseAllocatorStartFn_t start_fn = allocator_->StartFn();
Expand All @@ -93,6 +109,9 @@ InferenceResponse::InferenceResponse(
TRITONSERVER_InferenceResponseCompleteFn_t response_fn,
void* response_userp)
: response_fn_(response_fn), response_userp_(response_userp),
#ifdef TRITON_ENABLE_METRICS
responses_sent_(nullptr), infer_start_ns_(0),
#endif // TRITON_ENABLE_METRICS
null_response_(true)
{
}
Expand Down Expand Up @@ -214,6 +233,10 @@ InferenceResponse::Send(
TRITONSERVER_TRACE_TENSOR_BACKEND_OUTPUT, "InferenceResponse Send");
#endif // TRITON_ENABLE_TRACING

#ifdef TRITON_ENABLE_METRICS
response->UpdateResponseMetrics();
#endif // TRITON_ENABLE_METRICS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an initial merge with a default of histograms being disabled - I think this is fine to go ahead with if we need to cherry-pick. However, please take note of the following:

I think this is relatively on the hot path, possibly impacting latency, compared to our other inference metrics (TRITONBACKEND_ReportStatistics) which are generally reported after response sending in backends (impacting throughput but not response latency).

You can find some perf numbers of each prometheus-cpp metric type at the bottom of the README here: https://github.com/jupp0r/prometheus-cpp

One individual observation for a single metric and a small number of buckets may not be impactful for one request, but as we scale up high concurrency, more metrics, more buckets, etc - this could present a noticeable latency impact.

It would be good to do some light validation of overall latency before/after the feature via genai-perf. Especially for high concurrency and streaming many responses/tokens - as there can be some synchronization in interaction with the prometheus registry with many concurrent responses as well.

It would probably be advantageous to do the actual prometheus registry interaction after sending the response if possible, such as by only doing the bare minimum of determining if we should report metrics (check if first response and record latency), then using that information to report the metric after initiating response send.


if (response->response_delegator_ != nullptr) {
auto ldelegator = std::move(response->response_delegator_);
ldelegator(std::move(response), flags);
Expand Down Expand Up @@ -282,6 +305,25 @@ InferenceResponse::TraceOutputTensors(
}
#endif // TRITON_ENABLE_TRACING

#ifdef TRITON_ENABLE_METRICS
void
InferenceResponse::UpdateResponseMetrics() const
{
// Report inference to first response duration.
if (model_ != nullptr && responses_sent_ != nullptr &&
responses_sent_->fetch_add(1, std::memory_order_relaxed) == 0) {
auto now_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(
std::chrono::steady_clock::now().time_since_epoch())
.count();
if (auto reporter = model_->MetricReporter()) {
reporter->ObserveHistogram(
"first_response_histogram",
(now_ns - infer_start_ns_) / NANOS_PER_MILLIS);
}
}
}
#endif // TRITON_ENABLE_METRICS

//
// InferenceResponse::Output
//
Expand Down
39 changes: 37 additions & 2 deletions src/infer_response.h
Original file line number Diff line number Diff line change
Expand Up @@ -61,11 +61,20 @@ class InferenceResponseFactory {
alloc_userp_(alloc_userp), response_fn_(response_fn),
response_userp_(response_userp), response_delegator_(delegator),
is_cancelled_(false)
#ifdef TRITON_ENABLE_METRICS
,
responses_sent_(std::make_shared<std::atomic<uint64_t>>(0))
#endif // TRITON_ENABLE_METRICS
#ifdef TRITON_ENABLE_STATS
,
response_stats_index_(0)
#endif // TRITON_ENABLE_STATS
{
#ifdef TRITON_ENABLE_METRICS
infer_start_ns_ = std::chrono::duration_cast<std::chrono::nanoseconds>(
std::chrono::steady_clock::now().time_since_epoch())
.count();
#endif // TRITON_ENABLE_METRICS
}

void Cancel() { is_cancelled_ = true; }
Expand Down Expand Up @@ -134,6 +143,14 @@ class InferenceResponseFactory {

std::atomic<bool> is_cancelled_;

#ifdef TRITON_ENABLE_METRICS
// Total number of responses sent created by this response factory.
std::shared_ptr<std::atomic<uint64_t>> responses_sent_;

// The start time of associate request in ns.
uint64_t infer_start_ns_;
#endif // TRITON_ENABLE_METRICS

#ifdef TRITON_ENABLE_TRACING
// Inference trace associated with this response.
std::shared_ptr<InferenceTraceProxy> trace_;
Expand Down Expand Up @@ -246,8 +263,14 @@ class InferenceResponse {
const ResponseAllocator* allocator, void* alloc_userp,
TRITONSERVER_InferenceResponseCompleteFn_t response_fn,
void* response_userp,
const std::function<void(
std::unique_ptr<InferenceResponse>&&, const uint32_t)>& delegator);
const std::function<
void(std::unique_ptr<InferenceResponse>&&, const uint32_t)>& delegator
#ifdef TRITON_ENABLE_METRICS
,
std::shared_ptr<std::atomic<uint64_t>> responses_sent,
uint64_t infer_start_ns
#endif // TRITON_ENABLE_METRICS
);

// "null" InferenceResponse is a special instance of InferenceResponse which
// contains minimal information for calling InferenceResponse::Send,
Expand Down Expand Up @@ -324,6 +347,10 @@ class InferenceResponse {
TRITONSERVER_InferenceTraceActivity activity, const std::string& msg);
#endif // TRITON_ENABLE_TRACING

#ifdef TRITON_ENABLE_METRICS
void UpdateResponseMetrics() const;
#endif // TRITON_ENABLE_METRICS

// The model associated with this factory. For normal
// requests/responses this will always be defined and acts to keep
// the model loaded as long as this factory is live. It may be
Expand Down Expand Up @@ -358,6 +385,14 @@ class InferenceResponse {
std::function<void(std::unique_ptr<InferenceResponse>&&, const uint32_t)>
response_delegator_;

#ifdef TRITON_ENABLE_METRICS
// Total number of responses sent created by its response factory.
const std::shared_ptr<std::atomic<uint64_t>> responses_sent_;

// The start time of associate request in ns.
const uint64_t infer_start_ns_;
#endif // TRITON_ENABLE_METRICS

bool null_response_;

#ifdef TRITON_ENABLE_TRACING
Expand Down
65 changes: 60 additions & 5 deletions src/metric_model_reporter.cc
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,8 @@ namespace triton { namespace core {
// MetricReporterConfig
//
void
MetricReporterConfig::ParseConfig(bool response_cache_enabled)
MetricReporterConfig::ParseConfig(
bool response_cache_enabled, bool is_decoupled)
{
// Global config only for now in config map
auto metrics_config_map = Metrics::ConfigMap();
Expand All @@ -53,6 +54,10 @@ MetricReporterConfig::ParseConfig(bool response_cache_enabled)
latency_counters_enabled_ = false;
}

if (pair.first == "histogram_latencies" && pair.second == "true") {
yinggeh marked this conversation as resolved.
Show resolved Hide resolved
latency_histograms_enabled_ = true;
}

if (pair.first == "summary_latencies" && pair.second == "true") {
latency_summaries_enabled_ = true;
}
Expand All @@ -68,6 +73,7 @@ MetricReporterConfig::ParseConfig(bool response_cache_enabled)

// Set flag to signal to stats aggregator if caching is enabled or not
cache_enabled_ = response_cache_enabled;
is_decoupled_ = is_decoupled;
}

prometheus::Summary::Quantiles
Expand Down Expand Up @@ -112,7 +118,7 @@ const std::map<FailureReason, std::string>
Status
MetricModelReporter::Create(
const ModelIdentifier& model_id, const int64_t model_version,
const int device, bool response_cache_enabled,
const int device, bool response_cache_enabled, bool is_decoupled,
const triton::common::MetricTagsMap& model_tags,
std::shared_ptr<MetricModelReporter>* metric_model_reporter)
{
Expand Down Expand Up @@ -141,25 +147,27 @@ MetricModelReporter::Create(
}

metric_model_reporter->reset(new MetricModelReporter(
model_id, model_version, device, response_cache_enabled, model_tags));
model_id, model_version, device, response_cache_enabled, is_decoupled,
model_tags));
reporter_map.insert({hash_labels, *metric_model_reporter});
return Status::Success;
}

MetricModelReporter::MetricModelReporter(
const ModelIdentifier& model_id, const int64_t model_version,
const int device, bool response_cache_enabled,
const int device, bool response_cache_enabled, bool is_decoupled,
const triton::common::MetricTagsMap& model_tags)
{
std::map<std::string, std::string> labels;
GetMetricLabels(&labels, model_id, model_version, device, model_tags);

// Parse metrics config to control metric setup and behavior
config_.ParseConfig(response_cache_enabled);
config_.ParseConfig(response_cache_enabled, is_decoupled);

// Initialize families and metrics
InitializeCounters(labels);
InitializeGauges(labels);
InitializeHistograms(labels);
InitializeSummaries(labels);
}

Expand All @@ -182,6 +190,14 @@ MetricModelReporter::~MetricModelReporter()
}
}

for (auto& iter : histogram_families_) {
const auto& name = iter.first;
auto family_ptr = iter.second;
if (family_ptr) {
family_ptr->Remove(histograms_[name]);
}
}

for (auto& iter : summary_families_) {
const auto& name = iter.first;
auto family_ptr = iter.second;
Expand Down Expand Up @@ -262,6 +278,28 @@ MetricModelReporter::InitializeGauges(
}
}

void
MetricModelReporter::InitializeHistograms(
const std::map<std::string, std::string>& labels)
{
// Only create response metrics if decoupled model to reduce metric output
if (config_.latency_histograms_enabled_) {
if (config_.is_decoupled_) {
histogram_families_["first_response_histogram"] =
&Metrics::FamilyFirstResponseDuration();
}
}

for (auto& iter : histogram_families_) {
const auto& name = iter.first;
auto family_ptr = iter.second;
if (family_ptr) {
histograms_[name] = CreateMetric<prometheus::Histogram>(
*family_ptr, labels, config_.buckets_);
}
}
}

void
MetricModelReporter::InitializeSummaries(
const std::map<std::string, std::string>& labels)
Expand Down Expand Up @@ -408,6 +446,23 @@ MetricModelReporter::DecrementGauge(const std::string& name, double value)
IncrementGauge(name, -1 * value);
}

void
MetricModelReporter::ObserveHistogram(const std::string& name, double value)
{
auto iter = histograms_.find(name);
if (iter == histograms_.end()) {
// No histogram metric exists with this name
return;
}

auto histogram = iter->second;
if (!histogram) {
// histogram is uninitialized/nullptr
return;
}
histogram->Observe(value);
}

void
MetricModelReporter::ObserveSummary(const std::string& name, double value)
{
Expand Down
Loading
Loading