Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rearrange metric enablement, so that model metric reporter can procee… #321

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ClifHouck
Copy link

…d properly.

Addresses triton-inference-server/server#6815

The fix is to enable GPU metrics (assuming they're enabled at compile time and by the user at run time) prior to calling MetricModelReporter::Create. If GPU metrics are not enabled then MetricModelReporter::GetMetricsLabels will not get/populate relevant GPU labels.

@ClifHouck
Copy link
Author

To elaborate why this change fixes GPU metrics labels: enabling GPU metrics before initializing the server around line 2396: tc::Status status = lserver->Init() allows metric labels to be populated with GPU information.

@ClifHouck ClifHouck force-pushed the clif/fix_enablement_of_metric_labels branch from f93cf3a to b0a970d Compare January 26, 2024 14:29
@dyastremsky
Copy link
Contributor

dyastremsky commented Feb 21, 2024

Thank you for this PR!

These look good to me. Adding @rmccorm4 as a reviewer as well, since he is more familiar with these files.

Once Ryan is good with these changes, we'll run them through the CI and merge once all passes.

@dyastremsky dyastremsky self-assigned this Feb 21, 2024
@rmccorm4
Copy link
Contributor

rmccorm4 commented Feb 21, 2024

Hi @ClifHouck, thanks for this contribution!

While you have figured out a way to have the existing logic propagate the GPU labels to the generic per-model inference metrics - I wouldn't exactly say this is a bug at the moment.

Our per-model metrics are currently aggregated per-model, even if technically under the hood they are being tracked per-model-instance. By introducing these GPU labels for metrics other than the gpu mem/util metrics, it would start to expose the notion of per-model-instance metrics for the case of KIND_GPU models with multiple model instances.

To me I think there is some drawback to adding this support as-is, because it will introduce some inconsistency in how our metrics are reported and aggregated. With this change, KIND_GPU models will have per-model-instance metrics, but KIND_CPU/KIND_MODEL models will not. Similarly, I think this will also beg the question for models using multi-gpu (currently only supported via KIND_MODEL), why aren't the gpu_labels showing the multiple gpus being used for these model instances?

We have a ticket in our backlog (DLIS-3959) to increase the breakdown to per-model-instance metrics (generically for all model instances, irrespective of device type), but it hasn't been prioritized over other tasks yet. Not exposing the GPU labels for these inference metrics allows the metrics to be aggregated for consistency across all cases.


Can you elaborate more on your use case and needs, and how your proposed changes or our future changes for per-gpu or per-model-instance inference metrics would directly impact you?

Thanks,
Ryan

Copy link
Contributor

@rmccorm4 rmccorm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking accidental merge while we discuss the above comments.

@ClifHouck
Copy link
Author

@rmccorm4 I have to disagree that this is not a bug. Given what you have said, there are at least two here:

  1. It is not possible for certain metric information to be gathered or initialized during server initialization. Clearly MetricModelReporter expected metrics to be decisively enabled or disabled by the time that InferenceServer::Init is called. I think that's a reasonable thing to expect.
  2. If MetricModelReporter shouldn't apply GPU labels to its metrics, then that code should be changed or removed.

I can add a commit to this PR which removes the gathering and applying of GPU UUID information to model metrics. That way we solve both issues outlined above.

@rmccorm4
Copy link
Contributor

rmccorm4 commented Feb 22, 2024

(1) Clearly MetricModelReporter expected metrics to be decisively enabled or disabled by the time that InferenceServer::Init is called. I think that's a reasonable thing to expect.

lserver->Init() initializes most components of the server, several of which are the components that get queried to perioridcally update metrics. For example, tc::Metrics::StartPollingThreadSingleton(); starts a thread to poll metrics from the PinnedMemoryManager, which is initialized along with the server. So swapping these two operations does not currently make sense without greater refactoring.

I agree that this flow may be a bit unintuitive currently, since the MetricModelReporters are initialized along with the models+model_repository_manager. In fact, if you were do use the --model-control-mode explicit and dynamically load a model with KIND_GPU after the server has started up, then the gpu labels will actually get populated for these per-model metrics.

I agree this should be resolved one way or the other to be consistent, but I think it's something we should take care to change and we need to balance our current list of priorities. If this behavior is having a significant impact on some workflow or use case, please do let us know. But otherwise I think this is something for us to revisit when we have the bandwidth to do so.

(2) If MetricModelReporter shouldn't apply GPU labels to its metrics, then that code should be changed or removed.

I agree that this code should probably be commented out with a note that it could be re-applied if per-model-instance metrics are exposed for consistency.

@rmccorm4
Copy link
Contributor

rmccorm4 commented Aug 1, 2024

Hi @ClifHouck, thanks for your patience on this and sorry for the long turnaround time. Upon further reflection, I think application of the GPU labels to the inference request metrics can be useful and provide some insights into request distribution, as well as even indirectly detecting faulty GPUs.

To me I think there is some drawback to adding this support as-is, because it will introduce some inconsistency in how our metrics are reported and aggregated. With this change, KIND_GPU models will have per-model-instance metrics, but KIND_CPU/KIND_MODEL models will not.

In hindsight, I realize that you can have multiple model instances per GPU, and so this would directly mean the metric is becoming a per-model-instance level metric. It just so happens to be in the case of 1 instance per GPU.

Similarly, I think this will also beg the question for models using multi-gpu (currently only supported via KIND_MODEL), why aren't the gpu_labels showing the multiple gpus being used for these model instances?

I think this is something we will just have to accept for now until we re-work or improve the instance group feature to better account and track models that use multiple GPUs per-instance.


Overall, I want the metric labels to be consistent one way or the other, and not dependent on timing or use of dynamic model loading.

I think we should go ahead with this change, but will need to some testing of a few things to make sure nothing breaks. It would be great if you're willing to help in adding some test cases, I can help point in the right direction or where to start. Otherwise, we can help when we get the cycles as well.

Off the top of my head, I'm interested in seeing the following checked via tests:

  1. When a model is loaded at startup with KIND_GPU, that gpu_uuid label is present on the expected metrics for both inference request metrics and gpu memory/util metrics
  2. When a model is loaded at startup with KIND_CPU and KIND_MODEL, that the gpu_uuid label is not present
  3. The correctly labeled metric is updated per request. For example, 2-GPU system, 2 model instances, 1 per GPU, send 2 inference requests such that 1 request is scheduled to each GPU, assert that the request count for each per-gpu metric is 1, rather than either being 0 or 2.
  4. There is no duplicate "unlabeled" metric in the KIND_GPU case such as:
nv_inference_count{gpu_uuid="GPU-1ccddeda-8caf-7d8e-cfe2-a63cddeaafd2",model="double",version="1"} 0
nv_inference_count{model="double",version="1"} 0

Lastly, I'm curious how this will impact existing workflows or dashboards that query the metrics today. Given a server today reporting request metrics with no GPU labels such as:

nv_inference_count{model="double",version="1"} 0

and then we add the GPU label after this PR on a 2-GPU system, so the metrics output becomes:

nv_inference_count{gpu_uuid="GPU-1ccddeda-8caf-7d8e-cfe2-a63cddeaafd2",model="double",version="1"} 0
nv_inference_count{gpu_uuid="GPU-1ccddeda-8caf-7d8e-cfe2-a63cddeaafd2",model="double",version="1"} 0

Will we break any existing promql queries or workflows looking at nv_inference_count?

@rmccorm4
Copy link
Contributor

rmccorm4 commented Aug 1, 2024

CC @chriscarollo from your issue as this PR pertains to your question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants