Correct Configuration Fails to Provide Expected Custom Metrics in EKS #663

wuyudian1 · 2024-06-04T09:24:25Z

What happened?: Correct Configuration Fails to Provide Expected Custom Metrics in EKS
We have deployed identical Prometheus chart and Prometheus-Adapter chart in both Alibaba Cloud ACK cluster and AWS EKS cluster. The configurations of Prometheus and Prometheus-Adapter are the same in both K8S clusters. The scraping configuration for Prometheus is as follows:

job_name: basicai-business-queue-wait
metrics_path: /metrics/prometheus
scheme: http
scrape_interval: 30s
honor_labels: true
kubernetes_sd_configs:
  - role: service
    namespaces:
      names:
        - basicai-backend
        - basicai-stage-backend
relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_component]
    regex: dataset
    action: keep
  - source_labels: [__meta_kubernetes_namespace]
    target_label: 'kubernetes_namespace'
    action: replace
  - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_component]
    target_label: 'kubernetes_deployment'
    action: replace
  - source_labels: [__meta_kubernetes_service_port_number]
    regex: 80
    action: keep

The values.yaml for Prometheus-Adapter chart is as follows:

image:
  repository: registry.talos.basic.ai/common/images/prometheus-adapter
  tag: "v0.11.2"
  pullPolicy: IfNotPresent
prometheus:
  url: http://prometheus-server
  port: 80
resources:
   requests:
     cpu: 100m
     memory: 128Mi
   limits:
     cpu: 100m
     memory: 1Gi
rules:
  default: false
  custom:
  - seriesQuery: '{__name__=~"basicai_job_replica_scale_percent",container!="POD",kubernetes_namespace!="",type="dataset-upload"}'
    resources:
      template: <<.Resource>>
      overrides:
        kubernetes_namespace: {resource: "namespace"}
        kubernetes_deployment: {resource: "deployment"}
    name:
      matches: "basicai_job_replica_scale_percent"
      as: "upload_job_replica_scale_percent_dataset"
    metricsQuery: last_over_time(basicai_job_replica_scale_percent{<<.LabelMatchers>>,type="dataset-upload"}[5m])

In the Alibaba Cloud ACK cluster, the Prometheus-Adapter correctly provides custom metrics:

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "custom.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "deployments.apps/upload_job_replica_scale_percent_dataset",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "namespaces/upload_job_replica_scale_percent_dataset",
      "singularName": "",
      "namespaced": false,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "jobs.batch/upload_job_replica_scale_percent_dataset",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    }
  ]
}

However, in the EKS cluster, the Prometheus-Adapter provides a large number of default metrics, but does not include the expected 'upload_job_replica_scale_percent_dataset':

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq | head -n 50
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "custom.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "services/authentication_duration_seconds_sum",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    .....
    .....
    .....

What did you expect to happen?:
prometheus-adapter provides correct custom metrics in AWS EKS cluster as in Alibaba Cloud ACK cluster

Please provide the prometheus-adapter config:

image:
  repository: registry.talos.basic.ai/common/images/prometheus-adapter
  tag: "v0.11.2"
  pullPolicy: IfNotPresent
prometheus:
  url: http://prometheus-server
  port: 80
resources:
   requests:
     cpu: 100m
     memory: 128Mi
   limits:
     cpu: 100m
     memory: 1Gi
rules:
  default: false
  custom:
  - seriesQuery: '{__name__=~"basicai_job_replica_scale_percent",container!="POD",kubernetes_namespace!="",type="dataset-upload"}'
    resources:
      template: <<.Resource>>
      overrides:
        kubernetes_namespace: {resource: "namespace"}
        kubernetes_deployment: {resource: "deployment"}
    name:
      matches: "basicai_job_replica_scale_percent"
      as: "upload_job_replica_scale_percent_dataset"
    metricsQuery: last_over_time(basicai_job_replica_scale_percent{<<.LabelMatchers>>,type="dataset-upload"}[5m])

dgrisonnet · 2024-06-13T16:55:03Z

/triage accepted
/help

k8s-ci-robot · 2024-06-13T16:55:05Z

@dgrisonnet:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/triage accepted
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

wuyudian1 added the kind/bug Categorizes issue or PR as related to a bug. label Jun 4, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct Configuration Fails to Provide Expected Custom Metrics in EKS #663

Correct Configuration Fails to Provide Expected Custom Metrics in EKS #663

wuyudian1 commented Jun 4, 2024

dgrisonnet commented Jun 13, 2024

k8s-ci-robot commented Jun 13, 2024

Correct Configuration Fails to Provide Expected Custom Metrics in EKS #663

Correct Configuration Fails to Provide Expected Custom Metrics in EKS #663

Comments

wuyudian1 commented Jun 4, 2024

dgrisonnet commented Jun 13, 2024

k8s-ci-robot commented Jun 13, 2024

Guidelines