Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct Configuration Fails to Provide Expected Custom Metrics in EKS #663

Open
wuyudian1 opened this issue Jun 4, 2024 · 2 comments
Open
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@wuyudian1
Copy link

What happened?: Correct Configuration Fails to Provide Expected Custom Metrics in EKS
We have deployed identical Prometheus chart and Prometheus-Adapter chart in both Alibaba Cloud ACK cluster and AWS EKS cluster. The configurations of Prometheus and Prometheus-Adapter are the same in both K8S clusters. The scraping configuration for Prometheus is as follows:

job_name: basicai-business-queue-wait
metrics_path: /metrics/prometheus
scheme: http
scrape_interval: 30s
honor_labels: true
kubernetes_sd_configs:
  - role: service
    namespaces:
      names:
        - basicai-backend
        - basicai-stage-backend
relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_component]
    regex: dataset
    action: keep
  - source_labels: [__meta_kubernetes_namespace]
    target_label: 'kubernetes_namespace'
    action: replace
  - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_component]
    target_label: 'kubernetes_deployment'
    action: replace
  - source_labels: [__meta_kubernetes_service_port_number]
    regex: 80
    action: keep

The values.yaml for Prometheus-Adapter chart is as follows:

image:
  repository: registry.talos.basic.ai/common/images/prometheus-adapter
  tag: "v0.11.2"
  pullPolicy: IfNotPresent
prometheus:
  url: http://prometheus-server
  port: 80
resources:
   requests:
     cpu: 100m
     memory: 128Mi
   limits:
     cpu: 100m
     memory: 1Gi
rules:
  default: false
  custom:
  - seriesQuery: '{__name__=~"basicai_job_replica_scale_percent",container!="POD",kubernetes_namespace!="",type="dataset-upload"}'
    resources:
      template: <<.Resource>>
      overrides:
        kubernetes_namespace: {resource: "namespace"}
        kubernetes_deployment: {resource: "deployment"}
    name:
      matches: "basicai_job_replica_scale_percent"
      as: "upload_job_replica_scale_percent_dataset"
    metricsQuery: last_over_time(basicai_job_replica_scale_percent{<<.LabelMatchers>>,type="dataset-upload"}[5m])

In the Alibaba Cloud ACK cluster, the Prometheus-Adapter correctly provides custom metrics:

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "custom.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "deployments.apps/upload_job_replica_scale_percent_dataset",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "namespaces/upload_job_replica_scale_percent_dataset",
      "singularName": "",
      "namespaced": false,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    {
      "name": "jobs.batch/upload_job_replica_scale_percent_dataset",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    }
  ]
}

However, in the EKS cluster, the Prometheus-Adapter provides a large number of default metrics, but does not include the expected 'upload_job_replica_scale_percent_dataset':

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq | head -n 50
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "custom.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "services/authentication_duration_seconds_sum",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
    .....
    .....
    .....

What did you expect to happen?:
prometheus-adapter provides correct custom metrics in AWS EKS cluster as in Alibaba Cloud ACK cluster

Please provide the prometheus-adapter config:

image:
  repository: registry.talos.basic.ai/common/images/prometheus-adapter
  tag: "v0.11.2"
  pullPolicy: IfNotPresent
prometheus:
  url: http://prometheus-server
  port: 80
resources:
   requests:
     cpu: 100m
     memory: 128Mi
   limits:
     cpu: 100m
     memory: 1Gi
rules:
  default: false
  custom:
  - seriesQuery: '{__name__=~"basicai_job_replica_scale_percent",container!="POD",kubernetes_namespace!="",type="dataset-upload"}'
    resources:
      template: <<.Resource>>
      overrides:
        kubernetes_namespace: {resource: "namespace"}
        kubernetes_deployment: {resource: "deployment"}
    name:
      matches: "basicai_job_replica_scale_percent"
      as: "upload_job_replica_scale_percent_dataset"
    metricsQuery: last_over_time(basicai_job_replica_scale_percent{<<.LabelMatchers>>,type="dataset-upload"}[5m])
@wuyudian1 wuyudian1 added the kind/bug Categorizes issue or PR as related to a bug. label Jun 4, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 4, 2024
@dgrisonnet
Copy link
Member

/triage accepted
/help

@k8s-ci-robot
Copy link
Contributor

@dgrisonnet:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/triage accepted
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants