Unable to see Node Metrics - Error Metrics Missing CPU for node "XXX", skipping #639

jibinrajck · 2024-02-16T04:34:51Z

What happened?: kubectl top nodes not giving proper response. Error out with - error: metrics not available yet

What did you expect to happen?: Should return proper response

Please provide the prometheus-adapter config:

prometheus-adapter config

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  creationTimestamp: "2024-02-16T02:52:24Z"
  generation: 1
  labels:
    app.kubernetes.io/component: metrics-adapter
    app.kubernetes.io/instance: pipeline-monitoring-b9cb893b
    app.kubernetes.io/name: prometheus-adapter
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 0.11.2
  name: prometheus-adapter
  namespace: monitoring-system
  resourceVersion: "40411"
  uid: fe95f92d-788d-494b-8579-06dda771455c
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/component: metrics-adapter
      app.kubernetes.io/name: prometheus-adapter
      app.kubernetes.io/part-of: kube-prometheus
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      annotations:
        checksum.config/md5: 3b1ebf7df0232d1675896f67b66373db
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: metrics-adapter
        app.kubernetes.io/name: prometheus-adapter
        app.kubernetes.io/part-of: kube-prometheus
        app.kubernetes.io/version: 0.11.2
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  app.kubernetes.io/component: metrics-adapter
                  app.kubernetes.io/name: prometheus-adapter
                  app.kubernetes.io/part-of: kube-prometheus
              namespaces:
              - monitoring-system
              topologyKey: kubernetes.io/hostname
            weight: 100
      automountServiceAccountToken: true
      containers:
      - args:
        - --cert-dir=/var/run/serving-cert
        - --config=/etc/adapter/config.yaml
        - --metrics-relist-interval=1m
        - --prometheus-url=http://prometheus-k8s.monitoring-system.svc:9090/
        - --secure-port=6443
        - --tls-cipher-suites= XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
        image: registry.k8s.io/prometheus-adapter/prometheus-adapter:v0.11.2
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /livez
            port: https
            scheme: HTTPS
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        name: prometheus-adapter
        ports:
        - containerPort: 6443
          name: https
          protocol: TCP
        readinessProbe:
          failureThreshold: 5
          httpGet:
            path: /readyz
            port: https
            scheme: HTTPS
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: 250m
            memory: 180Mi
          requests:
            cpu: 102m
            memory: 180Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          seccompProfile:
            type: RuntimeDefault
        startupProbe:
          failureThreshold: 18
          httpGet:
            path: /livez
            port: https
            scheme: HTTPS
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /tmp
          name: tmpfs
        - mountPath: /var/run/serving-cert
          name: volume-serving-cert
        - mountPath: /etc/adapter
          name: config
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/os: linux
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: prometheus-adapter
      serviceAccountName: prometheus-adapter
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir: {}
        name: tmpfs
      - emptyDir: {}
        name: volume-serving-cert
      - configMap:
          defaultMode: 420
          name: adapter-config
        name: config
status:
  availableReplicas: 2
  conditions:
  - lastTransitionTime: "2024-02-16T02:52:24Z"
    lastUpdateTime: "2024-02-16T02:54:10Z"
    message: ReplicaSet "prometheus-adapter-fc7bc9c4d" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2024-02-16T03:22:06Z"
    lastUpdateTime: "2024-02-16T03:22:06Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 1
  readyReplicas: 2
  replicas: 2
  updatedReplicas: 2

Please provide the HPA resource used for autoscaling:

HPA yaml

Please provide the HPA status:

Please provide the prometheus-adapter logs with -v=6 around the time the issue happened:

prometheus-adapter logs

I0215 04:38:57.171364       1 handler.go:143] prometheus-metrics-adapter: GET "/apis/metrics.k8s.io/v1beta1/nodes" satisfied by gorestful with webservice /apis/metrics.k8s.io/v1beta1
I0215 04:38:57.174856       1 api.go:88] GET http://prometheus-k8s.monitoring-system.svc:9090/api/v1/query?query=sum+by+%28node%29+%28%0A++1+-+irate%28%0A++++node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B60s%5D%0A++%29%0A++%2A+on%28namespace%2C+pod%29+group_left%28node%29+%28%0A++++node_namespace_pod%3Akube_pod   _info%3A%7Bnode%3D~%22ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.co   mpute.internal%22%7D%0A++%29%0A%29%0Aor+sum+by+%28node%29+%28%0A++1+-+irate%28%0A++++windows_cpu_time_total%7Bmode%3D%22idle%22%2C+job%3D%22windows-exporter%22%2Cnode%3D~%22ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%22%7D%5B4m%5D%0A++%29%0A%29%0A&time=1707971937.171 200 OK
I0215 04:38:57.174974       1 api.go:107] Response Body: {"status":"success","data":{"resultType":"vector","result":[]}}
I0215 04:38:57.178115       1 api.go:88] GET http://prometheus-k8s.monitoring-system.svc:9090/api/v1/query?query=sum+by+%28instance%29+%28%0A++node_memory_MemTotal_bytes%7Bjob%3D%22node-exporter%22%2Cinstance%3D~%22ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.interna   l%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%22%7D%0A++-%0A++node_memory_MemAvailable_bytes%7Bjob%3D%22node-exporter%22%2Cinstance%3D~%22ip-10-161-218-   141.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%22%7D%0A%29%0Aor+sum+   by+%28instance%29+%28%0A++windows_cs_physical_memory_bytes%7Bjob%3D%22windows-exporter%22%2Cinstance%3D~%22ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Ci   p-10-161-216-168.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%22%7D%0A++-%0A++windows_memory_available_bytes%7Bjob%3D%22windows-exporter%22%2Cinstance%3D~%22ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx   .ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%22%7D%0A%29%0A&time=1707971937.171 200 OK
I0215 04:38:57.178231       1 api.go:107] Response Body: {"status":"success","data":{"resultType":"vector","result":[{"metric":{"instance":"ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal"},"value":[1707971937.171,"1563303936"]},{"metric":{"instance":"ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal"},"value":   [1707971937.171,"3037470720"]},{"metric":{"instance":"ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal"},"value":[1707971937.171,"2406985728"]},{"metric":{"instance":"ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal"},"value":[1707971937.171,"1566240768"]},{"metric":{"instance":"ip-xx-xxx-xxx-xxx.ap-southeast-2.c   ompute.internal"},"value":[1707971937.171,"3543482368"]},{"metric":{"instance":"ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal"},"value":[1707971937.171,"1575890944"]}]}}
I0215 04:38:57.178655       1 provider.go:291] missing CPU for node "ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal", skipping
I0215 04:38:57.178670       1 provider.go:291] missing CPU for node "ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal", skipping
I0215 04:38:57.178676       1 provider.go:291] missing CPU for node "ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal", skipping
I0215 04:38:57.178682       1 provider.go:291] missing CPU for node "ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal", skipping
I0215 04:38:57.178688       1 provider.go:291] missing CPU for node "ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal", skipping
I0215 04:38:57.178694       1 provider.go:291] missing CPU for node "ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal", skipping
I0215 04:38:57.178918       1 httplog.go:132] "HTTP" verb="LIST" URI="/apis/metrics.k8s.io/v1beta1/nodes" latency="8.096979ms" userAgent="kubectl/v1.26.7 (linux/amd64) kubernetes/89a3d86" audit-ID="4d2e7d1c-5107-4e3a-8caa-67c0c4e2ff6f" srcIP="10.161.221.68:53746" resp=200
I0215 04:38:57.221383       1 handler.go:153] prometheus-metrics-adapter: GET "/livez" satisfied by nonGoRestful
I0215 04:38:57.221410       1 pathrecorder.go:241] prometheus-metrics-adapter: "/livez" satisfied by exact match
I0215 04:38:57.221511       1 handler.go:153] prometheus-metrics-adapter: GET "/readyz" satisfied by nonGoRestful
I0215 04:38:57.221539       1 pathrecorder.go:241] prometheus-metrics-adapter: "/readyz" satisfied by exact match
I0215 04:38:57.221513       1 httplog.go:132] "HTTP" verb="GET" URI="/livez" latency="238.051µs" userAgent="kube-probe/1.26+" audit-ID="ac3d0baa-550b-4ea1-a156-563e84503bf4" srcIP="10.161.218.141:24110" resp=200
I0215 04:38:57.221592       1 shared_informer.go:341] caches populated
I0215 04:38:57.221668       1 httplog.go:132] "HTTP" verb="GET" URI="/readyz" latency="255.747µs" userAgent="kube-probe/1.26+" audit-ID="0813d31c-df53-4066-b23d-0edea4e8db04" srcIP="10.161.218.141:24112" resp=200

Anything else we need to know?:

Environment:

prometheus-adapter version: v0.11.2
prometheus version: v0.71.2
Kubernetes version (use kubectl version): 1.26
Cloud provider or hardware configuration: AWS EKS
Other info:

The text was updated successfully, but these errors were encountered:

dashpole · 2024-02-22T17:54:47Z

/assign @dgrisonnet
/triage accepted

kariya-mitsuru · 2024-03-07T18:22:59Z

Hi, I also encountered a similar problem. (I referred to this, so strictly speaking it's a bit different.)

I think you are using this ConfigMap. If so, node_cpu_usage_seconds_total and node_memory_working_set_bytes used there are metrics exposed by the kubelet's /metrics/resource endpoint, that is relatively new.

Is this endpoint scraped by Prometheus? If not, NodeMetrics cannot be generated.

On the other hand, it seems that PodMetrics has been obtained, so I think that /metrics/cadvisor has already been scraped by Prometheus. However, if both /metrics/cadvisor and /metrics/resource are scraped, this ConfigMap will generate incorrect PodMetrics values.

This is because container_cpu_usage_seconds_total and container_memory_working_set_bytes used there are exposed by both endpoints, resulting in double counting. Therefore, you will need to take measures such as assigning a metrics_path label using relabel_configs and narrowing it down to one.

As an alternative solution to avoid using /metrics/resource, you could follow the approach mentioned here(I used it). However, in that case, you'll need Prometheus Node Exporter, and you'll also need to assign node names to node label using relabel_configs.

ganchkal · 2024-04-30T09:40:45Z

Hi, I had similar issue with CPU metrics for the nodes. This solution helped me. ('node' attribute is not added in the node_cpu_seconds_total metric)

jibinrajck added the kind/bug Categorizes issue or PR as related to a bug. label Feb 16, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 16, 2024

jibinrajck changed the title ~~Unable to see Node - Error Metrics Missing CPU for node "ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal", skipping~~ Unable to see Node Metrics - Error Metrics Missing CPU for node "XXX", skipping Feb 16, 2024

k8s-ci-robot assigned dgrisonnet Feb 22, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to see Node Metrics - Error Metrics Missing CPU for node "XXX", skipping #639

Unable to see Node Metrics - Error Metrics Missing CPU for node "XXX", skipping #639

jibinrajck commented Feb 16, 2024

dashpole commented Feb 22, 2024

kariya-mitsuru commented Mar 7, 2024

ganchkal commented Apr 30, 2024

Unable to see Node Metrics - Error Metrics Missing CPU for node "XXX", skipping #639

Unable to see Node Metrics - Error Metrics Missing CPU for node "XXX", skipping #639

Comments

jibinrajck commented Feb 16, 2024

dashpole commented Feb 22, 2024

kariya-mitsuru commented Mar 7, 2024

ganchkal commented Apr 30, 2024