Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to see Node Metrics - Error Metrics Missing CPU for node "XXX", skipping #639

Open
jibinrajck opened this issue Feb 16, 2024 · 3 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@jibinrajck
Copy link

What happened?: kubectl top nodes not giving proper response. Error out with - error: metrics not available yet

What did you expect to happen?: Should return proper response

Please provide the prometheus-adapter config:

prometheus-adapter config
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  creationTimestamp: "2024-02-16T02:52:24Z"
  generation: 1
  labels:
    app.kubernetes.io/component: metrics-adapter
    app.kubernetes.io/instance: pipeline-monitoring-b9cb893b
    app.kubernetes.io/name: prometheus-adapter
    app.kubernetes.io/part-of: kube-prometheus
    app.kubernetes.io/version: 0.11.2
  name: prometheus-adapter
  namespace: monitoring-system
  resourceVersion: "40411"
  uid: fe95f92d-788d-494b-8579-06dda771455c
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/component: metrics-adapter
      app.kubernetes.io/name: prometheus-adapter
      app.kubernetes.io/part-of: kube-prometheus
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      annotations:
        checksum.config/md5: 3b1ebf7df0232d1675896f67b66373db
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: metrics-adapter
        app.kubernetes.io/name: prometheus-adapter
        app.kubernetes.io/part-of: kube-prometheus
        app.kubernetes.io/version: 0.11.2
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  app.kubernetes.io/component: metrics-adapter
                  app.kubernetes.io/name: prometheus-adapter
                  app.kubernetes.io/part-of: kube-prometheus
              namespaces:
              - monitoring-system
              topologyKey: kubernetes.io/hostname
            weight: 100
      automountServiceAccountToken: true
      containers:
      - args:
        - --cert-dir=/var/run/serving-cert
        - --config=/etc/adapter/config.yaml
        - --metrics-relist-interval=1m
        - --prometheus-url=http://prometheus-k8s.monitoring-system.svc:9090/
        - --secure-port=6443
        - --tls-cipher-suites= XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
        image: registry.k8s.io/prometheus-adapter/prometheus-adapter:v0.11.2
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /livez
            port: https
            scheme: HTTPS
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        name: prometheus-adapter
        ports:
        - containerPort: 6443
          name: https
          protocol: TCP
        readinessProbe:
          failureThreshold: 5
          httpGet:
            path: /readyz
            port: https
            scheme: HTTPS
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: 250m
            memory: 180Mi
          requests:
            cpu: 102m
            memory: 180Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          seccompProfile:
            type: RuntimeDefault
        startupProbe:
          failureThreshold: 18
          httpGet:
            path: /livez
            port: https
            scheme: HTTPS
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /tmp
          name: tmpfs
        - mountPath: /var/run/serving-cert
          name: volume-serving-cert
        - mountPath: /etc/adapter
          name: config
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/os: linux
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: prometheus-adapter
      serviceAccountName: prometheus-adapter
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir: {}
        name: tmpfs
      - emptyDir: {}
        name: volume-serving-cert
      - configMap:
          defaultMode: 420
          name: adapter-config
        name: config
status:
  availableReplicas: 2
  conditions:
  - lastTransitionTime: "2024-02-16T02:52:24Z"
    lastUpdateTime: "2024-02-16T02:54:10Z"
    message: ReplicaSet "prometheus-adapter-fc7bc9c4d" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2024-02-16T03:22:06Z"
    lastUpdateTime: "2024-02-16T03:22:06Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 1
  readyReplicas: 2
  replicas: 2
  updatedReplicas: 2

Please provide the HPA resource used for autoscaling:

HPA yaml

Please provide the HPA status:

Please provide the prometheus-adapter logs with -v=6 around the time the issue happened:

prometheus-adapter logs
I0215 04:38:57.171364       1 handler.go:143] prometheus-metrics-adapter: GET "/apis/metrics.k8s.io/v1beta1/nodes" satisfied by gorestful with webservice /apis/metrics.k8s.io/v1beta1
I0215 04:38:57.174856       1 api.go:88] GET http://prometheus-k8s.monitoring-system.svc:9090/api/v1/query?query=sum+by+%28node%29+%28%0A++1+-+irate%28%0A++++node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B60s%5D%0A++%29%0A++%2A+on%28namespace%2C+pod%29+group_left%28node%29+%28%0A++++node_namespace_pod%3Akube_pod   _info%3A%7Bnode%3D~%22ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.co   mpute.internal%22%7D%0A++%29%0A%29%0Aor+sum+by+%28node%29+%28%0A++1+-+irate%28%0A++++windows_cpu_time_total%7Bmode%3D%22idle%22%2C+job%3D%22windows-exporter%22%2Cnode%3D~%22ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%22%7D%5B4m%5D%0A++%29%0A%29%0A&time=1707971937.171 200 OK
I0215 04:38:57.174974       1 api.go:107] Response Body: {"status":"success","data":{"resultType":"vector","result":[]}}
I0215 04:38:57.178115       1 api.go:88] GET http://prometheus-k8s.monitoring-system.svc:9090/api/v1/query?query=sum+by+%28instance%29+%28%0A++node_memory_MemTotal_bytes%7Bjob%3D%22node-exporter%22%2Cinstance%3D~%22ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.interna   l%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%22%7D%0A++-%0A++node_memory_MemAvailable_bytes%7Bjob%3D%22node-exporter%22%2Cinstance%3D~%22ip-10-161-218-   141.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%22%7D%0A%29%0Aor+sum+   by+%28instance%29+%28%0A++windows_cs_physical_memory_bytes%7Bjob%3D%22windows-exporter%22%2Cinstance%3D~%22ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Ci   p-10-161-216-168.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%22%7D%0A++-%0A++windows_memory_available_bytes%7Bjob%3D%22windows-exporter%22%2Cinstance%3D~%22ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx   .ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%7Cip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal%22%7D%0A%29%0A&time=1707971937.171 200 OK
I0215 04:38:57.178231       1 api.go:107] Response Body: {"status":"success","data":{"resultType":"vector","result":[{"metric":{"instance":"ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal"},"value":[1707971937.171,"1563303936"]},{"metric":{"instance":"ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal"},"value":   [1707971937.171,"3037470720"]},{"metric":{"instance":"ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal"},"value":[1707971937.171,"2406985728"]},{"metric":{"instance":"ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal"},"value":[1707971937.171,"1566240768"]},{"metric":{"instance":"ip-xx-xxx-xxx-xxx.ap-southeast-2.c   ompute.internal"},"value":[1707971937.171,"3543482368"]},{"metric":{"instance":"ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal"},"value":[1707971937.171,"1575890944"]}]}}
I0215 04:38:57.178655       1 provider.go:291] missing CPU for node "ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal", skipping
I0215 04:38:57.178670       1 provider.go:291] missing CPU for node "ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal", skipping
I0215 04:38:57.178676       1 provider.go:291] missing CPU for node "ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal", skipping
I0215 04:38:57.178682       1 provider.go:291] missing CPU for node "ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal", skipping
I0215 04:38:57.178688       1 provider.go:291] missing CPU for node "ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal", skipping
I0215 04:38:57.178694       1 provider.go:291] missing CPU for node "ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal", skipping
I0215 04:38:57.178918       1 httplog.go:132] "HTTP" verb="LIST" URI="/apis/metrics.k8s.io/v1beta1/nodes" latency="8.096979ms" userAgent="kubectl/v1.26.7 (linux/amd64) kubernetes/89a3d86" audit-ID="4d2e7d1c-5107-4e3a-8caa-67c0c4e2ff6f" srcIP="10.161.221.68:53746" resp=200
I0215 04:38:57.221383       1 handler.go:153] prometheus-metrics-adapter: GET "/livez" satisfied by nonGoRestful
I0215 04:38:57.221410       1 pathrecorder.go:241] prometheus-metrics-adapter: "/livez" satisfied by exact match
I0215 04:38:57.221511       1 handler.go:153] prometheus-metrics-adapter: GET "/readyz" satisfied by nonGoRestful
I0215 04:38:57.221539       1 pathrecorder.go:241] prometheus-metrics-adapter: "/readyz" satisfied by exact match
I0215 04:38:57.221513       1 httplog.go:132] "HTTP" verb="GET" URI="/livez" latency="238.051µs" userAgent="kube-probe/1.26+" audit-ID="ac3d0baa-550b-4ea1-a156-563e84503bf4" srcIP="10.161.218.141:24110" resp=200
I0215 04:38:57.221592       1 shared_informer.go:341] caches populated
I0215 04:38:57.221668       1 httplog.go:132] "HTTP" verb="GET" URI="/readyz" latency="255.747µs" userAgent="kube-probe/1.26+" audit-ID="0813d31c-df53-4066-b23d-0edea4e8db04" srcIP="10.161.218.141:24112" resp=200



Anything else we need to know?:

Environment:

  • prometheus-adapter version: v0.11.2
  • prometheus version: v0.71.2
  • Kubernetes version (use kubectl version): 1.26
  • Cloud provider or hardware configuration: AWS EKS
  • Other info:
@jibinrajck jibinrajck added the kind/bug Categorizes issue or PR as related to a bug. label Feb 16, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 16, 2024
@jibinrajck jibinrajck changed the title Unable to see Node - Error Metrics Missing CPU for node "ip-xx-xxx-xxx-xxx.ap-southeast-2.compute.internal", skipping Unable to see Node Metrics - Error Metrics Missing CPU for node "XXX", skipping Feb 16, 2024
@dashpole
Copy link

/assign @dgrisonnet
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 22, 2024
@kariya-mitsuru
Copy link

Hi, I also encountered a similar problem. (I referred to this, so strictly speaking it's a bit different.)

I think you are using this ConfigMap. If so, node_cpu_usage_seconds_total and node_memory_working_set_bytes used there are metrics exposed by the kubelet's /metrics/resource endpoint, that is relatively new.

Is this endpoint scraped by Prometheus? If not, NodeMetrics cannot be generated.

On the other hand, it seems that PodMetrics has been obtained, so I think that /metrics/cadvisor has already been scraped by Prometheus. However, if both /metrics/cadvisor and /metrics/resource are scraped, this ConfigMap will generate incorrect PodMetrics values.

This is because container_cpu_usage_seconds_total and container_memory_working_set_bytes used there are exposed by both endpoints, resulting in double counting. Therefore, you will need to take measures such as assigning a metrics_path label using relabel_configs and narrowing it down to one.

As an alternative solution to avoid using /metrics/resource, you could follow the approach mentioned here(I used it). However, in that case, you'll need Prometheus Node Exporter, and you'll also need to assign node names to node label using relabel_configs.

@ganchkal
Copy link

Hi, I had similar issue with CPU metrics for the nodes. This solution helped me. ('node' attribute is not added in the node_cpu_seconds_total metric)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants