Missing collector for scheduled (success|failure) events #68

krx252525 · 2018-06-12T06:40:32Z

Primary Concern

I'd like some help to understand whether or not I've missed something when following the README and the guides on help.sumologic.com ... kubernetes.

I seem to have most dashboards working with the exception of scheduler related panels like Kubernetes - Overview -> Pods Scheduled By Namespace which is driven by the following query:

_sourceCategory = *kube-scheduler*
| timeslice 1h
| parse "Successfully assigned * to *\"" as name2,node
| parse "reason: '*'" as reason
| parse "type: '*'" as normal
| parse "Name:\\\"*\\\"" as name
| parse "Namespace:\\\"*\\\"" as namespace
| parse "Kind:\\\"*\\\"" as kind
| count by _timeslice, namespace
| transpose row _timeslice column namespace
| fillmissing timeslice(1h)

The problem is that the line this query is driven by is not logged by the scheduler but emitted as an event. The only piece from the documentation which I can see which would be able to push this to sumo is the sumologic-k8s-api script which is noticeably lacking any calls the v1/api/events as well as the role for calling that.

I've tested a fix which would add these log lines and can submit it as a PR against sumologic-k8s-api but I feel like I've missed something obvious.

Secondary concern

I see some of the panels are driven by queries which extract fields which don't fill me with confidence that I've got things configured correctly:
Kubernetes - Controller Manager -> Event Severity Trend using the following query:

_sourceCategory = *kube-controller-manager*
| parse "\"message\":\"*\"" as message
| parse "\"source\":\"*.*:*\"" as resource,resource_action,resource_code
| parse "\"severity\":\"*\"" as severity
| fields - resource_action, resource_code 
| timeslice 1h
| count _timeslice, severity 
| transpose row _timeslice column severity
| fillmissing timeslice(1h)

Which matches this log line:

{
"timestamp": 1528785188171,
"severity": "I",
"pid": "1",
"source": "round_trippers.go:439", 
"message": "Response Status: 200 OK in 2 milliseconds"
}

Where resource_action, resource_code would match go and 439 respectively. Is this correct?

The text was updated successfully, but these errors were encountered:

frankreno · 2018-06-18T22:47:47Z

@keir-rex can you provide the following information?

What version of k8s?
Where is it running?
Managed Service (GKE/EKS) or you manage the cluster (kops/kubeadm)
Can you share your YAML

These logs did exist at some point, very possible they have been tweaked in a new release or things have changed in the underlying logging of the scheduler so this will help me figure out what is going on.

krx252525 · 2018-06-25T04:01:13Z

@frankreno

v1.9.6 (kubectl version output below):
AWS
kops
Provided below

sumologic-k8s-api

I rebuilt your image to also hit /v1/api/events you can see the diff here:

        log.info("getting data for events")
        events = requests.get(url="{}/api/v1/events".format(self.k8s_api_url)).json()
        for event in events["items"]:
            log.info("pushing to sumo")
            requests.post(url=self.collector_url,
                          data=json.dumps(event),
                          headers=self.headers)

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: sumologic-k8s-api
  labels:
    app: sumologic-k8s-api
spec:
  schedule: "*/5 * * * *"
  successfulJobsHistoryLimit: 10
  failedJobsHistoryLimit: 10
  concurrencyPolicy: Replace
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccount: sumologic-k8s-api
          restartPolicy: OnFailure
          containers:
          - name:  sumologic-k8s-api
            imagePullPolicy: Always
            image: frankreno/sumologic-k8s-api:events
            env:
            - name: SUMO_HTTP_URL
              value: <INSERT_URL_HERE>
            - name: K8S_API_URL
              value: http://127.0.0.1:8001
            - name: X-Sumo-Category
              value: k8s/api
            - name: X-Sumo-Name
              value: sumologic-k8s-api
          - name:  kubectl
            image: gcr.io/google_containers/kubectl:v1.0.7
            command: ["/kubectl"]
            args: ["proxy", "-p", "8001"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: sumologic-k8s-api
  labels:
    app: sumologic-k8s-api
rules:
- apiGroups: [""]
  resources: ["nodes", "pods", "events"]
  verbs: ["get", "list"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: sumologic-k8s-api
  labels:
    app: sumologic-k8s-api
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: sumologic-k8s-api
  labels:
    app: sumologic-k8s-api
subjects:
- kind: ServiceAccount
  name: sumologic-k8s-api
  namespace: default
roleRef:
  kind: ClusterRole
  name: sumologic-k8s-api
  apiGroup: rbac.authorization.k8s.io

fluentd-kubernetes-sumologic

is basically vanilla

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluentd

---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: fluentd
rules:
- apiGroups:
  - ""
  resources:
  - namespaces
  - pods
  verbs:
  - get
  - list
  - watch

---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: fluentd
roleRef:
  kind: ClusterRole
  name: fluentd
  apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
  name: fluentd
  # This namespace setting will limit fluentd to watching/listing/getting pods in the default namespace. If you want it to be able to log your kube-system namespace as well, comment the line out.
  namespace: default

--- 
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: fluentd-sumologic
  labels:
    app: fluentd-sumologic
    version: v1
spec:
  template:
    metadata:
      labels:
        name: fluentd-sumologic
    spec:
      serviceAccountName: fluentd
      volumes:
      - name: pos-files
        emptyDir: {}
      - name: host-logs
        hostPath:
          path: /var/log/
      - name: docker-logs
        hostPath:
          path: /var/lib/docker
      containers:
      - image: sumologic/fluentd-kubernetes-sumologic:latest
        name: fluentd
        imagePullPolicy: Always
        volumeMounts:
        - name: host-logs
          mountPath: /mnt/log/
          readOnly: true
        - name: host-logs
          mountPath: /var/log/
          readOnly: true
        - name: docker-logs
          mountPath: /var/lib/docker/
          readOnly: true
        - name: pos-files
          mountPath: /mnt/pos/
        env:
        - name: COLLECTOR_URL
          valueFrom:
            secretKeyRef:
              name: sumologic
              key: collector-url
      tolerations:
          #- operator: "Exists"
          - effect: "NoSchedule"
            key: "node-role.kubernetes.io/master"

kubectl version:

Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-05-12T04:12:12Z", GoVersion:"go1.9.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.6", GitCommit:"9f8ebd171479bec0ada837d7ee641dec2f8c6dd1", GitTreeState:"clean", BuildDate:"2018-03-21T15:13:31Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}```

frankreno · 2018-06-25T19:45:03Z

@keir-rex thanks for the info. So this appears to be a change in 1.9.x. I have a 1.8 cluster and a 1.9 cluster and the schedule is not producing the same logs. Will try to track down to the source and work on remediation for this.

krx252525 · 2018-06-27T00:49:38Z

Cheers @frankreno let me know if there's anything I can help with

frankreno · 2018-06-29T20:41:18Z

@keir-rex still no response from the folks on the scheduling team for k8s. So I do not have a good answer as to why this changed and how to remedy yet. I found the code where the log used to be generated and see no changes to account for this, so just means the change is not coming from the scheduler, but somewhere else. Will keep you updated. Long term, we are working on a new metrics collection strategy for Kubernetes not using heapster which will allow us to collect from many more data sources and provide insights into this. Let's keep this issue open until we solve it one of those ways...

krx252525 · 2018-07-04T07:41:41Z

Sounds good @frankreno. I'll throw together something which does de-duping of events since we need that anyway.

Could you comment on my second query on my initial post?

Cheers

ankitgoelcmu · 2018-07-13T15:50:09Z

@keir-rex that's right. I see [218, 42, 205, 363 and 374] as code, 'event' as a resource, and 'go' as resource_action. Although, I have to revisit these to make sure these are proper naming conventions

krx252525 mentioned this issue Jun 13, 2018

Missing configuration for scheduled (success #67

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing collector for scheduled (success|failure) events #68

Missing collector for scheduled (success|failure) events #68

krx252525 commented Jun 12, 2018 •

edited

Loading

frankreno commented Jun 18, 2018

krx252525 commented Jun 25, 2018

frankreno commented Jun 25, 2018

krx252525 commented Jun 27, 2018

frankreno commented Jun 29, 2018

krx252525 commented Jul 4, 2018 •

edited

Loading

ankitgoelcmu commented Jul 13, 2018 •

edited

Loading

Missing collector for scheduled (success|failure) events #68

Missing collector for scheduled (success|failure) events #68

Comments

krx252525 commented Jun 12, 2018 • edited Loading

Primary Concern

Secondary concern

frankreno commented Jun 18, 2018

krx252525 commented Jun 25, 2018

sumologic-k8s-api

fluentd-kubernetes-sumologic

kubectl version:

frankreno commented Jun 25, 2018

krx252525 commented Jun 27, 2018

frankreno commented Jun 29, 2018

krx252525 commented Jul 4, 2018 • edited Loading

ankitgoelcmu commented Jul 13, 2018 • edited Loading

krx252525 commented Jun 12, 2018 •

edited

Loading

krx252525 commented Jul 4, 2018 •

edited

Loading

ankitgoelcmu commented Jul 13, 2018 •

edited

Loading