Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate appropriate monitoring/alerts needed for BU classes using OpenShift AI in NERC #2

Open
hpdempsey opened this issue Jan 12, 2024 · 14 comments
Assignees
Labels
OPE question Further information is requested

Comments

@hpdempsey
Copy link

There are two classes that are starting for Spring semester. They will be using Jupyter notebooks through OpenShift AI software. We would like to make sure that we have appropriate monitoring working for the classes, and also determine whether we need any alerts to be aware of when a class might be failing or impacted by another failure. Classes start on Jan 18, so it would be good to have this set up in NERC by then.

The two classes are ECE440SPRING2024 and CS210 - Computer System @ Boston University (these are the ColdFront project names--might differ from OpenShift namespaces).

There are about 300 students in CS210 and about 40 in ECE440, so clearly we need to avoid anything that will generate hundreds of alerts.....

If you find appropriate montoring/alerting already in place, no further action needed except documenting this. If additional work is needed, either add to this issue or create new ones, but work should be finished by Jan 18 if possible.

@hpdempsey hpdempsey added the question Further information is requested label Jan 12, 2024
@hpdempsey
Copy link
Author

@dystewart can provide more info on implementation if needed.

@hpdempsey
Copy link
Author

The rhods-notebooks namespace is actually used for creating the student's containers in both classes, not the project namespaces.

@schwesig
Copy link

schwesig commented Jan 15, 2024

Status:

  • we get some general alerts in slack about the prod cluster
  • nothing specified on on the namespace rhods-notebooks yet

Idea:

  • taking the basic alerts (e.g. memory usage in %) from here as idea
  • filter them to the namespace rhods-notebooks
  • creating previews here to review/vote
  • and then using a new slack channel just for class alerts

@dystewart
Copy link
Collaborator

dystewart commented Jan 15, 2024

@schwesig I think this is a great path forward

we'll want to get alerts for the basic utilization numbers like you said particularly those resources controlled by quota: OCP-on-NERC/nerc-ocp-config#340 at least at the namespace level, I'm not sure it would be worth reporting alerts per user workload

We also want to definitely look for pod creation failures and imagePullBackoff

And directing to a slack channel is definitely ideal, we could call it something like ope-prod-alerts or something

@schwesig
Copy link

OCP-on-NERC/nerc-ocp-config#341

    limits.cpu: '1000'
    limits.ephemeral-storage: 30Gi
    limits.memory: 3000000Mi
    persistentvolumeclaims: '400'
    requests.storage: 400Gi

@schwesig
Copy link

channel name: alerts-prod-rhods-ope
to have one ready (time matters), we go with this name for now.
if there are some concerns or better/easier/... ideas, let's discuss this and if needed, rename afterwards.

long run: rhods or rhoai? (thanks @joachimweyl)

@schwesig
Copy link

slack channel webhook exists and was tested, not yet in vault (due to some issues)

@dystewart
Copy link
Collaborator

I think in the long run rhoai, might as well get ahead on the new naming scheme

@schwesig
Copy link

schwesig commented Jan 17, 2024

limits.cpu: '1000'

  • rhods-notebooks
    • Trigger: usage bigger 80%
      • sum(kube_resourcequota{cluster="nerc-ocp-prod",namespace="rhods-notebooks",resource="limits.cpu",type="used"})/sum(kube_resourcequota{cluster="nerc-ocp-prod",namespace="rhods-notebooks",resource="limits.cpu",type="hard"}) >.80
    • Graph: limit vs usage
      • kube_resourcequota{cluster="nerc-ocp-prod",namespace="rhods-notebooks",resource="limits.cpu"}

limits.ephemeral-storage: 30Gi

  • rhods-notebooks
    • Trigger: usage bigger 80%
      • sum(kube_resourcequota{cluster="nerc-ocp-prod",namespace="rhods-notebooks",resource="limits.ephemeral-storage",type="used"})/sum(kube_resourcequota{cluster="nerc-ocp-prod",namespace="rhods-notebooks",resource="limits.ephemeral-storage",type="hard"}) >.80
    • Graph: limit vs usage
      • kube_resourcequota{cluster="nerc-ocp-prod",namespace="rhods-notebooks",resource="limits.ephemeral-storage"}

limits.memory: 3000000Mi

  • rhods-notebooks
    • Trigger: usage bigger 80%
      • sum(kube_resourcequota{cluster="nerc-ocp-prod",namespace="rhods-notebooks",resource="limits.memory",type="used"})/sum(kube_resourcequota{cluster="nerc-ocp-prod",namespace="rhods-notebooks",resource="limits.memory",type="hard"}) >.80
    • Graph: limit vs usage
      • kube_resourcequota{cluster="nerc-ocp-prod",namespace="rhods-notebooks",resource="limits.memory"}

persistentvolumeclaims: '400'

  • rhods-notebooks
    • Trigger: usage bigger 80%
      • sum(kube_resourcequota{cluster="nerc-ocp-prod",namespace="rhods-notebooks",resource="persistentvolumeclaims",type="used"})/sum(kube_resourcequota{cluster="nerc-ocp-prod",namespace="rhods-notebooks",resource="persistentvolumeclaims",type="hard"}) >.80
    • Graph: limit vs usage
      • kube_resourcequota{cluster="nerc-ocp-prod",namespace="rhods-notebooks",resource="persistentvolumeclaims"}

requests.storage: 400Gi

  • Trigger: usage bigger 80%
    • sum(kube_resourcequota{cluster="nerc-ocp-prod",namespace="rhods-notebooks",resource="requests.storage",type="used"})/sum(kube_resourcequota{cluster="nerc-ocp-prod",namespace="rhods-notebooks",resource="requests.storage",type="hard"}) >.80
  • Graph: limit vs usage
    • kube_resourcequota{cluster="nerc-ocp-prod",namespace="rhods-notebooks",resource="requests.storage"}

@schwesig
Copy link

growth max overtime

@schwesig
Copy link

v2

@schwesig
Copy link

pod growth warning

@DanNiESh DanNiESh added the OPE label Jan 24, 2024
@schwesig
Copy link

schwesig commented Feb 7, 2024

research done,

  • alerting is possible as needed
  • slack webhook is established
  • slack channel already announcing alerts from a test grafana
  • next step, deploying it in the current version
  • work on possible better, more alerts and dashboards, following this WIP https://github.com/OCP-on-NERC/docs/pull/58/files

@schwesig
Copy link

schwesig commented Feb 7, 2024

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OPE question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants