Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stackrox: set e2e-benchmarking EXTRA_FLAGS to include --metrics-profile acs metrics url #57412

Merged

Conversation

davdhacs
Copy link
Contributor

@davdhacs davdhacs commented Oct 2, 2024

Test of kube-burner/kube-burner-ocp#111

New test-run with KUBE_BURNER_VERSION=1.4.0

  • Run with new release that includes the argument.

e2e-benchmarking log shows the $cmd used to call kube-burner-ocp includes the EXTRA_FLAGS string using the new kube-burner-ocp arg:

+ /tmp/kube-burner-ocp cluster-density-v2 --log-level=info --qps=20 --burst=20 --gc=true --uuid 8f61f145-274d-4532-910d-c313f1b0a6e1 --metrics-profile https://raw.githubusercontent.com/stackrox/stackrox/refs/heads/master/tests/performance/scale/tests/kube-burner/cluster-density/metrics.yml --gc-metrics=true --profile-type=both --iterations=216 --churn=true --es-server=https://XXXXXXXXX:XXXXXXXXXXXXXXXXXXXXXXXX@XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX --es-index=ripsaw-kube-burner

The kube-burner log then shows the url metrics file was loaded and the metrics targets inside that file are used:

time="2024-10-04 18:30:42" level=info msg="🔍 Endpoint: https://prometheus-k8s-openshift-monitoring.apps.ci-op-79p6y2l5-802fc.XXXXXXXXX.rox.systems; profile: config/https:/raw.githubusercontent.com/stackrox/stackrox/refs/heads/master/tests/performance/scale/tests/kube-burner/cluster-density/metrics.yml start: 2024-10-04T17:16:56Z end: 2024-10-04T18:25:48Z; job: cluster-density-v2" file="prometheus.go:69"
...
time="2024-10-04 18:31:12" level=info msg="Indexing [103574] documents from metric sensor_rox_sensor_events_network_policy_store_total" file="prometheus.go:233"
time="2024-10-04 18:31:29" level=info msg="Indexing finished in 16.859s: created=103574" file="prometheus.go:238"

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 2, 2024
@davdhacs
Copy link
Contributor Author

davdhacs commented Oct 2, 2024

/pj-rehearse periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.16-nightly-x86-control-plane-24nodes-acs

@openshift-ci-robot
Copy link
Contributor

@davdhacs: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@davdhacs
Copy link
Contributor Author

davdhacs commented Oct 3, 2024

/pj-rehearse periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.16-nightly-x86-control-plane-24nodes-acs

@openshift-ci-robot
Copy link
Contributor

@davdhacs: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@davdhacs
Copy link
Contributor Author

davdhacs commented Oct 3, 2024

/pj-rehearse periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.16-nightly-x86-control-plane-24nodes-acs

@openshift-ci-robot
Copy link
Contributor

@davdhacs: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@davdhacs davdhacs changed the title stackrox: Rox 26061 allow e2e repo fork stackrox: set e2e-benchmarking EXTRA_FLAGS to include --metrics-profile acs metrics url Oct 3, 2024
@@ -55,7 +55,10 @@ tests:
env:
BASE_DOMAIN: perfscale.rox.systems
COMPUTE_NODE_REPLICAS: "24"
E2E_REPOSITORY: https://github.com/davdhacs/e2e-benchmarking
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just cut a new kube-burner-ocp version v1.4.0 including the patch for metrics-profiles, you can override the kube-burner version for this step with the env var KUBE_BURNER_VERSION https://github.com/cloud-bulldozer/e2e-benchmarking/blob/master/workloads/kube-burner-ocp-wrapper/run.sh#L11C23-L11C42

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I'll switch this PR to use that.

@davdhacs
Copy link
Contributor Author

davdhacs commented Oct 4, 2024

/pj-rehearse periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.16-nightly-x86-control-plane-24nodes-acs

@openshift-ci-robot
Copy link
Contributor

@davdhacs: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@davdhacs
Copy link
Contributor Author

davdhacs commented Oct 9, 2024

Test run used the metrics-aggregated.yml,*-acs.yaml:

time="2024-10-09 19:44:48" level=info msg="Indexing finished in 29ms: created=2" file="metadata.go:64"
time="2024-10-09 19:44:48" level=info msg="🔍 Endpoint: https://prometheus-k8s-openshift-monitoring.apps.ci-op-yy48vz5c-802fc.XXXXXXXXX.rox.systems; profile: config/metrics-aggregated.yml start: 2024-10-09T18:30:27Z end: 2024-10-09T19:39:54Z; job: cluster-density-v2" file="prometheus.go:69"
time="2024-10-09 19:44:51" level=info msg="🔍 Endpoint: https://prometheus-k8s-openshift-monitoring.apps.ci-op-yy48vz5c-802fc.XXXXXXXXX.rox.systems; profile: config/https:/raw.githubusercontent.com/stackrox/stackrox/refs/heads/master/tests/performance/scale/config/metrics-acs.yml start: 2024-10-09T18:30:27Z end: 2024-10-09T19:39:54Z; job: cluster-density-v2" file="prometheus.go:69"
time="2024-10-09 19:44:56" level=info msg="🔍 Endpoint: https://prometheus-k8s-openshift-monitoring.apps.ci-op-yy48vz5c-802fc.XXXXXXXXX.rox.systems; profile: config/metrics-report.yml start: 2024-10-09T18:30:27Z end: 2024-10-09T19:39:54Z; job: cluster-density-v2" file="prometheus.go:69"
time="2024-10-09 19:44:59" level=info msg="🔍 Endpoint: https://prometheus-k8s-openshift-monitoring.apps.ci-op-yy48vz5c-802fc.XXXXXXXXX.rox.systems; profile: config/metrics-aggregated.yml start: 2024-10-09T19:39:54Z end: 2024-10-09T19:44:46Z; job: garbage-collection" file="prometheus.go:69"
time="2024-10-09 19:45:00" level=info msg="🔍 Endpoint: https://prometheus-k8s-openshift-monitoring.apps.ci-op-yy48vz5c-802fc.XXXXXXXXX.rox.systems; profile: config/https:/raw.githubusercontent.com/stackrox/stackrox/refs/heads/master/tests/performance/scale/config/metrics-acs.yml start: 2024-10-09T19:39:54Z end: 2024-10-09T19:44:46Z; job: garbage-collection" file="prometheus.go:69"
time="2024-10-09 19:45:03" level=info msg="🔍 Endpoint: https://prometheus-k8s-openshift-monitoring.apps.ci-op-yy48vz5c-802fc.XXXXXXXXX.rox.systems; profile: config/metrics-report.yml start: 2024-10-09T19:39:54Z end: 2024-10-09T19:44:46Z; job: garbage-collection" file="prometheus.go:69"
time="2024-10-09 19:45:05" level=info msg="Indexing [2] documents from metric max-cpu-openshift-controller-manager" file="prometheus.go:233"
time="2024-10-09 19:45:05" level=info msg="Indexing finished in 80ms: created=2" file="prometheus.go:238"
time="2024-10-09 19:45:05" level=info msg="Indexing [151] documents from metric central_tar_file_count_per_layer_sum" file="prometheus.go:233"
time="2024-10-09 19:45:05" level=info msg="Indexing finished in 456ms: created=151" file="prometheus.go:238"
time="2024-10-09 19:45:05" level=info msg="Indexing [2] documents from metric max-cpu-crio" file="prometheus.go:233"

@davdhacs
Copy link
Contributor Author

davdhacs commented Oct 9, 2024

/pj-rehearse periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.16-nightly-x86-control-plane-24nodes-acs

@openshift-ci-robot
Copy link
Contributor

@davdhacs: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci-robot
Copy link
Contributor

[REHEARSALNOTIFIER]
@davdhacs: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-openshift-cluster-network-operator-master-qe-perfscale-aws-ovn-medium-node-density-cni openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-master-qe-perfscale-aws-ovn-small-node-density-cni openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.19-qe-perfscale-aws-ovn-medium-node-density-cni openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.19-qe-perfscale-aws-ovn-small-node-density-cni openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.18-qe-perfscale-aws-ovn-medium-node-density-cni openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.18-qe-perfscale-aws-ovn-small-node-density-cni openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.17-qe-perfscale-aws-ovn-medium-node-density-cni openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.17-qe-perfscale-aws-ovn-small-node-density-cni openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.16-qe-perfscale-aws-ovn-medium-node-density-cni openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.16-qe-perfscale-aws-ovn-small-node-density-cni openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.15-qe-perfscale-aws-ovn-medium-node-density-cni openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.15-qe-perfscale-aws-ovn-small-node-density-cni openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.14-qe-perfscale-aws-ovn-medium-node-density-cni openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-cluster-network-operator-release-4.14-qe-perfscale-aws-ovn-small-node-density-cni openshift/cluster-network-operator presubmit Registry content changed
pull-ci-openshift-kubernetes-master-perfscale-control-plane-6nodes openshift/kubernetes presubmit Registry content changed
pull-ci-openshift-kubernetes-release-4.19-perfscale-control-plane-6nodes openshift/kubernetes presubmit Registry content changed
pull-ci-openshift-kubernetes-release-4.18-perfscale-control-plane-6nodes openshift/kubernetes presubmit Registry content changed
pull-ci-openshift-ovn-kubernetes-master-qe-perfscale-aws-ovn-medium-node-density-cni openshift/ovn-kubernetes presubmit Registry content changed
pull-ci-openshift-ovn-kubernetes-master-qe-perfscale-aws-ovn-small-node-density-cni openshift/ovn-kubernetes presubmit Registry content changed
pull-ci-openshift-ovn-kubernetes-release-4.19-qe-perfscale-aws-ovn-medium-node-density-cni openshift/ovn-kubernetes presubmit Registry content changed
pull-ci-openshift-ovn-kubernetes-release-4.19-qe-perfscale-aws-ovn-small-node-density-cni openshift/ovn-kubernetes presubmit Registry content changed
pull-ci-openshift-ovn-kubernetes-release-4.18-qe-perfscale-aws-ovn-medium-node-density-cni openshift/ovn-kubernetes presubmit Registry content changed
pull-ci-openshift-ovn-kubernetes-release-4.18-qe-perfscale-aws-ovn-small-node-density-cni openshift/ovn-kubernetes presubmit Registry content changed
pull-ci-openshift-ovn-kubernetes-release-4.17-qe-perfscale-aws-ovn-medium-node-density-cni openshift/ovn-kubernetes presubmit Registry content changed
pull-ci-openshift-ovn-kubernetes-release-4.17-qe-perfscale-aws-ovn-small-node-density-cni openshift/ovn-kubernetes presubmit Registry content changed

A total of 214 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs.

A full list of affected jobs can be found here

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@rsevilla87
Copy link
Member

Is this ready to merge @davdhacs ? if so, please add pj-rehearsal-ack

@rsevilla87
Copy link
Member

/cc @jtaleric

@openshift-ci openshift-ci bot requested a review from jtaleric October 10, 2024 07:41
@jtaleric
Copy link
Contributor

lgtm - only concern I would have is the number of metrics some of these queries return 😨

time="2024-10-09 22:19:36" level=info msg="Indexing [113112] documents from metric central_rox_central_postgres_op_duration_bucket" file="prometheus.go:233"

@davdhacs
Copy link
Contributor Author

lgtm - only concern I would have is the number of metrics some of these queries return 😨

time="2024-10-09 22:19:36" level=info msg="Indexing [113112] documents from metric central_rox_central_postgres_op_duration_bucket" file="prometheus.go:233"

@mtodor should we change these metrics to reduce the volume before this runs (often)?

@davdhacs
Copy link
Contributor Author

/pj-rehearsal ack

@davdhacs
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 10, 2024
@davdhacs
Copy link
Contributor Author

/pj-rehearse ack

@openshift-ci-robot
Copy link
Contributor

@davdhacs: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci-robot openshift-ci-robot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Oct 10, 2024
@rsevilla87
Copy link
Member

rsevilla87 commented Oct 11, 2024

lgtm - only concern I would have is the number of metrics some of these queries return 😨

time="2024-10-09 22:19:36" level=info msg="Indexing [113112] documents from metric central_rox_central_postgres_op_duration_bucket" file="prometheus.go:233"

@mtodor should we change these metrics to reduce the volume before this runs (often)?

In the metrics file https://raw.githubusercontent.com/stackrox/stackrox/refs/heads/master/tests/performance/scale/config/metrics-acs.yml, you're capturing lot of raw prometheus timeseries, you should consider adding some aggregation expressions (sum, rate, histogram_quantile, etc..) to reduce the number of documents.

Indexing such amount of documents can lead to performance issues in the ElasticSearch database

@davdhacs
Copy link
Contributor Author

Indexing such amount of documents can lead to performance issues in the ElasticSearch database

@mtodor If you're okay with this, please add a /lgtm (only you and I are affected right now I think as we don't share this elasticsearch with anyone). If instead you want to wait and adjust the metrics first, then we can hold this.

@mtodor
Copy link

mtodor commented Oct 16, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 16, 2024
@davdhacs
Copy link
Contributor Author

@rsevilla87 could you /lgtm this pr for us? (we're not in the owners files for these)

@davdhacs
Copy link
Contributor Author

@mtodor and I discussed the volume of these metrics and decided to start with this and iterate on aggregating and reducing the volume as we start using the data (and since we're in a separate elasticsearch, we will not be bad neighbor even if this is too much data right now)

@jtaleric
Copy link
Contributor

@mtodor and I discussed the volume of these metrics and decided to start with this and iterate on aggregating and reducing the volume as we start using the data (and since we're in a separate elasticsearch, we will not be bad neighbor even if this is too much data right now)

ack!

Copy link
Contributor

@jtaleric jtaleric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Contributor

openshift-ci bot commented Oct 16, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davdhacs, jtaleric, mtodor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 16, 2024
Copy link
Contributor

openshift-ci bot commented Oct 16, 2024

@davdhacs: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 1ca86f1 into openshift:master Oct 16, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. rehearsals-ack Signifies that rehearsal jobs have been acknowledged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants