-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace shared-informers' methods with cached client #461
replace shared-informers' methods with cached client #461
Conversation
@@ -273,8 +273,8 @@ func (s *TestSignupServiceSuite) TestGetSignupFailsWithNotFoundThenOtherError() | |||
s.Application.MockInformerService(inf) | |||
svc := service.NewSignupService( | |||
fake.MemberClusterServiceContext{ | |||
Client: s, | |||
Svcs: s.Application, | |||
CrtClient: s, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to rename the field to CrtClient
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because there is a new method called Client
and you cannot have an exported field with the same name. At the end of the day (and this whole refactoring madness) this will all go away
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah right, I missed that
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #461 +/- ##
==========================================
+ Coverage 78.69% 79.45% +0.76%
==========================================
Files 42 42
Lines 2774 2726 -48
==========================================
- Hits 2183 2166 -17
+ Misses 494 471 -23
+ Partials 97 89 -8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good ! 👍
Thanks for the simplification 🙏
I have only one query - do we have any data/idea about performance impact of this switch ? I mean will it be faster/same/slower or we're unsure ?
Great question @mfrancisc. |
I was mostly curious - maybe there are some docs/blog posts providing comparison/pros and cons between the two approaches. |
As I see it, the biggest benefits/pros are:
|
so I have some numbers, I created 16k of UserSignups, MURs, Spaces, and SpaceBindings (I meak 16k for each resource kind)
as for memory consumption of the reg-service pods:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used controller runtime client(s) in our first proxy version as far as I remember. Assuming that the built-in client cache would be enough. But the performance was not consistent. Every few minutes of constant proxy usage the performance degraded. It looked like the cache was "reset" every few minutes or so.
This was the main reason why we switched to informers in the first place.
But maybe we did something different back then? If so then what exactly we did differently and why it should work now?
I might be missing or forgetting something though.
We also might need to do additional performance testing (or collaborate with the konflux folks to closely monitor any potential performance issues after deploying it and be ready to revert it if needed).
cc: @rajivnathan
I can test that - I still have the CRC with all the resources. |
TBH, I shortly discussed this plan with @rajivnathan, and he didn't mention anything about such issues. |
No, I'm not sure! Let me try to find the original code so we can compare. |
I don't see any cache reset or performance difference after 15, 40, or 60 minutes: $ oc get cm --token=XXX --server=https://api-toolchain-host-operator.apps-crc.testing --kubeconfig /tmp/dummy-kubeconfig --cache-dir /tmp/dummy-dir/ -v 6
I0918 08:33:54.361713 2595794 loader.go:373] Config loaded from file: /tmp/dummy-kubeconfig
I0918 08:33:54.434089 2595794 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api?timeout=32s 200 OK in 71 milliseconds
I0918 08:33:54.520956 2595794 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/apis?timeout=32s 200 OK in 79 milliseconds
I0918 08:33:54.656424 2595794 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api/v1/namespaces/mjobanek-dev/configmaps?limit=500 200 OK in 97 milliseconds
NAME DATA AGE
kube-root-ca.crt 1 15h
openshift-service-ca.crt 1 15h
# waited for 40 minutes
$ rm -rf /tmp/dummy-dir/
$ oc get cm --token=XXX --server=https://api-toolchain-host-operator.apps-crc.testing --kubeconfig /tmp/dummy-kubeconfig --cache-dir /tmp/dummy-dir/ -v 6
I0918 09:12:53.866156 2597468 loader.go:373] Config loaded from file: /tmp/dummy-kubeconfig
I0918 09:12:53.908376 2597468 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api?timeout=32s 200 OK in 41 milliseconds
I0918 09:12:53.929227 2597468 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/apis?timeout=32s 200 OK in 19 milliseconds
I0918 09:12:53.968135 2597468 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api/v1/namespaces/mjobanek-dev/configmaps?limit=500 200 OK in 23 milliseconds
NAME DATA AGE
kube-root-ca.crt 1 16h
openshift-service-ca.crt 1 16h
# waited for 15 minutes
$ rm -rf /tmp/dummy-dir/
$ oc get cm --token=XXX --server=https://api-toolchain-host-operator.apps-crc.testing --kubeconfig /tmp/dummy-kubeconfig --cache-dir /tmp/dummy-dir/ -v 6
I0918 09:18:20.803630 2598547 loader.go:373] Config loaded from file: /tmp/dummy-kubeconfig
I0918 09:18:20.833579 2598547 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api?timeout=32s 200 OK in 29 milliseconds
I0918 09:18:20.853703 2598547 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/apis?timeout=32s 200 OK in 19 milliseconds
I0918 09:18:20.895189 2598547 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api/v1/namespaces/mjobanek-dev/configmaps?limit=500 200 OK in 26 milliseconds
NAME DATA AGE
kube-root-ca.crt 1 16h
openshift-service-ca.crt 1 16h
# waited for more than 1 hour
$ rm -rf /tmp/dummy-dir/
$ oc get cm --token=XXX --server=https://api-toolchain-host-operator.apps-crc.testing --kubeconfig /tmp/dummy-kubeconfig --cache-dir /tmp/dummy-dir/ -v 6
I0918 10:21:25.334231 2604585 loader.go:373] Config loaded from file: /tmp/dummy-kubeconfig
I0918 10:21:25.382830 2604585 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api?timeout=32s 200 OK in 47 milliseconds
I0918 10:21:25.407047 2604585 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/apis?timeout=32s 200 OK in 23 milliseconds
I0918 10:21:25.452966 2604585 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api/v1/namespaces/mjobanek-dev/configmaps?limit=500 200 OK in 30 milliseconds
NAME DATA AGE
kube-root-ca.crt 1 17h
openshift-service-ca.crt 1 17h
There is also no difference in the memory consumption of the registration-service pods |
I realized one thing now, since I deleted the cache, then the |
ok, here are the numbers. To be sure, I also measured the execution time of the whole $ time oc get cm --token=XXX --server=https://api-toolchain-host-operator.apps-crc.testing --kubeconfig /tmp/dummy-kubeconfig --cache-dir /tmp/dummy-dir/ -v 6
I0918 10:37:57.186574 2606282 loader.go:373] Config loaded from file: /tmp/dummy-kubeconfig
I0918 10:37:57.214022 2606282 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api?timeout=32s 200 OK in 27 milliseconds
I0918 10:37:57.235949 2606282 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/apis?timeout=32s 200 OK in 20 milliseconds
I0918 10:37:57.269956 2606282 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api/v1/namespaces/mjobanek-dev/configmaps?limit=500 200 OK in 21 milliseconds
NAME DATA AGE
kube-root-ca.crt 1 17h
openshift-service-ca.crt 1 17h
real 0m0.107s
user 0m0.118s
sys 0m0.024s
# waited for 22 minutes
$ time oc get cm --token=XXX --server=https://api-toolchain-host-operator.apps-crc.testing --kubeconfig /tmp/dummy-kubeconfig --cache-dir /tmp/dummy-dir/ -v 6
I0918 10:59:18.052817 2607023 loader.go:373] Config loaded from file: /tmp/dummy-kubeconfig
I0918 10:59:18.094227 2607023 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api/v1/namespaces/mjobanek-dev/configmaps?limit=500 200 OK in 34 milliseconds
NAME DATA AGE
kube-root-ca.crt 1 18h
openshift-service-ca.crt 1 18h
real 0m0.146s
user 0m0.148s
sys 0m0.033s
# waited for almost 90 minutes and made multiple calls:
$ time oc get cm --token=XXX --server=https://api-toolchain-host-operator.apps-crc.testing --kubeconfig /tmp/dummy-kubeconfig --cache-dir /tmp/dummy-dir/ -v 6
I0918 12:26:00.902698 2610103 loader.go:373] Config loaded from file: /tmp/dummy-kubeconfig
I0918 12:26:00.960992 2610103 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api/v1/namespaces/mjobanek-dev/configmaps?limit=500 200 OK in 52 milliseconds
NAME DATA AGE
kube-root-ca.crt 1 19h
openshift-service-ca.crt 1 19h
real 0m0.180s
user 0m0.109s
sys 0m0.045s
$ time oc get cm --token=XXX --server=https://api-toolchain-host-operator.apps-crc.testing --kubeconfig /tmp/dummy-kubeconfig --cache-dir /tmp/dummy-dir/ -v 6
I0918 12:26:09.873033 2610208 loader.go:373] Config loaded from file: /tmp/dummy-kubeconfig
I0918 12:26:09.925783 2610208 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api/v1/namespaces/mjobanek-dev/configmaps?limit=500 200 OK in 40 milliseconds
NAME DATA AGE
kube-root-ca.crt 1 19h
openshift-service-ca.crt 1 19h
real 0m0.191s
user 0m0.187s
sys 0m0.039s
$ time oc get cm --token=XXX --server=https://api-toolchain-host-operator.apps-crc.testing --kubeconfig /tmp/dummy-kubeconfig --cache-dir /tmp/dummy-dir/ -v 6
I0918 12:26:18.126960 2610483 loader.go:373] Config loaded from file: /tmp/dummy-kubeconfig
I0918 12:26:18.156485 2610483 round_trippers.go:553] GET https://api-toolchain-host-operator.apps-crc.testing/api/v1/namespaces/mjobanek-dev/configmaps?limit=500 200 OK in 23 milliseconds
NAME DATA AGE
kube-root-ca.crt 1 19h
openshift-service-ca.crt 1 19h
real 0m0.107s
user 0m0.118s
sys 0m0.024s
so there is no difference, really |
There is actually the metric that should show the process time of the requests in proxy, but I have no clue how to visualize that easily in dev cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. In the original code we didn't use the controller-runtime cluster's client. We used a regular client instead. So I guess this is why you couldn't reproduce this problem with "cache reset" every few minutes.
Thank you for the extensive performance testing! I think it's good enough. And if we get complaints about degradation proxy performance then we can re-visit it. But for now I think we are good! 👍
I have just a couple of minor comments.
Great job!
objectsToList := []client.ObjectList{&toolchainv1alpha1.ToolchainConfigList{}, &corev1.SecretList{}} | ||
for i := range objectsToList { | ||
if err := hostCluster.GetClient().List(ctx, objectsToList[i], client.InNamespace(configuration.Namespace())); err != nil { | ||
objectsToList := map[string]client.ObjectList{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how we could make it easier to NOT forget to update this list when we need to access more resource kinds via the client... Maybe add a comment to https://github.com/codeready-toolchain/host-operator/blob/master/deploy/registration-service/registration-service.yaml#L20 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We list only those objects we want to cache upfront. As it is mentioned in the comment above this code, the cache works in a lazy way so it starts caching GVK resources only after it is fetched for the first time.
In other words, the cache is populated only after making the first call on the GVK resource.
If we didn't do the list here, then the very first call to the reg-service/proxy would take a lot of time because the client would start populating the cache with the resources.
By listing the resources before the reg-service pod gets ready, we ensure that the cache is fully synced with all resources we need to have cached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got that. My point is that it's easy to forget to add a new resource to this list. Let's say we are introducing some change to the proxy so it now needs to access a new resource with kind X
. If we don't add it to this list then the proxy performance may degrade for the first request. It will degrade again after the proxy is restarted for every replica. So for example if there is 10 replicas of reg-service/proxy then up to 10 requests per deployment/version would be impacted.
You can't access a new kind without adding this X
resource to the operator role in https://github.com/codeready-toolchain/host-operator/blob/master/deploy/registration-service/registration-service.yaml#L20
So I was wondering if that yaml is a good place to add a reminder to update the cache initialization if desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in the call:
- I'll add a comment to the Role template
- I'll drop the ConfigMap from the Role
- I'll try to make the list for loop a bit more generic so it goes over all toolchain kinds except for some specific ones like TierTemplate which we don't want to cache (there can be many of them and they are huge resources)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as for 1. and 2. see the PRs:
codeready-toolchain/host-operator#1089
#463
As for the 3. item, it's easily doable to go over all kinds of the toolchain api, but there is one problem. The scheme contains api of both host & member, and at the level of the scheme we cannot decide which one is the host resource and which one belongs to the member. We could list available api groups from the cluster and exclude those kinds that are not available, but this wouldn't work in single-cluster environments. In addition to that, host-operator SA doesn't have permission to read/list member-operator CRDs, so it would fail on authorization as well.
In other words, there are too many complications in doing it in a generic way so it's not worth it.
"NSTemplateTier": &toolchainv1alpha1.NSTemplateTierList{}, | ||
"ToolchainConfig": &toolchainv1alpha1.ToolchainConfigList{}, | ||
"BannedUser": &toolchainv1alpha1.BannedUserList{}, | ||
"Secret": &corev1.SecretList{}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Secret": &corev1.SecretList{}} | |
"Secret": &corev1.SecretList{}, | |
"Config": &corev1.ConfigMapList{}} |
While I don't see we actually access CMs in reg-service it's listed here:
https://github.com/codeready-toolchain/host-operator/blob/master/deploy/registration-service/registration-service.yaml#L43
So I wonder should we either add it to the cache initialization or remove it from the role.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe Matous added these resources based on the informers we were creating in informers.go. If my memory serves me correctly, the list of resources there are those that are needed in the proxy flow and so we pre-populate the cache for those resources so that the proxy is as fast as possible even on first time use by a user. I don't remember whether ConfigMaps are part of that flow but I don't think it hurts to add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rajivnathan is correct, we want to cache only those resources we access in reg-service. Caching anything else doesn't make much sense.
We could theoretically drop a list of some reources like Secrets, NSTemplateTiers, ProxyPlugins, ToolchainStatuses, and ToolchainConfigs because we don't expect many resources to be present in the namespace, so the very first call wouldn't take much more time compared to when it is already cached. But I thought that it's also completely fine to pre-populate the cache with all resources the reg-service and proxy touch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the ConfigMap, if I'm not mistaken, the reg-service doesn't touch any ConfigMap in the toolchain-host-operator namespace. The permission in the Role
is most likely a leftover from the time before we introduced ToolchainConfig and used CM to configure the service and operators.
That being said, while it wouldn't theoretically hurt (as we cache resources only in the host-operator namespace) I would rather keep the cache minimal and focused only on those resources that are really accessed by the reg-service.
I would rather avoid the situation when we would keep adding more and more resources to the cache without knowing why we are doing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache can be crucial for performance. Especially for the proxy. And it's easy to miss this (don't add specific resources to the initialization step) during development. We don't have any performance tests in our CI.
If neither reg-service or proxy touches CMs then IMO we should remove it from the role.
And I would argue that it would be safer to keep everything reg-service & proxy access in the cache initialization.
I checked it looks like the CM are not used by reg-service and proxy. So let's create a separate PR to remove it from the role.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @MatousJobanek! 🙌 It's very cool to see the memory footprint improvement as well! 🤩
We used controller runtime client(s) in our first proxy version as far as I remember. Assuming that the built-in client cache would be enough. But the performance was not consistent. Every few minutes of constant proxy usage the performance degraded. It looked like the cache was "reset" every few minutes or so.
This was the main reason why we switched to informers in the first place.
But maybe we did something different back then? If so then what exactly we did differently and why it should work now?
I might be missing or forgetting something though.
@alexeykazakov TBH, I don't remember the reason why we used informers. Looking back in the JIRA and the code it sounds like we were using some simple cache that mapped users to cluster access (this part I recall) but that breaks when a user is retargeted to another cluster and that was an issue we were trying to solve. Not sure whether there were other issues that led us to use informers over runtime client. Maybe we just didn't realize that the runtime client is backed by informers at the time?
@@ -98,26 +99,29 @@ func (c *userSignupClient) listActiveSignupsByLabelForHashedValue(labelKey, valu | |||
// label matching the specified label | |||
func (c *userSignupClient) listActiveSignupsByLabel(labelKey, labelValue string) ([]*crtapi.UserSignup, error) { | |||
|
|||
stateRequirement, err := labels.NewRequirement(crtapi.UserSignupStateLabelKey, selection.NotEquals, []string{crtapi.UserSignupStateLabelValueDeactivated}) | |||
// TODO add unit tests checking that the label selection works as expected. Right now, it's not possible to do that thanks to the great abstraction and multiple level of layers, mocks, and services. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is one big 🤦♂️
I was looking at options of covering that in unit-tests of the verification function, but unfortunately, it's not easily doable with the current setup.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alexeykazakov, MatousJobanek, mfrancisc, rajivnathan, xcoulon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
36ce810
into
codeready-toolchain:master
KUBESAW-132
First PR to replace the custom implementation of shared informers in reg-service with the cached controller-runtime client.
This PR changes the
informer_service.go
's methods so they use the client instead.I still keep the service file with the same methods to minimize the scope of the changes