-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Velero e2e tests #2269
base: main
Are you sure you want to change the base?
Add Velero e2e tests #2269
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
tests/end-to-end/velero/hnc.bats
Outdated
harbor.teardown_project | ||
} | ||
|
||
@test "velero backup and restore hns" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice test to have 😄
@Pavan-Gunda @OlleLarsson @viktor-f I'd like at least one from the Velero GOTO to also review this before I merge. Does it cover what you want to see tested? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good. I agree that we don't need to test both restic and kopia here, we can just use whatever is available in the cluster.
tests/end-to-end/velero/hnc.bats
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that we test backing up and restoring a user app. But I wonder if this should be named something else instead of "hnc.bats", mostly since it does not really feel like we are testing hnc. Maybe name it "user-app" or something like that.
However I do think that we could increase this test so that we also try backing up the subnamespaces and restoring that before we restore the app.
What do you think of those suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However I do think that we could increase this test so that we also try backing up the subnamespaces and restoring that before we restore the app.
Tried deleting the subnamespace as part of the test and then the restore failed with:
Errors:
Velero: error creating namespace velero-test: admission webhook "namespaces.hnc.x-k8s.io" denied the request: namespaces "velero-test" is forbidden: cannot set or modify tree label "production.tree.hnc.x-k8s.io/depth" in namespace "velero-test"; these can only be managed by HNC
Are we sure that restoring a subnamespace is even supported/working today?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that we test backing up and restoring a user app. But I wonder if this should be named something else instead of "hnc.bats", mostly since it does not really feel like we are testing hnc. Maybe name it "user-app" or something like that.
However I do think that we could increase this test so that we also try backing up the subnamespaces and restoring that before we restore the app.
1bb5d6d (see comment above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure. But I think the issue here is that the test is taking a backup of the velero-test namespace and restoring that. However that does not restore the subnamespace resource since that one is located in the production namespace.
So I think you would have to modify it to something like this:
- Backup velero-test namespace
- Backup subnamespace from the production namespace
- Delete everything
- Restore subnamespace in production
- Restore everything in the velero-test namespace
1 and 2 should be possible to do in either order, but the rest needs to be in this order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see the problem now. This test kind of doesn't make sense in that it's doing something that we never do (create a backup with --include-namespaces
) in production. It probably makes more sense to do backup and restore with the "daily" backup since that is the only "official" way we do things today.
If we implement another type of backup job that can backup and restore only the application developer namespaces then this test could be changed to test that feature instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now running a full backup (which should backup the subnamespace resource in the production namespace AFAIK?): 40687c9
But I'm still getting the same error:
Errors:
Velero: error creating namespace velero-test: admission webhook "namespaces.hnc.x-k8s.io" denied the request: namespaces "velero-test" is forbidden: cannot set or modify tree label "production.tree.hnc.x-k8s.io/depth" in namespace "velero-test"; these can only be managed by HNC
Looks to me that this is an actual bug or am I missing something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the backup should include the subnamespace resource. You can check that with velero backup describe
.
Then that might be an actual bug. My guess is that velero is not restoring this in the correct order. I.e. it might be trying to restore namespaces before it is restoring subnamespaces, which would fail because hnc does not allow you to create/edit namespaces with hnc labels (which the namespace in the backup would have).
If you look at the logs of the restore you should be able to see if it is trying to restore subnamespaces before or after namespaces (if it does namespaces first it might just fail there and stop).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear. I can see that it restores the subnamespace resource and that is the last resource it tries to restore before the errors are logged and it ends.
time="2024-09-30T13:29:22Z" level=info msg="Getting client for hnc.x-k8s.io/v1alpha2, Kind=SubnamespaceAnchor" logSource="pkg/restore/restore.go:1050" restore=velero/test-restore-1727702776
time="2024-09-30T13:29:22Z" level=info msg="restore status includes excludes: <nil>" logSource="pkg/restore/restore.go:1342" restore=velero/test-restore-1727702776
time="2024-09-30T13:29:22Z" level=info msg="Attempting to restore SubnamespaceAnchor: velero-test" logSource="pkg/restore/restore.go:1513" restore=velero/test-restore-1727702776
time="2024-09-30T13:29:23Z" level=info msg="the managed fields for production/velero-test is patched" logSource="pkg/restore/restore.go:1714" restore=velero/test-restore-1727702776
time="2024-09-30T13:29:23Z" level=info msg="Restored 252 items out of an estimated total of 275 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=velero-test namespace=production progress= resource=subnamespaceanchors.hnc.x-k8s.io restore=velero/test-restore-1727702776
time="2024-09-30T13:29:23Z" level=info msg="Waiting for all pod volume restores to complete" logSource="pkg/restore/restore.go:660" restore=velero/test-restore-1727702776
time="2024-09-30T13:29:23Z" level=info msg="Done waiting for all pod volume restores to complete" logSource="pkg/restore/restore.go:676" restore=velero/test-restore-1727702776
time="2024-09-30T13:29:23Z" level=info msg="Waiting for all post-restore-exec hooks to complete" logSource="pkg/restore/restore.go:680" restore=velero/test-restore-1727702776
time="2024-09-30T13:29:23Z" level=info msg="Done waiting for all post-restore exec hooks to complete" logSource="pkg/restore/restore.go:688" restore=velero/test-restore-1727702776
time="2024-09-30T13:29:23Z" level=info msg="hookTracker: map[], hookAttempted: 0, hookFailed: 0" logSource="pkg/restore/restore.go:695" restore=velero/test-restore-1727702776
time="2024-09-30T13:29:23Z" level=error msg="Velero restore error: error creating namespace velero-test: admission webhook \"namespaces.hnc.x-k8s.io\" denied the request: namespaces \"velero-test\" is forbidden: cannot set or modify tree label \"production.tree.hnc.x-k8s.io/depth\" in namespace \"velero-test\"; these can only be managed by HNC" logSource="pkg/controller/restore_controller.go:573" restore=velero/test-restore-1727702776
...
However, I never see it trying to restore an actual namespace resource:
$ velero restore logs test-restore-1727702776 | egrep -i 'Attempting to restore.*namespace'
time="2024-09-30T13:28:54Z" level=info msg="Attempting to restore CustomResourceDefinition: subnamespaceanchors.hnc.x-k8s.io" logSource="pkg/restore/restore.go:1513" restore=velero/test-restore-1727702776
time="2024-09-30T13:29:22Z" level=info msg="Attempting to restore SubnamespaceAnchor: velero-test" logSource="pkg/restore/restore.go:1513" restore=velero/test-restore-1727702776
I'm not sure if it just skips existing namespaces, but that isn't logged either as far as I can tell. For example, I can see that resources are restored in the staging
namespace but I don't see anything logged regarding the staging
namespace itself:
$ velero restore logs test-restore-1727702776 | grep -i 'staging'
time="2024-09-30T13:28:54Z" level=info msg="Resource 'serviceaccounts' will be restored into namespace 'staging'" logSource="pkg/restore/restore.go:2264" restore=velero/test-restore-1727702776
time="2024-09-30T13:28:55Z" level=info msg="Resource 'configmaps' will be restored into namespace 'staging'" logSource="pkg/restore/restore.go:2264" restore=velero/test-restore-1727702776
time="2024-09-30T13:28:55Z" level=info msg="Resource 'hierarchyconfigurations.hnc.x-k8s.io' will be restored into namespace 'staging'" logSource="pkg/restore/restore.go:2264" restore=velero/test-restore-1727702776
time="2024-09-30T13:28:55Z" level=info msg="Resource 'networkpolicies.networking.k8s.io' will be restored into namespace 'staging'" logSource="pkg/restore/restore.go:2264" restore=velero/test-restore-1727702776
time="2024-09-30T13:28:55Z" level=info msg="Resource 'rolebindings.rbac.authorization.k8s.io' will be restored into namespace 'staging'" logSource="pkg/restore/restore.go:2264" restore=velero/test-restore-1727702776
time="2024-09-30T13:28:55Z" level=info msg="Resource 'roles.rbac.authorization.k8s.io' will be restored into namespace 'staging'" logSource="pkg/restore/restore.go:2264" restore=velero/test-restore-1727702776
time="2024-09-30T13:28:59Z" level=info msg="Restored 15 items out of an estimated total of 275 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=default namespace=staging progress= resource=serviceaccounts restore=velero/test-restore-1727702776
time="2024-09-30T13:29:00Z" level=info msg="Restored 60 items out of an estimated total of 275 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=kube-root-ca.crt namespace=staging progress= resource=configmaps restore=velero/test-restore-1727702776
time="2024-09-30T13:29:20Z" level=info msg="Restored 219 items out of an estimated total of 277 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=hierarchy namespace=staging progress= resource=hierarchyconfigurations.hnc.x-k8s.io restore=velero/test-restore-1727702776
time="2024-09-30T13:29:21Z" level=info msg="Restored 234 items out of an estimated total of 277 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=allow-cert-manager-resolver namespace=staging progress= resource=networkpolicies.networking.k8s.io restore=velero/test-restore-1727702776
time="2024-09-30T13:29:21Z" level=info msg="Restored 240 items out of an estimated total of 275 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=extra-workload-admins namespace=staging progress= resource=rolebindings.rbac.authorization.k8s.io restore=velero/test-restore-1727702776
time="2024-09-30T13:29:21Z" level=info msg="Restored 241 items out of an estimated total of 275 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=hnc-controller-user-rolebinding namespace=staging progress= resource=rolebindings.rbac.authorization.k8s.io restore=velero/test-restore-1727702776
time="2024-09-30T13:29:21Z" level=info msg="Restored 242 items out of an estimated total of 275 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=workload-admin namespace=staging progress= resource=rolebindings.rbac.authorization.k8s.io restore=velero/test-restore-1727702776
time="2024-09-30T13:29:22Z" level=info msg="Restored 245 items out of an estimated total of 275 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=hnc-controller-user-role namespace=staging progress= resource=roles.rbac.authorization.k8s.io restore=velero/test-restore-1727702776
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the full restore.log
This current fails with the following error in the Velero restore: error creating namespace velero-test: admission webhook "namespaces.hnc.x-k8s.io" denied the request: namespaces "velero-test" is forbidden: cannot set or modify tree label "production.tree.hnc.x-k8s.io/depth" in namespace "velero-test"; these can only be managed by HNC
restic is going to get deprecated soon anyway, we can just test kopia. |
We don't really want end-to-end tests to modify and apply changes, so that is up to changing default config and templates to move to Kopia. |
Warning
This is a public repository, ensure not to disclose:
What kind of PR is this?
Required: Mark one of the following that is applicable:
Optional: Mark one or more of the following that are applicable:
Important
Breaking changes should be marked
kind/admin-change
orkind/dev-change
depending on typeCritical security fixes should be marked with
kind/security
What does this PR do / why do we need this PR?
Information to reviewers
Added all QA and Velero GOTOs so that I get feedback from both parties. Does the tests look correct, does the test fixes look correct, am I testing Velero correctly and does it cover everything we want tested with Velero?
Me and @aarnq talked offline and decided that modifying the cluster config during the tests isn't a good idea. So to test both Restic and Kopia it's up to the tester to reconfigure the cluster and run the tests twice.
Checklist
NetworkPolicy Dashboard