Add Velero e2e tests #2269

simonklb · 2024-09-02T19:15:03Z

Warning

This is a public repository, ensure not to disclose:

personal data beyond what is necessary for interacting with this pull request, nor
business confidential information, such as customer names.

What kind of PR is this?

Required: Mark one of the following that is applicable:

Optional: Mark one or more of the following that are applicable:

Important

Breaking changes should be marked kind/admin-change or kind/dev-change depending on type
Critical security fixes should be marked with kind/security

kind/admin-change
kind/dev-change
kind/security
kind/adr

What does this PR do / why do we need this PR?

Fixes [5] Add Velero backup and restore to end-to-end tests #1850

Information to reviewers

Added all QA and Velero GOTOs so that I get feedback from both parties. Does the tests look correct, does the test fixes look correct, am I testing Velero correctly and does it cover everything we want tested with Velero?

Me and @aarnq talked offline and decided that modifying the cluster config during the tests isn't a good idea. So to test both Restic and Kopia it's up to the tester to reconfigure the cluster and run the tests twice.

Checklist

aarnq

LGTM

lunkan93 · 2024-09-12T07:23:16Z

tests/end-to-end/velero/hnc.bats

+ harbor.teardown_project
+}
+
+@test "velero backup and restore hns" {


Very nice test to have 😄

simonklb · 2024-09-12T08:11:49Z

@Pavan-Gunda @OlleLarsson @viktor-f I'd like at least one from the Velero GOTO to also review this before I merge. Does it cover what you want to see tested?

viktor-f

I think this is good. I agree that we don't need to test both restic and kopia here, we can just use whatever is available in the cluster.

viktor-f · 2024-09-13T11:25:45Z

tests/end-to-end/velero/hnc.bats

I like that we test backing up and restoring a user app. But I wonder if this should be named something else instead of "hnc.bats", mostly since it does not really feel like we are testing hnc. Maybe name it "user-app" or something like that.

However I do think that we could increase this test so that we also try backing up the subnamespaces and restoring that before we restore the app.

What do you think of those suggestions?

However I do think that we could increase this test so that we also try backing up the subnamespaces and restoring that before we restore the app.

Tried deleting the subnamespace as part of the test and then the restore failed with:

Errors: Velero: error creating namespace velero-test: admission webhook "namespaces.hnc.x-k8s.io" denied the request: namespaces "velero-test" is forbidden: cannot set or modify tree label "production.tree.hnc.x-k8s.io/depth" in namespace "velero-test"; these can only be managed by HNC

Are we sure that restoring a subnamespace is even supported/working today?

I like that we test backing up and restoring a user app. But I wonder if this should be named something else instead of "hnc.bats", mostly since it does not really feel like we are testing hnc. Maybe name it "user-app" or something like that.

149737c

However I do think that we could increase this test so that we also try backing up the subnamespaces and restoring that before we restore the app.

1bb5d6d (see comment above)

I'm not 100% sure. But I think the issue here is that the test is taking a backup of the velero-test namespace and restoring that. However that does not restore the subnamespace resource since that one is located in the production namespace.

So I think you would have to modify it to something like this:

Backup velero-test namespace

Backup subnamespace from the production namespace

Delete everything

Restore subnamespace in production

Restore everything in the velero-test namespace

1 and 2 should be possible to do in either order, but the rest needs to be in this order.

Ah I see the problem now. This test kind of doesn't make sense in that it's doing something that we never do (create a backup with --include-namespaces) in production. It probably makes more sense to do backup and restore with the "daily" backup since that is the only "official" way we do things today.

If we implement another type of backup job that can backup and restore only the application developer namespaces then this test could be changed to test that feature instead.

I'm now running a full backup (which should backup the subnamespace resource in the production namespace AFAIK?): 40687c9
But I'm still getting the same error:

Errors: Velero: error creating namespace velero-test: admission webhook "namespaces.hnc.x-k8s.io" denied the request: namespaces "velero-test" is forbidden: cannot set or modify tree label "production.tree.hnc.x-k8s.io/depth" in namespace "velero-test"; these can only be managed by HNC

Looks to me that this is an actual bug or am I missing something here?

Yes the backup should include the subnamespace resource. You can check that with velero backup describe.

Then that might be an actual bug. My guess is that velero is not restoring this in the correct order. I.e. it might be trying to restore namespaces before it is restoring subnamespaces, which would fail because hnc does not allow you to create/edit namespaces with hnc labels (which the namespace in the backup would have).

If you look at the logs of the restore you should be able to see if it is trying to restore subnamespaces before or after namespaces (if it does namespaces first it might just fail there and stop).

It's unclear. I can see that it restores the subnamespace resource and that is the last resource it tries to restore before the errors are logged and it ends.

time="2024-09-30T13:29:22Z" level=info msg="Getting client for hnc.x-k8s.io/v1alpha2, Kind=SubnamespaceAnchor" logSource="pkg/restore/restore.go:1050" restore=velero/test-restore-1727702776 time="2024-09-30T13:29:22Z" level=info msg="restore status includes excludes: <nil>" logSource="pkg/restore/restore.go:1342" restore=velero/test-restore-1727702776 time="2024-09-30T13:29:22Z" level=info msg="Attempting to restore SubnamespaceAnchor: velero-test" logSource="pkg/restore/restore.go:1513" restore=velero/test-restore-1727702776 time="2024-09-30T13:29:23Z" level=info msg="the managed fields for production/velero-test is patched" logSource="pkg/restore/restore.go:1714" restore=velero/test-restore-1727702776 time="2024-09-30T13:29:23Z" level=info msg="Restored 252 items out of an estimated total of 275 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=velero-test namespace=production progress= resource=subnamespaceanchors.hnc.x-k8s.io restore=velero/test-restore-1727702776 time="2024-09-30T13:29:23Z" level=info msg="Waiting for all pod volume restores to complete" logSource="pkg/restore/restore.go:660" restore=velero/test-restore-1727702776 time="2024-09-30T13:29:23Z" level=info msg="Done waiting for all pod volume restores to complete" logSource="pkg/restore/restore.go:676" restore=velero/test-restore-1727702776 time="2024-09-30T13:29:23Z" level=info msg="Waiting for all post-restore-exec hooks to complete" logSource="pkg/restore/restore.go:680" restore=velero/test-restore-1727702776 time="2024-09-30T13:29:23Z" level=info msg="Done waiting for all post-restore exec hooks to complete" logSource="pkg/restore/restore.go:688" restore=velero/test-restore-1727702776 time="2024-09-30T13:29:23Z" level=info msg="hookTracker: map[], hookAttempted: 0, hookFailed: 0" logSource="pkg/restore/restore.go:695" restore=velero/test-restore-1727702776 time="2024-09-30T13:29:23Z" level=error msg="Velero restore error: error creating namespace velero-test: admission webhook \"namespaces.hnc.x-k8s.io\" denied the request: namespaces \"velero-test\" is forbidden: cannot set or modify tree label \"production.tree.hnc.x-k8s.io/depth\" in namespace \"velero-test\"; these can only be managed by HNC" logSource="pkg/controller/restore_controller.go:573" restore=velero/test-restore-1727702776 ...

However, I never see it trying to restore an actual namespace resource:

$ velero restore logs test-restore-1727702776 | egrep -i 'Attempting to restore.*namespace' time="2024-09-30T13:28:54Z" level=info msg="Attempting to restore CustomResourceDefinition: subnamespaceanchors.hnc.x-k8s.io" logSource="pkg/restore/restore.go:1513" restore=velero/test-restore-1727702776 time="2024-09-30T13:29:22Z" level=info msg="Attempting to restore SubnamespaceAnchor: velero-test" logSource="pkg/restore/restore.go:1513" restore=velero/test-restore-1727702776

I'm not sure if it just skips existing namespaces, but that isn't logged either as far as I can tell. For example, I can see that resources are restored in the staging namespace but I don't see anything logged regarding the staging namespace itself:

$ velero restore logs test-restore-1727702776 | grep -i 'staging' time="2024-09-30T13:28:54Z" level=info msg="Resource 'serviceaccounts' will be restored into namespace 'staging'" logSource="pkg/restore/restore.go:2264" restore=velero/test-restore-1727702776 time="2024-09-30T13:28:55Z" level=info msg="Resource 'configmaps' will be restored into namespace 'staging'" logSource="pkg/restore/restore.go:2264" restore=velero/test-restore-1727702776 time="2024-09-30T13:28:55Z" level=info msg="Resource 'hierarchyconfigurations.hnc.x-k8s.io' will be restored into namespace 'staging'" logSource="pkg/restore/restore.go:2264" restore=velero/test-restore-1727702776 time="2024-09-30T13:28:55Z" level=info msg="Resource 'networkpolicies.networking.k8s.io' will be restored into namespace 'staging'" logSource="pkg/restore/restore.go:2264" restore=velero/test-restore-1727702776 time="2024-09-30T13:28:55Z" level=info msg="Resource 'rolebindings.rbac.authorization.k8s.io' will be restored into namespace 'staging'" logSource="pkg/restore/restore.go:2264" restore=velero/test-restore-1727702776 time="2024-09-30T13:28:55Z" level=info msg="Resource 'roles.rbac.authorization.k8s.io' will be restored into namespace 'staging'" logSource="pkg/restore/restore.go:2264" restore=velero/test-restore-1727702776 time="2024-09-30T13:28:59Z" level=info msg="Restored 15 items out of an estimated total of 275 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=default namespace=staging progress= resource=serviceaccounts restore=velero/test-restore-1727702776 time="2024-09-30T13:29:00Z" level=info msg="Restored 60 items out of an estimated total of 275 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=kube-root-ca.crt namespace=staging progress= resource=configmaps restore=velero/test-restore-1727702776 time="2024-09-30T13:29:20Z" level=info msg="Restored 219 items out of an estimated total of 277 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=hierarchy namespace=staging progress= resource=hierarchyconfigurations.hnc.x-k8s.io restore=velero/test-restore-1727702776 time="2024-09-30T13:29:21Z" level=info msg="Restored 234 items out of an estimated total of 277 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=allow-cert-manager-resolver namespace=staging progress= resource=networkpolicies.networking.k8s.io restore=velero/test-restore-1727702776 time="2024-09-30T13:29:21Z" level=info msg="Restored 240 items out of an estimated total of 275 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=extra-workload-admins namespace=staging progress= resource=rolebindings.rbac.authorization.k8s.io restore=velero/test-restore-1727702776 time="2024-09-30T13:29:21Z" level=info msg="Restored 241 items out of an estimated total of 275 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=hnc-controller-user-rolebinding namespace=staging progress= resource=rolebindings.rbac.authorization.k8s.io restore=velero/test-restore-1727702776 time="2024-09-30T13:29:21Z" level=info msg="Restored 242 items out of an estimated total of 275 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=workload-admin namespace=staging progress= resource=rolebindings.rbac.authorization.k8s.io restore=velero/test-restore-1727702776 time="2024-09-30T13:29:22Z" level=info msg="Restored 245 items out of an estimated total of 275 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=hnc-controller-user-role namespace=staging progress= resource=roles.rbac.authorization.k8s.io restore=velero/test-restore-1727702776

Here is the full restore.log

This current fails with the following error in the Velero restore: error creating namespace velero-test: admission webhook "namespaces.hnc.x-k8s.io" denied the request: namespaces "velero-test" is forbidden: cannot set or modify tree label "production.tree.hnc.x-k8s.io/depth" in namespace "velero-test"; these can only be managed by HNC

Pavan-Gunda · 2024-09-17T09:08:02Z

I think this is good. I agree that we don't need to test both restic and kopia here, we can just use whatever is available in the cluster.

restic is going to get deprecated soon anyway, we can just test kopia.

aarnq · 2024-09-18T06:33:46Z

I think this is good. I agree that we don't need to test both restic and kopia here, we can just use whatever is available in the cluster.

restic is going to get deprecated soon anyway, we can just test kopia.

We don't really want end-to-end tests to modify and apply changes, so that is up to changing default config and templates to move to Kopia.
The tests doesn't seem to make any differences regardless, so they don't really need to be adapted.

simonklb added 4 commits September 2, 2024 21:07

scripts: Only run interactively if stdin is available

f1d5446

tests: Fix Harbor ctr.insecure execution

12d0d50

tests: Fix harbor.create_pull_secret

8098d34

tests: Add Velero e2e tests

4ffa538

simonklb requested review from OlleLarsson, Ajarmar, Pavan-Gunda, viktor-f, aarnq and lunkan93 September 2, 2024 19:15

aarnq approved these changes Sep 12, 2024

View reviewed changes

lunkan93 reviewed Sep 12, 2024

View reviewed changes

lunkan93 approved these changes Sep 12, 2024

View reviewed changes

viktor-f reviewed Sep 13, 2024

View reviewed changes

simonklb added 3 commits September 16, 2024 17:01

Wait for subnamespace to be created

32232db

hnc -> user-app

149737c

simonklb requested a review from viktor-f September 16, 2024 15:18

Run full backup in user-app test

40687c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Velero e2e tests #2269

Add Velero e2e tests #2269

simonklb commented Sep 2, 2024

aarnq left a comment

lunkan93 Sep 12, 2024

simonklb commented Sep 12, 2024

viktor-f left a comment

viktor-f Sep 13, 2024

simonklb Sep 16, 2024

simonklb Sep 16, 2024

viktor-f Sep 17, 2024

simonklb Sep 25, 2024

simonklb Sep 30, 2024

viktor-f Sep 30, 2024

simonklb Sep 30, 2024

simonklb Sep 30, 2024

Pavan-Gunda commented Sep 17, 2024

aarnq commented Sep 18, 2024

Add Velero e2e tests #2269

Are you sure you want to change the base?

Add Velero e2e tests #2269

Conversation

simonklb commented Sep 2, 2024

What kind of PR is this?

What does this PR do / why do we need this PR?

Information to reviewers

Checklist

aarnq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonklb commented Sep 12, 2024

viktor-f left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Pavan-Gunda commented Sep 17, 2024

aarnq commented Sep 18, 2024