Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Velero e2e tests #2269

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Add Velero e2e tests #2269

wants to merge 8 commits into from

Conversation

simonklb
Copy link
Contributor

@simonklb simonklb commented Sep 2, 2024

Warning

This is a public repository, ensure not to disclose:

  • personal data beyond what is necessary for interacting with this pull request, nor
  • business confidential information, such as customer names.

What kind of PR is this?

Required: Mark one of the following that is applicable:

  • kind/feature
  • kind/improvement
  • kind/deprecation
  • kind/documentation
  • kind/clean-up
  • kind/bug
  • kind/other

Optional: Mark one or more of the following that are applicable:

Important

Breaking changes should be marked kind/admin-change or kind/dev-change depending on type
Critical security fixes should be marked with kind/security

  • kind/admin-change
  • kind/dev-change
  • kind/security
  • kind/adr

What does this PR do / why do we need this PR?

Information to reviewers

Added all QA and Velero GOTOs so that I get feedback from both parties. Does the tests look correct, does the test fixes look correct, am I testing Velero correctly and does it cover everything we want tested with Velero?

Me and @aarnq talked offline and decided that modifying the cluster config during the tests isn't a good idea. So to test both Restic and Kopia it's up to the tester to reconfigure the cluster and run the tests twice.

Checklist

  • Proper commit message prefix on all commits
  • Change checks:
    • The change is transparent
    • The change is disruptive
    • The change requires no migration steps
    • The change requires migration steps
    • The change upgrades CRDs
    • The change updates the config and the schema
  • Metrics checks:
    • The metrics are still exposed and present in Grafana after the change
    • The metrics names didn't change (Grafana dashboards and Prometheus alerts are not affected)
    • The metrics names did change (Grafana dashboards and Prometheus alerts were fixed)
  • Logs checks:
    • The logs do not show any errors after the change
  • Pod Security Policy checks:
    • Any changed pod is covered by Pod Security Admission
    • Any changed pod is covered by Gatekeeper Pod Security Policies
    • The change does not cause any pods to be blocked by Pod Security Admission or Policies
  • Network Policy checks:
    • Any changed pod is covered by Network Policies
    • The change does not cause any dropped packets in the NetworkPolicy Dashboard
  • Audit checks:
    • The change does not cause any unnecessary Kubernetes audit events
    • The change requires changes to Kubernetes audit policy
  • Falco checks:
    • The change does not cause any alerts to be generated by Falco
  • Bug checks:
    • The bug fix is covered by regression tests

Copy link
Contributor

@aarnq aarnq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

harbor.teardown_project
}

@test "velero backup and restore hns" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice test to have 😄

@simonklb
Copy link
Contributor Author

@Pavan-Gunda @OlleLarsson @viktor-f I'd like at least one from the Velero GOTO to also review this before I merge. Does it cover what you want to see tested?

Copy link
Contributor

@viktor-f viktor-f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good. I agree that we don't need to test both restic and kopia here, we can just use whatever is available in the cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that we test backing up and restoring a user app. But I wonder if this should be named something else instead of "hnc.bats", mostly since it does not really feel like we are testing hnc. Maybe name it "user-app" or something like that.

However I do think that we could increase this test so that we also try backing up the subnamespaces and restoring that before we restore the app.

What do you think of those suggestions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However I do think that we could increase this test so that we also try backing up the subnamespaces and restoring that before we restore the app.

Tried deleting the subnamespace as part of the test and then the restore failed with:

Errors:
  Velero:   error creating namespace velero-test: admission webhook "namespaces.hnc.x-k8s.io" denied the request: namespaces "velero-test" is forbidden: cannot set or modify tree label "production.tree.hnc.x-k8s.io/depth" in namespace "velero-test"; these can only be managed by HNC

Are we sure that restoring a subnamespace is even supported/working today?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that we test backing up and restoring a user app. But I wonder if this should be named something else instead of "hnc.bats", mostly since it does not really feel like we are testing hnc. Maybe name it "user-app" or something like that.

149737c

However I do think that we could increase this test so that we also try backing up the subnamespaces and restoring that before we restore the app.

1bb5d6d (see comment above)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure. But I think the issue here is that the test is taking a backup of the velero-test namespace and restoring that. However that does not restore the subnamespace resource since that one is located in the production namespace.

So I think you would have to modify it to something like this:

  1. Backup velero-test namespace
  2. Backup subnamespace from the production namespace
  3. Delete everything
  4. Restore subnamespace in production
  5. Restore everything in the velero-test namespace

1 and 2 should be possible to do in either order, but the rest needs to be in this order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see the problem now. This test kind of doesn't make sense in that it's doing something that we never do (create a backup with --include-namespaces) in production. It probably makes more sense to do backup and restore with the "daily" backup since that is the only "official" way we do things today.

If we implement another type of backup job that can backup and restore only the application developer namespaces then this test could be changed to test that feature instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now running a full backup (which should backup the subnamespace resource in the production namespace AFAIK?): 40687c9
But I'm still getting the same error:

Errors:
  Velero:   error creating namespace velero-test: admission webhook "namespaces.hnc.x-k8s.io" denied the request: namespaces "velero-test" is forbidden: cannot set or modify tree label "production.tree.hnc.x-k8s.io/depth" in namespace "velero-test"; these can only be managed by HNC

Looks to me that this is an actual bug or am I missing something here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the backup should include the subnamespace resource. You can check that with velero backup describe.

Then that might be an actual bug. My guess is that velero is not restoring this in the correct order. I.e. it might be trying to restore namespaces before it is restoring subnamespaces, which would fail because hnc does not allow you to create/edit namespaces with hnc labels (which the namespace in the backup would have).

If you look at the logs of the restore you should be able to see if it is trying to restore subnamespaces before or after namespaces (if it does namespaces first it might just fail there and stop).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear. I can see that it restores the subnamespace resource and that is the last resource it tries to restore before the errors are logged and it ends.

time="2024-09-30T13:29:22Z" level=info msg="Getting client for hnc.x-k8s.io/v1alpha2, Kind=SubnamespaceAnchor" logSource="pkg/restore/restore.go:1050" restore=velero/test-restore-1727702776
time="2024-09-30T13:29:22Z" level=info msg="restore status includes excludes: <nil>" logSource="pkg/restore/restore.go:1342" restore=velero/test-restore-1727702776
time="2024-09-30T13:29:22Z" level=info msg="Attempting to restore SubnamespaceAnchor: velero-test" logSource="pkg/restore/restore.go:1513" restore=velero/test-restore-1727702776
time="2024-09-30T13:29:23Z" level=info msg="the managed fields for production/velero-test is patched" logSource="pkg/restore/restore.go:1714" restore=velero/test-restore-1727702776
time="2024-09-30T13:29:23Z" level=info msg="Restored 252 items out of an estimated total of 275 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=velero-test namespace=production progress= resource=subnamespaceanchors.hnc.x-k8s.io restore=velero/test-restore-1727702776
time="2024-09-30T13:29:23Z" level=info msg="Waiting for all pod volume restores to complete" logSource="pkg/restore/restore.go:660" restore=velero/test-restore-1727702776
time="2024-09-30T13:29:23Z" level=info msg="Done waiting for all pod volume restores to complete" logSource="pkg/restore/restore.go:676" restore=velero/test-restore-1727702776
time="2024-09-30T13:29:23Z" level=info msg="Waiting for all post-restore-exec hooks to complete" logSource="pkg/restore/restore.go:680" restore=velero/test-restore-1727702776
time="2024-09-30T13:29:23Z" level=info msg="Done waiting for all post-restore exec hooks to complete" logSource="pkg/restore/restore.go:688" restore=velero/test-restore-1727702776
time="2024-09-30T13:29:23Z" level=info msg="hookTracker: map[], hookAttempted: 0, hookFailed: 0" logSource="pkg/restore/restore.go:695" restore=velero/test-restore-1727702776
time="2024-09-30T13:29:23Z" level=error msg="Velero restore error: error creating namespace velero-test: admission webhook \"namespaces.hnc.x-k8s.io\" denied the request: namespaces \"velero-test\" is forbidden: cannot set or modify tree label \"production.tree.hnc.x-k8s.io/depth\" in namespace \"velero-test\"; these can only be managed by HNC" logSource="pkg/controller/restore_controller.go:573" restore=velero/test-restore-1727702776
...

However, I never see it trying to restore an actual namespace resource:

$ velero restore logs test-restore-1727702776 | egrep -i 'Attempting to restore.*namespace'
time="2024-09-30T13:28:54Z" level=info msg="Attempting to restore CustomResourceDefinition: subnamespaceanchors.hnc.x-k8s.io" logSource="pkg/restore/restore.go:1513" restore=velero/test-restore-1727702776
time="2024-09-30T13:29:22Z" level=info msg="Attempting to restore SubnamespaceAnchor: velero-test" logSource="pkg/restore/restore.go:1513" restore=velero/test-restore-1727702776

I'm not sure if it just skips existing namespaces, but that isn't logged either as far as I can tell. For example, I can see that resources are restored in the staging namespace but I don't see anything logged regarding the staging namespace itself:

$ velero restore logs test-restore-1727702776 | grep -i 'staging'
time="2024-09-30T13:28:54Z" level=info msg="Resource 'serviceaccounts' will be restored into namespace 'staging'" logSource="pkg/restore/restore.go:2264" restore=velero/test-restore-1727702776
time="2024-09-30T13:28:55Z" level=info msg="Resource 'configmaps' will be restored into namespace 'staging'" logSource="pkg/restore/restore.go:2264" restore=velero/test-restore-1727702776
time="2024-09-30T13:28:55Z" level=info msg="Resource 'hierarchyconfigurations.hnc.x-k8s.io' will be restored into namespace 'staging'" logSource="pkg/restore/restore.go:2264" restore=velero/test-restore-1727702776
time="2024-09-30T13:28:55Z" level=info msg="Resource 'networkpolicies.networking.k8s.io' will be restored into namespace 'staging'" logSource="pkg/restore/restore.go:2264" restore=velero/test-restore-1727702776
time="2024-09-30T13:28:55Z" level=info msg="Resource 'rolebindings.rbac.authorization.k8s.io' will be restored into namespace 'staging'" logSource="pkg/restore/restore.go:2264" restore=velero/test-restore-1727702776
time="2024-09-30T13:28:55Z" level=info msg="Resource 'roles.rbac.authorization.k8s.io' will be restored into namespace 'staging'" logSource="pkg/restore/restore.go:2264" restore=velero/test-restore-1727702776
time="2024-09-30T13:28:59Z" level=info msg="Restored 15 items out of an estimated total of 275 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=default namespace=staging progress= resource=serviceaccounts restore=velero/test-restore-1727702776
time="2024-09-30T13:29:00Z" level=info msg="Restored 60 items out of an estimated total of 275 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=kube-root-ca.crt namespace=staging progress= resource=configmaps restore=velero/test-restore-1727702776
time="2024-09-30T13:29:20Z" level=info msg="Restored 219 items out of an estimated total of 277 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=hierarchy namespace=staging progress= resource=hierarchyconfigurations.hnc.x-k8s.io restore=velero/test-restore-1727702776
time="2024-09-30T13:29:21Z" level=info msg="Restored 234 items out of an estimated total of 277 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=allow-cert-manager-resolver namespace=staging progress= resource=networkpolicies.networking.k8s.io restore=velero/test-restore-1727702776
time="2024-09-30T13:29:21Z" level=info msg="Restored 240 items out of an estimated total of 275 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=extra-workload-admins namespace=staging progress= resource=rolebindings.rbac.authorization.k8s.io restore=velero/test-restore-1727702776
time="2024-09-30T13:29:21Z" level=info msg="Restored 241 items out of an estimated total of 275 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=hnc-controller-user-rolebinding namespace=staging progress= resource=rolebindings.rbac.authorization.k8s.io restore=velero/test-restore-1727702776
time="2024-09-30T13:29:21Z" level=info msg="Restored 242 items out of an estimated total of 275 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=workload-admin namespace=staging progress= resource=rolebindings.rbac.authorization.k8s.io restore=velero/test-restore-1727702776
time="2024-09-30T13:29:22Z" level=info msg="Restored 245 items out of an estimated total of 275 (estimate will change throughout the restore)" logSource="pkg/restore/restore.go:807" name=hnc-controller-user-role namespace=staging progress= resource=roles.rbac.authorization.k8s.io restore=velero/test-restore-1727702776

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the full restore.log

This current fails with the following error in the Velero
restore:

error creating namespace velero-test: admission webhook "namespaces.hnc.x-k8s.io" denied the request: namespaces "velero-test" is forbidden: cannot set or modify tree label "production.tree.hnc.x-k8s.io/depth" in namespace "velero-test"; these can only be managed by HNC
@Pavan-Gunda
Copy link
Contributor

I think this is good. I agree that we don't need to test both restic and kopia here, we can just use whatever is available in the cluster.

restic is going to get deprecated soon anyway, we can just test kopia.

@aarnq
Copy link
Contributor

aarnq commented Sep 18, 2024

I think this is good. I agree that we don't need to test both restic and kopia here, we can just use whatever is available in the cluster.

restic is going to get deprecated soon anyway, we can just test kopia.

We don't really want end-to-end tests to modify and apply changes, so that is up to changing default config and templates to move to Kopia.
The tests doesn't seem to make any differences regardless, so they don't really need to be adapted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[5] Add Velero backup and restore to end-to-end tests
5 participants