Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNF-13652: Test integration of SRIOV and NMState operators #1990

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

zeeke
Copy link
Member

@zeeke zeeke commented Aug 8, 2024

This changes implements a simple test to verify the NMState operator does not interfere with the SR-IOV network operator.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 8, 2024
@openshift-ci-robot
Copy link
Collaborator

@zeeke: This pull request references CNF-13652 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

This changes implements a simple test to verify the NMState operator does not interfere with the SR-IOV network operator.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 8, 2024
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 8, 2024
@zeeke
Copy link
Member Author

zeeke commented Aug 27, 2024

/test ?

Copy link
Contributor

openshift-ci bot commented Aug 27, 2024

@zeeke: The following commands are available to trigger required jobs:

  • /test ci
  • /test e2e-aws-ci-tests
  • /test images
  • /test ztp-ci

The following commands are available to trigger optional jobs:

  • /test e2e-aws-ran-profile
  • /test e2e-telco5g-cnftests
  • /test e2e-telco5g-hcp-cnftests
  • /test e2e-telco5g-sno-cnftests
  • /test security

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-kni-cnf-features-deploy-master-ci
  • pull-ci-openshift-kni-cnf-features-deploy-master-e2e-aws-ci-tests
  • pull-ci-openshift-kni-cnf-features-deploy-master-e2e-aws-ran-profile
  • pull-ci-openshift-kni-cnf-features-deploy-master-images
  • pull-ci-openshift-kni-cnf-features-deploy-master-security

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@zeeke
Copy link
Member Author

zeeke commented Aug 27, 2024

/test e2e-telco5g-cnftests

@zeeke
Copy link
Member Author

zeeke commented Aug 29, 2024

/test e2e-telco5g-cnftests

@zeeke zeeke force-pushed the CNF-13652 branch 2 times, most recently from 7692f80 to bcb630c Compare August 29, 2024 11:25
@zeeke zeeke changed the title [WIP] CNF-13652: Test integration of SRIOV and NMState operators CNF-13652: Test integration of SRIOV and NMState operators Aug 29, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 29, 2024
@zeeke
Copy link
Member Author

zeeke commented Aug 29, 2024

/test e2e-telco5g-cnftests

1 similar comment
@zeeke
Copy link
Member Author

zeeke commented Aug 30, 2024

/test e2e-telco5g-cnftests

func waitForNMStatePolicyToBeStable(nmstatePolicy *nmstatev1.NodeNetworkConfigurationPolicy) {
key := runtimeclient.ObjectKey{Name: nmstatePolicy.Name}
Eventually(func(g Gomega) {
err := client.Client.Get(context.Background(), key, nmstatePolicy)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the nmstatePolicy is using the arg var, which creates confusion about the scope of the variables. Maybe have a new one.

Help functions imo are easier to read/be refactored if they are not having many assertions. I always suggest the assertion to be in the main block and have function that return error/true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the nmstatePolicy is using the arg var, which creates confusion about the scope of the variables. Maybe have a new one.

good point. And since the function only use the name of the policy, I refactor it to pass a string only.

Help functions imo are easier to read/be refactored if they are not having many assertions. I always suggest the assertion to be in the main block and have function that return error/true.

I'm not 100% with it: helper functions must be short (<20 lines, as a rule of thumb) and that's where they get their readability. On the other side, the call site needs to be way more readable, as it contains the complicated test logic. With the current implementation, the call site reads:

waitForNMStatePolicyToBeStable(nodeNetworkConfigPolicy.Name)

which reads straightforward to me.
With functions like these, we might lose reusability; a function that waits for different conditions is almost copy/paste, but that's another story.

testDevice := devices[0]
By("Using device " + testDevice.Name)

By("Configuring IPAM on Physical Function via NMState")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that different than "Configuring and IPv4 address"? just IPAM (Address Management) makes me think whereabouts/or other IPAM is being deployed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPv4 is definitely more clear! changing

sriovclient = sriovtestclient.New("")
}

var _ = Describe("[nmstate] SR-IOV Network Operator Integration", Ordered, func() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the ordered is needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To use the BeforeAll section.
https://onsi.github.io/ginkgo/#setup-in-ordered-containers-beforeall-and-afterall

The question might be, why did I used a BeforeAll if there is only one test case? The answer is "I hope to add more test cases in the future". It might be an early optimization, WDYT?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure you can use BeforeEach, AfterEach stacked on top of Describe/Context to achieve what you need in the future.

According to the doc

Ginkgo did not include support for Ordered containers for quite some time. As you can see Ordered containers make it possible to circumvent the "Declare in container nodes, initialize in setup nodes" principle; and they make it possible to write dependent specs This comes at a cost, of course - specs in Ordered containers cannot be fully parallelized which can result in slower suite runtimes. Despite these cons, pragmatism prevailed and Ordered containers were introduced in response to real-world needs in the community. Nonetheless, we recommend using Ordered containers only when needed.

So I would say specs(tests) should be not ordered unless they really really need to be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, let's go for the BeforeEach way

resNum := testedNode.Status.Allocatable[corev1.ResourceName("openshift.io/testnmstateresource")]
capacity, _ := resNum.AsInt64()
return capacity
}, 10*time.Minute, time.Second).Should(Equal(int64(10)))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the end spec is " there should be 10 VRFs"?

I suggest to extend the description that appears if the spec fails, e.g.

Should(Equal(int64(10))," the number of VFs observes is not that same as the one sriov created")

``

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. changing

cnf-tests/testsuites/e2esuite/nmstate/nmstate_sriov.go Outdated Show resolved Hide resolved
})

func findNonPrimarySriovDevicesAndNode(n *cluster.EnabledNodes) (string, []*sriovv1.InterfaceExt, error) {
errs := []error{}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use this lib https://pkg.go.dev/errors to wrap errors

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using that package at the end of the function:

if len(retDevices) == 0 {
	return "", nil, fmt.Errorf("can't find any SR-IOV devices in cluster's nodes: %w", errors.Join(errs...))
}

How would you use it with Wrap(...)?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can avoid the []array

func main() {
	var err error
    if false{
	err1 := errors.New("err1")
	err = errors.Join(err, err1)
	err2 := errors.New("err2")
	err = errors.Join(err, err2)
}
	fmt.Println(err)

}

but to be honest, both works

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going for the errors.Join(...) approach


if len(nonPrimaryDevices) > len(retDevices) {
retNode = node
retDevices = nonPrimaryDevices

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just return here in the happy path, and return the wrap errors. I might be missing if there is selection between two nodes that have sriov or not

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point: this function is not very clear. The point is to find the node with the most number of SR-IOV devices. Going to reword the variables and comment to make it clearer.

name: openshift-nmstate
spec:
finalizers:
- kubernetes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] my yamllint is complaining that it need --- and two spaces here (same for all yamls)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make the linter happy 😉

@zeeke zeeke force-pushed the CNF-13652 branch 3 times, most recently from b7f7b62 to cd0875f Compare September 4, 2024 08:31
@zeeke
Copy link
Member Author

zeeke commented Sep 4, 2024

/test e2e-telco5g-cnftests

name: kubernetes-nmstate-operator
source: art-nightly-operator-catalog
sourceNamespace: openshift-marketplace
---
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the --- goes to the top

yamllint  feature-configs/ci/nmstate/kustomization.yaml
feature-configs/ci/nmstate/kustomization.yaml
  1:1       warning  missing document start "---"  (document-start)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noticed all the yaml under feature-configs does not have --- . Lets make it as all the other and live with that until we decide to put yamllint in the CI. Does it sound good?

@zeeke zeeke force-pushed the CNF-13652 branch 2 times, most recently from 6b21f40 to f81a28a Compare September 4, 2024 16:12
@zeeke
Copy link
Member Author

zeeke commented Sep 4, 2024

/test e2e-telco5g-cnftests

@zeeke
Copy link
Member Author

zeeke commented Sep 4, 2024

/retest
/test e2e-telco5g-cnftests

Makefile Outdated
@@ -1,5 +1,5 @@
#TODO add default features here
export FEATURES?=sctp performance vrf container-mount-namespace metallb tuningcni bondcni
export FEATURES?=sctp performance vrf container-mount-namespace metallb tuningcni bondcni nmstate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: knmstate is more accurate, here and in the following commits

var sriovInfos *cluster.EnabledNodes

BeforeEach(func() {
if !networks.IsSriovOperatorInstalled() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to check for knmstate too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

@@ -0,0 +1,207 @@
package nmstate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo on the commit message: opertor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, sohuld

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing

// findNonPrimarySriovDevicesAndNode returns a node name with its SR-IOV devices. The device list does not include
// the one used for primary networking, if it is an SR-IOV device.
// The node with the highest number of SR-IOV devices is returned.
func findNonPrimarySriovDevicesAndNode(n *cluster.EnabledNodes) (string, []*sriovv1.InterfaceExt, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this is not super clear, and I feel this is doing too many things all together. It finds the node with the highest number of interfaces, and it finds the interfaces for that (excluding the primary).

In general, I'd split this in two steps:

  • find the first node with at least two sriov interfaces
  • fetch all the interfaces for that node, filtering out the default interface

Also, is there a need to find the one with the highest number of devices or one with one plus the default interface would be enough?

Copy link
Member Author

@zeeke zeeke Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • find the first node with at least two sriov interfaces

There is no need to have two interfaces. If we have one node with one interface that is not used for primary networking, that node/device pair is good.

Also, is there a need to find the one with the highest number of devices or one with one plus the default interface would be enough?

No special need. The function can exit early as soon as it finds a good node-device pair.
changing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I wasn't clear. I meant having two functions, one that finds the node and the other one that gives you the interface for that node. I think it's clearer and each function would have a clearer responsibility

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not figuring out how this function division should be:
the first function should find a node, but I need a node with at least an SRIOV device that is not used as a primary NIC. Then, the function should return the node name, and another function takes the node name and returns all the non-Primary NIC.

If this is what you meant, then the first function would be 99% equal to what we have now.

If not, please provide some code skeleton

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the name of the function to findNodeWithNonPrimarySriovNIC

}).
WithPolling(5*time.Second).
WithTimeout(10*time.Minute).
Should(Equal(int64(10)), "the number of the observed VFs is not that same as the one sriov created")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wdyt about going deeper in checking that both operators did their part? I would check the ip on the interface on the host, to make sure (k)nmstate did its part and probably make some tests on the vf.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good! changing

}

buf, err := nodes.ExecCommandOnMachineConfigDaemon(client.Client, nodeObj,
[]string{"/bin/bash", "-c", "chroot /rootfs /usr/bin/ovs-vsctl list-ports br-ex | /usr/bin/grep -v patch"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably you can get away with checking interfaces which do not have br-ex as master, not sure it would be less fragile

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ip link on my dev worker node says master ovs-system, but it's not unique:

$ oc debug node/cnfdr22.telco5g.eng.rdu2.redhat.com -- chroot /host ip link | grep master
4: eno12399: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
10: genev_sys_6081: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN mode DEFAULT group default qlen 1000
14: 39d9e5dfde97b52@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue master ovs-system state UP mode DEFAULT group default 
16: f50c6bcd4922aa0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue master ovs-system state UP mode DEFAULT group default 
17: a9c644ea1db880d@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue master ovs-system state UP mode DEFAULT group default 
...

while ovs-vsctl appears to be a little bit more accurate:

$ oc debug node/cnfdr22.telco5g.eng.rdu2.redhat.com -- chroot /host /usr/bin/ovs-vsctl list-ports br-ex | /usr/bin/grep -v patch
eno12399

I'd go with the ovs-vsctl way, and I'll consider that if I see flakes, I have plan B. WDYT?

nmstate := nmstatev1.NMState{ObjectMeta: metav1.ObjectMeta{Name: "nmstate"}}
err = client.Client.Create(context.Background(), &nmstate)
Expect(err).ToNot(HaveOccurred())
DeferCleanup(client.Client.Delete, context.Background(), &nmstate)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we should wait for the resource to be deleted, to avoid state leaking across the tests. Also, please be aware that with knmstate, deleting doesn't always mean "restoring the previous state", ie if the pf was down, it will probably stay up and keep the assigned ip.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we should wait for the resource to be deleted, to avoid state leaking across the tests

Do you mean waiting for all the related resources (pods, webhooks) to be cleared?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and not only that, as I wrote above anything that you applied via the knmstate operator won't be unconfigured when you remove it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the code to correctly clean the API resources (NMState and NodeNetworkConfigurationPolicy). Other resources created by the operator (handler pod, operator pod) might change over time and it would be out of this test scope.

Regarding the node's state, the best thing we can do is avoid applying changes that might damage other tests. The proposed test case adds a test-only IP address to an unused NIC.
I believe implementing a revert logic could generate more flakes than not managing it.

WDYT? Is it a fair tradeoff?

@zeeke
Copy link
Member Author

zeeke commented Sep 20, 2024

@karampok @fedepaol please, take another look when you have time

@zeeke
Copy link
Member Author

zeeke commented Sep 21, 2024

/test e2e-telco5g-cnftests
/retest

return "", fmt.Errorf("can't get node[%s]: %w", node, err)
}

cmd := []string{"chroot", "/rootfs", "ip", "-brief", "address", "show", "dev", intf}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chroot is not needed I think, the container lands in the host networking so ip command should work.
Also just fyi in case you did no know, the ip command has json param ip --json in case helps (but here brief and ContainerString is just fine)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine! let's remove the chroot.

}

buf, err := nodes.ExecCommandOnMachineConfigDaemon(client.Client, nodeObj,
[]string{"/bin/bash", "-c", "chroot /rootfs /usr/bin/ovs-vsctl list-ports br-ex | /usr/bin/grep -v patch"})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here chroot is needed because ovs-vsctl command does not exist outside the rootfs

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 7, 2024
Copy link
Contributor

openshift-ci bot commented Oct 7, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: karampok, zeeke

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Oct 7, 2024

New changes are detected. LGTM label has been removed.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 7, 2024
@zeeke
Copy link
Member Author

zeeke commented Oct 7, 2024

/retest
/test e2e-telco5g-cnftests

Expect(err).ToNot(HaveOccurred())
DeferCleanup(sriovclient.Delete, context.Background(), sriovNetworkNodePolicy)

By("Verify Virtual Functions have been correctly configured")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: verifying that (same below)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed!

@zeeke
Copy link
Member Author

zeeke commented Oct 11, 2024

/test e2e-telco5g-cnftests

@zeeke
Copy link
Member Author

zeeke commented Oct 11, 2024

/retest

@zeeke
Copy link
Member Author

zeeke commented Oct 14, 2024

/test e2e-telco5g-cnftests
/retest

@zeeke
Copy link
Member Author

zeeke commented Oct 15, 2024

/test e2e-telco5g-cnftests

@zeeke
Copy link
Member Author

zeeke commented Oct 16, 2024

/retest

@zeeke
Copy link
Member Author

zeeke commented Oct 16, 2024

/test e2e-telco5g-cnftests

@zeeke
Copy link
Member Author

zeeke commented Oct 16, 2024

/retest

Copy link
Contributor

openshift-ci bot commented Oct 16, 2024

@zeeke: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ran-profile a3faba8 link false /test e2e-aws-ran-profile
ci/prow/e2e-aws-ci-tests a3faba8 link true /test e2e-aws-ci-tests
ci/prow/e2e-telco5g-cnftests a3faba8 link false /test e2e-telco5g-cnftests

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@zeeke zeeke force-pushed the CNF-13652 branch 2 times, most recently from d4a960e to 602c277 Compare October 23, 2024 09:01
Add a test that configures the same NIC via Kubernetes-NMState and
SR-IOV network operator. The test ensure that if NMState doesn't
have any SR-IOV configuration, then both operator should work correctly.

Signed-off-by: Andrea Panattoni <[email protected]>
`kubernetes-nmsate` CI does not publish images on
`quay.io/openshift/*`. The best way to deploy a recent
version of the operator is to leverage the
`quay.io/openshift-release-dev/ocp-release-nightly:iib-int-index-art-operators-4.17`
index image, which picks latest nightly builds from Brew.

Signed-off-by: Andrea Panattoni <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants