-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CNF-13652: Test integration of SRIOV and NMState operators #1990
base: master
Are you sure you want to change the base?
Conversation
@zeeke: This pull request references CNF-13652 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/test ? |
@zeeke: The following commands are available to trigger required jobs:
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/test e2e-telco5g-cnftests |
/test e2e-telco5g-cnftests |
7692f80
to
bcb630c
Compare
/test e2e-telco5g-cnftests |
1 similar comment
/test e2e-telco5g-cnftests |
func waitForNMStatePolicyToBeStable(nmstatePolicy *nmstatev1.NodeNetworkConfigurationPolicy) { | ||
key := runtimeclient.ObjectKey{Name: nmstatePolicy.Name} | ||
Eventually(func(g Gomega) { | ||
err := client.Client.Get(context.Background(), key, nmstatePolicy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the nmstatePolicy is using the arg var, which creates confusion about the scope of the variables. Maybe have a new one.
Help functions imo are easier to read/be refactored if they are not having many assertions. I always suggest the assertion to be in the main block and have function that return error/true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the nmstatePolicy is using the arg var, which creates confusion about the scope of the variables. Maybe have a new one.
good point. And since the function only use the name of the policy, I refactor it to pass a string only.
Help functions imo are easier to read/be refactored if they are not having many assertions. I always suggest the assertion to be in the main block and have function that return error/true.
I'm not 100% with it: helper functions must be short (<20 lines, as a rule of thumb) and that's where they get their readability. On the other side, the call site needs to be way more readable, as it contains the complicated test logic. With the current implementation, the call site reads:
waitForNMStatePolicyToBeStable(nodeNetworkConfigPolicy.Name)
which reads straightforward to me.
With functions like these, we might lose reusability; a function that waits for different conditions is almost copy/paste, but that's another story.
testDevice := devices[0] | ||
By("Using device " + testDevice.Name) | ||
|
||
By("Configuring IPAM on Physical Function via NMState") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that different than "Configuring and IPv4 address"? just IPAM (Address Management) makes me think whereabouts/or other IPAM is being deployed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IPv4 is definitely more clear! changing
sriovclient = sriovtestclient.New("") | ||
} | ||
|
||
var _ = Describe("[nmstate] SR-IOV Network Operator Integration", Ordered, func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the ordered
is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To use the BeforeAll
section.
https://onsi.github.io/ginkgo/#setup-in-ordered-containers-beforeall-and-afterall
The question might be, why did I used a BeforeAll if there is only one test case? The answer is "I hope to add more test cases in the future". It might be an early optimization, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure you can use BeforeEach, AfterEach stacked on top of Describe/Context to achieve what you need in the future.
According to the doc
Ginkgo did not include support for Ordered containers for quite some time. As you can see Ordered containers make it possible to circumvent the "Declare in container nodes, initialize in setup nodes" principle; and they make it possible to write dependent specs This comes at a cost, of course - specs in Ordered containers cannot be fully parallelized which can result in slower suite runtimes. Despite these cons, pragmatism prevailed and Ordered containers were introduced in response to real-world needs in the community. Nonetheless, we recommend using Ordered containers only when needed.
So I would say specs(tests) should be not ordered unless they really really need to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, let's go for the BeforeEach way
resNum := testedNode.Status.Allocatable[corev1.ResourceName("openshift.io/testnmstateresource")] | ||
capacity, _ := resNum.AsInt64() | ||
return capacity | ||
}, 10*time.Minute, time.Second).Should(Equal(int64(10))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the end spec is " there should be 10 VRFs"?
I suggest to extend the description that appears if the spec fails, e.g.
Should(Equal(int64(10))," the number of VFs observes is not that same as the one sriov created")
``
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good. changing
}) | ||
|
||
func findNonPrimarySriovDevicesAndNode(n *cluster.EnabledNodes) (string, []*sriovv1.InterfaceExt, error) { | ||
errs := []error{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use this lib https://pkg.go.dev/errors to wrap errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using that package at the end of the function:
if len(retDevices) == 0 {
return "", nil, fmt.Errorf("can't find any SR-IOV devices in cluster's nodes: %w", errors.Join(errs...))
}
How would you use it with Wrap(...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can avoid the []array
func main() {
var err error
if false{
err1 := errors.New("err1")
err = errors.Join(err, err1)
err2 := errors.New("err2")
err = errors.Join(err, err2)
}
fmt.Println(err)
}
but to be honest, both works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going for the errors.Join(...)
approach
|
||
if len(nonPrimaryDevices) > len(retDevices) { | ||
retNode = node | ||
retDevices = nonPrimaryDevices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just return here in the happy path, and return the wrap errors. I might be missing if there is selection between two nodes that have sriov or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point: this function is not very clear. The point is to find the node with the most number of SR-IOV devices. Going to reword the variables and comment to make it clearer.
name: openshift-nmstate | ||
spec: | ||
finalizers: | ||
- kubernetes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] my yamllint is complaining that it need ---
and two spaces here (same for all yamls)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make the linter happy 😉
b7f7b62
to
cd0875f
Compare
/test e2e-telco5g-cnftests |
name: kubernetes-nmstate-operator | ||
source: art-nightly-operator-catalog | ||
sourceNamespace: openshift-marketplace | ||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the ---
goes to the top
yamllint feature-configs/ci/nmstate/kustomization.yaml
feature-configs/ci/nmstate/kustomization.yaml
1:1 warning missing document start "---" (document-start)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just noticed all the yaml under feature-configs
does not have ---
. Lets make it as all the other and live with that until we decide to put yamllint in the CI. Does it sound good?
6b21f40
to
f81a28a
Compare
/test e2e-telco5g-cnftests |
/retest |
Makefile
Outdated
@@ -1,5 +1,5 @@ | |||
#TODO add default features here | |||
export FEATURES?=sctp performance vrf container-mount-namespace metallb tuningcni bondcni | |||
export FEATURES?=sctp performance vrf container-mount-namespace metallb tuningcni bondcni nmstate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: knmstate is more accurate, here and in the following commits
var sriovInfos *cluster.EnabledNodes | ||
|
||
BeforeEach(func() { | ||
if !networks.IsSriovOperatorInstalled() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to check for knmstate too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
@@ -0,0 +1,207 @@ | |||
package nmstate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo on the commit message: opertor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, sohuld
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixing
// findNonPrimarySriovDevicesAndNode returns a node name with its SR-IOV devices. The device list does not include | ||
// the one used for primary networking, if it is an SR-IOV device. | ||
// The node with the highest number of SR-IOV devices is returned. | ||
func findNonPrimarySriovDevicesAndNode(n *cluster.EnabledNodes) (string, []*sriovv1.InterfaceExt, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of this is not super clear, and I feel this is doing too many things all together. It finds the node with the highest number of interfaces, and it finds the interfaces for that (excluding the primary).
In general, I'd split this in two steps:
- find the first node with at least two sriov interfaces
- fetch all the interfaces for that node, filtering out the default interface
Also, is there a need to find the one with the highest number of devices or one with one plus the default interface would be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- find the first node with at least two sriov interfaces
There is no need to have two interfaces. If we have one node with one interface that is not used for primary networking, that node/device pair is good.
Also, is there a need to find the one with the highest number of devices or one with one plus the default interface would be enough?
No special need. The function can exit early as soon as it finds a good node-device pair.
changing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I wasn't clear. I meant having two functions, one that finds the node and the other one that gives you the interface for that node. I think it's clearer and each function would have a clearer responsibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not figuring out how this function division should be:
the first function should find a node, but I need a node with at least an SRIOV device that is not used as a primary NIC. Then, the function should return the node name, and another function takes the node name and returns all the non-Primary NIC.
If this is what you meant, then the first function would be 99% equal to what we have now.
If not, please provide some code skeleton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the name of the function to findNodeWithNonPrimarySriovNIC
}). | ||
WithPolling(5*time.Second). | ||
WithTimeout(10*time.Minute). | ||
Should(Equal(int64(10)), "the number of the observed VFs is not that same as the one sriov created") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wdyt about going deeper in checking that both operators did their part? I would check the ip on the interface on the host, to make sure (k)nmstate did its part and probably make some tests on the vf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good! changing
} | ||
|
||
buf, err := nodes.ExecCommandOnMachineConfigDaemon(client.Client, nodeObj, | ||
[]string{"/bin/bash", "-c", "chroot /rootfs /usr/bin/ovs-vsctl list-ports br-ex | /usr/bin/grep -v patch"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably you can get away with checking interfaces which do not have br-ex as master, not sure it would be less fragile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ip link
on my dev worker node says master ovs-system
, but it's not unique:
$ oc debug node/cnfdr22.telco5g.eng.rdu2.redhat.com -- chroot /host ip link | grep master
4: eno12399: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
10: genev_sys_6081: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN mode DEFAULT group default qlen 1000
14: 39d9e5dfde97b52@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue master ovs-system state UP mode DEFAULT group default
16: f50c6bcd4922aa0@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue master ovs-system state UP mode DEFAULT group default
17: a9c644ea1db880d@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue master ovs-system state UP mode DEFAULT group default
...
while ovs-vsctl
appears to be a little bit more accurate:
$ oc debug node/cnfdr22.telco5g.eng.rdu2.redhat.com -- chroot /host /usr/bin/ovs-vsctl list-ports br-ex | /usr/bin/grep -v patch
eno12399
I'd go with the ovs-vsctl way, and I'll consider that if I see flakes, I have plan B. WDYT?
nmstate := nmstatev1.NMState{ObjectMeta: metav1.ObjectMeta{Name: "nmstate"}} | ||
err = client.Client.Create(context.Background(), &nmstate) | ||
Expect(err).ToNot(HaveOccurred()) | ||
DeferCleanup(client.Client.Delete, context.Background(), &nmstate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we should wait for the resource to be deleted, to avoid state leaking across the tests. Also, please be aware that with knmstate, deleting doesn't always mean "restoring the previous state", ie if the pf was down, it will probably stay up and keep the assigned ip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we should wait for the resource to be deleted, to avoid state leaking across the tests
Do you mean waiting for all the related resources (pods, webhooks) to be cleared?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and not only that, as I wrote above anything that you applied via the knmstate operator won't be unconfigured when you remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the code to correctly clean the API resources (NMState and NodeNetworkConfigurationPolicy). Other resources created by the operator (handler pod, operator pod) might change over time and it would be out of this test scope.
Regarding the node's state, the best thing we can do is avoid applying changes that might damage other tests. The proposed test case adds a test-only IP address to an unused NIC.
I believe implementing a revert logic could generate more flakes than not managing it.
WDYT? Is it a fair tradeoff?
/test e2e-telco5g-cnftests |
return "", fmt.Errorf("can't get node[%s]: %w", node, err) | ||
} | ||
|
||
cmd := []string{"chroot", "/rootfs", "ip", "-brief", "address", "show", "dev", intf} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The chroot is not needed I think, the container lands in the host networking so ip command should work.
Also just fyi in case you did no know, the ip command has json
param ip --json
in case helps (but here brief and ContainerString is just fine)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fine! let's remove the chroot
.
} | ||
|
||
buf, err := nodes.ExecCommandOnMachineConfigDaemon(client.Client, nodeObj, | ||
[]string{"/bin/bash", "-c", "chroot /rootfs /usr/bin/ovs-vsctl list-ports br-ex | /usr/bin/grep -v patch"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here chroot is needed because ovs-vsctl command does not exist outside the rootfs
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: karampok, zeeke The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
New changes are detected. LGTM label has been removed. |
/retest |
Expect(err).ToNot(HaveOccurred()) | ||
DeferCleanup(sriovclient.Delete, context.Background(), sriovNetworkNodePolicy) | ||
|
||
By("Verify Virtual Functions have been correctly configured") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: verifying that (same below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed!
/test e2e-telco5g-cnftests |
/retest |
/test e2e-telco5g-cnftests |
Signed-off-by: Andrea Panattoni <[email protected]>
Signed-off-by: Andrea Panattoni <[email protected]>
/test e2e-telco5g-cnftests |
/retest |
/test e2e-telco5g-cnftests |
/retest |
@zeeke: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
d4a960e
to
602c277
Compare
Add a test that configures the same NIC via Kubernetes-NMState and SR-IOV network operator. The test ensure that if NMState doesn't have any SR-IOV configuration, then both operator should work correctly. Signed-off-by: Andrea Panattoni <[email protected]>
`kubernetes-nmsate` CI does not publish images on `quay.io/openshift/*`. The best way to deploy a recent version of the operator is to leverage the `quay.io/openshift-release-dev/ocp-release-nightly:iib-int-index-art-operators-4.17` index image, which picks latest nightly builds from Brew. Signed-off-by: Andrea Panattoni <[email protected]>
This changes implements a simple test to verify the NMState operator does not interfere with the SR-IOV network operator.