Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FlowAggregator bandwidth tests are too flaky #2283

Closed
antoninbas opened this issue Jun 16, 2021 · 12 comments · Fixed by #2308
Closed

FlowAggregator bandwidth tests are too flaky #2283

antoninbas opened this issue Jun 16, 2021 · 12 comments · Fixed by #2308
Assignees
Labels
area/test/e2e Issues or PRs related to Antrea specific end-to-end testing. kind/bug Categorizes issue or PR as related to a bug.

Comments

@antoninbas
Copy link
Contributor

Describe the bug
I have been seeing test failures like this one frequently:

=== RUN   TestFlowAggregator/IPv6/LocalServiceAccess
    flowaggregator_test.go:555: Check the bandwidth using octetDeltaCountFromSourceNode in data record.
    flowaggregator_test.go:675: Iperf throughput: 24678.40 Mbits/s, IPFIX record throughput calculated through octetDeltaCountFromSourceNode: 23130.28 Mbits/s
    flowaggregator_test.go:564: Check the average bandwidth using octetTotalCountFromSourceNode 26185840273 in data record.
    flowaggregator_test.go:675: Iperf throughput: 24678.40 Mbits/s, IPFIX record throughput calculated through octetTotalCountFromSourceNode: 17457.23 Mbits/s
    flowaggregator_test.go:676: 
        	Error Trace:	flowaggregator_test.go:676
        	            				flowaggregator_test.go:565
        	            				flowaggregator_test.go:455
        	Error:      	Max difference between 17457.22684866667 and 24678.4 allowed is 3701.76, but difference was -7221.1731513333325
        	Test:       	TestFlowAggregator/IPv6/LocalServiceAccess
        	Messages:   	Difference between Iperf bandwidth and IPFIX record bandwidth calculated through octetTotalCountFromSourceNode should be lower than 15%
=== RUN   TestFlowAggregator/IPv6/RemoteServiceAccess
    flowaggregator_test.go:555: Check the bandwidth using octetDeltaCountFromSourceNode in data record.
    flowaggregator_test.go:675: Iperf throughput: 562.00 Mbits/s, IPFIX record throughput calculated through octetDeltaCountFromSourceNode: 561.73 Mbits/s
    flowaggregator_test.go:564: Check the average bandwidth using octetTotalCountFromSourceNode 842696547 in data record.
    flowaggregator_test.go:675: Iperf throughput: 562.00 Mbits/s, IPFIX record throughput calculated through octetTotalCountFromSourceNode: 561.80 Mbits/s
=== CONT  TestFlowAggregator
    fixtures.go:338: Deleting 'flow-aggregator' K8s Namespace
    fixtures.go:224: Exporting test logs to '/var/lib/jenkins/workspace/antrea-ipv6-only-e2e-for-pull-request/antrea-test-logs/TestFlowAggregator/beforeTeardown.Jun14-15-24-38'
    fixtures.go:328: Error when exporting kubelet logs: error when running journalctl on Node 'antrea-ipv6-2-0', is it available? Error: <nil>
    fixtures.go:349: Deleting 'antrea-test' K8s Namespace
--- FAIL: TestFlowAggregator (247.98s)
    --- FAIL: TestFlowAggregator/IPv6 (101.84s)
        --- PASS: TestFlowAggregator/IPv6/IntraNodeFlows (14.54s)
        --- PASS: TestFlowAggregator/IPv6/IntraNodeDenyConnIngressANP (3.28s)
        --- PASS: TestFlowAggregator/IPv6/IntraNodeDenyConnEgressANP (4.53s)
        --- PASS: TestFlowAggregator/IPv6/IntraNodeDenyConnNP (5.19s)
        --- PASS: TestFlowAggregator/IPv6/InterNodeFlows (14.61s)
        --- PASS: TestFlowAggregator/IPv6/InterNodeDenyConnIngressANP (3.29s)
        --- PASS: TestFlowAggregator/IPv6/InterNodeDenyConnEgressANP (3.39s)
        --- PASS: TestFlowAggregator/IPv6/InterNodeDenyConnNP (9.55s)
        --- PASS: TestFlowAggregator/IPv6/ToExternalFlows (12.29s)
        --- FAIL: TestFlowAggregator/IPv6/LocalServiceAccess (13.18s)
        --- PASS: TestFlowAggregator/IPv6/RemoteServiceAccess (14.90s)

It seems to happen more frequently for IPv6 e2e test jobs than for IPv4 e2e test jobs, but I don't know if there is a real correlation there.

@antoninbas antoninbas added area/test/e2e Issues or PRs related to Antrea specific end-to-end testing. kind/bug Categorizes issue or PR as related to a bug. labels Jun 16, 2021
@antoninbas antoninbas changed the title FlowAggregator bandwidth tests are too falky FlowAggregator bandwidth tests are too flaky Jun 16, 2021
@zyiou
Copy link
Contributor

zyiou commented Jun 17, 2021

(Moved comment to #2282 as it is more related)

@srikartati
Copy link
Member

@zyiou This is a good find. Makes sense to cover all the cases through different templates. This issue is pertinent to the dual-stack case and doesn't explain the flakiness in other test setups which are raised by this PR. Therefore, it is better to move in Issue #2282

I think we have to check the combination before sending each aggregated flow record. If that is the case, it's better to have this info as metadata and pick the correct template ID when adding the data record.

@srikartati
Copy link
Member

@dreamtalen Could you comment on this bandwidth issue based on your experience while working on a similar bug earlier: #2211? Are these related?

@dreamtalen
Copy link
Contributor

@dreamtalen Could you comment on this bandwidth issue based on your experience while working on a similar bug earlier: #2211? Are these related?

Sure, the error in this screenshoot shows that test failed at the average bandwidth check using octetTotalCountFromSourceNode, and the value 17457.23 Mbits/s is 25% less than we expected, which means that in the last record, the flowEndSeconds is updated while the octetTotalCountFromSourceNode has not been updated (missed the last flow record from sourceNode at timestamp 12s since traffic begin), so I think it is related to issue #2211.

@srikartati
Copy link
Member

srikartati commented Jun 22, 2021

Sure, the error in this screenshoot shows that test failed at the average bandwidth check using octetTotalCountFromSourceNode, and the value 17457.23 Mbits/s is 25% less than we expected, which means that in the last record, the flowEndSeconds is updated while the octetTotalCountFromSourceNode has not been updated (missed the last flow record from sourceNode at timestamp 12s since traffic begin), so I think it is related to issue #2211.

I thought the intention of workaround placed in the e2e flow aggregator test is to make the test stable. Do you know why it became flaky?

@dreamtalen
Copy link
Contributor

Sure, the error in this screenshoot shows that test failed at the average bandwidth check using octetTotalCountFromSourceNode, and the value 17457.23 Mbits/s is 25% less than we expected, which means that in the last record, the flowEndSeconds is updated while the octetTotalCountFromSourceNode has not been updated (missed the last flow record from sourceNode at timestamp 12s since traffic begin), so I think it is related to issue #2211.

I thought the intention of workaround placed in the e2e flow aggregator test is to make the test stable. Do you know why it became flaky?

I don't think current workaround will make the test flaky, just triggered the e2e test with a local ipv6 vagrant cluster multiple times and no failure happened. Maybe we should look through the records received by collector in this failed case to find the root cause.

@dreamtalen
Copy link
Contributor

@antoninbas Hi Antonin, is this failure shown in the screenshot inside a ipv6 only jenkins test job? Could you share how frequently it happens and how to reproduce it if possible, thanks!

@antoninbas
Copy link
Contributor Author

@dreamtalen all the information I have is in the issue. I don't recall which testbed it was, but I do know I observed the same issue on IPv4 testbeds as well. There was a memory leak in the test binary until recently causing a lot of test failures as one of the Nodes (control-plane Node) would run out of memory. It's possible that it could explain such failures, although I'm not sure how. You can keep monitoring the jobs to see if it happens again.

@antoninbas
Copy link
Contributor Author

@dreamtalen I just saw this failure on an IPv4 testbed: https://jenkins.antrea-ci.rocks/job/antrea-e2e-for-pull-request/2745/console

=== RUN   TestFlowAggregator/IPv4/LocalServiceAccess
    flowaggregator_test.go:568: Check the bandwidth using octetDeltaCountFromSourceNode in data record.
    flowaggregator_test.go:688: Iperf throughput: 18227.20 Mbits/s, IPFIX record throughput calculated through octetDeltaCountFromSourceNode: 17747.55 Mbits/s
    flowaggregator_test.go:577: Check the average bandwidth using octetTotalCountFromSourceNode 26684556369 in data record.
    flowaggregator_test.go:688: Iperf throughput: 18227.20 Mbits/s, IPFIX record throughput calculated through octetTotalCountFromSourceNode: 17789.70 Mbits/s
=== RUN   TestFlowAggregator/IPv4/RemoteServiceAccess
    flowaggregator_test.go:568: Check the bandwidth using octetDeltaCountFromSourceNode in data record.
    flowaggregator_test.go:688: Iperf throughput: 5416.96 Mbits/s, IPFIX record throughput calculated through octetDeltaCountFromSourceNode: 5303.33 Mbits/s
    flowaggregator_test.go:577: Check the average bandwidth using octetTotalCountFromSourceNode 5754020579 in data record.
    flowaggregator_test.go:688: Iperf throughput: 5416.96 Mbits/s, IPFIX record throughput calculated through octetTotalCountFromSourceNode: 3836.01 Mbits/s
    flowaggregator_test.go:689: 
        	Error Trace:	flowaggregator_test.go:689
        	            				flowaggregator_test.go:578
        	            				flowaggregator_test.go:478
        	Error:      	Max difference between 3836.0137193333335 and 5416.96 allowed is 812.544, but difference was -1580.9462806666666
        	Test:       	TestFlowAggregator/IPv4/RemoteServiceAccess
        	Messages:   	Difference between Iperf bandwidth and IPFIX record bandwidth calculated through octetTotalCountFromSourceNode should be lower than 15%
=== CONT  TestFlowAggregator
    fixtures.go:328: Deleting 'flow-aggregator' K8s Namespace
    fixtures.go:214: Exporting test logs to '/var/lib/jenkins/workspace/antrea-e2e-for-pull-request/antrea-test-logs/TestFlowAggregator/beforeTeardown.Jun23-21-34-50'
    fixtures.go:318: Error when exporting kubelet logs: error when running journalctl on Node 'antrea-e2e-for-pull-request-2745-7rdnp', is it available? Error: <nil>
    fixtures.go:339: Deleting 'antrea-test' K8s Namespace
--- FAIL: TestFlowAggregator (229.49s)
    --- FAIL: TestFlowAggregator/IPv4 (94.66s)
        --- PASS: TestFlowAggregator/IPv4/IntraNodeFlows (14.06s)
        --- PASS: TestFlowAggregator/IPv4/IntraNodeDenyConnIngressANP (3.37s)
        --- PASS: TestFlowAggregator/IPv4/IntraNodeDenyConnEgressANP (3.38s)
        --- PASS: TestFlowAggregator/IPv4/IntraNodeDenyConnNP (4.56s)
        --- PASS: TestFlowAggregator/IPv4/InterNodeFlows (13.53s)
        --- PASS: TestFlowAggregator/IPv4/InterNodeDenyConnIngressANP (3.40s)
        --- PASS: TestFlowAggregator/IPv4/InterNodeDenyConnEgressANP (2.59s)
        --- PASS: TestFlowAggregator/IPv4/InterNodeDenyConnNP (8.88s)
        --- PASS: TestFlowAggregator/IPv4/ToExternalFlows (10.27s)
        --- PASS: TestFlowAggregator/IPv4/LocalServiceAccess (14.05s)
        --- FAIL: TestFlowAggregator/IPv4/RemoteServiceAccess (13.45s)

This PR doesn't have the memory leak patch, but I am not sure the memory leak affected IPv4 testbeds.

@dreamtalen
Copy link
Contributor

@dreamtalen I just saw this failure on an IPv4 testbed: https://jenkins.antrea-ci.rocks/job/antrea-e2e-for-pull-request/2745/console

=== RUN   TestFlowAggregator/IPv4/LocalServiceAccess
    flowaggregator_test.go:568: Check the bandwidth using octetDeltaCountFromSourceNode in data record.
    flowaggregator_test.go:688: Iperf throughput: 18227.20 Mbits/s, IPFIX record throughput calculated through octetDeltaCountFromSourceNode: 17747.55 Mbits/s
    flowaggregator_test.go:577: Check the average bandwidth using octetTotalCountFromSourceNode 26684556369 in data record.
    flowaggregator_test.go:688: Iperf throughput: 18227.20 Mbits/s, IPFIX record throughput calculated through octetTotalCountFromSourceNode: 17789.70 Mbits/s
=== RUN   TestFlowAggregator/IPv4/RemoteServiceAccess
    flowaggregator_test.go:568: Check the bandwidth using octetDeltaCountFromSourceNode in data record.
    flowaggregator_test.go:688: Iperf throughput: 5416.96 Mbits/s, IPFIX record throughput calculated through octetDeltaCountFromSourceNode: 5303.33 Mbits/s
    flowaggregator_test.go:577: Check the average bandwidth using octetTotalCountFromSourceNode 5754020579 in data record.
    flowaggregator_test.go:688: Iperf throughput: 5416.96 Mbits/s, IPFIX record throughput calculated through octetTotalCountFromSourceNode: 3836.01 Mbits/s
    flowaggregator_test.go:689: 
        	Error Trace:	flowaggregator_test.go:689
        	            				flowaggregator_test.go:578
        	            				flowaggregator_test.go:478
        	Error:      	Max difference between 3836.0137193333335 and 5416.96 allowed is 812.544, but difference was -1580.9462806666666
        	Test:       	TestFlowAggregator/IPv4/RemoteServiceAccess
        	Messages:   	Difference between Iperf bandwidth and IPFIX record bandwidth calculated through octetTotalCountFromSourceNode should be lower than 15%
=== CONT  TestFlowAggregator
    fixtures.go:328: Deleting 'flow-aggregator' K8s Namespace
    fixtures.go:214: Exporting test logs to '/var/lib/jenkins/workspace/antrea-e2e-for-pull-request/antrea-test-logs/TestFlowAggregator/beforeTeardown.Jun23-21-34-50'
    fixtures.go:318: Error when exporting kubelet logs: error when running journalctl on Node 'antrea-e2e-for-pull-request-2745-7rdnp', is it available? Error: <nil>
    fixtures.go:339: Deleting 'antrea-test' K8s Namespace
--- FAIL: TestFlowAggregator (229.49s)
    --- FAIL: TestFlowAggregator/IPv4 (94.66s)
        --- PASS: TestFlowAggregator/IPv4/IntraNodeFlows (14.06s)
        --- PASS: TestFlowAggregator/IPv4/IntraNodeDenyConnIngressANP (3.37s)
        --- PASS: TestFlowAggregator/IPv4/IntraNodeDenyConnEgressANP (3.38s)
        --- PASS: TestFlowAggregator/IPv4/IntraNodeDenyConnNP (4.56s)
        --- PASS: TestFlowAggregator/IPv4/InterNodeFlows (13.53s)
        --- PASS: TestFlowAggregator/IPv4/InterNodeDenyConnIngressANP (3.40s)
        --- PASS: TestFlowAggregator/IPv4/InterNodeDenyConnEgressANP (2.59s)
        --- PASS: TestFlowAggregator/IPv4/InterNodeDenyConnNP (8.88s)
        --- PASS: TestFlowAggregator/IPv4/ToExternalFlows (10.27s)
        --- PASS: TestFlowAggregator/IPv4/LocalServiceAccess (14.05s)
        --- FAIL: TestFlowAggregator/IPv4/RemoteServiceAccess (13.45s)

This PR doesn't have the memory leak patch, but I am not sure the memory leak affected IPv4 testbeds.

Thanks Antonin, I'm looking at it.

@dreamtalen
Copy link
Contributor

dreamtalen commented Jun 24, 2021

Updated: reproduce this bandwidth check failure successfully. Find the root cause is that ipfix-collector only received two data records and missed the last one, so that the average bandwidth calculated using octetTotalCountFromSourceNode is wrong.
The reason why collector didn't receive the last data record is because there are control flow and data flow record of iperf traffic, and in the corner case, collector just received the last control flow record but hasn't received the data flow record yet, it stops the wait.PollImmediate() at

if exportTime >= timeStart.Unix()+iperfTimeSec {

To solve this issue, we need to improve the judge condition in wait.PollImmediate() to only consider the data flow record of iperf traffic, --cport n Option to specify the client-side port of iperf command may help distinguish the control and data flow bacause they have different client-side ports.

@srikartati
Copy link
Member

To solve this issue, we need to improve the judge condition in wait.PollImmediate() to only consider the data flow record of iperf traffic, --cport n Option to specify the client-side port of iperf command may help distinguish the control and data flow bacause they have different client-side ports.

Great. Source port could definitely help in resolving the issue. Prior we talked about using cport which was not implemented when improving the bandwidth tests. We could also get the source port from iperf command output. I incorporated the source port based detection in PR #2308. It should resolve the issue. Let me know if you think otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test/e2e Issues or PRs related to Antrea specific end-to-end testing. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants