Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: separate out PR tests and run fuzzing and race in nightlies #2673

Closed
wants to merge 4 commits into from

Conversation

cmwaters
Copy link
Contributor

This PR attempts to improve our CI. It breaks up our test suite into 4 parallel test jobs. It doesn't replicate the work by removing test-short and test-coverage, combining them into a single job.

This PR also moves our race tests to the nightlies and adds fuzzing to the nightlies

@cmwaters
Copy link
Contributor Author

In order for this to work we will need to disable the required tests as they no longer exist @evan-forbes (and require the new tests)

@cmwaters cmwaters marked this pull request as ready for review October 12, 2023 15:07
Copy link
Collaborator

@rootulp rootulp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It breaks up our test suite into 4 parallel test jobs.

Why? In my experience, running 4 parallel jobs isn't faster than running 1 sequential job. See #1439. IMO 1 sequential job is easier to monitor than 4 parallel jobs.

It doesn't replicate the work by removing test-short and test-coverage, combining them into a single job.

It looks like make test-short got dropped from this PR and I think we should retain it so that CI can inform us if someone adds a long running test (i.e. 10+ mins) and forgets to add it to the list of tests skipped in short mode.

Ref: #2361

This PR also moves our race tests to the nightlies

Why? make test-race doesn't appear to be a bottleneck. It takes ~8m to run and test takes ~17m to run. Not blocking on it but it seems like we can run make test-race on every PR b/c make test takes longer and they run in parallel.

Ref: https://github.com/celestiaorg/celestia-app/actions/runs/6499346269

and adds fuzzing to the nightlies

👍

@cmwaters
Copy link
Contributor Author

Why? In my experience, running 4 parallel jobs isn't faster than running 1 sequential job. See #1439. IMO 1 sequential job is easier to monitor than 4 parallel jobs.

I think the advantage is more that you only have to rerun the test split that failed instead of rerunning everything (I think you're right that it doesn't allocate more resources and most tests run in parallel anyway). There actually might be a way where we can cache the results so rerunning doesn't rerun the tests that already passed (I actually don't know how this works)

It looks like make test-short got dropped from this PR and I think we should retain it so that CI can inform us if someone adds a long running test (i.e. 10+ mins) and forgets to add it to the list of tests skipped in short mode.

Ref: #2361

I don't think there's a need to run test-short if you're going to run test. The problem with the integration tests that needed to be skipped wasn't that they took long but because of the race detector

Why? make test-race doesn't appear to be a bottleneck. It takes ~8m to run and test takes ~17m to run. Not blocking on it but it seems like we can run make test-race on every PR b/c make test takes longer and they run in parallel.

Ref: https://github.com/celestiaorg/celestia-app/actions/runs/6499346269

Yes I guess the majority of the time of tests is taken is integration tests. Maybe I can just add this back. The reason for removing it is because race detection for a single threaded application is useless. There is only one place in the entire repo that we spin up a go routine and that is the txsim.

At a high level, I'm just trying to reduce the time our c.i. takes. It would be nice to get everything down to about 5mins. Open to other suggestions

@rootulp
Copy link
Collaborator

rootulp commented Oct 13, 2023

Ahh thanks for pointing out the re-run thing for when test fails. I agree we have a problem with needing to re-run the test task frequently. IMO the problem is caused by frequent test flakes. Proposals that come to mind:

  1. Fix test flakes
  2. Modify tests to run faster (i.e. remove sleeps or WaitForNextHeights)
  3. Auto retry tests if flake is observed. Maybe w/ retry-step

Part of my hesitation is that running the tests in parallel on CI is more difficult to monitor and debug when a test fails. Additionally parallel tests in CI are less representative of how tests are run by contributors during development. So I'm concerned of edge cases where make test fails if run sequentially on a contributor's machine but passes on CI b/c they're run in parallel (or vice versa).

I don't think there's a need to run test-short if you're going to run test. The problem with the integration tests that needed to be skipped wasn't that they took long but because of the race detector

Sorry I don't follow what you mean by integration tests that needed to be skipped... I'm repeating myself but I do think there's a reason to run make test-short even if CI runs make test.

Example
CI runs make test and doesn't run make test-short. Developer A adds a test that takes 10+ mins and forgets to add these lines to the test. The PR passes make test in CI and merges. Later on, Developer B runs make test-short locally, they observe that a long running test was added that forgot those lines and needs to make a separate PR to add them. IMO running make test-short on CI with a low timeout (i.e. 1m) catches this early so that the first PR wouldn't have merged.

Yes I guess the majority of the time of tests is taken is integration tests. Maybe I can just add this back. The reason for removing it is because race detection for a single threaded application is useless. There is only one place in the entire repo that we spin up a go routine and that is the txsim.

Makes sense. Down to move make test-race to nightly given it hasn't really caught any issues for our application b/c as you pointed out, it's single threaded.

@evan-forbes
Copy link
Member

evan-forbes commented Oct 13, 2023

In order for this to work we will need to disable the required tests as they no longer exist

cc @MSevey

fwiw the only thing maintainers can do is to require merge/squash/rebase, otherwise its identical to write permissions

@MSevey
Copy link
Member

MSevey commented Oct 16, 2023

In order for this to work we will need to disable the required tests as they no longer exist

cc @MSevey

fwiw the only thing maintainers can do is to require merge/squash/rebase, otherwise its identical to write permissions

@rootulp has admin rights currently from doing the binary signing work. So he can resolve 👍

Copy link
Member

@evan-forbes evan-forbes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference here, so I will defer to the author and second reviewer

we should definitely take the time to fix our tests and refactor them so that the tests that actually need to use comet can but in a robust way and the tests that don't need it don't use it

ensuring that all of the tests always pass is clearly a difficult thing to maintain, so having the ability to quickly rerun only portions of the tests is also useful

Copy link
Collaborator

@rootulp rootulp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LMK if you want me to disable required CI checks in order to merge this. I'm personally not a fan of split test files (for reasons mentioned above) but I won't block on it if other maintainers prefer it.

Before we merge this, it may make sense to verify that all the new CI jobs pass on a fork

go-version: ${{ env.GO_VERSION }}

- name: Generate coverage.txt
run: make test-coverage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question] if we're removing usage of make test-coverage should we remove the command from the Makefile so that we don't have to maintain it?

@@ -26,3 +27,26 @@ jobs:

- name: Run e2e tests
run: make test-e2e

race-tests:
Copy link
Collaborator

@rootulp rootulp Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit][not blocking] so maintainers can easily link the Makefile command to the CI task. Without consistent naming, it's easy to get lost translating between CI job and Makefile command

Suggested change
race-tests:
test-race:

- name: Run tests in race mode
run: make test-race

fuzz_tests:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[new issue] it may be helpful to add a Makefile command that runs go test -fuzz with similar params to this CI job so that devs can attempt to reproduce an error locally if one is surfaced on CI

@cmwaters
Copy link
Contributor Author

Closing this in favour of #2787 which just focuses on fuzzing

@cmwaters cmwaters closed this Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants