Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent test failures #588

Closed
matthew-white opened this issue Aug 29, 2022 · 21 comments
Closed

Intermittent test failures #588

matthew-white opened this issue Aug 29, 2022 · 21 comments
Labels
testing Integration tests, unit tests

Comments

@matthew-white
Copy link
Member

Tests often pass locally for me, but there are sometimes one or more failures. The failures aren't consistent: I've seen a few different errors, and even when it's the same error, it's often a different test that fails. It sounds like @ktuite is seeing some of these same errors, so I thought I'd create an issue to track them. Rather than creating an issue for each error, I thought I'd create a single issue to track the different errors we've seen. (We can always spin one off into its own issue if it starts a discussion, but starting things here seems like it might work well.) Feel free to comment if you encounter an error not listed here, to 👍 an error that you've seen, or to add more information to any of the comments below.

@matthew-white matthew-white added the testing Integration tests, unit tests label Aug 29, 2022
@matthew-white

This comment was marked as outdated.

@matthew-white
Copy link
Member Author

matthew-white commented Aug 29, 2022

I see this error maybe 10% of the time that I run tests:

     Error: socket hang up
      at connResetException (internal/errors.js:628:14)
      at Socket.socketOnEnd (_http_client.js:499:23)
      at Socket.emit (domain.js:470:12)
      at endReadableNT (internal/streams/readable.js:1317:12)
      at processTicksAndRejections (internal/process/task_queues.js:82:21)

I've seen the error in these tests:

List
  1) managed encryption
       end-to-end
         should handle mixed [plaintext/encrypted] attachments (not decrypting):

  1) analytics task queries
       submission metrics
         should calculate submissions that have comments:

  1) analytics task queries
       submission metrics
         should calculate submissions:

  1) analytics task queries
       combined analytics
         should be idempotent and not cross-polute project counts:

  1) api: /projects/:id/forms (drafts)
       /:id/draft
         /attachments
           /:name DELETE
             should reject notfound if the draft does not exist:

  1) api: /projects/:id/forms (drafts)
       /:id/draft
         POST
           updating form titles
             purging unneeded drafts
               should purge the old undeeded draft when a new version is uploaded (and no published draft):

  1) api: /users
       POST
         should duplicate the email into the display name if not given:

  1) api: /forms/:id/submissions
       [draft] /:instanceId/attachments/:name DELETE
         should not delete a draft attachment nondraft:

@matthew-white
Copy link
Member Author

matthew-white commented Aug 29, 2022

I see this error maybe 10% of the time that I run tests:

     Error: Parse Error: Expected HTTP/
      at Socket.socketOnData (_http_client.js:515:22)
      at Socket.emit (domain.js:470:12)
      at addChunk (internal/streams/readable.js:290:12)
      at readableAddChunk (internal/streams/readable.js:265:9)
      at Socket.Readable.push (internal/streams/readable.js:204:10)
      at TCP.onStreamRead (internal/stream_base_commons.js:188:23)

I've seen the error in these tests:

List
  1) api: /submission
       POST
         should store the correct formdef and actor ids:

  1) api: /forms/:id.svc
       /draft.svc
         /Submissions.xyz
           should not return results from the published form:

  1) api: /forms/:id/submissions
       [draft] /keys GET
         should return draft-used keys:

  1) analytics task queries
       combined analytics
         should fill in all project.forms queries:

  2) api: /users
       /users/:id DELETE
         should delete any assignments the user had:

I've also seen the similar error:

     Error: Parse Error: Expected HTTP/
      at HTTPParser.execute (<anonymous>)
      at Socket.socketOnData (_http_client.js:515:22)
      at Socket.emit (domain.js:470:12)
      at addChunk (internal/streams/readable.js:290:12)
      at readableAddChunk (internal/streams/readable.js:265:9)
      at Socket.Readable.push (internal/streams/readable.js:204:10)
      at TCP.onStreamRead (internal/stream_base_commons.js:188:23)

I've seen that error in these tests:

List
  1) api: /users
       /reset/initiate POST
         should send a specific email if an account existed but was deleted:

  1) analytics task queries
       submission metrics
         should calculate submissions by review state: approved:

  1) analytics task queries
       combined analytics
         should fill in all project.forms queries:

  1) managed encryption
       end-to-end
         should decrypt to CSV successfully:

@matthew-white
Copy link
Member Author

matthew-white commented Aug 29, 2022

I see this error maybe 40% of the time that I run tests:

     TypeError: Cannot read property 'status' of undefined
      at Test._assertStatus (node_modules/supertest/lib/test.js:263:11)
      at Test._assertFunction (node_modules/supertest/lib/test.js:281:11)
      at Test.assert (node_modules/supertest/lib/test.js:171:18)
      at Server.assert (node_modules/supertest/lib/test.js:131:12)
      at Server.emit (domain.js:470:12)
      at emitCloseNT (net.js:1661:8)
      at processTicksAndRejections (internal/process/task_queues.js:81:21)

I've seen that error in these tests:

List
  1) api: /forms/:id/submissions
       /:instanceId PATCH
         should reject if the user cannot update:

  1) api: /forms/:id/submissions
       /keys GET
         should return a self-managed key if it is used:

  1) api: /forms/:id/submissions
       [draft] .csv.zip
         should return draft submissions:

  1) api: /forms/:id.svc
       /Submissions.xyz.* GET
         should limit and offset subtable results:

  1) /audits
       GET
         should filter out submission and backup events given action=nonverbose:

  2) project viewer role
       should be able to list form submissions:

  1) project viewer role
       should be able to fetch a submission attachment:

  1) api: /users
       /users/:id PATCH
         should update only the allowed fields:

  1) api: /projects
       GET
         should return all projects for an administrator:

  1) api: /projects/:id/forms (drafts)
       /:id/draft
         POST
           should deal with xlsx itemsets:

  1) api: /projects/:id/forms (testing drafts)
       /test/:key/…/:id/draft
         /attachments/:name GET
           should return the attachment:

  1) api: /projects/:id/forms (versions)
       /:id/versions
         /:version
           /manifest GET
             should return a manifest:

  1) api: /assignments
       /:roleId/:actorId
         POST
           should create an assignment by role system id:

  1) api: /submission
       POST
         versioning
           should set the submission review state to edited:

  1) api: /sessions
       /:token DELETE
         should return a 403 if the token does not exist:

@alxndrsn
Copy link
Contributor

@matthew-white does #499 issue look similar to your _assertStatus failures? I think the logs there give a bit more clue as to what could be failing.

I see this error maybe 40% of the time that I run tests:

Does the failure rate decrease if you increase the test mocha default test timeout?

@matthew-white

This comment was marked as outdated.

@matthew-white
Copy link
Member Author

matthew-white commented Aug 30, 2022

does #499 issue look similar to your _assertStatus failures? I think the logs there give a bit more clue as to what could be failing.

I think they're similar insomuch as they're both failing at an .expect() call that's making an assertion about the response status code. But in #499, it's the wrong status code, whereas above, it's more like there isn't even a response to check the status. Also, I didn't see additional logging like #499.

ladjs/supertest#352 has some related discussion. Maybe we could try updating SuperTest? Though it doesn't seem like the version we're on should be having this issue.

Does the failure rate decrease if you increase the test mocha default test timeout?

For some reason, I'm now seeing test failures at a much lower rate. I'm not seeing some of the errors that I did yesterday. I'm still seeing Error: end of central directory record signature not found, and I'm also seeing a new error. I'm going to keep trying over the next few days to see if there are changes.

@matthew-white
Copy link
Member Author

I've seen this test failure just once so far:

  1) api: /forms/:id/submissions
       .csv.zip GET
         should split select multiples and filter given both options:
     Uncaught TypeError: Cannot read property 'call' of undefined
      at Test.assert (node_modules/supertest/lib/test.js:179:6)
      at Server.assert (node_modules/supertest/lib/test.js:131:12)
      at Server.emit (domain.js:470:12)
      at emitCloseNT (net.js:1661:8)
      at processTicksAndRejections (internal/process/task_queues.js:81:21)

It's another SuperTest error, so may be related to TypeError: Cannot read property 'status' of undefined above?

After this test failure, 190 other tests timed out.

@matthew-white
Copy link
Member Author

I've seen this test failure just once so far:

  1) api: /projects/:id/assignments/forms
       GET
         should return all assignments on forms:
     Error: ECONNREFUSED: Connection refused
      at Test.assert (node_modules/supertest/lib/test.js:164:13)
      at assert (node_modules/supertest/lib/test.js:131:12)
      at node_modules/supertest/lib/test.js:128:5
      at Test.Request.callback (node_modules/superagent/lib/node/index.js:728:3)
      at ClientRequest.<anonymous> (node_modules/superagent/lib/node/index.js:647:10)
      at ClientRequest.emit (domain.js:470:12)
      at Socket.socketErrorListener (_http_client.js:475:9)
      at Socket.emit (domain.js:470:12)
      at emitErrorNT (internal/streams/destroy.js:106:8)
      at emitErrorCloseNT (internal/streams/destroy.js:74:3)
      at processTicksAndRejections (internal/process/task_queues.js:82:21)

@matthew-white
Copy link
Member Author

matthew-white commented Aug 31, 2022

I wrote a little Bash script to run make test multiple times, saving just the summary from each test run in testloop.txt:

#!/bin/bash -eo pipefail
testloop() {
	local count=${1:-10}
	touch testloop.txt
	local temp=$(mktemp)
	local i line
	for i in $(seq $count); do
		{ make test || true; } | tee $temp
		line=$(grep -n -m 1 passing $temp | cut -d ':' -f 1)
		tail -n +$line $temp >> testloop.txt
	done
}
testloop 25 # or some other number

@alxndrsn, do you see any of these errors if you run tests say 25 times?

My current theory is that Error: end of central directory record signature not found and the AssertionError for should return worker-processed consolidated client audit log attachments will show up on a variety of machines, whereas for many of the others, there has to be some condition present on the machine related to connections.

@alxndrsn

This comment was marked as outdated.

@matthew-white
Copy link
Member Author

matthew-white commented Sep 2, 2022

I've also seen that "before all" hook: initialize in "{root}" timeout (just once over the past few days). I agree that it seems fine.

I ran my Bash script on CircleCI, running make test 100 times (over two builds). Mostly the only issue was timeouts, but Error: end of central directory record signature not found and the AssertionError for should return worker-processed consolidated client audit log attachments each came up once. Since we're seeing those on a variety of machines, and they don't seem connected to the other errors I've seen, I think I'll go ahead and file individual issues for those. I'll probably stop looking into those errors for now, but anyone is welcome to keep digging. (UPDATE: I ran make test another 100 times, in this build. Error: end of central directory record signature not found came up 5 times.)

I'm continuing to see other errors, but only sometimes. One thing I tried doing is to update SuperTest to the latest version. TypeError: Cannot read property 'status' of undefined went away, so I think updating had an effect. After updating, I think I'm seeing a greater number of Error: socket hang up and Error: Parse Error: Expected HTTP/. My thinking is that those are the actual errors and that the latest version of SuperTest has better error handling that will surface the underlying error instead of failing during an assertion. At some point, I'll push a PR to upgrade SuperTest, since I think that's potentially helpful. I'll also try to look more into why I'm seeing so many socket hang-ups. (Let me know if anyone has ideas there!)

@alxndrsn
Copy link
Contributor

Here's a new one:

1 failing

  1) task: fs
       encrypted archives
         should round-trip successfully:
     AssertionError: expected Array [ 'one' ] to contain Array [ 'one', 'two' ]
      at Assertion.fail (node_modules/should/cjs/should.js:275:17)
      at Assertion.value [as containDeep] (node_modules/should/cjs/should.js:356:19)
      at /home/circleci/repo/test/integration/task/fs.js:33:20
      at async Object.transaction (node_modules/slonik/dist/src/connectionMethods/transaction.js:22:24)
      at async Object.createConnection (node_modules/slonik/dist/src/factories/createConnection.js:97:18)

https://app.circleci.com/pipelines/github/getodk/central-backend/1156/workflows/5b3e5294-0fd2-4f77-a495-233716b88306/jobs/1940

@alxndrsn
Copy link
Contributor

alxndrsn commented Apr 26, 2023

Seen with current master (d8181ac):

  1097 passing (3m)
  14 pending
  3 failing

  1) datbase migrations: removing default project
       should put old forms into project:
     Error: Timeout of 4000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/odk/odk-central-backend/test/integration/other/migrations.js)
      at listOnTimeout (node:internal/timers:559:17)
      at processTimers (node:internal/timers:502:7)

  2) datbase migrations: intermediate form schema
       should test migration:
     Error: Timeout of 10000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/odk/odk-central-backend/test/integration/other/migrations.js)
      at listOnTimeout (node:internal/timers:559:17)
      at processTimers (node:internal/timers:502:7)

  3) "after each" hook for "should test migration":
     error: drop owned by current_user - deadlock detected
      at Parser.parseErrorMessage (node_modules/pg-protocol/dist/parser.js:287:98)
      at Parser.handlePacket (node_modules/pg-protocol/dist/parser.js:126:29)
      at Parser.parse (node_modules/pg-protocol/dist/parser.js:39:38)
      at Socket.<anonymous> (node_modules/pg-protocol/dist/index.js:11:42)
      at Socket.emit (node:events:513:28)
      at Socket.emit (node:domain:489:12)
      at addChunk (node:internal/streams/readable:315:12)
      at readableAddChunk (node:internal/streams/readable:289:9)
      at Socket.Readable.push (node:internal/streams/readable:228:10)
      at TCP.onStreamRead (node:internal/stream_base_commons:190:23)



make: *** [Makefile:32: test] Error 3

Adding --timeout=0 to the mocha invocation in the test make target changes the error slightly:

  datbase migrations: intermediate form schema
    1) should test migration
    2) "after each" hook for "should test migration"


  1098 passing (2m)
  14 pending
  2 failing

  1) datbase migrations: intermediate form schema
       should test migration:
     Error: Timeout of 10000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/odk/odk-central-backend/test/integration/other/migrations.js)
      at listOnTimeout (node:internal/timers:559:17)
      at processTimers (node:internal/timers:502:7)

  2) "after each" hook for "should test migration":
     error: drop owned by current_user - deadlock detected
      at Parser.parseErrorMessage (node_modules/pg-protocol/dist/parser.js:287:98)
      at Parser.handlePacket (node_modules/pg-protocol/dist/parser.js:126:29)
      at Parser.parse (node_modules/pg-protocol/dist/parser.js:39:38)
      at Socket.<anonymous> (node_modules/pg-protocol/dist/index.js:11:42)
      at Socket.emit (node:events:513:28)
      at Socket.emit (node:domain:489:12)
      at addChunk (node:internal/streams/readable:315:12)
      at readableAddChunk (node:internal/streams/readable:289:9)
      at Socket.Readable.push (node:internal/streams/readable:228:10)
      at TCP.onStreamRead (node:internal/stream_base_commons:190:23)



make: *** [Makefile:32: test] Error 2

Modifying all explicit this.timeout(...) calls to this.timeout(0) allows the tests to pass, but is inconvenient as it requires constant change management when working with git.

This has been discussed previously at #532.

Another option that works on my machine is doubling all the explicit timeouts. I've opened a PR for this at #847.

@alxndrsn
Copy link
Contributor

alxndrsn commented May 29, 2023

supertest updated in #899

@alxndrsn

This comment was marked as resolved.

matthew-white added a commit that referenced this issue Sep 22, 2023
This is to prevent intermittent test failures:
#588 (comment)
@alxndrsn
Copy link
Contributor

Seeing intermittent faliures with auth(?) for OIDC tests, e.g.:

oidc-integration-test

2023-09-29T07:28:39.4110341Z ::ffff:127.0.0.1 - - [29/Sep/2023:07:28:39 +0000] "GET /v1/oidc/login HTTP/1.1" 307 331
2023-09-29T07:28:39.4361117Z }attempted to log Sentry exception in development:
2023-09-29T07:28:39.4365708Z OPError: expected 200 OK, got: 408 Request Timeout
2023-09-29T07:28:39.4366590Z     at processResponse (/home/runner/work/central-backend/central-backend/node_modules/openid-client/lib/helpers/process_response.js:41:11)
2023-09-29T07:28:39.4367336Z     at Client.userinfo (/home/runner/work/central-backend/central-backend/node_modules/openid-client/lib/client.js:1237:18)
2023-09-29T07:28:39.4367857Z     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
2023-09-29T07:28:39.4368425Z     at async /home/runner/work/central-backend/central-backend/lib/resources/oidc.js:99:24 {
2023-09-29T07:28:39.4368890Z   error: 'expected 200 OK, got: 408 Request Timeout'
2023-09-29T07:28:39.4369153Z }
2023-09-29T07:28:39.4383034Z ::ffff:127.0.0.1 - - [29/Sep/2023:07:28:39 +0000] "GET /v1/oidc/callback?code=ItcuKk91nQamjq81DjbXSly4kpezOlYJcjfovU1Wt35&iss=http%3A%2F%2Flocalhost%3A9898 HTTP/1.1" 303 87
2023-09-29T07:28:39.4401946Z         1) should update the label of an entity

@matthew-white

This comment was marked as resolved.

@matthew-white
Copy link
Member Author

@alxndrsn, I feel like we're doing better with intermittent test failures than when I first filed this issue. Glancing over this issue, it's unclear to me which failures are still an issue and which have been resolved (e.g., by the update to SuperTest). What do you think about us closing this issue, then filing an individual issue for each new test failure we see in the future?

@matthew-white
Copy link
Member Author

If there are any test failures listed here that you know are still an issue or that you don't want to miss, I'd be happy to file issues for them.

@alxndrsn
Copy link
Contributor

alxndrsn commented Oct 8, 2024

If there are any test failures listed here that you know are still an issue or that you don't want to miss, I'd be happy to file issues for them.

I think it's inevitable that this issue is stale, so I think we can just open new tickets for any failures we observe in future.

@alxndrsn alxndrsn closed this as completed Oct 8, 2024
alxndrsn pushed a commit to alxndrsn/odk-central-backend that referenced this issue Oct 26, 2024
This should make tests fail faster, and make failures easier to understand:

Previously: `Error: end of central directory record signature not found`
Now:        `Error: expected 200 "OK", got 400 "Bad Request"`

Related:

* getodk#595
* getodk#588
* getodk#1052
alxndrsn added a commit that referenced this issue Oct 28, 2024
This should make tests fail faster, and make failures easier to understand:

Previously: Error: end of central directory record signature not found
Now:        Error: expected 200 "OK", got 400 "Bad Request"

Related:

* #595
* #588
* #1052
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Integration tests, unit tests
Projects
None yet
Development

No branches or pull requests

2 participants