Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make unit tests use redis #6245

Merged
merged 20 commits into from
Sep 19, 2024
Merged

Make unit tests use redis #6245

merged 20 commits into from
Sep 19, 2024

Conversation

cognifloyd
Copy link
Member

@cognifloyd cognifloyd commented Sep 17, 2024

The graceful_shutdown tests have been very flaky using the NoOpDriver. Switching from NoOpDriver to RedisDriver will hopefully make our CI more stable.

This was discussed in a TSC meeting and implemented by @FileMagic (#6223) and @guzzijones (#6236). I cherry-picked the redis-related parts of their excellent work. After stumbling through the cherry-picks (I missed a few things, and had to rebase/cherry-pick a few times), I refactored a few things:

  • GHA: use the redis container defined in services, dropping the tasks that managed it manually.
  • use ST2TESTS_REDIS_* as the format for new env vars instead of ST2_OVERRIDE_COORDINATOR_REDIS_*.
  • fix a regression I added in Clean up import side effects in tests #6241 that only became apparent once we stopped using NoOpDriver. It's surprising and scary how many of our tests rely on import-time side-effects.
  • refactored tests to ensure each test got clean config. Several tests were relying on the config changes previous tests made, making them somewhat brittle. This will hopefully make that better.

👏 Thank you @FileMagic and @guzzijones for getting this figured out! I have high hopes for more stable CI thanks to your excellent work! 🎉

@pull-request-size pull-request-size bot added the size/L PR that changes 100-499 lines. Requires some effort to review. label Sep 17, 2024
@cognifloyd cognifloyd force-pushed the unit-tests-use-redis branch 3 times, most recently from 8c16324 to f772e71 Compare September 18, 2024 21:11
get_members should return a list or tuple, not a string.
I noticed while debugging that get_members output ended up
as ['m', 'e', 'm', 'b', 'e', 'r', '-', '1'] because it
listified the string. So, use a tuple when creating
the mocked NoOpAsyncResult so it is closer to the actual
return values.
@cognifloyd cognifloyd force-pushed the unit-tests-use-redis branch 3 times, most recently from ab46c40 to 78c3248 Compare September 19, 2024 04:11
Comment on lines +40 to +41
# This needs to run before creating FakeConcurrencyApplicator below.
tests_config.parse_args()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests were inadvertently using the NoOpDriver because of the import-time creation of FakeConcurrencyApplicator in the @mock.patch.object decorators. parse_args() wasn't called until setUpClass which happens after import time.

I found this by putting some raise ValueError in codepaths that create the NoOpDriver to make sure nothing was using it.

Comment on lines +78 to +79
tests_config.reset()
tests_config.parse_args()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these tests need to control the config, changing values in each test, it makes more sense to setup the cfg.CONF per test instead of only during setUpClass. These two lines ensure that the config has been reset to a clean state before proceeding.

In a future refactor, this would be a good function to turn into a pytest fixture, after we've got pytest running everything.

Comment on lines +103 to +104
tests_config.reset()
tests_config.parse_args()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same issues with tests_config apply to this file too. So, reset it per test, not just per class.

Comment on lines -181 to -182
coordination.ToozConnectionError("foobar"),
coordination.ToozConnectionError("foobar"),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, the side effect happens too much, resulting in this traceback + test error:

======================================================================
1) ERROR: test_process_error_handling (tests.unit.test_workflow_engine.WorkflowExecutionHandlerTest)
----------------------------------------------------------------------
   Traceback (most recent call last):
    virtualenv/lib/python3.8/site-packages/mock/mock.py line 1452 in patched
      return func(*newargs, **newkeywargs)
    st2actions/tests/unit/test_workflow_engine.py line 227 in test_process_error_handling
      workflows.get_engine().process(t1_ac_ex_db)
    st2actions/st2actions/workflows/workflows.py line 103 in process
      self.fail_workflow_execution(message, e)
    st2actions/st2actions/workflows/workflows.py line 191 in fail_workflow_execution
      wf_svc.update_task_state(task_ex_id, ac_const.LIVEACTION_STATUS_FAILED)
    virtualenv/lib/python3.8/site-packages/retrying.py line 56 in wrapped_f
      return Retrying(*dargs, **dkw).call(f, *args, **kw)
    virtualenv/lib/python3.8/site-packages/retrying.py line 257 in call
      return attempt.get(self._wrap_exception)
    virtualenv/lib/python3.8/site-packages/retrying.py line 301 in get
      six.reraise(self.value[0], self.value[1], self.value[2])
    virtualenv/lib/python3.8/site-packages/six.py line 719 in reraise
      raise value
    virtualenv/lib/python3.8/site-packages/retrying.py line 251 in call
      attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
    virtualenv/lib/python3.8/site-packages/retrying.py line 56 in wrapped_f
      return Retrying(*dargs, **dkw).call(f, *args, **kw)
    virtualenv/lib/python3.8/site-packages/retrying.py line 266 in call
      raise attempt.get()
    virtualenv/lib/python3.8/site-packages/retrying.py line 301 in get
      six.reraise(self.value[0], self.value[1], self.value[2])
    virtualenv/lib/python3.8/site-packages/six.py line 719 in reraise
      raise value
    virtualenv/lib/python3.8/site-packages/retrying.py line 251 in call
      attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
    st2common/st2common/services/workflows.py line 1054 in update_task_state
      update_execution_records(
    st2common/st2common/services/workflows.py line 1475 in update_execution_records
      ex_svc.update_execution(wf_lv_ac_db, publish=pub_ac_ex, set_result_size=True)
    st2common/st2common/services/executions.py line 199 in update_execution
      with coordination.get_coordinator().get_lock(str(liveaction_db.id).encode()):
    virtualenv/lib/python3.8/site-packages/mock/mock.py line 1178 in __call__
      return _mock_self._mock_call(*args, **kwargs)
    virtualenv/lib/python3.8/site-packages/mock/mock.py line 1182 in _mock_call
      return _mock_self._execute_mock_call(*args, **kwargs)
    virtualenv/lib/python3.8/site-packages/mock/mock.py line 1243 in _execute_mock_call
      raise result
   ToozConnectionError: foobar

@@ -195,7 +196,6 @@ def test_retries_exhausted_from_coordinator_connection_error(self, mock_get_lock
"update_task_state",
mock.MagicMock(
side_effect=[
mongoengine.connection.ConnectionFailure(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, the side effect happens too much, resulting in this traceback + test error on at least python3.9 (and probably newer):

======================================================================
1) ERROR: test_recover_from_database_connection_error (tests.unit.services.test_workflow_service_retries.OrquestaServiceRetryTest)
----------------------------------------------------------------------
   Traceback (most recent call last):
    virtualenv/lib/python3.9/site-packages/mock/mock.py line 1452 in patched
      return func(*newargs, **newkeywargs)
    st2common/tests/unit/services/test_workflow_service_retries.py line 222 in test_recover_from_database_connection_error
      wf_svc.handle_action_execution_completion(tk1_ac_ex_db)
    virtualenv/lib/python3.9/site-packages/retrying.py line 56 in wrapped_f
      return Retrying(*dargs, **dkw).call(f, *args, **kw)
    virtualenv/lib/python3.9/site-packages/retrying.py line 266 in call
      raise attempt.get()
    virtualenv/lib/python3.9/site-packages/retrying.py line 301 in get
      six.reraise(self.value[0], self.value[1], self.value[2])
    virtualenv/lib/python3.9/site-packages/six.py line 719 in reraise
      raise value
    virtualenv/lib/python3.9/site-packages/retrying.py line 251 in call
      attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
    st2common/st2common/services/workflows.py line 969 in handle_action_execution_completion
      update_task_state(
    virtualenv/lib/python3.9/site-packages/mock/mock.py line 1178 in __call__
      return _mock_self._mock_call(*args, **kwargs)
    virtualenv/lib/python3.9/site-packages/mock/mock.py line 1182 in _mock_call
      return _mock_self._execute_mock_call(*args, **kwargs)
    virtualenv/lib/python3.9/site-packages/mock/mock.py line 1243 in _execute_mock_call
      raise result
   ConnectionFailure: 

Comment on lines +143 to +146
redis_host = os.environ.get("ST2TESTS_REDIS_HOST", False)
if redis_host:
redis_port = os.environ.get("ST2TESTS_REDIS_PORT", "6379")
driver = f"redis://{redis_host}:{redis_port}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ST2TESTS_* is the convention I've been using for vars that configure tests:

db_name = f"st2-test{os.environ.get('ST2TESTS_PARALLEL_SLOT', '')}"
CONF.set_override(name="db_name", override=db_name, group="database")

system_user = os.environ.get("ST2TESTS_SYSTEM_USER", "")
if system_user:
CONF.set_override(name="user", override=system_user, group="system_user")

Comment on lines -166 to -170
- name: Run Redis Service Container
timeout-minutes: 2
run: |
docker run --rm --detach -p 127.0.0.1:6379:6379/tcp --name redis redis:latest
until [ "$(docker inspect -f {{.State.Running}} redis)" == "true" ]; do sleep 0.1; done
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need one redis container, either started here or in services above. The services approach is a bit nicer because we know that redis is up and responding (thanks to the health check that GHA waits for before continuing), whereas this just makes ssure the container (not redis in the container) is running.

This was also the cause of the conflicts where redis was already in use. Getting rid of this allows us to use the simple redis name for the service container instead of redis-server.

@cognifloyd
Copy link
Member Author

The Test / Test (pants runs: pytest) CI failures are not related to this PR. To validate that, I re-ran the success CI in #6244, and now it fails there too. So, it seem to be some GHA issue.

Luckily, the pants workflows are not required to merge PRs (in branch protection rules), so please ignore the failure.

@guzzijones
Copy link
Contributor

Thanks for going through this. The noop driver random failures were a pain to deal with.

Copy link
Contributor

@nzlosh nzlosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is great work!

@cognifloyd cognifloyd merged commit d985b0d into master Sep 19, 2024
31 of 33 checks passed
@cognifloyd cognifloyd deleted the unit-tests-use-redis branch September 19, 2024 12:55
cognifloyd added a commit that referenced this pull request Sep 19, 2024
This makes the orquesta workflow consistent with ci workflow
changes added in #6245
cognifloyd added a commit that referenced this pull request Sep 19, 2024
This makes the orquesta workflow consistent with ci workflow
changes added in #6245
cognifloyd added a commit that referenced this pull request Sep 26, 2024
This makes the orquesta workflow consistent with ci workflow
changes added in #6245
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infrastructure: ci/cd service: workflow engine size/L PR that changes 100-499 lines. Requires some effort to review. tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants