Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v24.1.x] rptest: upgrade kgo and fix its usage #23813

Open
wants to merge 8 commits into
base: v24.1.x
Choose a base branch
from

Conversation

vbotbuildovich
Copy link
Collaborator

Backport of PR #23793

Avoid spamming the log if a persistent failure is detected.

(cherry picked from commit 2203d2a)
This prevents incorrect assertions being performed on the status thread.
Stopping a service is a hard stop and does not guarantee that latest
status of the service was ingested. Instead, the user should call wait()
first and optionally stop().

If we call stop() first, then wait() this hides a problem in
DeleteRecordsTest where KGO crashes but the test never notices it
because stop() just killed the service and the status thread.

(cherry picked from commit 29f94d6)
wait() after stop() does not make sense because wait() is then short
circuited. In a previous commit I made this sequence of calls illegal to
prevent mistakes where the caller might have wrong beliefs about what
wait() does.

In this test it seems that it is never intended to wait for producer or
consumer to finish their work as we are producing an infinite (very
large number) amount of messages and the intent is just to generate load
so stop() is likely what was intended.

(cherry picked from commit c476f20)
This prevents incorrect assertions being performed on the status thread.
Stopping a service is a hard stop and does not guarantee that latest
status of the service was ingested. Instead, the user should call wait
first and optionally stop().

This prevents one particular problem where KGO did crash but the test
never noticed it because it did call stop() and wait() just does nothing
afterwards because it is expected that the service is down afterwards.

(cherry picked from commit 477ccb5)
In immediately preceding commits I have fixed the order in which
consumer is shut down in this test which led which did hide a crash in
kgo verifier consumer. I have also fixed the crash in kgo.

This revealed a problem within DeleteRecordsTest and specifically the
assert on out_of_scope_invalid_reads. In this particular test, before
launching kgo producer, we first call produce_until_segments which uses
a kafka tools for producing data. These batches do not conform to what
kgo-verifier does expect. So it is perfectly valid to have
out_of_scope_invalid_reads.

(cherry picked from commit 45bb8f3)
"optimistically" because it is not guaranteed that kgo verifier service
did any work yet. Users should call wait() in most of the cases and wait
on the service to finish consuming all all the data.

(cherry picked from commit 34eac28)
@vbotbuildovich vbotbuildovich requested a review from a team as a code owner October 16, 2024 20:35
@vbotbuildovich vbotbuildovich removed the request for review from a team October 16, 2024 20:35
@vbotbuildovich vbotbuildovich added this to the v24.1.x-next milestone Oct 16, 2024
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label Oct 16, 2024
@vbotbuildovich
Copy link
Collaborator Author

vbotbuildovich commented Oct 16, 2024

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/56663#01929753-f4e7-4616-87fa-7893837d528b:

"rptest.tests.data_transforms_test.DataTransformsLoggingMetricsTest.test_manager_metrics_values"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/56663#01929752-0328-48ca-9d9a-c0ddd435fc9e:

"rptest.tests.test_si_cache_space_leak.ShadowIndexingCacheSpaceLeakTest.test_si_cache.message_size=10000.num_messages=100000.concurrency=2"

@vbotbuildovich
Copy link
Collaborator Author

vbotbuildovich commented Oct 16, 2024

Retry command for Build#56663

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/data_transforms_test.py::DataTransformsLoggingMetricsTest.test_manager_metrics_values
tests/rptest/tests/test_si_cache_space_leak.py::ShadowIndexingCacheSpaceLeakTest.test_si_cache@{"concurrency":2,"message_size":10000,"num_messages":100000}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/backport PRs targeting a stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants