Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Akka Persistence Cassandra retry after DB exceptions #847

Open
mckeeh3 opened this issue Dec 1, 2020 · 5 comments
Open

Akka Persistence Cassandra retry after DB exceptions #847

mckeeh3 opened this issue Dec 1, 2020 · 5 comments
Assignees
Labels
2 - pick next Used to mark issues which are next up in the queue to be worked on. The tag is non-binding

Comments

@mckeeh3
Copy link

mckeeh3 commented Dec 1, 2020

Running tests with Akka Micorservices in Amazon EKS and Keyspaces for Cassandra. When exceptions are thrown due to DB operation timeouts the below queries are retried every few seconds. With version 1.0.3 under the same conditions when these same exceptions occur circuit breakers opened and the retry rate slowed down. Testing indicated that the 1.0.3 behavior recovered more quickly than 1.0.4.

Versions used

Akka version: akka persistence Cassandra 1.0.4

Expected Behavior

When exceptions like this occur expect some circuit breaker reaction with backoff before retry.

Actual Behavior

The below query is performed with no apparent backoff. Queries are retried every few seconds.

Relevant logs

woe-twin-749df6f759-q9kpn woe-twin [2020-12-01 16:28:40,103] [ERROR] [com.datastax.oss.driver.internal.core.tracker.RequestLogger] [] [s0-io-4] - [s0|1919515490][Node(endPoint=3.234.248.222/3.234.248.222:9142, hostId=fd0e98e7-ad76-310d-9094-065cea862868, hashCode=18dbebab)] Error (2 ms) [4 values]
woe-twin-749df6f759-q9kpn woe-twin SELECT persistence_id, tag_pid_sequence_nr, timestamp
woe-twin-749df6f759-q9kpn woe-twin FROM woe_twin.tag_views WHERE
woe-twin-749df6f759-q9kpn woe-twin tag_name = ? AND
woe-twin-749df6f759-q9kpn woe-twin timebucket = ? AND
woe-twin-749df6f759-q9kpn woe-twin timestamp > ? AND
woe-twin-749df6f759-q9kpn woe-twin timestamp <= ? [tag_name='0', timebucket=1606838400000, timestamp=2008b020-33f0-11eb-8080-808080808080, timestamp=2303a0a0-33f0-11eb-ad7d-478fe0f9131b] MDC: {}
woe-twin-749df6f759-q9kpn woe-twin com.datastax.oss.driver.api.core.servererrors.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 0 replica responded)
woe-twin-749df6f759-q9kpn woe-twin [2020-12-01 16:28:40,103] [WARN] [akka.persistence.cassandra.query.EventsByTagStage] [] [woe-twin-akka.actor.default-dispatcher-5] - Backtrack failed, this will retried. com.datastax.oss.driver.api.core.servererrors.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 0 replica responded) MDC: {akkaAddress=akka://[email protected]:25520, sourceThread=woe-twin-akka.actor.default-dispatcher-5, akkaSource=EventsByTagStage(akka://woe-twin), sourceActorSystem=woe-twin, akkaTimestamp=16:28:40.103UTC}
woe-twin-749df6f759-77xqb woe-twin [2020-12-01 16:28:40,143] [ERROR] [com.datastax.oss.driver.internal.core.tracker.RequestLogger] [] [s0-io-3] - [s0|1245995343][Node(endPoint=3.234.248.223/3.234.248.223:9142, hostId=8f3b894b-760a-38c2-9db9-bcdc6cf66ef2, hashCode=5192550a)] Error (3 ms) [4 values]
woe-twin-749df6f759-77xqb woe-twin SELECT * FROM woe_twin.tag_views WHERE
woe-twin-749df6f759-77xqb woe-twin tag_name = ? AND
woe-twin-749df6f759-77xqb woe-twin timebucket = ? AND
woe-twin-749df6f759-77xqb woe-twin timestamp > ? AND
woe-twin-749df6f759-77xqb woe-twin timestamp < ?
woe-twin-749df6f759-77xqb woe-twin ORDER BY timestamp ASC
woe-twin-749df6f759-77xqb woe-twin [tag_name='0', timebucket=1606838400000, timestamp=2303a0a0-33f0-11eb-ad7d-478fe0f9131b, timestamp=4194b8df-33f2-11eb-7f7f-7f7f7f7f7f7f] MDC: {}
woe-twin-749df6f759-77xqb woe-twin com.datastax.oss.driver.api.core.servererrors.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 0 replica responded)

Reproducible Test Case

The problem occurs when Cassandra is pushed to the point that some operations are timing out.

@mckeeh3
Copy link
Author

mckeeh3 commented Dec 2, 2020

From the Amazon Keyspaces contact, Michael Raney.
"If you made a request that was rejected for a hotkey that’s not counted towards the limit itself. It’s sort of blocked at the front door.

There are two thresholds. 1 for the table and 1 for the individual partitions that make up the partition. If access is too hot on a given partition then it may result in hotkey.

Maybe increase the wait between polling slightly? Few mills."

This appears to be what is happening. The DB is throwing exceptions and the queries retry often, which keeps the DB in an overloaded state.

@patriknw
Copy link
Member

patriknw commented Dec 2, 2020

Thanks for reporting. We'll look into this shortly.

@patriknw patriknw added the 2 - pick next Used to mark issues which are next up in the queue to be worked on. The tag is non-binding label Dec 2, 2020
@chbatey
Copy link
Member

chbatey commented Dec 7, 2020

The reties are configurable: https://github.com/akka/akka-persistence-cassandra/blob/master/core/src/main/resources/reference.conf#L357 but maybe there is something broken there.

Can you provide all the query logs? To verify the queries are not backing off to every 30s? Each time there is a successful result the backoff is reset

@patriknw
Copy link
Member

patriknw commented Dec 7, 2020

The retries in eventsByTag works as expected. Exponential backoff according to the configuration, so I don't think that is causing the db overload. The logging of the failed attempts were missing, though, so added in #848

Above log mentions "Backtrack failed, this will retried". I will look into if we retry those queries without delays.

@chbatey
Copy link
Member

chbatey commented Dec 7, 2020

Shall we leave this open for @mckeeh3 to try? Can you try publishing locally Hugh?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - pick next Used to mark issues which are next up in the queue to be worked on. The tag is non-binding
Projects
None yet
Development

No branches or pull requests

3 participants