-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some events are skipped by Akka.Persistence.Query when under load. #313
Comments
We are also battling with this ever since going to production, and has become more than a nuisance now that the system is getting heavier use. TLDR; the latest experiment we're testing is to use linearizable read concern when querying with My experiments so far seem to suggest this is down to how writes take place, and what data is returned by queries from MongoDB. There is a danger zone around when documents have just been sent for writing. It seems that after the BsonTimestamp (Ordering field) is assigned to documents to be written, the order in which those documents become available for querying is not necessarily the same as the timestamp. What that means is that when querying for events with Here's how we came to this hypothesis:
For completeness, performance characteristics from our testing with 4M messages (200 actors, 20k msgs each), on a single machine dev environment, MongoDb in docker single-node replica set:
There's definitely a drop in performance, but I'd say it's better than skipping events. You can change the read concern via the connection string. Test harness - I've omitted the test actor and events - they're super simple, does nothing but calls public async Task All_events_should_project_correctly() {
var actorCount = 200;
var msgCount = 20_000;
Log.Info($"TenantId {TestTenantId}");
// Setup actors
Func<int, WidgetId> getActorId = actorId => WidgetId.With(Guid.Parse($"00000000-0000-0000-0000-{actorId:000000000000}"));
Func<WidgetId, string> toDisplay = id => $"widget_{id.GetGuid().ToString("n").TrimStart('0')}";
var actors = Enumerable.Range(1, actorCount)
.Select(i => Sys.ActorOf(Props.Create(() => new WidgetActor(getActorId(i), TestTenantId))))
.ToArray();
// Concurrently issue command to all actors to cause concurrent write
_ = Source.UnfoldInfinite(0, msgId => (msgId + 1, msgId + 1))
.Take(msgCount)
.SelectAsync(actorCount, async msgId => {
return await Task.WhenAll(actors.Select((actor, i) => {
var cmd = new WidgetCommand(getActorId(i+1),
BatchTestFailMode.None,
null);
return actor.Ask<WidgetCommandResult>(cmd);
}));
})
.SelectMany(results => results)
.RunWith(Sink.Ignore<WidgetCommandResult>(), Sys.Materializer());
// Subscribe using AllEvents
await ((Source<long, NotUsed>)MongoJournalQuery
.AllEvents(Offset.NoOffset())
.GroupBy(int.MaxValue, env => ((ICommittedEvent<WidgetEvent>)env.Event).Data.Id)
.Scan((-1L, ImmutableArray<long>.Empty), (prev, currEnv) => {
var curr = (ICommittedEvent<WidgetEvent>)currEnv.Event;
var (prevSeq, prevList) = prev;
var currSeq = curr.Data.Sequence;
if (currSeq != prevSeq + 1) {
throw new Exception($"Received {currSeq} from {toDisplay(curr.Data.Id)}, expecting {string.Join("; ", prevList.TakeLast(5))}; [{prevSeq + 1}] ({currEnv.Offset.ToInt64()})");
}
return (prevSeq + 1, prevList.Add(currSeq));
})
.Select(t => 0L)
.MergeSubstreams())
.Take((msgCount + 1) * actorCount) // Scan produces 1 extra initial message
.IdleTimeout(TimeSpan.FromSeconds(300))
.RunWith(Sink.Ignore<long>(), Sys.Materializer());
} |
Also noted @jaydeboer reported this once before, and it was suspected this will be fixed by writing in a transaction. I think if this transaction covered all the AtomicWrite in the batch, rather than just a single AtomicWrite, we can use a read concern with less guarantees. FYI @Aaronontheweb |
@ptjhuang Thanks for the tip. I have tried out the |
@jaydeboer @ptjhuang is there something we can do in our defaults for the Akka.Persistence.MongoDb driver to address this internally? |
Also, I need to take a look at #318 - my fault; been doing a lot of traveling since late May. Will be done traveling in two weeks. |
@Aaronontheweb there isn't anything I have seen so far, but maybe @ptjhuang has better info than I do. |
This leads me to suspect it's to do with the way AsyncWriteJournal is implemented, where calls to What's your thoughts of a |
That's old code that needs to be done away with - I hope we're not still using that to trigger subscriber updates. That should all be done entirely through Db polling. |
We have an issue similar to this one of the Sql query plugins as well - "missed reads" usually occur only when write volumes are higher and there's some transactions that haven't been fully committed that are included in the current "page" accessed by the query. A sequencer would probably be a good idea - that's something that either tagged queries or |
@ptjhuang @jaydeboer @Aaronontheweb <Edit>
I'm not sure if this is caused by the new transacion codes or not that we've managed to capture these new failure modes, but it seems like the missing sequence numbers were caused by refused connection or connection refusal because the waiting queue for the connection is saturated. This MongoDb waiting queue default behaviour is to drop newest connection attempt when the buffer is full. I'm not a MongoDb expert, so I don't know if this waiting queue is implemented server side or client side, so I'm still unsure on how to implement the fix, but the correct fix would be a backpressure mechanism to prevent lost message, a retry mechanism would actually make the problem worse because it will hammer the server with even more connection attempts. |
Wasn't @jaydeboer 's issue missing reads that had already been successfully written, not the writes themselves? cc @Arkatufus |
I'm not sure, was the message actually written @jaydeboer? If they are, then this is a new separate issue. |
@Arkatufus The issue I was having was the events would be written to mongo, but if the query was running on the same node as the actor that journaled the event, some events would not "seen" by the query. However, if the query was running on a different node, all events were seen. Does that help at all? |
Yes, there's a big probability that this would be fixed on the latest release, knock on wood |
I have confidence that this issue will be resolved by the next release, I'm going to close this issue for now. |
Some clarification on this issue and why I thought that the latest release would fix it: The problem we're having with the If writes were done fast enough that two writes were to complete its write operation out of order and the read operation were done in such a way that it disregards data consistency ("local" or "available" read concern, maybe even higher than this), then it is possible that the event publisher to read the out of order write first, skipping a sequence number in the process. |
We removed the pub-sub mechanism in the latest release. The event publisher would not try to immediately read database changes as soon as they were written, instead, they will rely on their internal timer to poll the database for any new event changes. |
That sounds like a fix to me! Thanks for all the help! |
I hate to resurrect a closed issue, but the project I was using this was ended up being abandoned. I am not back on a new project and ran into this old friend again. It looks to still be an issue in version 1.5.12.1. I have added the |
So that part blows my mind a bit - using transactions actually made this issue worse ? |
The transactions didn't make it worse, I got an error message from MongoDB saying that |
Ah, transactions on read is what it was complaining about? |
Is the error. It looks like it is on the read side. |
Version Information
Version of Akka.NET?
Describe the bug
When there are a fair number of writes and reads happening at the same time, an event will be skipped. I have tested this against MongoDB running as a single node cluster running in Docker. I have reproduced the same behavior with EventsByTag and AllEvents queries. It may take many thousands of events for one to be skipped.
To Reproduce
I have a sink setup as follows:
Expected behavior
Each of the events would be sent to the subscriber.
Actual behavior
Some go missing.
Environment
.NET 7
Windows 11 and MacOS both exhibit the same behavior.
The text was updated successfully, but these errors were encountered: