Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connection resume/recover #167

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
13 changes: 7 additions & 6 deletions textile/features.textile
Original file line number Diff line number Diff line change
Expand Up @@ -505,13 +505,14 @@ h3(#realtime-connection). Connection
** @(RTN15g)@ Connection state is only maintained server-side for a brief period, given by the @connectionStateTtl@ in the @connectionDetails@, see "CD2f":#CD2f. If a client has been disconnected for longer than the @connectionStateTtl@, it should not attempt to resume. Instead, it should clear the local connection state, and any connection attempts should be made as for a fresh connection
*** @(RTN15g1)@ This check should be made before each connection attempt. It is generally not sufficient to merely clear the connection state when moving to @SUSPENDED@ state (though that may be done too), since the device may have been sleeping / suspended, in which case it may have been many hours since it was last actually connected, even though, having been in the @CONNECTED@ state when it was put to sleep, it has only moved out of that state very recently (after waking up and noticing it's no longer connected)
*** @(RTN15g2)@ Another consequence of that is that the measure of whether the client been disconnected for too long (for the purpose of this check) cannot just be whether the client left the @CONNECTED@ state more than @connectionStateTtl@ ago. Instead, it should be whether the difference between the current time and the last activity time is greater than the sum of the @connectionStateTtl@ and the @maxIdleInterval@, where the last activity time is the time of the last known actual sign of activity from Ably per "RTN23a":#RTN23a
*** @(RTN15g3)@ When a connection attempt succeeds after the connection state has been cleared in this way, channels that were previously @ATTACHED@, @ATTACHING@, or @SUSPENDED@ must be automatically reattached, just as if the connection was a resume attempt which failed per "RTN15c7":#RTN15c7
*** @(RTN15g3)@ When a connection attempt succeeds after the connection state has been cleared in this way, channels that were previously @ATTACHED@, @ATTACHING@, or @SUSPENDED@ must be automatically reattached, just as if the connection was a resume attempt which failed per "RTN15c8":#RTN15c8
** @(RTN15b)@ In order for a connection to be resumed and connection state to be recovered, the client must have received a @CONNECTED@ ProtocolMessage which will include a private connection key. To resume that connection, the library reconnects to the "websocket":https://ably.com/topic/websockets endpoint with an additional querystring param:
*** @(RTN15b1)@ @resume@ is the @ProtocolMessage#connectionKey@ from the most recent @CONNECTED@ @ProtocolMessage@ received
*** @(RTN15b2)@ This clause has been deleted. It was valid up to and including specification version @1.2@.
** @(RTN15c)@ The system's response to a resume request will be one of the following:
**** @(RTN15c6)@ A @CONNECTED@ @ProtocolMessage@ with the same @connectionId@ as the current client (and no @error@ property). This indicates that the resume attempt was valid. The client library should move all channels that were in the @ATTACHING@, @ATTACHED@, or @SUSPENDED@ states to the @ATTACHING@ state, and initiate an @RTL4c@ attach sequence for each. The connection should also process any messages queued per @RTL6c2@ (there is no need to wait for the attaches to finish before processing queued messages).
**** @(RTN15c7)@ @CONNECTED@ @ProtocolMessage@ with a new @connectionId@ and an @ErrorInfo@ in the @error@ field. In this case, the resume was invalid, and the error indicates the cause. The @error@ should be set as the @reason@ in the @CONNECTED@ event, and as the @Connection#errorReason@. The internal @msgSerial@ counter should be reset so that the first message published to Ably will contain a @msgSerial@ of @0@. The rest of the process is the same as for @RTN16c6@: The client library should move all channels that were in the @ATTACHING@, @ATTACHED@, or @SUSPENDED@ states to the @ATTACHING@ state, and initiate an @RTL4c@ attach sequence for each. The connection should also process any messages queued per @RTL6c2@.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if there was no resume attempt as a part of suspended retry, are we supposed to treat is as resume failed? That means we need to reset msgSerial in this case too 🤔

Copy link
Collaborator Author

@sacOO7 sacOO7 Sep 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we will get a new connectionId but there won't be an error in the errorField. I think this third case is coming for almost every other spec that mentions just a dual behavior for resume success/failure.

Copy link
Collaborator Author

@sacOO7 sacOO7 Sep 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think resume failure should be new connectionId != prev connectionId. We shouldn't have extra checks for errors as such. So, if there is no resume attempt (null key ), it will be treated as a resume failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if there was no resume attempt as a part of suspended retry, are we supposed to treat is as resume failed? That means we need to reset msgSerial in this case too 🤔

RTN15g: "If a client has been disconnected for longer than the @connectionStateTtl@, it should not attempt to resume. Instead, it should clear the local connection state, and any connection attempts should be made as for a fresh connection" -- the msgSerial is part of the connection state, so it will have been reset in this case. (Perhaps the spec should explicitly list the properties that should be cleared, for clarity?)

(see also RTN15g1 which explains why RTN15g is strictly stronger than just doing this on going into SUSPENDED)

Also, we will get a new connectionId but there won't be an error in the errorField. I think this third case is coming for almost every other spec that mentions just a dual behavior for resume success/failure.

There's two cases. Either connection state (including connectionKey and msgSerial) is still there, in which case the library will try and do a resume, and the possible cases are listed in RTN15c. Or the connection state has been cleared, including the msgSerial, in which case the library is just doing a clean connection. The sdk never stops trying to resume but keeps its msgSerial.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

** @(RTN15c)@ The system's response to a resume request (with a non-empty connectionKey) will be one of the following:
**** @(RTN15c6)@ A @CONNECTED@ @ProtocolMessage@ with the same @connectionId@ as the current client (and no @error@ property). This indicates that the resume attempt was valid.
**** @(RTN15c7)@ @CONNECTED@ @ProtocolMessage@ with a new @connectionId@ and an @ErrorInfo@ in the @error@ field. In this case, the resume was invalid, and the error indicates the cause. The @error@ should be set as the @reason@ in the @CONNECTED@ event, and as the @Connection#errorReason@. The internal @msgSerial@ counter should be reset so that the first message published to Ably will contain a @msgSerial@ of @0@.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spec doesn't mention about value of channel Flag.resumed after channel attach @SimonWoolf ?

**** @(RTN15c8)@ Irrespective of success/failure for a resume request, the client library should move all channels that were in the @ATTACHING@, @ATTACHED@, or @SUSPENDED@ states to the @ATTACHING@ state, and initiate an @RTL4c@ attach sequence for each. The connection should also process any messages queued per @RTL6c2@ (there is no need to wait for the attaches to finish before processing queued messages).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this item is a child of a spec item saying "The system's response to a resume request (with a non-empty connectionKey) will be one of the following:". Every other child is one of the four things that can happen:

(RTN15c) The system’s response to a resume request will be one of the following:
↳ CONNECTED, same connId
↳ CONNECTED, new connId
↳ token ERROR
↳ other ERROR

Adding an item in the middle that's actually is a shared spec item for the first two of the items doesn't make sense

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, though the first 2 do exhibit a common behavior for attaching channels and processing pending messages.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove this if not needed. Though I am not sure if resume failure applies here too as per #167 (comment)

**** @(RTN15c5)@ @ERROR@ @ProtocolMessage@ indicating a failure to authenticate as a result of a token error (see "RTN15h":#RTN15h). The transport will be closed by the server. The spec described in "RTN15h":#RTN15h must be followed for a connection being resumed with a token error
**** @(RTN15c4)@ Any other @ERROR@ @ProtocolMessage@ indicating a fatal error in the connection. The server will close the transport immediately after. The client should transition to the @FAILED@ state triggering all attached channels to transition to the @FAILED@ state as well. Additionally the @Connection#errorReason@ will be set should be set with the error received from Ably
**** @(RTN15c1)@ This clause has been replaced by "@RTN15c6@":#RTN15c6. It was valid up to and including specification version @1.2@.
Expand All @@ -535,7 +536,7 @@ h3(#realtime-connection). Connection
** @(RTN16d)@ The library may wish to test that after a connection has been successfully recovered, the @Connection#id@ should be identical to the @id@ of the connection that was recovered, and @Connection#key@ should have been updated to the @ConnectionDetails#connectionKey@ provided in the @CONNECTED@ @ProtocolMessage@.
** @(RTN16m)@ @Connection#recoveryKey@ is a deprecated property that, when read, contains the same value that would be returned by calling @Connection#createRecoveryKey@. If the implementation languague allows, it should be implemented in a way that does not require that it be recalculated on every message. If this is not possible, it should be recalculated on every incoming message.
*** @(RTN16m1)@ @Connection#recoveryKey@ must be removed on the next major release of the SDK.
** @(RTN16l)@ Recovery failures should be handled identically to resume failures, per "RTN15c7":#RTN15c7, "RTN15c5":#RTN15c5, and "RTN15c4":#RTN15c4.
** @(RTN16l)@ Recovery success/failures should be handled identically to resume, per "RTN15c8":#RTN15c8, "RTN15c5":#RTN15c5, and "RTN15c4":#RTN15c4.
** @(RTN16a)@ This clause has been replaced by "@RTN16i@":#RTN16i. It was valid up to and including specification version @1.2@.
** @(RTN16b)@ This clause has been replaced by "@RTN16g@":#RTN16g and "@RTN16m@":#RTN16m. It was valid up to and including specification version @1.2@.
** @(RTN16c)@ This clause has been replaced by "@RTN16g2@":#RTN16g2. It was valid up to and including specification version @1.2@.
Expand All @@ -554,7 +555,7 @@ h3(#realtime-connection). Connection
* @(RTN19)@ Transport state side effects - when a transport is disconnected for any reason:
** @(RTN19a)@ Any @ProtocolMessage@ that is awaiting an @ACK@/@NACK@ on the old transport will not receive the @ACK@/@NACK@ on the new transport. The client library must therefore resend any @ProtocolMessage@ that is awaiting a @ACK@/@NACK@ to Ably in order to receive the expected @ACK@/@NACK@ for that message (subject to @RTN7c@/@RTN7d@). The Ably service is responsible for keeping track of messages, ignoring duplicates and responding with suitable @ACK@/@NACK@ messages
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine, connection was in suspendedState, after several retries, it's connected to the server ( with new transport )
As per RTN8c and RTN9c we are clearing connectionKey and connectionId. This means a new connection attempt is not a resume attempt. So, there's neither resume failure nor resume success as a part of a new connection.

In this case, what should happen to ack/nack messages? Are we supposed to treat it as a resume failure and send ack/nack messages with incremental message-serial ? Same we do it for channel attach in case of RTN15g3

Copy link
Collaborator Author

@sacOO7 sacOO7 Sep 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it feels we are attaching channels (in attaching, attached or suspended state ) for almost every type of reconnection 🤔

Copy link
Collaborator Author

@sacOO7 sacOO7 Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update - I checked the code, seems we clear ack/nack messages on suspended state, so we don't really have to worry about sending them. But can there be a case where a re-connection attempt is not a resume attempt in disconnected state?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, what should happen to ack/nack messages?

they'll have been cleared already per RTN7c

can there be a case where a re-connection attempt is not a resume attempt in disconnected state?

yes: before you're connected for the first time

Copy link
Collaborator Author

@sacOO7 sacOO7 Sep 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Statement says The Ably service is responsible for keeping track of messages, ignoring duplicate. Since, messages published via realtimechannel doesn't have ID ( assigned by server ) when ack/nacks are republished, we receive duplicate messages for the same. This means, as of now, any realtimechannel that doesn't receive ack/nack on same transport will publish duplicate messages on new transport : (

Copy link
Collaborator Author

@sacOO7 sacOO7 Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update -

  1. When resume is successful, resent ack/nack doesn't have duplicates -> https://github.com/ably/ably-dotnet/blob/201feb14f3934b3a1c6778dcddbf14deb9c17027/src/IO.Ably.Tests.Shared/Realtime/ConnectionSandboxTransportSideEffectsSpecs.cs#L199

  2. When resume is not successful, resent ack/nack have duplicates ->
    https://github.com/ably/ably-dotnet/blob/201feb14f3934b3a1c6778dcddbf14deb9c17027/src/IO.Ably.Tests.Shared/Realtime/ConnectionSandboxTransportSideEffectsSpecs.cs#L190C19-L190C19

Not sure how server actually checks for duplicates : (
My only guess is by checking message id's which are not present in both cases. Though for resume success, we have same connection id.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update - for the second case, since we are adding ack/nack to pending messages, is it a desired behavior?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means, as of now, any realtimechannel that doesn't receive ack/nack on same transport will publish duplicate messages on new transport

This is what RTN19a2 is for, which requires the re-send attempt (for a successful resume) to use the same msgSerial.

The message id for a realtime connection is implicitly connectionId:msgSerial. If your resume is successful, and you resend with the same msgId, you will be correctly deduplicated.

In the case of a resume failure, you're out of luck, so yes you can get duplicates.

*** @(RTN19a1)@ One possible implementation of this requirement would be to add any in-flight messages to the @RTL6c2@ connection-wide queue of messages that will be sent once the connection next becomes @CONNECTED@
*** @(RTN19a2)@ In the case of an @RTN15c6@ successful resume, the @msgSerial@ of the reattempted @ProtocolMessage@s should remain the same as for the original attempt. In the case of an @RTN15c7@ failed resume, the message must be assigned a new @msgSerial@ from the SDK's internal counter.
*** @(RTN19a2)@ In the case of an @RTN15c6@ successful resume, the @msgSerial@ of the reattempted @ProtocolMessage@s should remain the same as for the original attempt. In the case of an @RTN15c7@ failed resume, the message must be assigned a new @msgSerial@ from the SDK's internal counter ( need clarification).
sacOO7 marked this conversation as resolved.
Show resolved Hide resolved
** @(RTN19b)@ If there are any pending channels i.e. in the @ATTACHING@ or @DETACHING@ state, the respective @ATTACH@ or @DETACH@ message should be resent to Ably
* @(RTN23)@ Heartbeats
** @(RTN23a)@ If a transport does not receive any indication of activity on a transport for a period greater than the sum of the @maxIdleInterval@ (which will be sent in the @connectionDetails@ of the most recent @CONNECTED@ message received on that transport) and the @realtimeRequestTimeout@, that transport should be disconnected. Any message (or non-message indicator, see @RTN23b@) received counts as an indication of activity and should reset the timer, not merely heartbeat messages. However, it must be received (that is, sent from the server to the client); client-sent data does not count.
Expand Down
Loading