Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dev ARQ (retransmit) #185

Draft
wants to merge 71 commits into
base: main
Choose a base branch
from
Draft

dev ARQ (retransmit) #185

wants to merge 71 commits into from

Conversation

olliw42
Copy link
Owner

@olliw42 olliw42 commented May 24, 2024

this is a dev/test branch for working on retransmission

this one is simplified in two ways, to not make it too complicated and facilitate easier testing:

  • retransmission handling only for receiver->transmitter link direction. That's the more complicated direction also (entangles with flow control). Once this is bugged out adding tx->rx direction should be easy.
  • lossles transmission, i.e., a frame is send for as long as it is not acked. For low LQ this is bad. Eventually we should go to a scheme there it does only one retransmit attempt, but so it's easier to get going.

@brad112358 this might interest you, given you showed interest in this before. I actually would much appreciate you cheking it out; your strength in finding the little loopholes/bugs would be useful :)

anyone else is of course also massively welcome !!!!!

@olliw42
Copy link
Owner Author

olliw42 commented May 24, 2024

a pic I made to help me
mlrs-arq

@brad112358
Copy link
Contributor

brad112358 commented May 26, 2024

Thanks for including the very helpful and, I think clear, diagram. So far, I've only looked at that, but I already have a few comments/observations/suggestions.

  1. You seem to be sending back only ack or nack and not the received sequence number. I believe this is not optimal as we observe below.

  2. Sequence number 5 shows sub-optimal behavior which is a direct result of 1). In the 8th response, we send a nack with no indication of which sequence number was lost, even though, we have previously received and accepted the frame which was lost.

  3. Instead of responding with just ack or nack, it would be better to send back a received sequence number and to always just acknowledge the last successfully received data. In other words, it doesn't matter much that we haven't received the most recent message. What matters most is what was the last message we successfully did receive. This change would have eliminated an unnecessary retransmission for sequence number 5.

  4. The diagram doesn't specify the size of the sequence number field, but they seem to be at least 3 bits. I think the sequence numbers don't need to be so large. In fact, I think they can be 1 bit. This is because we have alternating transmission direction with no chance of out of order reception. This reduces the overhead for the sequence number and has the added advantage that responding with the last received sequence number rather than ack or nack still only requires 1 bit.

  5. What I describe above is actually the retransmission method ELRS uses (last time I checked) and I think it is probably optimal in terms of overhead. They can only use it in one direction because they don't send an equal number of frames in both directions, but we can use this method in both directions.

  6. There might be some small utility to using sequence numbers larger than 1 bit or to also send the ack/nack flag in terms of allowing each side to more accurately estimate the loss in both directions.

  7. I agree we should limit the number of retransmissions. Even with 1 bit sequence numbers, I think we can limit the number of retransmissions to any value we like as long as both sides agree what the limit is (both sides would need a counter for each direction).

Does this make sense to you? Am I missing something?

@olliw42
Copy link
Owner Author

olliw42 commented May 26, 2024

many thx for the comments

concerning all points related to seq not 1 bit: YES, 1 bit is fully sufficient for a basic mechanism. Eventually we may want to change. However, for purely historical reasons the seq number just happens to be 3 bit, and at this point I see no reason to change that. We know we could/will. Not a relevant point IMHO :)

there is btw always a situation which is not optional. See e.g seq 3

the protocol is just the standard protocol as on any web page (there are various names for it, so no name here)(all 1 bit). There are two versions, those which send an ack, and those which send the next desired number. The main challenge is handling the various non-protocol related states, like disconnects, frames which carry commands not serial data, etc. pp. Hence these states, and I figured that the send ack version is easier to handle these states.

it's possible that with > 1 bit one may reduce few edge cases

the problem with 5 is probably solved by using the other method, to send the next desired no. Hm. Maybe I shoud convert to that.
EDIT: yes, I think that's what we should go to ...

@brad112358
Copy link
Contributor

Can you elaborate a bit on what cases in mLRS the ack/nack is better than sending the sequence number of the last received frame? (You are faster at edits than I am at replys) If I understand your edit correctly, that was my point. Is there any practical difference between sending the next desired number vs the last received number?

@brad112358
Copy link
Contributor

brad112358 commented May 26, 2024

Also, 3 is only sub-optimal if you allow for sending ahead of acknowledged (whether by sequence number or ack) frames which requires more buffering and can result in out of order delivery.

I think such type of protocols are out of question

@olliw42
Copy link
Owner Author

olliw42 commented May 26, 2024

sorry, I edited your post ... grrrr, this damed github, not the first time I stepped into ths trap

@olliw42
Copy link
Owner Author

olliw42 commented May 26, 2024

Is there any practical difference between sending the next desired number vs the last received number?

the book keeping and state handling looked easier to me

@brad112358
Copy link
Contributor

the book keeping and state handling looked easier to me

To me, the opposite. Sending back the last received sequence number means you just check if the acknowledged sequence number matches what you last sent. If so, move on, if not, retransmit. The other way seems to require you to add when you respond and subtract when you compare. But, I suppose it's mostly a matter of how you think about it. Either way, there is not much state involved except for deciding when to give up on retransmission of a given frame.

Many of the algorithms found online and in text books are intended for more complex systems which don't just ping-pong messages at a constant interval like we do, so a very simple method can be optimal for us if we rule out buffering more than the most recent transmission as you have.

@olliw42
Copy link
Owner Author

olliw42 commented May 26, 2024

I was trying this initially, but then concluded for the ack, I will retry, could be that at the early stages I also had too much of how to abstract the code in mind. I didn't had sorted out the states initially. Anyway, it has benefits, so whatever what it's going to be so :)

@olliw42
Copy link
Owner Author

olliw42 commented May 27, 2024

@brad112358
so, changed it now to send the last received seq no as ack, instead of ack/nack-ing reception

seemingly works, in that it connects etc. pp, but the symptom described by @jlpoltrack also exists, i.e. MP shows lost packets ... so, appears something is still not working as expected ...

here the time plan for the changed protocol
(note, ack is only 1 bit, seq is 3 bits)

grafik

@olliw42
Copy link
Owner Author

olliw42 commented May 28, 2024

with

#define USE_ARQ_TX_SIM_MISS 9 //9
#define USE_ARQ_RX_SIM_MISS 5 //5

I do see continuous packet losses in MP, MP tells pretty stably 95-96% link quality, so around every 20iest packet is lost

the mLRS LQ metric on the OLED display tells something around 75% ... not sure if that means that the mechanism is helping

I do my tests btw in 19 Hz mode (with a 2.4 GHz system)

@brad112358
Copy link
Contributor

And you didn't add a retransmission limit yet? So something must be wrong if MP is correct. When I get some time, I'll try to reproduce the problem with QGC.

@brad112358
Copy link
Contributor

Do you use Bluetooth or UDP or TCP WiFi or wired serial for the GCS connection?

@olliw42
Copy link
Owner Author

olliw42 commented May 28, 2024

And you didn't add a retransmission limit yet?

yes, no retransmission limit

So something must be wrong if MP is correct.

yes :)

Do you use Bluetooth or UDP or TCP WiFi or wired serial for the GCS connection?

wired serial connection, from tx serial via usb-ttl to PC

one potential source of problem which I have not yet ruled out is that the stream flow control isn't good enough, so that AP sends too many messages, so that some are dropped every once in a while ... I'm using 19 Hz, so there is some restriction.
It's also curious that 95% is close to 1/19 ... can't see though how that could correlated.
Some more tests should clear up some speculations.

@brad112358
Copy link
Contributor

Looks like that fixed it. QGC is now reporting 0 lost messages

@olliw42
Copy link
Owner Author

olliw42 commented May 28, 2024

Looks like that fixed it. QGC is now reporting 0 lost messages

YES :) @jlpoltrack made the relevant comment

not sure you also follow the discussion at discord

@brad112358
Copy link
Contributor

brad112358 commented May 28, 2024

Well, It ran well for a while and then it started dropping a lot of messages and it seemed to get worse over time. I've power cycled both ends of the link (one at a time) and restarted the GCS, but it hasn't recovered. I'm not sure what happened.

@brad112358
Copy link
Contributor

I had the baud rate too high for the crap R9M inverter with weak pullup. Working fine at 115200 serial speed on the Tx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants