Skip to content

[BUG] robus infinite collision condition still exist #483

@nicolas-rabault

Description

@nicolas-rabault

Details

Which version of the bug has been detected on

Luos engine 3.1.0 and all others before that

Description of the bug

Robus can experience some message collisions on the network due to the multi-master aspect of the protocol. After a collision, Robus has to retry to send a message and do something to avoid re-colliding. But it seems that we still have one condition where collision avoidance doesn't work.

Context and environment

Few explanations about basic protocol timeout

On Robus timeout is used to avoid transmission during a reception AKA collision. The idea is to lock the transmission as soon as we receive something and unlock it after a timeout. more info about timeout in the related documentation page
To manage that Robus reset a timer to a specific value at each byte's reception so that after an inactivity period on the bus all the nodes can send messages again.

Timeout used for collision avoidance

Sometimes 2 nodes will try to send messages at the same time. In this condition, the timeout is not working and we still have a collision on the network. This collision will be detected and Robus will retry to send the message after a timeout period depending on its node ID to avoid to recollide with the same node again:
image
But the thing is that the collision avoidance timer is the same timer used for normal reception so in reality the node 2 collision avoidance timeout is overwritten by the reception of node 1 tx:
image
This leads us to the case where we could have a failure of collision avoidance :
image
Here we have 3 nodes colliding and then a fourth node colliding with the retry of node 1. This leads us to a collision loop.

How to reproduce the bug

@houkhouk only sees it in one specific condition in years, so it's almost impossible to reproduce voluntarily.

Possible solution

To avoid this we could give the timeout timer priority to the latest timeout. If a normal timeout should trigger before a collision avoidance timeout we should not reset it.
To say it differently this timer should prioritize the latest timeout possible:
image

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions