Skip to content

Conversation

joostjager
Copy link
Collaborator

@joostjager joostjager commented Nov 21, 2022

Failure attribution is important to properly penalize nodes after a payment failure occurs. The goal of the penalty is to give the next attempt a better chance at succeeding. In the happy failure flow, the sender is able to determine the origin of the failure and penalizes a single node or pair of nodes.

Unfortunately it is possible for nodes on the route to hide themselves. If they return random data as the failure message, the sender won't know where the failure happened.

This PR proposes a new failure message format that lets each node commit to the failure message. If one of the nodes corrupts the failure message, the sender will be able to identify that node.

For more information, see https://lists.linuxfoundation.org/pipermail/lightning-dev/2022-October/003723.html and https://delvingbitcoin.org/t/latency-and-privacy-in-lightning/1723.

Furthermore, the htlc fail and fulfill flows are extended to convey self-reported htlc hold times to the sender.

Fail flow implementations

LND implementation: lightningnetwork/lnd#9888

LDK implementation: lightningdevkit/rust-lightning#2256

Eclair implementation: ACINQ/eclair#3065

CLN implementation: ElementsProject/lightning#8291

Fulfill flow implementations

LDK implementation: lightningdevkit/rust-lightning#3801

Eclair implementation: ACINQ/eclair#3100

@thomash-acinq
Copy link

I've started implementing it in eclair, do you have some test vectors so we can check that we are compatible?
The design seems good to me, but as I've said previously, I think keeping hop payloads and hmacs for 8 nodes only (instead of 27) is enough for almost all use cases and would give us huge size savings.

@joostjager
Copy link
Collaborator Author

joostjager commented Dec 6, 2022

I don't have test vectors yet, but I can produce them. Will add them to this PR when ready.

Capping the max hops at a lower number is fine to me, but do you have a scenario in mind where this would really make the difference? Or is it to more generally that everything above 8 is wasteful?

@joostjager joostjager force-pushed the fat-errors branch 2 times, most recently from 4b48481 to 24b10d5 Compare December 6, 2022 16:52
@joostjager
Copy link
Collaborator Author

@thomash-acinq added a happy fat error test vector.

09-features.md Outdated
| 20/21 | `option_anchor_outputs` | Anchor outputs | IN | `option_static_remotekey` | [BOLT #3](03-transactions.md) |
| 22/23 | `option_anchors_zero_fee_htlc_tx` | Anchor commitment type with zero fee HTLC transactions | IN | `option_static_remotekey` | [BOLT #3][bolt03-htlc-tx], [lightning-dev][ml-sighash-single-harmful]|
| 26/27 | `option_shutdown_anysegwit` | Future segwit versions allowed in `shutdown` | IN | | [BOLT #2][bolt02-shutdown] |
| 28/29 | `option_fat_error` | Can generate/relay fat errors in `update_fail_htlc` | IN | | [BOLT #4][bolt04-fat-errors] |
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this big gap in the bits has emerged here because of tentative spec changes that may or may not make it. Not sure why that is necessary. I thought for unofficial extensions, the custom range is supposed to be used?

I can see that with unofficial features deployed in the wild, it is easier to keep the same bit when something becomes official. But not sure if that is worth creating the gap here? An alternative is to deploy unofficial features in the custom range first, and then later recognize both the official and unofficial bit. Slightly more code, but this feature list remains clean.

@joostjager
Copy link
Collaborator Author

Added fat error signaling to the PR.

@thomash-acinq
Copy link

I've spent a lot of time trying to make the test vector pass and I've finally found what was wrong:
In the spec you write that the hmac covers

  • failure_len, failuremsg, pad_len and pad.

  • The first y+1 payloads in payloads. For example, hmac_0_2 would cover all three payloads.

  • y downstream hmacs that correspond to downstream node positions relative to x. For example, hmac_0_2 would cover hmac_1_1 and hmac_2_0.

implying that we need to concatenate them in that order. But in your code you follow a different order:

// Include payloads including our own.
_, _ = hash.Write(payloads[:(NumMaxHops-position)*payloadLen])

// Include downstream hmacs.
var hmacsIdx = position + NumMaxHops
for j := 0; j < NumMaxHops-position-1; j++ {
	_, _ = hash.Write(
		hmacs[hmacsIdx*sha256.Size : (hmacsIdx+1)*sha256.Size],
	)

	hmacsIdx += NumMaxHops - j - 1
}

// Include message.
_, _ = hash.Write(message)

I think the order message + hop payloads + hmacs is more intuitive as it matches the order of the fields in the packet.

@joostjager
Copy link
Collaborator Author

Oh great catch! Will produce a new vector.

@joostjager
Copy link
Collaborator Author

@thomash-acinq updated vector

@joostjager
Copy link
Collaborator Author

Updated LND implementation with sender-picked fat error structure parameters: lightningnetwork/lnd#7139

Copy link

@brh28 brh28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see the start time and end time of the htlc_hold_time defined. I assume these are upon receiving the update_add_htlc and before sending the update_fulfill_htlc/update_fail_htlc, respectively?

Overall, I like the idea of htlc_hold_time but suspect it'll be gamed by routing nodes.

@Roasbeef
Copy link
Collaborator

Roasbeef commented Jun 6, 2025

Overall, I like the idea of htlc_hold_time but suspect it'll be gamed by routing nodes.

A motivated sender can always point point what the actual forwarding delays are by bisecting the route. They also always know how long the entire route took as well. You are correct though that it's intended mainly on a best effort basis, the overall assumption is that this can be used to allow a "fast" node to distinguish themselves, and be rewarded for that.

thomash-acinq added a commit to ACINQ/eclair that referenced this pull request Jun 11, 2025
Attribution data is added to both failed and fulfilled HTLCs
lightning/bolts#1044
t-bast added a commit to t-bast/bolts that referenced this pull request Jun 17, 2025
In lightning#1044, we introduced a 12-blocks delay before considering a channel
closed when we see a spending confirmation on-chain. This ensures that
if the channel was spliced instead of closed, the channel participants
are able to broadcast a new `channel_announcement` to the rest of the
network. If this new `channel_announcement` is received by renote nodes
before the 12-blocks delay, the channel can keep its history in path
finding scoring/reputation, which is important.

We then realize that 12 blocks probably wasn't enough to allow this to
happen: some implementations default to 8 confirmations before sending
`splice_locked`, and allow node operators to increase this value. We
thus bump this delay to 72 blocks to allow more time before channels
are removed from the local graph.
@joostjager
Copy link
Collaborator Author

Overall, I like the idea of htlc_hold_time but suspect it'll be gamed by routing nodes.

As @Roasbeef says, in any case the sender knows how long the entire route took. The total penalty to apply is known. Without additional information, the sender might apply that penalty equally across all nodes.

The only thing the reported hold time does, is shift the penalty to better match reality. If a node reports zero, it doesn't mean that it doesn't get any penalty. It just means that the sender concludes that if that zero is true, that the outgoing side of that node is fast, and all of the to-be-distributed penalty for that node needs to be directed to its incoming side.

There is some gaming potential in there, but I think it is limited and also in a routing node's interest to report realistically.

@Roasbeef
Copy link
Collaborator

Ok, looks like this is ready for final review now.

@thomash-acinq
Copy link

ACK

@joostjager joostjager requested a review from carlaKC July 28, 2025 20:07
Copy link
Contributor

@carlaKC carlaKC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't verified the test vectors.

Otherwise some small comments then this is ready to go IMO.

The `update_fulfill_htlc` message contains an optional `attribution_data` field, similar to the one used in the failure
case described above. The only difference is that there is no failure message that is covered by the HMACs. With
attribution data, the sender can obtain hold times as reported and committed to by each hop along the payment path. This
field should not be set when the node is part of a blinded route.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For errors we set this data if path_key is not set in update_add_htlc, which means we'll set it for introduction nodes and nodes that aren't in a blinded route at all.

To me "part of a blinded route" could include the introduction route, which would make this inconsistent with the error section. We should express the requirements in a standard way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's indeed unclear. Removed this sentence and added it to the BOLT02 requirements. I think that's better.

For the layout of `attribution data`, see [Removing an HTLC: `update_fulfill_htlc`, `update_fail_htlc`, and
`update_fail_malformed_htlc`](02-peer-protocol.md#removing-an-htlc-update_fulfill_htlc-update_fail_htlc-and-update_fail_malformed_htlc).

The `htlc_hold_times` field specifies the htlc hold time for each hop in units of 100 milliseconds. For example, a value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to define this properly, otherwise each implementation can have its own interpretations, which can also be a privacy leak? For instance, does it start when the node receives an htlc or sends an htlc? and does it end when the node receives the failure msg or sends back the msg?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with the 100 ms granularity, we are good now? I also wouldn't be sure that everyone would really implement it exactly the way we'd specify it here. The implementation complexity can vary quite a bit depending on which timing points are used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think that once senders start to use timing info during pathfinding, routing nodes will want to optimize the way they report times in their favor, potentially ignoring what we'd write in this document now.

2. types:
1. type: 1 (`attribution_data`)
2. data:
* [`20*u32`:`htlc_hold_times`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the goal of this field here.

  • If people are dishonest, which they have incentives to, given there's no penalty to lie here, then this value becomes useless.
  • If people are honest, this value creates privacy leaks and side channel attacks, people can use it to fingerprint software or hardware, detect node latency, or its internal state. And for the blinded path this can be used to probe?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If people are dishonest, which they have incentives to, given there's no penalty to lie here, then this value becomes useless.

Keep in mind that this value is guarded by 2 more reported values. For example, if we have a path A->B->C->D->E over which A pays to E, let's assume Alice gets the hold times as 4 / 3 / 2 / 1 for B C D & E respectively. Also A keeps track of the real elapsed time.

If C wants to lie, he can try to report a value other than 3, but is at the same time guarded by B and D's values, which are 4 and 2. He can under-report or over-report, and he also doesn't know what the actual values for B and D are. If he crosses the neighboring values then the sender can detect the invalid hold time and normalize/penalize accordingly.

If people are honest, this value creates privacy leaks and side channel attacks, people can use it to fingerprint software or hardware, detect node latency, or its internal state. And for the blinded path this can be used to probe?

That was the whole rationale for the 100ms buckets discussion. This (partially) helps with mitigating fingerprinting (i.e the LND 50ms batch ticker now becomes invisible) and also pushes away from over-optimizing pathfinding around low latency nodes.

For blinded paths (at least on LND) we don't populate the attribution data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If he crosses the neighboring values then the sender can detect the invalid hold time and normalize/penalize accordingly.

B can also be dishonest here. It's likely A will see 0.4/0.3/0.2/0.1 while the measured duration is 4 seconds, so this info alone cannot be used to penalize nodes. With a more sophisticated algo and historical data A can start recognizing patterns, assuming there are incentives to stay honest.

That was the whole rationale for the 100ms buckets discussion.

That helps, but note that there's no perfect hiding as long as you are giving the info. For blinded paths this enables a side channel attack as you can get the hold time from the intro node, then you can maybe dos the suspected node in the blinded path to slow down its processing time and measure again the blinded path. This is already doable, but with the hold time it gives a much clearer signal.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

B can also be dishonest here. It's likely A will see 0.4/0.3/0.2/0.1 while the measured duration is 4 seconds, so this info alone cannot be used to penalize nodes. With a more sophisticated algo and historical data A can start recognizing patterns, assuming there are incentives to stay honest.

With this info, A knows that there is at least something slow on its connection to B and A can penalize B. The others got away this time, but in a next round - where B is no longer used - they may also receive a penalty.

@joostjager
Copy link
Collaborator Author

@carlaKC, I've pushed a fix up addressing your comments.

Copy link

@thomash-acinq thomash-acinq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

@joostjager
Copy link
Collaborator Author

Squashed and rebased without changes or conflicts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.