Implement the `RPCBlockHeaderSubscriber` for indexing finalized results #728

m-Peter · 2025-01-21T17:17:59Z

Closes: #727

Description

For contributor use:

Targeted PR against master branch
Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
Code follows the standards mentioned here.
Updated relevant documentation
Re-reviewed Files changed in the Github PR explorer
Added appropriate labels

Summary by CodeRabbit

New Features
- Added experimental soft finality mode to speed up event availability.
- Added optional sealing verification to detect mismatches between unsealed and sealed data.
- Event ingestion can follow finalized headers and backfill from earlier periods.
- New CLI flags: --experimental-soft-finality-enabled (default: false) and --experimental-sealing-verification-enabled (default: true), both with safety warnings.
Bug Fixes
- Clearer error messages when collecting or storing transaction traces.
Chores
- Improved module tidy checks in the build process.

coderabbitai · 2025-01-21T17:18:07Z

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly communicates the primary intent—adding a subscriber to index finalized block results—which matches the core contribution of the changeset. The diff implements that intent by adding a subscriber that consumes finalized block headers and polls GetEventsForBlockIDs (with supporting CrossSporkClient methods), plus related storage and verifier wiring, so the title maps to the main change. The title’s exact type name (RPCBlockHeaderSubscriber) differs from the concrete type added (RPCBlockTrackingSubscriber), which is a minor naming mismatch but does not make the title misleading.
Linked Issues Check	✅ Passed	The changes fulfill the PoC objectives from issue #727: the PR subscribes to finalized block headers, adds CrossSporkClient methods to fetch events for a given block, and introduces an RPCBlockTrackingSubscriber that emits per-block EVM events for ingestion. It also acknowledges PoC limitations and provides optional sealing verification plus EventsHash storage and processed-sealed-height handling to bridge soft-finality and later sealed verification. The PR notes that integration tests cannot run for this endpoint and does not include explicit rollback verification in the diff, so operational testing and rollback validation remain necessary.
Out of Scope Changes Check	✅ Passed	I found no significant out-of-scope changes: the new subscriber, CrossSporkClient methods, SealingVerifier, EventsHash storage, and bootstrap/config/CLI wiring all directly support the PoC. Minor unrelated housekeeping appears (a Makefile tidy command tweak and clearer engine error messages), but these are low-risk and do not alter the PoC functionality. No large unrelated features or risky cross-cutting changes were introduced.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch mpeter/poc-index-finalized-block-results

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between efd12b0 and a404eab.

📒 Files selected for processing (2)

cmd/run/cmd.go (3 hunks)
config/config.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

config/config.go
cmd/run/cmd.go

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Test

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

peterargue

added a couple small comments, but otherwise this looks good. let's get it running against one of the live networks.

services/ingestion/block_header_subscriber.go

m-Peter · 2025-01-23T15:29:54Z

@peterargue Thanks for the review 🙌 I've addressed all the comments.
Did you perhaps have the chance to take a look at the behavior described in #727 (comment) ? This is puzzling to me 🤔

services/ingestion/block_tracking_subscriber.go

…ersFromStartHeight

…or function resuse

…empty EVM blocks

Co-authored-by: Peter Argue <[email protected]>

…ange

…k-port-soft-finality [Backport] Implement `BatchTxPool` to handle nonce mismatch issues

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (5)

services/requester/tx_pool.go (2)
12-14: Consider making the regex more robust.

The current regex assumes the error message ends with a newline, which might not always be the case.
 const (
-	evmErrorRegex = `evm_error=(.*)\n`
+	evmErrorRegex = `evm_error=([^\n]+)`
 )
This pattern will capture everything after "evm_error=" up to but not including a newline, making it more flexible.

26-34: Consider pre-compiling the regex for better performance.

The regex is compiled on every function call. For better performance, especially under high load, consider compiling it once.
+var evmErrorPattern = regexp.MustCompile(evmErrorRegex)
+
 func parseInvalidError(err error) (error, bool) {
-	r := regexp.MustCompile(evmErrorRegex)
-	matches := r.FindStringSubmatch(err.Error())
+	matches := evmErrorPattern.FindStringSubmatch(err.Error())
 	if len(matches) != 2 {
 		return nil, false
 	}
 
 	return errs.NewFailedTransactionError(matches[1]), true
 }
tests/tx_batching_test.go (2)
111-112: Consider reducing test flakiness from timing dependencies.

The conditional sleep based on index could lead to flaky tests if the batch interval timing changes. Consider using a more deterministic approach or waiting for specific events.

Instead of fixed sleep times, consider waiting for transaction pool events or using a mock time source for more deterministic testing.

322-327: Consider extracting test addresses to constants or helper function.

The hardcoded test addresses and private keys could be generated dynamically or moved to test constants for better maintainability.

Consider creating a helper function to generate test EOA addresses and keys:
func generateTestEOAs(count int) map[common.Address]string {
    // Generate deterministic test addresses and keys
}
services/requester/batch_tx_pool.go (1)
109-111: Consider adding batch size limits per sender.

The current implementation can accumulate unlimited transactions per sender, which could lead to memory issues or oversized Flow transactions.

Consider adding a check and handling for maximum batch size:
const maxBatchSizePerSender = 100 // configurable

userTx := pooledEvmTx{txPayload: hexEncodedTx, nonce: tx.Nonce()}
if len(t.pooledTxs[from]) >= maxBatchSizePerSender {
    // Either reject the transaction or trigger immediate batch submission
    return fmt.Errorf("batch size limit reached for sender %s", from.Hex())
}
t.pooledTxs[from] = append(t.pooledTxs[from], userTx)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 092753c and 9364624.

⛔ Files ignored due to path filters (2)

go.sum is excluded by !**/*.sum
tests/go.sum is excluded by !**/*.sum

📒 Files selected for processing (14)

bootstrap/bootstrap.go (6 hunks)
cmd/run/cmd.go (3 hunks)
config/config.go (1 hunks)
go.mod (4 hunks)
services/requester/batch_tx_pool.go (1 hunks)
services/requester/cadence/batch_run.cdc (1 hunks)
services/requester/pool.go (0 hunks)
services/requester/requester.go (5 hunks)
services/requester/single_tx_pool.go (1 hunks)
services/requester/tx_pool.go (1 hunks)
tests/go.mod (4 hunks)
tests/helpers.go (4 hunks)
tests/integration_test.go (1 hunks)
tests/tx_batching_test.go (1 hunks)

💤 Files with no reviewable changes (1)

services/requester/pool.go

✅ Files skipped from review due to trivial changes (1)

tests/go.mod

🚧 Files skipped from review as they are similar to previous changes (4)

config/config.go
bootstrap/bootstrap.go
go.mod
cmd/run/cmd.go

🔇 Additional comments (13)

tests/integration_test.go (1)

558-558: LGTM! Excellent resource management improvement.

The addition of explicit cleanup for both boot instances ensures proper resource management and prevents potential leaks. The deferred cleanup with nil-check is well-structured and follows Go best practices.

Also applies to: 565-569

tests/helpers.go (3)

95-96: Configuration changes look good.

Enabling EVM setup while disabling VM bridge aligns with the PR's focus on EVM functionality.

246-258: Proper COA initialization with existence checks.

Good defensive programming - checking for existing resources before creating new ones prevents runtime errors and follows Cadence best practices.

468-481: Clean implementation of getBalance RPC method.

The method follows the existing pattern and properly handles the JSON response with appropriate type conversion.

services/requester/requester.go (2)

38-39: Proper embedding of batch run script.

Good use of go:embed directive for the new batch transaction script.

237-239: Clean delegation to transaction pool.

The simplified implementation properly delegates transaction submission to the pool while maintaining all necessary validation.

tests/tx_batching_test.go (2)

128-164: Excellent test coverage for nonce validation errors.

The test properly validates both "nonce too low" and "nonce too high" scenarios, and correctly asserts that the batch fails with the appropriate error message.

347-382: Well-structured concurrent testing with proper synchronization.

Excellent use of errgroup for managing concurrent goroutines and proper error propagation. The random delays effectively simulate real-world concurrent transaction submissions.
services/requester/single_tx_pool.go (3)

154-198: Well-implemented transaction building with proper resource management.

The method correctly handles keystore locking, metric updates, and error scenarios. The pattern of taking a key and returning it on error is particularly good.

200-226: Efficient concurrent fetching implementation.

Good use of errgroup for parallel fetching of block and account information, which optimizes the transaction preparation time.

117-121: Fix defer placement to avoid deleting non-existent entries.

The defer t.pool.Delete(tx.Hash()) statement will execute even if an error occurs before the transaction is stored in the pool. This could lead to unnecessary deletion attempts on non-existent entries.

Move the defer statement after the Store operation:
 if t.config.TxStateValidation == config.TxSealValidation {
     // add to pool and delete after transaction is sealed or errored out
     t.pool.Store(tx.Hash(), tx)
-    defer t.pool.Delete(tx.Hash())
+    defer t.pool.Delete(tx.Hash())
Actually, the better approach would be to restructure the code to ensure defer is only set after successful storage:
 if t.config.TxStateValidation == config.TxSealValidation {
-    // add to pool and delete after transaction is sealed or errored out
-    t.pool.Store(tx.Hash(), tx)
-    defer t.pool.Delete(tx.Hash())
+    // add to pool
+    t.pool.Store(tx.Hash(), tx)
+    // ensure cleanup after transaction is sealed or errored out
+    defer func() {
+        t.pool.Delete(tx.Hash())
+    }()
Likely an incorrect or invalid review comment.
services/requester/batch_tx_pool.go (2)

131-138: Excellent concurrent design with atomic map swapping.

The pattern of atomically swapping the pooled transactions map allows new transactions to be added without blocking during batch submission. This is a well-designed approach for high-throughput scenarios.

165-169: Critical nonce sorting ensures transaction order preservation.

Excellent implementation of nonce-based sorting to maintain transaction order within a batch. This is essential for preventing nonce-related errors.

coderabbitai · 2025-06-25T16:47:31Z

services/requester/cadence/batch_run.cdc

+        for txResult in txResults {
+            assert(
+                txResult.status == EVM.Status.failed || txResult.status == EVM.Status.successful,
+                message: "evm_error=".concat(txResult.errorMessage).concat("\n")
+            )
+        }


🛠️ Refactor suggestion

The loop is redundant - it will always fail on the first iteration.

Since this code is only reached when all transactions have invalid status (neither failed nor successful), the assertion txResult.status == EVM.Status.failed || txResult.status == EVM.Status.successful will always be false on the first iteration, causing the transaction to fail immediately. The loop structure is unnecessary.

Consider simplifying to directly assert on the first transaction result:

- for txResult in txResults { - assert( - txResult.status == EVM.Status.failed || txResult.status == EVM.Status.successful, - message: "evm_error=".concat(txResult.errorMessage).concat("\n") - ) - } + assert( + false, + message: "evm_error=".concat(txResults[0].errorMessage).concat("\n") + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for txResult in txResults {

assert(

txResult.status == EVM.Status.failed || txResult.status == EVM.Status.successful,

message: "evm_error=".concat(txResult.errorMessage).concat("\n")

)

}

assert(

false,

message: "evm_error=".concat(txResults[0].errorMessage).concat("\n")

)

🤖 Prompt for AI Agents

In services/requester/cadence/batch_run.cdc around lines 28 to 33, the for loop asserting each txResult's status is redundant because the code path is only reached when all transactions have invalid statuses, causing the assertion to fail immediately on the first iteration. Replace the loop with a single assertion on the first txResult's status to simplify the code and avoid unnecessary iteration.

services/requester/batch_tx_pool.go

coderabbitai · 2025-06-25T16:47:32Z

services/requester/batch_tx_pool.go

+		txMux:        sync.Mutex{},
+	}
+
+	go batchPool.processPooledTransactions(ctx)


🛠️ Refactor suggestion

Add panic recovery for background goroutine.

The background goroutine processing transactions could panic and terminate batching functionality. Add panic recovery to ensure resilience.

-go batchPool.processPooledTransactions(ctx) +go func() { + defer func() { + if r := recover(); r != nil { + logger.Error().Interface("panic", r).Msg("panic in batch transaction processor") + // Optionally restart the goroutine or implement a circuit breaker + } + }() + batchPool.processPooledTransactions(ctx) +}()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

go batchPool.processPooledTransactions(ctx)

go func() {

defer func() {

if r := recover(); r != nil {

logger.Error().Interface("panic", r).Msg("panic in batch transaction processor")

// Optionally restart the goroutine or implement a circuit breaker

}

}()

batchPool.processPooledTransactions(ctx)

}()

🤖 Prompt for AI Agents

In services/requester/batch_tx_pool.go at line 76, the background goroutine started with go batchPool.processPooledTransactions(ctx) lacks panic recovery. Wrap the call to processPooledTransactions in an anonymous function that defers a recover function to catch and handle any panics, preventing the goroutine from crashing and ensuring the batching functionality remains resilient.

Sync soft-finality with changes from `main` branch

…sync-with-main

Sync soft-finality branch with changes from main

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

services/ingestion/block_tracking_subscriber.go (2)
177-177: Fix typo in comment.
-					// typically we receive an error in the errChan before the channels are closes
+					// typically we receive an error in the errChan before the channels are closed
359-359: Use the actual event type constant instead of string comparison.

Using string contains for event type comparison is fragile. Consider using the actual constant or a more precise comparison.
-	if strings.Contains(eventType, string(events.EventTypeBlockExecuted)) && len(event.Events) != 1 {
+	if eventType == string(events.EventTypeBlockExecuted) && len(event.Events) != 1 {

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 34f6efd and 88e6d7e.

📒 Files selected for processing (4)

bootstrap/bootstrap.go (5 hunks)
cmd/run/cmd.go (3 hunks)
config/config.go (1 hunks)
services/ingestion/block_tracking_subscriber.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

config/config.go
bootstrap/bootstrap.go
cmd/run/cmd.go

🧰 Additional context used

🧠 Learnings (1)

services/ingestion/block_tracking_subscriber.go (9)

Learnt from: peterargue
PR: #617
File: api/stream.go:62-67
Timestamp: 2024-10-18T19:26:37.579Z
Learning: In the flow-evm-gateway project, within the Go file api/stream.go, the prepareBlockResponse method includes the Bloom filter as the field LogsBloom in the returned Block struct.

Learnt from: m-Peter
PR: #831
File: services/requester/pool.go:202-214
Timestamp: 2025-06-17T10:29:35.941Z
Learning: In Flow EVM Gateway's transaction pool (services/requester/pool.go), signing keys acquired with keystore.Take() are intentionally held until the transaction is sealed, not released immediately after sending. The key release happens asynchronously when sealed transactions are ingested via event_subscriber.go NotifyTransaction() -> keystore unsafeUnlockKey() -> key.Done(). This pattern ensures transaction integrity by preventing key reuse before transaction finalization.

Learnt from: m-Peter
PR: #831
File: services/requester/batch_tx_pool.go:0-0
Timestamp: 2025-06-19T11:36:25.478Z
Learning: In Go, when copying maps that contain slices (like map[gethCommon.Address][]pooledEvmTx), perform deep copies by iterating over the map and copying each slice individually using make() and copy() to avoid shared references that could lead to race conditions and data corruption.

Learnt from: peterargue
PR: #615
File: bootstrap/bootstrap.go:167-197
Timestamp: 2024-10-17T18:04:04.165Z
Learning: In the flow-evm-gateway Go project, the validation ensuring that startHeight is less than or equal to endHeight is performed before the StartTraceDownloader method in bootstrap/bootstrap.go, so additional checks in this method are unnecessary.

Learnt from: peterargue
PR: #772
File: services/requester/keystore/key_store.go:50-62
Timestamp: 2025-03-07T01:35:09.751Z
Learning: In the flow-evm-gateway codebase, panics are acceptable in scenarios where immediate detection of critical bugs is desired during development and testing, particularly for invariant violations that should never occur in a correctly functioning system (e.g., when a key is available but locked in the keystore implementation).

Learnt from: peterargue
PR: #682
File: services/ingestion/event_subscriber.go:85-90
Timestamp: 2025-01-24T20:15:10.908Z
Learning: When calculating blocks to catch up in Flow, initialize the counter to 0 and only set it when the latest chain height is greater than the current height (latestOnChainHeight > currentHeight). There's no need to handle the else case as 0 blocks to catch up is the correct default when we're at the same height or ahead.

Learnt from: peterargue
PR: #615
File: services/traces/engine.go:175-178
Timestamp: 2024-10-17T18:05:27.084Z
Learning: In the trace backfill process, returning upon encountering an error when retrieving a block or its Cadence ID is intentional because future blocks are unlikely to exist if these blocks don't.

Learnt from: peterargue
PR: #615
File: services/traces/engine.go:190-190
Timestamp: 2024-10-17T18:04:41.062Z
Learning: In services/traces/engine.go, the e.indexBlockTraces function limits the number of goroutines to a maximum of 5 and blocks until all are complete, so during backfill, concurrency is already managed to prevent resource exhaustion.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Test

🔇 Additional comments (1)

services/ingestion/block_tracking_subscriber.go (1)

114-114: No change needed for logger.Fatal() here

The use of logger.Fatal() to terminate on a verifier failure is consistent with the rest of the codebase, where critical errors trigger a process exit:

storage/pebble/db.go: log.Fatal().Err(err).Msg("failed to close batch")

services/ingestion/sealing_verifier.go: v.logger.Fatal().Err(err)...

services/ingestion/engine.go: e.log.Fatal().Err(err).Msg("failed to close batch")

bootstrap/bootstrap.go: l.Fatal().Err(err).Msg("engine failed to run")

Terminate-on-error for the sealing verifier aligns with the intended failure semantics elsewhere.

coderabbitai · 2025-07-21T16:45:06Z

services/ingestion/block_tracking_subscriber.go

+	for {
+		evts, err = r.client.GetEventsForBlockHeader(
+			ctx,
+			eventType,
+			blockHeader,
+		)
+		if err != nil {
+			// retry after a short pause
+			if status.Code(err) == codes.NotFound || status.Code(err) == codes.ResourceExhausted {
+				time.Sleep(200 * time.Millisecond)
+				continue
+			}
+
+			return flow.BlockEvents{}, fmt.Errorf("failed to get events from access node: %w", err)
+		}
+		break
+	}


🛠️ Refactor suggestion

Add maximum retry attempts to prevent infinite loops.

The retry loop could run indefinitely if NotFound or ResourceExhausted errors persist. Consider adding a maximum retry count with exponential backoff.

+ const maxRetries = 10 + retryCount := 0 + retryDelay := 200 * time.Millisecond + // retry until we get the block from an execution node that has the events for { evts, err = r.client.GetEventsForBlockHeader( ctx, eventType, blockHeader, ) if err != nil { // retry after a short pause if status.Code(err) == codes.NotFound || status.Code(err) == codes.ResourceExhausted { - time.Sleep(200 * time.Millisecond) + retryCount++ + if retryCount > maxRetries { + return flow.BlockEvents{}, fmt.Errorf("max retries exceeded after %d attempts: %w", maxRetries, err) + } + time.Sleep(retryDelay) + retryDelay = retryDelay * 2 // exponential backoff + if retryDelay > 5*time.Second { + retryDelay = 5*time.Second // cap at 5 seconds + } continue } return flow.BlockEvents{}, fmt.Errorf("failed to get events from access node: %w", err) } break }

🤖 Prompt for AI Agents

In services/ingestion/block_tracking_subscriber.go around lines 331 to 347, the retry loop for fetching events can run indefinitely on NotFound or ResourceExhausted errors. To fix this, introduce a maximum retry count to limit the number of attempts and implement exponential backoff by increasing the sleep duration after each retry. This will prevent infinite loops and reduce load on the system during persistent errors.

Improve release frequency of signing keys for soft-finality

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

services/ingestion/block_tracking_subscriber.go (1)

343-359: Add maximum retry attempts to prevent infinite loops.

The retry loop could run indefinitely if NotFound or ResourceExhausted errors persist. This was mentioned in a previous review but wasn't addressed.

Consider implementing the suggested retry limit with exponential backoff:

+	const maxRetries = 10
+	retryCount := 0
+	retryDelay := 200 * time.Millisecond
+
 	// retry until we get the block from an execution node that has the events
 	for {
 		evts, err = r.client.GetEventsForBlockHeader(
 			ctx,
 			eventType,
 			blockHeader,
 		)
 		if err != nil {
 			// retry after a short pause
 			if status.Code(err) == codes.NotFound || status.Code(err) == codes.ResourceExhausted {
-				time.Sleep(200 * time.Millisecond)
+				retryCount++
+				if retryCount > maxRetries {
+					return flow.BlockEvents{}, fmt.Errorf("max retries exceeded after %d attempts: %w", maxRetries, err)
+				}
+				time.Sleep(retryDelay)
+				retryDelay = retryDelay * 2 // exponential backoff
+				if retryDelay > 5*time.Second {
+					retryDelay = 5*time.Second // cap at 5 seconds
+				}
 				continue
 			}
 
 			return flow.BlockEvents{}, fmt.Errorf("failed to get events from access node: %w", err)
 		}
 		break
 	}

🧹 Nitpick comments (1)

services/ingestion/block_tracking_subscriber.go (1)

235-235: Consider making the block header queue size configurable.

The hardcoded value of 5 blocks for the notification delay might need adjustment based on network conditions or chain configuration. Consider making this a configurable parameter.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 88e6d7e and 7575806.

📒 Files selected for processing (1)

services/ingestion/block_tracking_subscriber.go (1 hunks)