Skip to content

Conversation

@syszery
Copy link
Contributor

@syszery syszery commented Oct 9, 2025

Motivation

This PR addresses #9730 and focuses on the CommitCheckpointVerifiedBlock path.

Solution

Follow-up Work

  • Should CloneError be replaced?
    CloneError was introduced due to commit_finalized_direct returning a BoxError. Replacing it will require additional changes to the error like implementing From<zebra_chain::parallel::tree::NoteCommitmentTreeError>.
  • Should CommitCheckpointVerifiedError wrap QueueAndCommitError?
    There is some overlap between these error types, particularly with regard to the error messages. Adopting QueueAndCommitError would simplify error handling, but this would require modifying some of the existing error messages to match.

Happy to adjust based on feedback.

PR Checklist

  • The PR name is suitable for the release notes.
  • The PR follows the contribution guidelines.
  • The library crate changelogs are up to date.
  • The solution is tested.
  • The documentation is up to date.

@syszery syszery marked this pull request as ready for review October 9, 2025 21:18
arya2
arya2 previously approved these changes Oct 14, 2025
Copy link
Contributor

@arya2 arya2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, one small suggestions to avoid changing the error messages if you want to keep a CloneError variant in the new error enum.

Should CloneError be replaced?

It would be great if it were replaced there, though it looks like the change may be substantial if you want this to merge as-is.

Should CommitCheckpointVerifiedError wrap QueueAndCommitError?

I think that would be nice, CommitCheckpointVerifiedError could essentially be a copy of CommitSemanticallyVerifiedError with an added WriteError(rocksdb::Error) variant.

@syszery
Copy link
Contributor Author

syszery commented Oct 14, 2025

Hi @arya2, thank you for the feedback.

I’ve restructured CommitCheckpointVerifiedError to follow the same model as CommitSemanticallyVerifiedError, as suggested. I also stubbed in WriteError(#[from] rocksdb::Error) for the finalized path (currently unused). All relevant error cases are now represented in QueueAndCommitError.

Some variants between the finalized and non-finalized paths are close in meaning, but not identical — so a full unification would require either small semantic shifts or renaming a few arms. Also, e.g., QueueAndCommitError::Dropped isn't currently used — would you prefer to keep these around for now?

If you already have a naming or structural direction in mind, feel free to push it — I’m happy to clean up any resulting breakages or follow-up changes.

TODOs before merge:

  • Attempt full removal of BoxError.
  • Fix documentation for the previously merged InvalidateError and ReconsiderError in service.rs.
  • (?) Update CHANGELOG.md to highlight breaking changes.

@oxarbitrage oxarbitrage added the do-not-merge Tells Mergify not to merge this PR label Oct 14, 2025
Copy link
Contributor

@arya2 arya2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great!

I started making some changes based off this branch in #9999, where the write_block() method has been updated to return a CommitCheckpointVerifiedError. I think there may still be some room for improvement in the Duplicate error variant. I'm happy to work on that a bit more or for you to takeover from there, let me know what you would prefer.

Aside from improving the Duplicate error variant, these changes will likely require some updates to ChainSync::should_restart_sync().

Some variants between the finalized and non-finalized paths are close in meaning, but not identical — so a full unification would require either small semantic shifts or renaming a few arms.

I think we should fully unify them and have both request variants return a CommitBlockError, the only substantial difference is that committing checkpoint-verified blocks could return a rocksdb::Error, but only because those are expected not to happen when finalizing non-finalized blocks, so I think it would be fine to expect that they won't happen for checkpoint-verified blocks either. It doesn't seem like Zebra could recover from those errors anyway, see "an error during a write operation (write to WAL, Memtable Flush, background compaction etc) may cause the database instance to go into read-only mode and further user writes may not be accepted."

Also, e.g., QueueAndCommitError::Dropped isn't currently used — would you prefer to keep these around for now?

If they're not currently necessary, it seems easy enough to add them once they're needed.

@syszery
Copy link
Contributor Author

syszery commented Oct 17, 2025

Quoting @arya2 from #9999:

TODO:

  • The Duplicate error variant could likely be improved, I'm happy to try doing that here,
  • Error messages should likely be adjusted to more closely match existing error messages, and
  • ChainSync::should_restart_sync() will likely need some updates to continue working as expected

I tried to beautify the new Duplicate error and deliver a simple version that fits the current style of the module. However, formatting the hash_or_height is a bit tedious due to the Option type — especially since the None case needs to be handled explicitly for clean error messages.

Currently, the None case only appears in service.rs, where there are TODOs hinting that block_height tracking is planned. Would you prefer to leave this as a TODO for follow-up, or should we address it in this PR?


On a related note: I noticed that BoxError and CommitSemanticallyVerifiedError are re-exported in lib.rs, while the other error types are not. I'd like to unify this approach for consistency across existing and upcoming variants.

Do you prefer that I do this cleanup as part of this PR, or would you rather wait and revisit it once the StateService error structure is more finalized?

@syszery syszery force-pushed the refactor/state-CommitCheckpointVerifiedBlock-error branch from 590b1af to dead800 Compare October 18, 2025 05:47
@syszery
Copy link
Contributor Author

syszery commented Oct 18, 2025

Yesterday, after a rebase, I had some issues with the acceptance tests in zebrad, particularly with activate_mempool_mainnet. Now, I just re-ran all tests again, and they all passed on the first try.

We’ve been working on parts of the code that might affect mempool logic, so I’m wondering — is there anything I should take a closer look at, or anything in particular to watch out for?

@syszery
Copy link
Contributor Author

syszery commented Oct 26, 2025

Hi @arya2, I patched the should_restart_sync method you mentioned. The duplicate/commit checks are now consolidated in a single Invalid { error, .. } arm, which handles:

  • error.is_duplicate_request()
  • CommitBlockError::Duplicate
  • fallback string matches (for now)

This commit only fixes a small part of #2908 — the fallback string matching is still a TODO, I think.

@syszery
Copy link
Contributor Author

syszery commented Oct 27, 2025

After re-visiting the method, I made it a bit more defensive: it now checks for CommitBlockError::Duplicate in the _ => fallback branch, while keeping the previous string matches.

Do you want to merge this “safe but messy” fix first and clean it up later (e.g., by implementing is_duplicate() on CommitBlockError), essentially addressing issue #2908? Or as part of this PR?

@arya2
Copy link
Contributor

arya2 commented Oct 28, 2025

Currently, the None case only appears in service.rs, where there are TODOs hinting that block_height tracking is planned. Would you prefer to leave this as a TODO for follow-up, or should we address it in this PR?

I would prefer to address it in this PR if it helps make the Option<HashOrHeight> a HashOrHeight, but it would be fine to do it in a follow-up PR too.

On a related note: I noticed that BoxError and CommitSemanticallyVerifiedError are re-exported in lib.rs, while the other error types are not. I'd like to unify this approach for consistency across existing and upcoming variants.

Do you prefer that I do this cleanup as part of this PR, or would you rather wait and revisit it once the StateService error structure is more finalized?

Whichever you prefer, I see no urgency to do it now nor much benefit to deferring it.

Yesterday, after a rebase, I had some issues with the acceptan

Yesterday, after a rebase, I had some issues with the acceptance tests in zebrad, particularly with activate_mempool_mainnet.
Now, I just re-ran all tests again, and they all passed on the first try.

We’ve been working on parts of the code that might affect mempool logic, so I’m wondering — is there anything I should take a closer look at, or anything in particular to watch out for?

No, these changes should have no effect on the mempool, that test and a handful of other acceptance tests just fail occasionally, it's not an issue unless they fail consistently or are relevant to the changes and fail in a way that could be due to a concurrency bug.

Do you want to merge this “safe but messy” fix first and clean it up later (e.g., by implementing is_duplicate() on CommitBlockError), essentially addressing issue #2908? Or as part of this PR?

I want to address #2908 as part of this PR, we've been wanting to address it for a long time, seems convenient to do it here where it makes the changes easier to review, and it would be nice to merge this just after tagging the 3.0.0 release anyway so that there's plenty of time to test it before it gets tagged in a release. Let me know if you'd like me to make those changes.

Thank you for making all of these changes, they're looking excellent. I'll closely review everything again soon.

@syszery
Copy link
Contributor Author

syszery commented Oct 28, 2025

Thank you @arya2 for all the explanations and guidance! I really appreciate your help. I agree that we shouldn’t be lazy now 😉 — both TODOs seem manageable and in range to get closed here.

@syszery
Copy link
Contributor Author

syszery commented Oct 28, 2025

Here are the TODOs we’ve identified.

  • Address state error path of issue Improve VerifyBlockError::Commit typing, so we don't accidentally break syncer error handling #2908

  • Try to remove the Option from hash_or_height — this requires addressing the following TODOs in the StateService implementation, where the None is currently used:

    • queue_and_commit_to_finalized_state: Track the latest sent height and drop any blocks below it when sending new blocks.
    • queue_and_commit_to_non_finalized_state: It seems harder to blame a single block here, so the None case may not be meaningfully removable.
  • Beautify CommitBlockError::Duplicate — this will be straightforward once the Option is removed.

  • Minor cleanup: unify error re-exports in lib.rs.

@syszery
Copy link
Contributor Author

syszery commented Oct 29, 2025

I have reverted the changes to should_restart_sync and implemented the following.

Update:

  • Implemented is_duplicate_request for CommitBlockError.
  • Updated VerifyBlockError::Commit to properly wrap CommitBlockError and introduced a StateService variant.
  • Modified Service::call to propagate CommitBlockError.
  • Added a new unit test verifying that a BlockDownloadVerifyError::Invalid wrapping a CommitBlockError::Duplicate doesn't trigger a restart.
  • Cleaned up should_restart_sync.

Questions for review:

syszery and others added 12 commits November 10, 2025 13:14
implements review feedback

Co-authored-by: Arya <[email protected]>
- Updates `KnownBlock` request handling to check for block hashes in the state's sent hashes, fixing a bug where it's possible for part of a chain to be duplicated in the non-finalized state if a block is sent to the state via a `CommitSemanticallyVerifiedBlock` request while it's in the block write channel or before the non-finalized state channel has been updated
@syszery syszery force-pushed the refactor/state-CommitCheckpointVerifiedBlock-error branch from 15125c5 to 4889e85 Compare November 10, 2025 12:44
@arya2 arya2 added do-not-merge Tells Mergify not to merge this PR and removed do-not-merge Tells Mergify not to merge this PR labels Nov 17, 2025
@mpguerra mpguerra linked an issue Nov 27, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cleanup: Return more concrete errors from the state service

3 participants