Skip to content

WIP - Revive PR 16728, the block producer thread refactor revival#18694

Draft
cjjdespres wants to merge 4 commits intocompatiblefrom
cjjdespres/revive-bp-refactor
Draft

WIP - Revive PR 16728, the block producer thread refactor revival#18694
cjjdespres wants to merge 4 commits intocompatiblefrom
cjjdespres/revive-bp-refactor

Conversation

@cjjdespres
Copy link
Copy Markdown
Member

@cjjdespres cjjdespres commented Mar 28, 2026

For some background on these changes: the PR #16728 revived an older PR that refactored the daemon's block producer thread, as part of a larger block producer simplification. It introduced a strange bug that was discovered in a mainnet beta release and reported in #17595; I developed a simplified manual test that reproduced the bug, and the PR was reverted in #17616 after observing that the reversion fixed the manual test. The bug was no longer observed by the user that originally reported the bug after the revert.

This PR is a port of #16728 to the latest compatible. The first few commits are exactly the changes in the original PR, and the final commit f17c8b6 fixes what I think was the bug in the original PR. The problem appears to be that commit 599ebf0 1 introduced a small scoping error that went unnoticed in review, which caused the block producer thread to run its start method twice if the daemon started after the genesis timestamp. The two threads would each take from the same VRF queue, so block production events would end up being scheduled two at a time. Thus when the bug was active the "Next block produced.." in the status would always refer to the second block to be produced in this epoch. This also explains the strange duplicated log lines I was seeing in my manual reproduction: there really were two block producer processing tasks active at once. I think this also confirms my original guess that the blocks would actually have been produced at the right times while the bug was still around.

I'm leaving this up to have a record of this around. I think this can be updated, re-reviewed, and then merged after the hard fork. I would rather not do it now, just in case it introduces any additional bugs that haven't been caught yet. I'd also like to clean up the commits slightly, and probably also add the test case I've been using somewhere.


To expand on the small scope difference here, the block producer's run method is written so it waits for the chain's genesis time before it starts up the block production loop. The new code is here from f34038c in this PR. The old code is here as of current compatible at 6aee3d6. The old code's else branch starts with a let ... in. The in creates a scope that contains both the following log line and the ignore - it contains everything up to the end of the enclosing scope, which is the ) on that ignore line. The entire expression within the else branch can thus be written without parentheses. In contrast, the new code inlines the let expression into the log line. The else branch ends up containing just the log line, and the new upon call gets run unconditionally. That's why start ends up being called twice. The difference can be seen in the final formatted result; the old code's ignore lines up with the let and is indented more than the else, whereas the new code's upon lines up with the else.

Footnotes

  1. This commit corresponds to f34038c here. The relevant line is here. The same bug exists in the older https://github.com/MinaProtocol/mina/pull/16224, but not the even older 93f6457, if you were interested in the full history, not that that's particularly relevant. It's pretty difficult to see that the bug was introduced just by looking at the diff, in my opinion.

cjjdespres and others added 4 commits March 28, 2026 12:39
Restructure the run function from recursive check_next_block_timing +
Singleton_supervisor/Singleton_scheduler to Deferred.forever +
iteration_wrapped. The cancel-via-Ivar mechanism from
Singleton_supervisor is inlined into iteration_wrapped. Interruptible
usage is kept for now.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The former block generation loop was structured internally with
Interruptible in a way that allowed async block generation tasks to
cancel previously-spawned async block generation tasks. After the
refactoring of the block creation scheduling, this functionality
became unused, so it and the use of Interruptible itself have now
been removed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The original code accidentally starts a block producer thread twice when
the daemon starts after genesis due to an expression scoping bug; the
call that schedules the block production thread to be started at an
upcoming genesis should be within the scope of the else branch of the if
statement that checks if the current time is before or after genesis.
@glyh
Copy link
Copy Markdown
Member

glyh commented Mar 31, 2026

Maybe we should introduce linting/formatting rule to disable use of brackets, and enforce use of begin end everywhere in the codebase to reduce the likelyhood of such bugs.

I remember reading OCaml's source code and they're using begin & end extensively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: To triage

Development

Successfully merging this pull request may close these issues.

2 participants