WIP - Revive PR 16728, the block producer thread refactor revival#18694
Draft
cjjdespres wants to merge 4 commits intocompatiblefrom
Draft
WIP - Revive PR 16728, the block producer thread refactor revival#18694cjjdespres wants to merge 4 commits intocompatiblefrom
cjjdespres wants to merge 4 commits intocompatiblefrom
Conversation
Restructure the run function from recursive check_next_block_timing + Singleton_supervisor/Singleton_scheduler to Deferred.forever + iteration_wrapped. The cancel-via-Ivar mechanism from Singleton_supervisor is inlined into iteration_wrapped. Interruptible usage is kept for now. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The former block generation loop was structured internally with Interruptible in a way that allowed async block generation tasks to cancel previously-spawned async block generation tasks. After the refactoring of the block creation scheduling, this functionality became unused, so it and the use of Interruptible itself have now been removed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The original code accidentally starts a block producer thread twice when the daemon starts after genesis due to an expression scoping bug; the call that schedules the block production thread to be started at an upcoming genesis should be within the scope of the else branch of the if statement that checks if the current time is before or after genesis.
Member
|
Maybe we should introduce linting/formatting rule to disable use of brackets, and enforce use of I remember reading OCaml's source code and they're using |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
For some background on these changes: the PR #16728 revived an older PR that refactored the daemon's block producer thread, as part of a larger block producer simplification. It introduced a strange bug that was discovered in a mainnet beta release and reported in #17595; I developed a simplified manual test that reproduced the bug, and the PR was reverted in #17616 after observing that the reversion fixed the manual test. The bug was no longer observed by the user that originally reported the bug after the revert.
This PR is a port of #16728 to the latest compatible. The first few commits are exactly the changes in the original PR, and the final commit f17c8b6 fixes what I think was the bug in the original PR. The problem appears to be that commit 599ebf0 1 introduced a small scoping error that went unnoticed in review, which caused the block producer thread to run its
startmethod twice if the daemon started after the genesis timestamp. The two threads would each take from the same VRF queue, so block production events would end up being scheduled two at a time. Thus when the bug was active the "Next block produced.." in the status would always refer to the second block to be produced in this epoch. This also explains the strange duplicated log lines I was seeing in my manual reproduction: there really were two block producer processing tasks active at once. I think this also confirms my original guess that the blocks would actually have been produced at the right times while the bug was still around.I'm leaving this up to have a record of this around. I think this can be updated, re-reviewed, and then merged after the hard fork. I would rather not do it now, just in case it introduces any additional bugs that haven't been caught yet. I'd also like to clean up the commits slightly, and probably also add the test case I've been using somewhere.
To expand on the small scope difference here, the block producer's
runmethod is written so it waits for the chain's genesis time before it starts up the block production loop. The new code is here from f34038c in this PR. The old code is here as of current compatible at 6aee3d6. The old code'selsebranch starts with alet ... in. Theincreates a scope that contains both the following log line and theignore- it contains everything up to the end of the enclosing scope, which is the)on thatignoreline. The entire expression within theelsebranch can thus be written without parentheses. In contrast, the new code inlines theletexpression into the log line. Theelsebranch ends up containing just the log line, and the newuponcall gets run unconditionally. That's whystartends up being called twice. The difference can be seen in the final formatted result; the old code'signorelines up with theletand is indented more than theelse, whereas the new code'suponlines up with theelse.Footnotes
This commit corresponds to f34038c here. The relevant line is here. The same bug exists in the older https://github.com/MinaProtocol/mina/pull/16224, but not the even older 93f6457, if you were interested in the full history, not that that's particularly relevant. It's pretty difficult to see that the bug was introduced just by looking at the diff, in my opinion. ↩