fix: hard guards against experiment fabrication cascade#263
Open
r3n3x wants to merge 1 commit into
Open
Conversation
Four targeted fixes to prevent the pipeline from proceeding when Stage 12 completes without real experiment data (issue aiming-lab#165): 1. Stage 12 validation (_execution.py): - Return FAILED when experiment has zero real metrics and status=failed - Return FAILED on stdout failure signals with zero metrics - Return FAILED on suspiciously fast completion (<30s) with zero metrics - These replace the unconditional DONE return that let fabricated papers through the quality gate 2. Stage 20 quality gate enforcement (_review_publish.py): - Block pipeline when VerifiedRegistry has zero values AND experiment failed, regardless of LLM-assigned quality score - Prevents the exact failure mode: 3.46s experiment -> fabricated paper -> quality gate passes 3. NONCRITICAL_STAGES correction (stages.py): - Remove QUALITY_GATE from noncritical stages - A skipped quality gate allows fabricated results to export; it must be treated as a gate, not a suggestion 4. Anomaly detection (integrated into Stage 12 guard): - Experiments completing in <30s with zero metrics are classified as misclassified crashes rather than successful runs Tests: 8 new tests covering all guard paths.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The pipeline can proceed from Stage 12 (EXPERIMENT_RUN) through Stage 20 (QUALITY_GATE) and export a fabricated paper even when the experiment produced zero real metrics. This is the exact failure mode documented in #165.
In the reported case, Stage 12 completed in 3.46 seconds with no real data, but:
StageStatus.DONEQUALITY_GATEwas inNONCRITICAL_STAGES(failures are warnings, not blocks)Changes
1. Stage 12: Experiment output validation (
_execution.py)Added three hard guards that return
StageStatus.FAILEDinstead of unconditionalDONE:2. Stage 20: Quality gate enforcement (
_review_publish.py)Added a hard block when
VerifiedRegistryhas zero experiment values AND the experiment summary reports failure. Even if the LLM assigns a passing quality score, a paper with no grounded data must not be exported.3. NONCRITICAL_STAGES correction (
stages.py)Removed
QUALITY_GATEfromNONCRITICAL_STAGES. A skipped quality gate allows fabricated results to be exported — it must be treated as a gate, not a suggestion.4. Anomaly detection (integrated into Stage 12)
The existing P1 warning for <5s completion is preserved. The new <30s guard with zero metrics catches the broader class of misclassified crashes.
Testing
8 new tests in
tests/test_fabrication_guards.py:TestNoncriticalStages(2): Verify QUALITY_GATE is critical, KNOWLEDGE_ARCHIVE still noncriticalTestStage12HardGuards(4): Failed+zero metrics, suspicious speed, success with metrics, stdout failure signalsTestStage20HardGuard(1): Quality gate blocks zero verified valuesTestStage12DurationAnomaly(1): Code structure verificationAll 754 existing tests continue to pass (1 pre-existing async test failure unrelated to this PR).
Related Issues
Closes #165 (Cascading Pipeline Failure)
Related to #238 (Hallucinated references)
Breaking Changes
None for correct pipelines. Pipelines that previously "succeeded" with fabricated experiment data will now correctly fail at Stage 12 or Stage 20, which is the intended behavior.