fix(paper-extraction): never fabricate DOIs for no-DOI bib entries#147
Open
cailmdaley wants to merge 1 commit into
Open
fix(paper-extraction): never fabricate DOIs for no-DOI bib entries#147cailmdaley wants to merge 1 commit into
cailmdaley wants to merge 1 commit into
Conversation
The DOI resolver fuzzy-searches Crossref/ADS by title for any bib entry
lacking a doi:/eprint: field. Two failure modes let it attach *wrong*
DOIs — a direct traceability violation, since a wrong DOI silently
points evidence verification at the wrong paper:
- "in preparation" companion papers (no DOI exists yet) matched
unrelated journal articles via fuzzy title search.
- An ASCL software record (TreeCorr, "Two-point correlation functions")
matched a 1978 DOE OSTI report subtitled "[Two-point correlation
functions]" through a permissive 0.55 title gate.
Fix (traceability over coverage — a flagged miss beats a fabricated hit):
- classify_unresolvable(): short-circuit entries that have no DOI to
find (in-prep/submitted/to-appear/in-press/forthcoming; ASCL software
records; no publication metadata) to doi=null + a needs_human tag.
Never fuzzy-resolve them.
- Raise the fuzzy title-match gate 0.55 -> 0.80 AND require a
first-author surname match, on BOTH the Crossref and ADS paths (the
ADS path previously had no gate at all).
- Surface unresolved entries via needs_human in the citations: block.
Verified against the UNIONS B-modes bib: the four in-prep companions and
the TreeCorr ASCL entry that previously got phantom DOIs are now flagged;
all published entries still resolve. Adds tests/test_paper_extraction_doi.py.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
❌ Eval Results
GradersNo grader results Full output |
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
lightcone-cli | e943952 | Commit Preview URL Branch Preview URL |
May 30 2026, 04:24 PM |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The paper-extraction DOI resolver fuzzy-searches Crossref/ADS by title for any bib entry lacking a
doi:/eprint:field, and accepts the top hit if aSequenceMatchertitle ratio clears 0.55. Two failure modes let it attach wrong DOIs — a direct traceability violation, since a wrong DOI silently pointsastra validate --verify-evidenceat the wrong paper:goh.etal26,guerrini.etal26) both got10.1111/j.1365-2966.2010.17430.x— an unrelated 2010 MNRAS paper.treecorr15(TreeCorr: Two-point correlation functions,archivePrefix=ascl) got10.2172/6401879— a 1978 DOE OSTI report subtitled[Two-point correlation functions], which cleared the 0.55 gate (similarity ≈ 0.63).The ADS fallback path had no similarity gate at all.
These were surfaced by a citation-audit dry-run: workers were handed the wrong paper to verify against and could not recover. This is not LLM hallucination — it's a deterministic programmatic search with too-loose acceptance.
Fix
Principle: traceability over coverage — a flagged miss (
doi: null, surfaced for human review) is strictly preferable to a fabricated hit.classify_unresolvable()short-circuits entries with no DOI to find — unpublished (in prep/submitted/to appear/in press/forthcominginjournal/note/howpublished), ASCL software records (archivePrefix=asclorascl:eprint), or entries with no publication metadata — todoi: nullplus aneeds_humantag. These are never fuzzy-resolved.needs_humanin thecitations:block so a downstream consumer or human can act on them.Verification
Against the UNIONS B-modes bibliography (79 entries):
in_preparation/software_ascl); none are fuzzy-resolved.doi:/eprint:) still resolve unchanged.TreeCorr…vs the 1978 DOE report) now scores 0.63 < 0.80 — rejected by the gate even independent of the no-DOI guard.Adds
tests/test_paper_extraction_doi.py(15 cases: classifier, author matcher, title gate). No new lint debt (E501 count unchanged).🤖 Generated with Claude Code