Skip to content

fix(paper-extraction): never fabricate DOIs for no-DOI bib entries#147

Open
cailmdaley wants to merge 1 commit into
mainfrom
fix/paper-extraction-doi-resolver
Open

fix(paper-extraction): never fabricate DOIs for no-DOI bib entries#147
cailmdaley wants to merge 1 commit into
mainfrom
fix/paper-extraction-doi-resolver

Conversation

@cailmdaley
Copy link
Copy Markdown
Member

Problem

The paper-extraction DOI resolver fuzzy-searches Crossref/ADS by title for any bib entry lacking a doi:/eprint: field, and accepts the top hit if a SequenceMatcher title ratio clears 0.55. Two failure modes let it attach wrong DOIs — a direct traceability violation, since a wrong DOI silently points astra validate --verify-evidence at the wrong paper:

  1. "in preparation" entries (no DOI exists yet) match unrelated journal articles. On the UNIONS B-modes bibliography, Papers III & IV (goh.etal26, guerrini.etal26) both got 10.1111/j.1365-2966.2010.17430.x — an unrelated 2010 MNRAS paper.
  2. ASCL software records match by subtitle collision. treecorr15 (TreeCorr: Two-point correlation functions, archivePrefix=ascl) got 10.2172/6401879 — a 1978 DOE OSTI report subtitled [Two-point correlation functions], which cleared the 0.55 gate (similarity ≈ 0.63).

The ADS fallback path had no similarity gate at all.

These were surfaced by a citation-audit dry-run: workers were handed the wrong paper to verify against and could not recover. This is not LLM hallucination — it's a deterministic programmatic search with too-loose acceptance.

Fix

Principle: traceability over coverage — a flagged miss (doi: null, surfaced for human review) is strictly preferable to a fabricated hit.

  • classify_unresolvable() short-circuits entries with no DOI to find — unpublished (in prep/submitted/to appear/in press/forthcoming in journal/note/howpublished), ASCL software records (archivePrefix=ascl or ascl: eprint), or entries with no publication metadata — to doi: null plus a needs_human tag. These are never fuzzy-resolved.
  • Raise the fuzzy title gate 0.55 → 0.80 AND require a first-author surname match, on both the Crossref and ADS paths (the ADS path was previously ungated).
  • Unresolved entries carry needs_human in the citations: block so a downstream consumer or human can act on them.

Verification

Against the UNIONS B-modes bibliography (79 entries):

  • The four in-prep companions and the TreeCorr ASCL entry that previously got phantom DOIs are now flagged (in_preparation / software_ascl); none are fuzzy-resolved.
  • All published entries (real doi:/eprint:) still resolve unchanged.
  • The exact phantom collision (TreeCorr… vs the 1978 DOE report) now scores 0.63 < 0.80 — rejected by the gate even independent of the no-DOI guard.

Adds tests/test_paper_extraction_doi.py (15 cases: classifier, author matcher, title gate). No new lint debt (E501 count unchanged).

🤖 Generated with Claude Code

The DOI resolver fuzzy-searches Crossref/ADS by title for any bib entry
lacking a doi:/eprint: field. Two failure modes let it attach *wrong*
DOIs — a direct traceability violation, since a wrong DOI silently
points evidence verification at the wrong paper:

  - "in preparation" companion papers (no DOI exists yet) matched
    unrelated journal articles via fuzzy title search.
  - An ASCL software record (TreeCorr, "Two-point correlation functions")
    matched a 1978 DOE OSTI report subtitled "[Two-point correlation
    functions]" through a permissive 0.55 title gate.

Fix (traceability over coverage — a flagged miss beats a fabricated hit):

  - classify_unresolvable(): short-circuit entries that have no DOI to
    find (in-prep/submitted/to-appear/in-press/forthcoming; ASCL software
    records; no publication metadata) to doi=null + a needs_human tag.
    Never fuzzy-resolve them.
  - Raise the fuzzy title-match gate 0.55 -> 0.80 AND require a
    first-author surname match, on BOTH the Crossref and ADS paths (the
    ADS path previously had no gate at all).
  - Surface unresolved entries via needs_human in the citations: block.

Verified against the UNIONS B-modes bib: the four in-prep companions and
the TreeCorr ASCL entry that previously got phantom DOIs are now flagged;
all published entries still resolve. Adds tests/test_paper_extraction_doi.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

❌ Eval Results

Metric Value
Score 0.00
Build complete
Cost $0.00
Turns 0
Duration 0s
lightcone-cli 0.3.7.dev3+ga5e767ffc (a5e767ff)
Results Download

Graders

No grader results

Full output
16:24:12 lightcone.eval.build Building lightcone-cli wheel from /home/runner/work/lightcone-cli/lightcone-cli ...
16:24:17 lightcone.eval.build Built lightcone_cli-0.3.7.dev3+ga5e767ffc-py3-none-any.whl (commit a5e767ff)
16:24:20 lightcone.eval.harness Trial build-snae-0 failed: Failed to create sandbox: Invalid credentials
Traceback (most recent call last):
  File "/home/runner/work/lightcone-cli/lightcone-cli/src/lightcone/eval/harness.py", line 112, in run_trial
    sandbox.create()
  File "/home/runner/work/lightcone-cli/lightcone-cli/src/lightcone/eval/sandbox.py", line 149, in create
    self._sandbox = self._daytona.create(
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/lightcone-cli/lightcone-cli/.venv/lib/python3.12/site-packages/daytona_sdk/_utils/errors.py", line 206, in sync_wrapper
    process_n_raise_exception(e)
  File "/home/runner/work/lightcone-cli/lightcone-cli/.venv/lib/python3.12/site-packages/daytona_sdk/_utils/errors.py", line 139, in process_n_raise_exception
    raise create_daytona_error(
daytona_sdk.common.errors.DaytonaAuthenticationError: Failed to create sandbox: Invalid credentials
  snae trial 0: score=0.00 error: Failed to create sandbox: Invalid credentials

lightcone-cli: 0.3.7.dev3+ga5e767ffc (HEAD a5e767ff)
ASTRA: 0.2.9

  Eval Results: Scores  
┏━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Task ┃     Score     ┃
┡━━━━━━╇━━━━━━━━━━━━━━━┩
│ snae │ 0.00 +/- 0.00 │
│      │  pass@k: 0%   │
│      │   1 errors    │
└──────┴───────────────┘

   Eval Results: Cost &   
         Duration         
┏━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Task ┃ Cost / Duration ┃
┡━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ snae │      $0.00      │
│      │       0s        │
└──────┴─────────────────┘

Total: 1 trials, $0.00, 0s

Results saved to: eval-results/build-a5e767ff/results.json

@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
lightcone-cli e943952 Commit Preview URL

Branch Preview URL
May 30 2026, 04:24 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant