Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science (EMNLP 2025)
This is the repository for Matter-of-fact, a benchmark of approximately 8,000 claims extracted or derived from material science papers on Arxiv. The dataset is intended to be used for feasibility assessment tasks, which aim to determine whether a claim is likely to be feasible or infeasible, and are related to claim verification tasks. The repository also includes the baseline models from the paper.
- 1. Paper
- 1.1 Generating your own claims from scientific papers using this method
- 2. Benchmark
- 3. Baseline Models
- 4. Citation
- 5. License
- 6. Contact
The paper Matter-of-Fact: A Benchmark for Verifying the Feasibility of Literature-Supported Claims in Materials Science is accepted to EMNLP 2025 and currently available on Arxiv.
If you would like to use the same methodology to generate claims in your own domain of interest, our claim extraction system is here: ClaimExtractionPrompt.py
The train, validation, and test sets are available in: /benchmark/
| Set | Size | Years Covered | Filename |
|---|---|---|---|
| train | 1376 | 2022 | matteroffact.train.2022.1376.json |
| valid | 2538 | 2023 | matteroffact.validation.2023.2538.json |
| test | 4446 | 2024-2025 | matteroffact.test.20242025.4446.json |
Below is an example of one feasibility problem from the benchmark:
{
"claim_id": "2206.01072v1_9_T",
"claim_text": "In Fe chains on Rh(111) surface, nearest neighbor isotropic exchange interactions are ferromagnetic while next nearest neighbor interactions are antiferromagnetic, with the nearest neighbor coupling between edge atoms reaching values of approximately 0.6 mRy.",
"gold_label": true,
"metadata": {
"type": "code/simulation",
"problem_description": "Isotropic exchange interactions in Fe chains on Rh(111)",
"supporting_facts_from_paper": [
"In Table \\ref{tbl:jij-k} we present the nearest neighbor (NN) and next nearest neighbor (NNN) isotropic interactions between the Fe atoms. Apparently, the NN and NNN isotropic interactions are ferromagnetic (FM) and antiferromagnetic (AFM), respectively. (code)",
"The leading terms in the spin model are the isotropic exchange interactions. In Table \\ref{tbl:jij-k} we present the nearest neighbor (NN) and next nearest neighbor (NNN) isotropic interactions between the Fe atoms. (code)",
"For the 4-atom chain, the nearest neighbor interaction between sites 1-2 is 0.604 mRy, for the 5-atom chain it is 0.572 mRy, for the 6-atom chain it is 0.686 mRy, and for the 7-atom chain it is 0.605 mRy. (code)"
],
"explanation": "The paper explicitly states that in Fe chains on Rh(111), nearest neighbor isotropic exchange interactions are ferromagnetic while next nearest neighbor interactions are antiferromagnetic. Table values show that the nearest neighbor coupling between edge atoms (sites 1-2) is consistently around 0.6 mRy across chains of different lengths: 0.604 mRy for 4-atom chains, 0.572 mRy for 5-atom chains, 0.686 mRy for 6-atom chains, and 0.605 mRy for 7-atom chains. These quantitative values directly support the claim.",
"quant_or_qual": "quantitative",
"paper_id": "2206.01072v1",
"topics": [
"magnetic nanoclusters",
"first principles calculations",
"spin-spiral states",
"Dzyaloshinsky-Moriya interaction",
"magnetic ground states",
"Fe atomic chains",
"topological superconductivity",
"theme_superconductors"
],
"published_date": {
"year": 2022,
"month": 6
},
"exclude_date": {
"year": 2022,
"month": 5
}
}
}The field descriptions are below:
claim_id: A unique ID for the claim, which includes the arxiv source paper, an (arbitrary) index for the claim, and whether or not the claim was generated to be true or false (T/F).claim_text: The text of the claim. This is (nominally) the only information that a feasibility verification/claim verification model should recieve when performing the feasibility/claim verification task.gold_label: Whether the claim is true/feasible (true), or false/infeasible (false).metadata.type: The type of claim, from 4 broad categories:code/simulation,experimental,theoretical, orintegrative.metadata.problem_description: A natural language description of the broad scope/domain of the claim.metadata.supporting_facts_from_paper: A set of supporting facts from the paper. These are automatically generated by the LLM, so they may be summarative rather than extract exact text spans.metadata.explanation: An automatically generated explanation for why the gold label is believed to be correct.quant_or_qual: Is this claim problemquantitativeorqualitative?paper_id: The original source paper (an Arxiv ID). The model generated this claim after reading the full text of this article.topics: A list of automatically labeled topics of the source paper.published_date: The date that the earliest version of this paper appeared on Arxiv.exclude_date: For the feasibility verification task, exclude any artifacts (e.g. papers, knowledge sources, etc.) whose knowledge comes after this date.
A few notes on scoring:
- The task is framed as a binary classification problem, where a model provides each claim with a binary label (
true/feasible, orfalse/infeasible). - There are an equal number of
trueandfalseclaims, so chance performance is 50%. - A model's performance is simply its classification accuracy across the test set. (For example, if Model X correctly predicts the labels for
3000claims, and incorrectly predicts the labels for1446claims, then its accuracy will be3000 / 4446 = 67.5%).
Matter-of-Fact frames feasibility detection as a temporally-filtered claim verification task. It's very challenging to generate accurate and realistic estimates of the feasibility of a new claim, since (by definition) the experiment hasn't been run yet, so we don't know the result. Matter-of-Fact gets around this challenge by extracting genuine claims from recent papers, and then (for each claim) making a negative/infeasible version. The benefits of this are that it makes the feasibility data much more realistic, and based on actual scientific results. The drawbacks are that it adds complexity to the model evaluation procedure: When solving the claim verification problem, the model shouldn't have information available to it that was published after the original paper the claim came from was published -- otherwise it could trivially solve the feasibility verification problem by retrieving papers (including the original source paper) that know the results of the experiment (that 'temporally unrestricted' baseline is provided in Table 3).
We've designed the dataset with this temporal filtering in mind, and there are two possible ways to perform this filtering:
- Easy: All the claims in the test set are from papers posted in January 2024 or after. You can simply exclude all knowledge from 2024 onward in your retrieval system.
- Claim-specific: Each claim has a knowledge cutoff-date (
exclude_date), so if you'd like your model to use all knowledge up to just before the source paper was published, that is possible too.


