From d3f8003eec2909cedb12e8f0a0f9939ce1a556c5 Mon Sep 17 00:00:00 2001
From: Lukas Wallrich <lukas.wallrich@gmail.com>
Date: Sun, 28 Jun 2026 17:06:16 +0100
Subject: [PATCH 1/3] Discuss irreducible disagreement about replication
 success (#4)

Extend the Hidden Moderator Account in the Discussion chapter with
Sarisoy's (2025) argument: because experimental control is limited and
possible moderators are vast, experts often cannot tell a genuine
replication failure from a violated ceteris paribus assumption (a form
of the experimenter's regress). Frame such disagreement as legitimate
normative judgement rather than poor practice, and note that
transparency about a replication's intended epistemic function
(reliability / validity / generalisation) makes it more tractable.

Add verified references.bib entry for Sarisoy (2025).
---
 discussion.qmd | 20 ++++++++++++++++++++
 references.bib | 11 +++++++++++
 2 files changed, 31 insertions(+)

diff --git a/discussion.qmd b/discussion.qmd
index 619070d..0650003 100644
--- a/discussion.qmd
+++ b/discussion.qmd
@@ -172,6 +172,26 @@ Whether that generalises to the setting of the original study needs to
 be considered in light of theory, and might be a legitimate matter of
 contention.
 
+These difficulties point to a partly irreducible source of disagreement
+about replication success. Because experimental control is limited and
+the space of possible moderators is vast, experts often cannot decide
+unambiguously whether divergent results reflect a genuine failure to
+replicate or a violation of the *ceteris paribus* assumption between the
+original study and the replication — a version of Collins' experimenter's
+regress that Sarisoy [-@Sarisoy2025] examines in detail. On this account,
+sustained disagreement about whether a replication succeeded can reflect
+legitimate normative judgements that researchers make when the evidence
+underdetermines the conclusion, rather than poor research practice.
+Sarisoy argues that such disagreements become more tractable once
+researchers are transparent about a replication's intended *epistemic
+function* — whether it is designed to test the reliability (stability) of
+an effect, to probe a specific validity threat, or to assess
+generalisation to a new context — because each function carries different
+standards for what would count as success. Declaring this purpose, in
+addition to pre-specifying which effects are of primary interest (see
+@sec-success-criteria), helps to recast debates as disagreements about
+what a replication was meant to show.
+
 ## The Role of Differences for the Interpretation of Findings {#sec-differences-and-interpretation}
 
 Each replication outcome should be evaluated in the light of its
diff --git a/references.bib b/references.bib
index cecd4e5..4265670 100644
--- a/references.bib
+++ b/references.bib
@@ -1359,6 +1359,17 @@ @article{RosenbergFinn2022
   doi = {10.1038/s41593-022-01110-9}
 }
 
+@article{Sarisoy2025,
+  author = {Sarisoy, J.},
+  title = {Why we disagree about the success of replications},
+  journal = {Journal for General Philosophy of Science},
+  volume = {56},
+  number = {3},
+  pages = {307-324},
+  year = {2025},
+  doi = {10.1007/s10838-024-09709-1}
+}
+
 @article{SchauerHedges2021,
   author = {Schauer, J. M. and Hedges, L. V.},
   title = {Reconsidering statistical methods for assessing replication},

From 52be9e16363585a0c132171b55497bab001134ea Mon Sep 17 00:00:00 2001
From: Lukas Wallrich <lukas.wallrich@gmail.com>
Date: Mon, 29 Jun 2026 17:43:23 +0200
Subject: [PATCH 2/3] Address codex review: soften disagreement claim (#4)

---
 discussion.qmd | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/discussion.qmd b/discussion.qmd
index 0650003..b3f74cc 100644
--- a/discussion.qmd
+++ b/discussion.qmd
@@ -181,7 +181,7 @@ original study and the replication — a version of Collins' experimenter's
 regress that Sarisoy [-@Sarisoy2025] examines in detail. On this account,
 sustained disagreement about whether a replication succeeded can reflect
 legitimate normative judgements that researchers make when the evidence
-underdetermines the conclusion, rather than poor research practice.
+underdetermines the conclusion, rather than necessarily indicating poor research practice.
 Sarisoy argues that such disagreements become more tractable once
 researchers are transparent about a replication's intended *epistemic
 function* — whether it is designed to test the reliability (stability) of

From 10b7bfa852982508defa75437efd708c9217425d Mon Sep 17 00:00:00 2001
From: Lukas Wallrich <lukas.wallrich@gmail.com>
Date: Mon, 29 Jun 2026 17:51:58 +0200
Subject: [PATCH 3/3] Sharpen 'failure to replicate'; commas for dashes (#4)

---
 discussion.qmd | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/discussion.qmd b/discussion.qmd
index b3f74cc..84268fa 100644
--- a/discussion.qmd
+++ b/discussion.qmd
@@ -176,17 +176,18 @@ These difficulties point to a partly irreducible source of disagreement
 about replication success. Because experimental control is limited and
 the space of possible moderators is vast, experts often cannot decide
 unambiguously whether divergent results reflect a genuine failure to
-replicate or a violation of the *ceteris paribus* assumption between the
-original study and the replication — a version of Collins' experimenter's
+replicate the original effect or a violation of the *ceteris paribus*
+assumption between the original study and the replication, a version of
+Collins' experimenter's
 regress that Sarisoy [-@Sarisoy2025] examines in detail. On this account,
 sustained disagreement about whether a replication succeeded can reflect
 legitimate normative judgements that researchers make when the evidence
 underdetermines the conclusion, rather than necessarily indicating poor research practice.
 Sarisoy argues that such disagreements become more tractable once
 researchers are transparent about a replication's intended *epistemic
-function* — whether it is designed to test the reliability (stability) of
+function*, whether it is designed to test the reliability (stability) of
 an effect, to probe a specific validity threat, or to assess
-generalisation to a new context — because each function carries different
+generalisation to a new context, because each function carries different
 standards for what would count as success. Declaring this purpose, in
 addition to pre-specifying which effects are of primary interest (see
 @sec-success-criteria), helps to recast debates as disagreements about