Issue Report: content_flagging benchmark is non-deterministic and internally inconsistent
Summary
The released content_flagging benchmark appears to have a real implementation bug, not intentional benchmark noise.
The main issue is that the current tool implementation does not behave like a deterministic mock-tool benchmark backed by precomputed CSV outputs:
determineFinalDecision() is not a pure CSV lookup. It recomputes the final label from the scores passed into it.
- Two upstream tool outputs are stochastic:
calculate_device_consistency() uses random.random()
calculateBotProbabilityIndex() also uses random.random()
- The shipped CSV does not contain the intermediate random outputs needed to reproduce the stored ground truth.
calculateContentSeverityIndex() also recomputes a value that does not match the shipped CSV.
As a result, identical inputs can produce different intermediate scores and different final decisions across runs, and the released code does not appear to be aligned with the released ground-truth CSV, the repo documentation, or the paper.
Why this matters
SOP-Bench is presented as a reproducible benchmark with precomputed mock tool outputs. The current content_flagging release violates that assumption and can make benchmark scores depend on runtime randomness rather than only on agent behavior.
That affects:
- task-level reproducibility
- comparability across models and agents
- interpretability of failures
- validity of the published
content_flagging numbers
Expected behavior
Given the paper and repo docs, the expected behavior is:
- tools should be deterministic for the same inputs
- tool outputs should match precomputed dataset values
- final decisions should be reproducible across runs
- shipped CSV ground truth should be consistent with shipped tool logic
Relevant sources:
- The paper says the dataset contains structured inputs and outputs for each tool/API invocation and that the mocks are intended to remove runtime variability:
- Repo docs say tools should be deterministic mock implementations:
docs/ADDING_BENCHMARKS.md
README.md
ARCHITECTURE.md
Actual behavior
1. determineFinalDecision() recomputes the label from passed scores
The method validates a row using content_id, NumberofPreviousPosts, and CountofFlaggedPosts, but it does not return the CSV label. Instead it recomputes final_score from:
user_trust_score
content_severity_index
bot_probability_index
and maps that score to one of:
user_banned
removed
warning
allowed
Code:
src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py:294-379
This means any mismatch in upstream intermediate scores directly changes the final answer.
2. Two upstream tool outputs are random
The following functions are stochastic:
There is no repo-level seeding for this path.
3. calculate_user_trust_score() depends on those random values
calculate_user_trust_score() explicitly uses the passed:
bot_probability_index
device_consistency_score
and does not load them from CSV.
Code:
src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py:207-290
So the trust score varies with runtime randomness.
4. The shipped CSV does not contain the random intermediate outputs
The shipped content_flagging CSV includes:
user_trust_score
content_severity_index
final_decision
but does not include:
bot_probability_index
device_consistency_score
File:
src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_with_outputs.csv
This means the released toolchain cannot reconstruct the benchmark’s own ground truth from the dataset alone.
5. Likely logic bug in calculate_user_trust_score() (line 269)
calculate_user_trust_score() contains:
flag_penalty = min(CountofFlaggedPosts * -0.5, -25)
This should almost certainly be max(...) not min(...). As written, min(-2.5, -25) = -25, so even 1 flagged post applies the maximum -25 penalty. This collapses the penalty range for nearly all rows and is further evidence that the tool code was not carefully reviewed before release. This is a separate issue from the randomness/CSV mismatch, but contributes to the overall picture of release misalignment.
6. calculateContentSeverityIndex() also disagrees with the CSV
calculateContentSeverityIndex() recomputes CSI from violation weights and confidences rather than returning the stored CSV value:
src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py:117-205
Running the released formula against the shipped CSV produced:
So the inconsistency is broader than the two random fields.
Quantitative checks
I reproduced the formulas from tools.py in a standalone script using only the shipped CSV and Python stdlib.
Results:
- Shipped
content_flagging CSV rows: 168
- Rows whose predicted label flipped across repeated identical runs:
98/168
- Average match rate between repeated simulated outputs and shipped CSV
final_decision: 0.3689
- Median match rate:
0.225
- Rows that always matched shipped CSV across 100 seeded repeats:
31
- Rows that never matched shipped CSV across 100 seeded repeats:
74
- Rows where shipped
content_severity_index disagreed with released CSI formula: 168/168
- Rows where shipped
final_decision was impossible under the released final-decision formula for any valid bot_probability_index in [0, 1], using the CSV’s own user_trust_score and content_severity_index: 125/168
Note: the 125/168 figure is a conservative lower bound. It uses the CSV’s own user_trust_score and content_severity_index as inputs, but in practice the agent would not obtain those values either — CSI mismatches 168/168 rows, and UTS depends on the random BPI/DCS. The actual fraction of unreachable final decisions during a live run is higher.
Concrete examples
Example 1: stored user_banned is impossible under the released formula
Row:
content_id = C123456908
user_trust_score = 14
content_severity_index = 86
final_decision = user_banned
CSV row:
src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_with_outputs.csv:2
Under the released determineFinalDecision() formula, with any valid bot_probability_index in [0, 1], the final score range is:
- min score:
43.47
- max score:
60.67
That can only yield:
It can never yield user_banned.
Example 2: stored allowed is impossible under the released formula
Row:
content_id = C123456972
user_trust_score = 87
content_severity_index = 33
final_decision = allowed
CSV row:
src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_with_outputs.csv:3
Under the released formula, the score range is:
- min score:
68.47
- max score:
75.07
That can only yield:
It can never yield allowed.
Why this looks unintentional rather than intentional noise
Paper mismatch
The paper says:
- the dataset includes inputs and outputs for tool/API invocations
- mock APIs replace live APIs
- evaluation is intended to be stable and reproducible without runtime variability
Source:
That directly conflicts with runtime random.random() in the released tools.
Repo documentation mismatch
The repo explicitly describes deterministic mock tools:
README.md:69
ARCHITECTURE.md:113-152
docs/ADDING_BENCHMARKS.md:115
docs/ADDING_BENCHMARKS.md:576-585
docs/ADDING_BENCHMARKS.md even uses random outputs as the example of what not to do.
The paper’s own tool-generation prompt conflicts with the implementation
The paper’s tool-code-generation appendix says generated tools should:
- use only existing dataset columns
- load the CSV before accessing required fields
Source:
But the released content_flagging tool code synthesizes values that are not present in the dataset.
The paper discusses “inconsistent function outputs” as a symptom, not as intended design
The discussion section says content flagging suffered from “inconsistent function outputs,” which is consistent with a broken tool implementation. It does not present stochastic tool behavior as a benchmark feature.
Source:
random.shuffle in other domains is cosmetic
The other domains that import random (video_annotation, video_classification) use it only for random.shuffle(toolspec_json), which changes the presentation order of tool specifications. This does not affect tool outputs or evaluation correctness — it is purely cosmetic. The content_flagging domain is the only one where random affects computed tool return values.
Additional release inconsistencies
1. sop.txt vs sop_v2.txt
The loader uses sop.txt:
src/amazon_sop_bench/benchmarks/loader.py:65
But content_flagging/sop_v2.txt is the version that explicitly instructs agents to call the four tools and use returned BPI/UTC/CSI in sequence.
Files:
src/amazon_sop_bench/benchmarks/data/content_flagging/sop.txt
src/amazon_sop_bench/benchmarks/data/content_flagging/sop_v2.txt
The repo does not appear to reference sop_v2.txt anywhere.
2. Task count mismatch
The latest paper and repo docs report 226 content-flagging tasks:
- paper table
README.md
docs/GETTING_STARTED.md
ARCHITECTURE.md
But the shipped CSVs contain:
168 rows in test_set_with_outputs.csv
57 data rows in test_set_without_outputs.csv (58 lines including header)
This suggests the released artifact bundle is not aligned with the paper/docs.
Minimal reproduction
Inspect the stochastic code paths
nl -ba src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py | sed -n '30,113p'
nl -ba src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py | sed -n '207,379p'
Inspect the CSV schema
nl -ba src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_with_outputs.csv | sed -n '1,3p'
Check task-count mismatch
wc -l \
src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_with_outputs.csv \
src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_without_outputs.csv \
src/amazon_sop_bench/benchmarks/data/content_flagging/data.csv
Standalone reproduction script
This script reproduces the released formulas using only stdlib and the shipped CSV. It demonstrates both non-determinism and mismatch with the stored labels.
import csv, math, random, statistics
random.seed(0)
path = "src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_with_outputs.csv"
rows = list(csv.DictReader(open(path)))
VIOLATION_WEIGHTS = {
"hate_speech": 1.0,
"spam": 0.6,
"violence": 0.9,
"adult_content": 0.7,
"copyright": 0.5,
"misinformation": 0.8,
"bot_activity": 0.7,
"self_harm": 1.0,
"discrimination": 0.9,
"harassment": 0.8,
}
def calc_csi(row):
pvt = row["PrimaryViolationType"]
svt = row["SecondaryViolationType"]
pvc = float(row["PrimaryViolation_Confidence"])
svc = float(row["SecondaryViolation_Confidence"])
primary_score = VIOLATION_WEIGHTS[pvt] * pvc
secondary_score = 0
if svt and svc > 0:
secondary_score = VIOLATION_WEIGHTS[svt] * svc
if pvt == svt:
secondary_score *= 1.2
return min(round(primary_score * 0.7 + secondary_score * 0.3), 100)
def calc_uts(row, bpi, dcs):
npp = int(row["NumberofPreviousPosts"])
cfp = int(row["CountofFlaggedPosts"])
lat = float(row["Latitude"])
lon = float(row["Longitude"])
base_score = 50 + min(npp * 0.3, 30) + min(cfp * -0.5, -25)
grq = abs(math.cos(math.radians(lat)) * math.cos(math.radians(lon)))
geographic_modifier = 1 - (grq * 0.2)
behavioral_modifier = ((1 - bpi) + dcs) / 2
return max(0, min(100, int(base_score * geographic_modifier * behavioral_modifier)))
def calc_final(row, uts, csi, bpi):
npp = int(row["NumberofPreviousPosts"])
cfp = int(row["CountofFlaggedPosts"])
historical_factor = 100 * (1 - (cfp / max(npp, 1)))
final_score = (
(uts * 0.35) +
(csi * 0.40 * (1 + bpi * 0.5)) +
(historical_factor * 0.25)
)
if final_score > 80:
return "user_banned"
elif final_score > 60:
return "removed"
elif final_score > 40:
return "warning"
return "allowed"
def one_run(row):
bpi = random.random()
dcs = round(random.random(), 2)
if dcs < 0.5:
bpi = min(1.0, bpi + 0.2)
bpi = round(bpi, 2)
csi = calc_csi(row)
uts = calc_uts(row, bpi, dcs)
return calc_final(row, uts, csi, bpi)
repeats = 100
match_rates = []
flip_rows = 0
for row in rows:
decisions = [one_run(row) for _ in range(repeats)]
match_rates.append(sum(d == row["final_decision"] for d in decisions) / repeats)
if len(set(decisions)) > 1:
flip_rows += 1
print("rows", len(rows))
print("rows_with_label_flip", flip_rows)
print("avg_match_rate_to_csv", round(sum(match_rates) / len(match_rates), 4))
print("median_match_rate_to_csv", round(statistics.median(match_rates), 4))
print("always_match_rows", sum(x == 1.0 for x in match_rates))
print("never_match_rows", sum(x == 0.0 for x in match_rates))
Likely root cause
This looks like a release misalignment between:
- the generated CSV artifacts
- the generated tool code
- the loaded SOP file
- the benchmark statistics reported in docs/paper
Most likely the benchmark was supposed to use deterministic, CSV-backed intermediate outputs, but the released content_flagging/tools.py either:
- kept a bad generated implementation, or
- was regenerated from a different prompt/version than the CSV
Proposed fix
Recommended fix
Make content_flagging deterministic and align all artifacts to one canonical version.
Concretely:
- Remove runtime randomness from:
calculate_device_consistency()
calculateBotProbabilityIndex()
- Either:
- add
bot_probability_index and device_consistency_score columns to the dataset and return those values directly, or
- replace the random logic with a deterministic function of existing inputs and regenerate all downstream ground truth accordingly
- Make
calculateContentSeverityIndex() consistent with the stored content_severity_index, or regenerate the CSV to match the released formula
- Make
determineFinalDecision() consistent with the stored final_decision, or regenerate the CSV to match the released formula
- Reconcile
sop.txt vs sop_v2.txt
- Reconcile the shipped row counts with the paper/docs (
168 vs 226)
- Add a regression test that runs the same content-flagging task multiple times and asserts identical tool outputs and identical final decisions
Minimum acceptable fix
If a full artifact regeneration is not immediately possible:
- Mark
content_flagging as currently inconsistent/non-reproducible
- Exclude it from aggregate benchmark reporting until fixed
- Document which artifact version produced the published numbers
Bottom line
The released content_flagging subset is currently not a valid deterministic benchmark in the same sense as the rest of SOP-Bench appears to be. The evidence strongly suggests an implementation/data release bug rather than intentional stochastic evaluation noise.
Issue Report:
content_flaggingbenchmark is non-deterministic and internally inconsistentSummary
The released
content_flaggingbenchmark appears to have a real implementation bug, not intentional benchmark noise.The main issue is that the current tool implementation does not behave like a deterministic mock-tool benchmark backed by precomputed CSV outputs:
determineFinalDecision()is not a pure CSV lookup. It recomputes the final label from the scores passed into it.calculate_device_consistency()usesrandom.random()calculateBotProbabilityIndex()also usesrandom.random()calculateContentSeverityIndex()also recomputes a value that does not match the shipped CSV.As a result, identical inputs can produce different intermediate scores and different final decisions across runs, and the released code does not appear to be aligned with the released ground-truth CSV, the repo documentation, or the paper.
Why this matters
SOP-Bench is presented as a reproducible benchmark with precomputed mock tool outputs. The current
content_flaggingrelease violates that assumption and can make benchmark scores depend on runtime randomness rather than only on agent behavior.That affects:
content_flaggingnumbersExpected behavior
Given the paper and repo docs, the expected behavior is:
Relevant sources:
docs/ADDING_BENCHMARKS.mdREADME.mdARCHITECTURE.mdActual behavior
1.
determineFinalDecision()recomputes the label from passed scoresThe method validates a row using
content_id,NumberofPreviousPosts, andCountofFlaggedPosts, but it does not return the CSV label. Instead it recomputesfinal_scorefrom:user_trust_scorecontent_severity_indexbot_probability_indexand maps that score to one of:
user_bannedremovedwarningallowedCode:
src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py:294-379This means any mismatch in upstream intermediate scores directly changes the final answer.
2. Two upstream tool outputs are random
The following functions are stochastic:
calculate_device_consistency():consistency_score = random.random()src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py:30-44calculateBotProbabilityIndex():bpi = random.random()src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py:46-113There is no repo-level seeding for this path.
3.
calculate_user_trust_score()depends on those random valuescalculate_user_trust_score()explicitly uses the passed:bot_probability_indexdevice_consistency_scoreand does not load them from CSV.
Code:
src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py:207-290So the trust score varies with runtime randomness.
4. The shipped CSV does not contain the random intermediate outputs
The shipped
content_flaggingCSV includes:user_trust_scorecontent_severity_indexfinal_decisionbut does not include:
bot_probability_indexdevice_consistency_scoreFile:
src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_with_outputs.csvThis means the released toolchain cannot reconstruct the benchmark’s own ground truth from the dataset alone.
5. Likely logic bug in
calculate_user_trust_score()(line 269)calculate_user_trust_score()contains:This should almost certainly be
max(...)notmin(...). As written,min(-2.5, -25) = -25, so even 1 flagged post applies the maximum -25 penalty. This collapses the penalty range for nearly all rows and is further evidence that the tool code was not carefully reviewed before release. This is a separate issue from the randomness/CSV mismatch, but contributes to the overall picture of release misalignment.6.
calculateContentSeverityIndex()also disagrees with the CSVcalculateContentSeverityIndex()recomputes CSI from violation weights and confidences rather than returning the stored CSV value:src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py:117-205Running the released formula against the shipped CSV produced:
168/168CSI mismatchesSo the inconsistency is broader than the two random fields.
Quantitative checks
I reproduced the formulas from
tools.pyin a standalone script using only the shipped CSV and Python stdlib.Results:
content_flaggingCSV rows:16898/168final_decision:0.36890.2253174content_severity_indexdisagreed with released CSI formula:168/168final_decisionwas impossible under the released final-decision formula for any validbot_probability_indexin[0, 1], using the CSV’s ownuser_trust_scoreandcontent_severity_index:125/168Note: the
125/168figure is a conservative lower bound. It uses the CSV’s ownuser_trust_scoreandcontent_severity_indexas inputs, but in practice the agent would not obtain those values either — CSI mismatches 168/168 rows, and UTS depends on the random BPI/DCS. The actual fraction of unreachable final decisions during a live run is higher.Concrete examples
Example 1: stored
user_bannedis impossible under the released formulaRow:
content_id = C123456908user_trust_score = 14content_severity_index = 86final_decision = user_bannedCSV row:
src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_with_outputs.csv:2Under the released
determineFinalDecision()formula, with any validbot_probability_indexin[0, 1], the final score range is:43.4760.67That can only yield:
warningremovedIt can never yield
user_banned.Example 2: stored
allowedis impossible under the released formulaRow:
content_id = C123456972user_trust_score = 87content_severity_index = 33final_decision = allowedCSV row:
src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_with_outputs.csv:3Under the released formula, the score range is:
68.4775.07That can only yield:
removedIt can never yield
allowed.Why this looks unintentional rather than intentional noise
Paper mismatch
The paper says:
Source:
That directly conflicts with runtime
random.random()in the released tools.Repo documentation mismatch
The repo explicitly describes deterministic mock tools:
README.md:69ARCHITECTURE.md:113-152docs/ADDING_BENCHMARKS.md:115docs/ADDING_BENCHMARKS.md:576-585docs/ADDING_BENCHMARKS.mdeven uses random outputs as the example of what not to do.The paper’s own tool-generation prompt conflicts with the implementation
The paper’s tool-code-generation appendix says generated tools should:
Source:
But the released
content_flaggingtool code synthesizes values that are not present in the dataset.The paper discusses “inconsistent function outputs” as a symptom, not as intended design
The discussion section says content flagging suffered from “inconsistent function outputs,” which is consistent with a broken tool implementation. It does not present stochastic tool behavior as a benchmark feature.
Source:
random.shufflein other domains is cosmeticThe other domains that import
random(video_annotation,video_classification) use it only forrandom.shuffle(toolspec_json), which changes the presentation order of tool specifications. This does not affect tool outputs or evaluation correctness — it is purely cosmetic. Thecontent_flaggingdomain is the only one whererandomaffects computed tool return values.Additional release inconsistencies
1.
sop.txtvssop_v2.txtThe loader uses
sop.txt:src/amazon_sop_bench/benchmarks/loader.py:65But
content_flagging/sop_v2.txtis the version that explicitly instructs agents to call the four tools and use returned BPI/UTC/CSI in sequence.Files:
src/amazon_sop_bench/benchmarks/data/content_flagging/sop.txtsrc/amazon_sop_bench/benchmarks/data/content_flagging/sop_v2.txtThe repo does not appear to reference
sop_v2.txtanywhere.2. Task count mismatch
The latest paper and repo docs report
226content-flagging tasks:README.mddocs/GETTING_STARTED.mdARCHITECTURE.mdBut the shipped CSVs contain:
168rows intest_set_with_outputs.csv57data rows intest_set_without_outputs.csv(58lines including header)This suggests the released artifact bundle is not aligned with the paper/docs.
Minimal reproduction
Inspect the stochastic code paths
Inspect the CSV schema
Check task-count mismatch
Standalone reproduction script
This script reproduces the released formulas using only stdlib and the shipped CSV. It demonstrates both non-determinism and mismatch with the stored labels.
Likely root cause
This looks like a release misalignment between:
Most likely the benchmark was supposed to use deterministic, CSV-backed intermediate outputs, but the released
content_flagging/tools.pyeither:Proposed fix
Recommended fix
Make
content_flaggingdeterministic and align all artifacts to one canonical version.Concretely:
calculate_device_consistency()calculateBotProbabilityIndex()bot_probability_indexanddevice_consistency_scorecolumns to the dataset and return those values directly, orcalculateContentSeverityIndex()consistent with the storedcontent_severity_index, or regenerate the CSV to match the released formuladetermineFinalDecision()consistent with the storedfinal_decision, or regenerate the CSV to match the released formulasop.txtvssop_v2.txt168vs226)Minimum acceptable fix
If a full artifact regeneration is not immediately possible:
content_flaggingas currently inconsistent/non-reproducibleBottom line
The released
content_flaggingsubset is currently not a valid deterministic benchmark in the same sense as the rest of SOP-Bench appears to be. The evidence strongly suggests an implementation/data release bug rather than intentional stochastic evaluation noise.