Skip to content

Bug: content_flagging benchmark is non-deterministic and internally inconsistent #2

@vivekhaldar

Description

@vivekhaldar

Issue Report: content_flagging benchmark is non-deterministic and internally inconsistent

Summary

The released content_flagging benchmark appears to have a real implementation bug, not intentional benchmark noise.

The main issue is that the current tool implementation does not behave like a deterministic mock-tool benchmark backed by precomputed CSV outputs:

  1. determineFinalDecision() is not a pure CSV lookup. It recomputes the final label from the scores passed into it.
  2. Two upstream tool outputs are stochastic:
    • calculate_device_consistency() uses random.random()
    • calculateBotProbabilityIndex() also uses random.random()
  3. The shipped CSV does not contain the intermediate random outputs needed to reproduce the stored ground truth.
  4. calculateContentSeverityIndex() also recomputes a value that does not match the shipped CSV.

As a result, identical inputs can produce different intermediate scores and different final decisions across runs, and the released code does not appear to be aligned with the released ground-truth CSV, the repo documentation, or the paper.

Why this matters

SOP-Bench is presented as a reproducible benchmark with precomputed mock tool outputs. The current content_flagging release violates that assumption and can make benchmark scores depend on runtime randomness rather than only on agent behavior.

That affects:

  • task-level reproducibility
  • comparability across models and agents
  • interpretability of failures
  • validity of the published content_flagging numbers

Expected behavior

Given the paper and repo docs, the expected behavior is:

  • tools should be deterministic for the same inputs
  • tool outputs should match precomputed dataset values
  • final decisions should be reproducible across runs
  • shipped CSV ground truth should be consistent with shipped tool logic

Relevant sources:

  • The paper says the dataset contains structured inputs and outputs for each tool/API invocation and that the mocks are intended to remove runtime variability:
  • Repo docs say tools should be deterministic mock implementations:
    • docs/ADDING_BENCHMARKS.md
    • README.md
    • ARCHITECTURE.md

Actual behavior

1. determineFinalDecision() recomputes the label from passed scores

The method validates a row using content_id, NumberofPreviousPosts, and CountofFlaggedPosts, but it does not return the CSV label. Instead it recomputes final_score from:

  • user_trust_score
  • content_severity_index
  • bot_probability_index

and maps that score to one of:

  • user_banned
  • removed
  • warning
  • allowed

Code:

  • src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py:294-379

This means any mismatch in upstream intermediate scores directly changes the final answer.

2. Two upstream tool outputs are random

The following functions are stochastic:

  • calculate_device_consistency():

    • consistency_score = random.random()
    • src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py:30-44
  • calculateBotProbabilityIndex():

    • bpi = random.random()
    • src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py:46-113

There is no repo-level seeding for this path.

3. calculate_user_trust_score() depends on those random values

calculate_user_trust_score() explicitly uses the passed:

  • bot_probability_index
  • device_consistency_score

and does not load them from CSV.

Code:

  • src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py:207-290

So the trust score varies with runtime randomness.

4. The shipped CSV does not contain the random intermediate outputs

The shipped content_flagging CSV includes:

  • user_trust_score
  • content_severity_index
  • final_decision

but does not include:

  • bot_probability_index
  • device_consistency_score

File:

  • src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_with_outputs.csv

This means the released toolchain cannot reconstruct the benchmark’s own ground truth from the dataset alone.

5. Likely logic bug in calculate_user_trust_score() (line 269)

calculate_user_trust_score() contains:

flag_penalty = min(CountofFlaggedPosts * -0.5, -25)

This should almost certainly be max(...) not min(...). As written, min(-2.5, -25) = -25, so even 1 flagged post applies the maximum -25 penalty. This collapses the penalty range for nearly all rows and is further evidence that the tool code was not carefully reviewed before release. This is a separate issue from the randomness/CSV mismatch, but contributes to the overall picture of release misalignment.

6. calculateContentSeverityIndex() also disagrees with the CSV

calculateContentSeverityIndex() recomputes CSI from violation weights and confidences rather than returning the stored CSV value:

  • src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py:117-205

Running the released formula against the shipped CSV produced:

  • 168/168 CSI mismatches

So the inconsistency is broader than the two random fields.

Quantitative checks

I reproduced the formulas from tools.py in a standalone script using only the shipped CSV and Python stdlib.

Results:

  • Shipped content_flagging CSV rows: 168
  • Rows whose predicted label flipped across repeated identical runs: 98/168
  • Average match rate between repeated simulated outputs and shipped CSV final_decision: 0.3689
  • Median match rate: 0.225
  • Rows that always matched shipped CSV across 100 seeded repeats: 31
  • Rows that never matched shipped CSV across 100 seeded repeats: 74
  • Rows where shipped content_severity_index disagreed with released CSI formula: 168/168
  • Rows where shipped final_decision was impossible under the released final-decision formula for any valid bot_probability_index in [0, 1], using the CSV’s own user_trust_score and content_severity_index: 125/168

Note: the 125/168 figure is a conservative lower bound. It uses the CSV’s own user_trust_score and content_severity_index as inputs, but in practice the agent would not obtain those values either — CSI mismatches 168/168 rows, and UTS depends on the random BPI/DCS. The actual fraction of unreachable final decisions during a live run is higher.

Concrete examples

Example 1: stored user_banned is impossible under the released formula

Row:

  • content_id = C123456908
  • user_trust_score = 14
  • content_severity_index = 86
  • final_decision = user_banned

CSV row:

  • src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_with_outputs.csv:2

Under the released determineFinalDecision() formula, with any valid bot_probability_index in [0, 1], the final score range is:

  • min score: 43.47
  • max score: 60.67

That can only yield:

  • warning
  • removed

It can never yield user_banned.

Example 2: stored allowed is impossible under the released formula

Row:

  • content_id = C123456972
  • user_trust_score = 87
  • content_severity_index = 33
  • final_decision = allowed

CSV row:

  • src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_with_outputs.csv:3

Under the released formula, the score range is:

  • min score: 68.47
  • max score: 75.07

That can only yield:

  • removed

It can never yield allowed.

Why this looks unintentional rather than intentional noise

Paper mismatch

The paper says:

  • the dataset includes inputs and outputs for tool/API invocations
  • mock APIs replace live APIs
  • evaluation is intended to be stable and reproducible without runtime variability

Source:

That directly conflicts with runtime random.random() in the released tools.

Repo documentation mismatch

The repo explicitly describes deterministic mock tools:

  • README.md:69
  • ARCHITECTURE.md:113-152
  • docs/ADDING_BENCHMARKS.md:115
  • docs/ADDING_BENCHMARKS.md:576-585

docs/ADDING_BENCHMARKS.md even uses random outputs as the example of what not to do.

The paper’s own tool-generation prompt conflicts with the implementation

The paper’s tool-code-generation appendix says generated tools should:

  • use only existing dataset columns
  • load the CSV before accessing required fields

Source:

But the released content_flagging tool code synthesizes values that are not present in the dataset.

The paper discusses “inconsistent function outputs” as a symptom, not as intended design

The discussion section says content flagging suffered from “inconsistent function outputs,” which is consistent with a broken tool implementation. It does not present stochastic tool behavior as a benchmark feature.

Source:

random.shuffle in other domains is cosmetic

The other domains that import random (video_annotation, video_classification) use it only for random.shuffle(toolspec_json), which changes the presentation order of tool specifications. This does not affect tool outputs or evaluation correctness — it is purely cosmetic. The content_flagging domain is the only one where random affects computed tool return values.

Additional release inconsistencies

1. sop.txt vs sop_v2.txt

The loader uses sop.txt:

  • src/amazon_sop_bench/benchmarks/loader.py:65

But content_flagging/sop_v2.txt is the version that explicitly instructs agents to call the four tools and use returned BPI/UTC/CSI in sequence.

Files:

  • src/amazon_sop_bench/benchmarks/data/content_flagging/sop.txt
  • src/amazon_sop_bench/benchmarks/data/content_flagging/sop_v2.txt

The repo does not appear to reference sop_v2.txt anywhere.

2. Task count mismatch

The latest paper and repo docs report 226 content-flagging tasks:

  • paper table
  • README.md
  • docs/GETTING_STARTED.md
  • ARCHITECTURE.md

But the shipped CSVs contain:

  • 168 rows in test_set_with_outputs.csv
  • 57 data rows in test_set_without_outputs.csv (58 lines including header)

This suggests the released artifact bundle is not aligned with the paper/docs.

Minimal reproduction

Inspect the stochastic code paths

nl -ba src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py | sed -n '30,113p'
nl -ba src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py | sed -n '207,379p'

Inspect the CSV schema

nl -ba src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_with_outputs.csv | sed -n '1,3p'

Check task-count mismatch

wc -l \
  src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_with_outputs.csv \
  src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_without_outputs.csv \
  src/amazon_sop_bench/benchmarks/data/content_flagging/data.csv

Standalone reproduction script

This script reproduces the released formulas using only stdlib and the shipped CSV. It demonstrates both non-determinism and mismatch with the stored labels.

import csv, math, random, statistics

random.seed(0)

path = "src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_with_outputs.csv"
rows = list(csv.DictReader(open(path)))

VIOLATION_WEIGHTS = {
    "hate_speech": 1.0,
    "spam": 0.6,
    "violence": 0.9,
    "adult_content": 0.7,
    "copyright": 0.5,
    "misinformation": 0.8,
    "bot_activity": 0.7,
    "self_harm": 1.0,
    "discrimination": 0.9,
    "harassment": 0.8,
}

def calc_csi(row):
    pvt = row["PrimaryViolationType"]
    svt = row["SecondaryViolationType"]
    pvc = float(row["PrimaryViolation_Confidence"])
    svc = float(row["SecondaryViolation_Confidence"])
    primary_score = VIOLATION_WEIGHTS[pvt] * pvc
    secondary_score = 0
    if svt and svc > 0:
        secondary_score = VIOLATION_WEIGHTS[svt] * svc
        if pvt == svt:
            secondary_score *= 1.2
    return min(round(primary_score * 0.7 + secondary_score * 0.3), 100)

def calc_uts(row, bpi, dcs):
    npp = int(row["NumberofPreviousPosts"])
    cfp = int(row["CountofFlaggedPosts"])
    lat = float(row["Latitude"])
    lon = float(row["Longitude"])
    base_score = 50 + min(npp * 0.3, 30) + min(cfp * -0.5, -25)
    grq = abs(math.cos(math.radians(lat)) * math.cos(math.radians(lon)))
    geographic_modifier = 1 - (grq * 0.2)
    behavioral_modifier = ((1 - bpi) + dcs) / 2
    return max(0, min(100, int(base_score * geographic_modifier * behavioral_modifier)))

def calc_final(row, uts, csi, bpi):
    npp = int(row["NumberofPreviousPosts"])
    cfp = int(row["CountofFlaggedPosts"])
    historical_factor = 100 * (1 - (cfp / max(npp, 1)))
    final_score = (
        (uts * 0.35) +
        (csi * 0.40 * (1 + bpi * 0.5)) +
        (historical_factor * 0.25)
    )
    if final_score > 80:
        return "user_banned"
    elif final_score > 60:
        return "removed"
    elif final_score > 40:
        return "warning"
    return "allowed"

def one_run(row):
    bpi = random.random()
    dcs = round(random.random(), 2)
    if dcs < 0.5:
        bpi = min(1.0, bpi + 0.2)
    bpi = round(bpi, 2)
    csi = calc_csi(row)
    uts = calc_uts(row, bpi, dcs)
    return calc_final(row, uts, csi, bpi)

repeats = 100
match_rates = []
flip_rows = 0

for row in rows:
    decisions = [one_run(row) for _ in range(repeats)]
    match_rates.append(sum(d == row["final_decision"] for d in decisions) / repeats)
    if len(set(decisions)) > 1:
        flip_rows += 1

print("rows", len(rows))
print("rows_with_label_flip", flip_rows)
print("avg_match_rate_to_csv", round(sum(match_rates) / len(match_rates), 4))
print("median_match_rate_to_csv", round(statistics.median(match_rates), 4))
print("always_match_rows", sum(x == 1.0 for x in match_rates))
print("never_match_rows", sum(x == 0.0 for x in match_rates))

Likely root cause

This looks like a release misalignment between:

  • the generated CSV artifacts
  • the generated tool code
  • the loaded SOP file
  • the benchmark statistics reported in docs/paper

Most likely the benchmark was supposed to use deterministic, CSV-backed intermediate outputs, but the released content_flagging/tools.py either:

  • kept a bad generated implementation, or
  • was regenerated from a different prompt/version than the CSV

Proposed fix

Recommended fix

Make content_flagging deterministic and align all artifacts to one canonical version.

Concretely:

  1. Remove runtime randomness from:
    • calculate_device_consistency()
    • calculateBotProbabilityIndex()
  2. Either:
    • add bot_probability_index and device_consistency_score columns to the dataset and return those values directly, or
    • replace the random logic with a deterministic function of existing inputs and regenerate all downstream ground truth accordingly
  3. Make calculateContentSeverityIndex() consistent with the stored content_severity_index, or regenerate the CSV to match the released formula
  4. Make determineFinalDecision() consistent with the stored final_decision, or regenerate the CSV to match the released formula
  5. Reconcile sop.txt vs sop_v2.txt
  6. Reconcile the shipped row counts with the paper/docs (168 vs 226)
  7. Add a regression test that runs the same content-flagging task multiple times and asserts identical tool outputs and identical final decisions

Minimum acceptable fix

If a full artifact regeneration is not immediately possible:

  1. Mark content_flagging as currently inconsistent/non-reproducible
  2. Exclude it from aggregate benchmark reporting until fixed
  3. Document which artifact version produced the published numbers

Bottom line

The released content_flagging subset is currently not a valid deterministic benchmark in the same sense as the rest of SOP-Bench appears to be. The evidence strongly suggests an implementation/data release bug rather than intentional stochastic evaluation noise.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions