Bug: content_flagging benchmark is non-deterministic and internally inconsistent

# Issue Report: `content_flagging` benchmark is non-deterministic and internally inconsistent

## Summary

The released `content_flagging` benchmark appears to have a real implementation bug, not intentional benchmark noise.

The main issue is that the current tool implementation does **not** behave like a deterministic mock-tool benchmark backed by precomputed CSV outputs:

1. `determineFinalDecision()` is not a pure CSV lookup. It recomputes the final label from the scores passed into it.
2. Two upstream tool outputs are stochastic:
   - `calculate_device_consistency()` uses `random.random()`
   - `calculateBotProbabilityIndex()` also uses `random.random()`
3. The shipped CSV does **not** contain the intermediate random outputs needed to reproduce the stored ground truth.
4. `calculateContentSeverityIndex()` also recomputes a value that does not match the shipped CSV.

As a result, identical inputs can produce different intermediate scores and different final decisions across runs, and the released code does not appear to be aligned with the released ground-truth CSV, the repo documentation, or the paper.

## Why this matters

SOP-Bench is presented as a reproducible benchmark with precomputed mock tool outputs. The current `content_flagging` release violates that assumption and can make benchmark scores depend on runtime randomness rather than only on agent behavior.

That affects:

- task-level reproducibility
- comparability across models and agents
- interpretability of failures
- validity of the published `content_flagging` numbers

## Expected behavior

Given the paper and repo docs, the expected behavior is:

- tools should be deterministic for the same inputs
- tool outputs should match precomputed dataset values
- final decisions should be reproducible across runs
- shipped CSV ground truth should be consistent with shipped tool logic

Relevant sources:

- The paper says the dataset contains structured inputs and outputs for each tool/API invocation and that the mocks are intended to remove runtime variability:
  - https://ar5iv.labs.arxiv.org/html/2506.08119
- Repo docs say tools should be deterministic mock implementations:
  - `docs/ADDING_BENCHMARKS.md`
  - `README.md`
  - `ARCHITECTURE.md`

## Actual behavior

### 1. `determineFinalDecision()` recomputes the label from passed scores

The method validates a row using `content_id`, `NumberofPreviousPosts`, and `CountofFlaggedPosts`, but it does **not** return the CSV label. Instead it recomputes `final_score` from:

- `user_trust_score`
- `content_severity_index`
- `bot_probability_index`

and maps that score to one of:

- `user_banned`
- `removed`
- `warning`
- `allowed`

Code:

- `src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py:294-379`

This means any mismatch in upstream intermediate scores directly changes the final answer.

### 2. Two upstream tool outputs are random

The following functions are stochastic:

- `calculate_device_consistency()`:
  - `consistency_score = random.random()`
  - `src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py:30-44`

- `calculateBotProbabilityIndex()`:
  - `bpi = random.random()`
  - `src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py:46-113`

There is no repo-level seeding for this path.

### 3. `calculate_user_trust_score()` depends on those random values

`calculate_user_trust_score()` explicitly uses the passed:

- `bot_probability_index`
- `device_consistency_score`

and does not load them from CSV.

Code:

- `src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py:207-290`

So the trust score varies with runtime randomness.

### 4. The shipped CSV does not contain the random intermediate outputs

The shipped `content_flagging` CSV includes:

- `user_trust_score`
- `content_severity_index`
- `final_decision`

but does **not** include:

- `bot_probability_index`
- `device_consistency_score`

File:

- `src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_with_outputs.csv`

This means the released toolchain cannot reconstruct the benchmark’s own ground truth from the dataset alone.

### 5. Likely logic bug in `calculate_user_trust_score()` (line 269)

`calculate_user_trust_score()` contains:

```python
flag_penalty = min(CountofFlaggedPosts * -0.5, -25)
```

This should almost certainly be `max(...)` not `min(...)`. As written, `min(-2.5, -25) = -25`, so even 1 flagged post applies the maximum -25 penalty. This collapses the penalty range for nearly all rows and is further evidence that the tool code was not carefully reviewed before release. This is a separate issue from the randomness/CSV mismatch, but contributes to the overall picture of release misalignment.

### 6. `calculateContentSeverityIndex()` also disagrees with the CSV

`calculateContentSeverityIndex()` recomputes CSI from violation weights and confidences rather than returning the stored CSV value:

- `src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py:117-205`

Running the released formula against the shipped CSV produced:

- `168/168` CSI mismatches

So the inconsistency is broader than the two random fields.

## Quantitative checks

I reproduced the formulas from `tools.py` in a standalone script using only the shipped CSV and Python stdlib.

Results:

- Shipped `content_flagging` CSV rows: `168`
- Rows whose predicted label flipped across repeated identical runs: `98/168`
- Average match rate between repeated simulated outputs and shipped CSV `final_decision`: `0.3689`
- Median match rate: `0.225`
- Rows that always matched shipped CSV across 100 seeded repeats: `31`
- Rows that never matched shipped CSV across 100 seeded repeats: `74`
- Rows where shipped `content_severity_index` disagreed with released CSI formula: `168/168`
- Rows where shipped `final_decision` was impossible under the released final-decision formula for any valid `bot_probability_index` in `[0, 1]`, using the CSV’s own `user_trust_score` and `content_severity_index`: `125/168`

Note: the `125/168` figure is a conservative lower bound. It uses the CSV’s own `user_trust_score` and `content_severity_index` as inputs, but in practice the agent would not obtain those values either — CSI mismatches 168/168 rows, and UTS depends on the random BPI/DCS. The actual fraction of unreachable final decisions during a live run is higher.

## Concrete examples

### Example 1: stored `user_banned` is impossible under the released formula

Row:

- `content_id = C123456908`
- `user_trust_score = 14`
- `content_severity_index = 86`
- `final_decision = user_banned`

CSV row:

- `src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_with_outputs.csv:2`

Under the released `determineFinalDecision()` formula, with any valid `bot_probability_index` in `[0, 1]`, the final score range is:

- min score: `43.47`
- max score: `60.67`

That can only yield:

- `warning`
- `removed`

It can never yield `user_banned`.

### Example 2: stored `allowed` is impossible under the released formula

Row:

- `content_id = C123456972`
- `user_trust_score = 87`
- `content_severity_index = 33`
- `final_decision = allowed`

CSV row:

- `src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_with_outputs.csv:3`

Under the released formula, the score range is:

- min score: `68.47`
- max score: `75.07`

That can only yield:

- `removed`

It can never yield `allowed`.

## Why this looks unintentional rather than intentional noise

### Paper mismatch

The paper says:

- the dataset includes inputs and outputs for tool/API invocations
- mock APIs replace live APIs
- evaluation is intended to be stable and reproducible without runtime variability

Source:

- https://ar5iv.labs.arxiv.org/html/2506.08119
  - Section 3.1.3 / 3.1.5 / evaluation setup

That directly conflicts with runtime `random.random()` in the released tools.

### Repo documentation mismatch

The repo explicitly describes deterministic mock tools:

- `README.md:69`
- `ARCHITECTURE.md:113-152`
- `docs/ADDING_BENCHMARKS.md:115`
- `docs/ADDING_BENCHMARKS.md:576-585`

`docs/ADDING_BENCHMARKS.md` even uses random outputs as the example of what **not** to do.

### The paper’s own tool-generation prompt conflicts with the implementation

The paper’s tool-code-generation appendix says generated tools should:

- use only existing dataset columns
- load the CSV before accessing required fields

Source:

- https://ar5iv.labs.arxiv.org/html/2506.08119
  - Appendix C.5, especially the requirement to use only existing columns

But the released `content_flagging` tool code synthesizes values that are not present in the dataset.

### The paper discusses “inconsistent function outputs” as a symptom, not as intended design

The discussion section says content flagging suffered from “inconsistent function outputs,” which is consistent with a broken tool implementation. It does not present stochastic tool behavior as a benchmark feature.

Source:

- https://ar5iv.labs.arxiv.org/html/2506.08119
  - Discussion section on Content Flagging

### `random.shuffle` in other domains is cosmetic

The other domains that import `random` (`video_annotation`, `video_classification`) use it only for `random.shuffle(toolspec_json)`, which changes the presentation order of tool specifications. This does not affect tool outputs or evaluation correctness — it is purely cosmetic. The `content_flagging` domain is the only one where `random` affects computed tool return values.

## Additional release inconsistencies

### 1. `sop.txt` vs `sop_v2.txt`

The loader uses `sop.txt`:

- `src/amazon_sop_bench/benchmarks/loader.py:65`

But `content_flagging/sop_v2.txt` is the version that explicitly instructs agents to call the four tools and use returned BPI/UTC/CSI in sequence.

Files:

- `src/amazon_sop_bench/benchmarks/data/content_flagging/sop.txt`
- `src/amazon_sop_bench/benchmarks/data/content_flagging/sop_v2.txt`

The repo does not appear to reference `sop_v2.txt` anywhere.

### 2. Task count mismatch

The latest paper and repo docs report `226` content-flagging tasks:

- paper table
- `README.md`
- `docs/GETTING_STARTED.md`
- `ARCHITECTURE.md`

But the shipped CSVs contain:

- `168` rows in `test_set_with_outputs.csv`
- `57` data rows in `test_set_without_outputs.csv` (`58` lines including header)

This suggests the released artifact bundle is not aligned with the paper/docs.

## Minimal reproduction

### Inspect the stochastic code paths

```bash
nl -ba src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py | sed -n '30,113p'
nl -ba src/amazon_sop_bench/benchmarks/data/content_flagging/tools.py | sed -n '207,379p'
```

### Inspect the CSV schema

```bash
nl -ba src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_with_outputs.csv | sed -n '1,3p'
```

### Check task-count mismatch

```bash
wc -l \
  src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_with_outputs.csv \
  src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_without_outputs.csv \
  src/amazon_sop_bench/benchmarks/data/content_flagging/data.csv
```

### Standalone reproduction script

This script reproduces the released formulas using only stdlib and the shipped CSV. It demonstrates both non-determinism and mismatch with the stored labels.

```python
import csv, math, random, statistics

random.seed(0)

path = "src/amazon_sop_bench/benchmarks/data/content_flagging/test_set_with_outputs.csv"
rows = list(csv.DictReader(open(path)))

VIOLATION_WEIGHTS = {
    "hate_speech": 1.0,
    "spam": 0.6,
    "violence": 0.9,
    "adult_content": 0.7,
    "copyright": 0.5,
    "misinformation": 0.8,
    "bot_activity": 0.7,
    "self_harm": 1.0,
    "discrimination": 0.9,
    "harassment": 0.8,
}

def calc_csi(row):
    pvt = row["PrimaryViolationType"]
    svt = row["SecondaryViolationType"]
    pvc = float(row["PrimaryViolation_Confidence"])
    svc = float(row["SecondaryViolation_Confidence"])
    primary_score = VIOLATION_WEIGHTS[pvt] * pvc
    secondary_score = 0
    if svt and svc > 0:
        secondary_score = VIOLATION_WEIGHTS[svt] * svc
        if pvt == svt:
            secondary_score *= 1.2
    return min(round(primary_score * 0.7 + secondary_score * 0.3), 100)

def calc_uts(row, bpi, dcs):
    npp = int(row["NumberofPreviousPosts"])
    cfp = int(row["CountofFlaggedPosts"])
    lat = float(row["Latitude"])
    lon = float(row["Longitude"])
    base_score = 50 + min(npp * 0.3, 30) + min(cfp * -0.5, -25)
    grq = abs(math.cos(math.radians(lat)) * math.cos(math.radians(lon)))
    geographic_modifier = 1 - (grq * 0.2)
    behavioral_modifier = ((1 - bpi) + dcs) / 2
    return max(0, min(100, int(base_score * geographic_modifier * behavioral_modifier)))

def calc_final(row, uts, csi, bpi):
    npp = int(row["NumberofPreviousPosts"])
    cfp = int(row["CountofFlaggedPosts"])
    historical_factor = 100 * (1 - (cfp / max(npp, 1)))
    final_score = (
        (uts * 0.35) +
        (csi * 0.40 * (1 + bpi * 0.5)) +
        (historical_factor * 0.25)
    )
    if final_score > 80:
        return "user_banned"
    elif final_score > 60:
        return "removed"
    elif final_score > 40:
        return "warning"
    return "allowed"

def one_run(row):
    bpi = random.random()
    dcs = round(random.random(), 2)
    if dcs < 0.5:
        bpi = min(1.0, bpi + 0.2)
    bpi = round(bpi, 2)
    csi = calc_csi(row)
    uts = calc_uts(row, bpi, dcs)
    return calc_final(row, uts, csi, bpi)

repeats = 100
match_rates = []
flip_rows = 0

for row in rows:
    decisions = [one_run(row) for _ in range(repeats)]
    match_rates.append(sum(d == row["final_decision"] for d in decisions) / repeats)
    if len(set(decisions)) > 1:
        flip_rows += 1

print("rows", len(rows))
print("rows_with_label_flip", flip_rows)
print("avg_match_rate_to_csv", round(sum(match_rates) / len(match_rates), 4))
print("median_match_rate_to_csv", round(statistics.median(match_rates), 4))
print("always_match_rows", sum(x == 1.0 for x in match_rates))
print("never_match_rows", sum(x == 0.0 for x in match_rates))
```

## Likely root cause

This looks like a release misalignment between:

- the generated CSV artifacts
- the generated tool code
- the loaded SOP file
- the benchmark statistics reported in docs/paper

Most likely the benchmark was supposed to use deterministic, CSV-backed intermediate outputs, but the released `content_flagging/tools.py` either:

- kept a bad generated implementation, or
- was regenerated from a different prompt/version than the CSV

## Proposed fix

### Recommended fix

Make `content_flagging` deterministic and align all artifacts to one canonical version.

Concretely:

1. Remove runtime randomness from:
   - `calculate_device_consistency()`
   - `calculateBotProbabilityIndex()`
2. Either:
   - add `bot_probability_index` and `device_consistency_score` columns to the dataset and return those values directly, or
   - replace the random logic with a deterministic function of existing inputs and regenerate all downstream ground truth accordingly
3. Make `calculateContentSeverityIndex()` consistent with the stored `content_severity_index`, or regenerate the CSV to match the released formula
4. Make `determineFinalDecision()` consistent with the stored `final_decision`, or regenerate the CSV to match the released formula
5. Reconcile `sop.txt` vs `sop_v2.txt`
6. Reconcile the shipped row counts with the paper/docs (`168` vs `226`)
7. Add a regression test that runs the same content-flagging task multiple times and asserts identical tool outputs and identical final decisions

### Minimum acceptable fix

If a full artifact regeneration is not immediately possible:

1. Mark `content_flagging` as currently inconsistent/non-reproducible
2. Exclude it from aggregate benchmark reporting until fixed
3. Document which artifact version produced the published numbers

## Bottom line

The released `content_flagging` subset is currently not a valid deterministic benchmark in the same sense as the rest of SOP-Bench appears to be. The evidence strongly suggests an implementation/data release bug rather than intentional stochastic evaluation noise.


Bug: content_flagging benchmark is non-deterministic and internally inconsistent #2

Description

Issue Report: content_flagging benchmark is non-deterministic and internally inconsistent

Summary

Why this matters

Expected behavior

Actual behavior

1. determineFinalDecision() recomputes the label from passed scores

2. Two upstream tool outputs are random

3. calculate_user_trust_score() depends on those random values

4. The shipped CSV does not contain the random intermediate outputs

5. Likely logic bug in calculate_user_trust_score() (line 269)

6. calculateContentSeverityIndex() also disagrees with the CSV

Quantitative checks

Concrete examples

Example 1: stored user_banned is impossible under the released formula

Example 2: stored allowed is impossible under the released formula

Why this looks unintentional rather than intentional noise

Paper mismatch

Repo documentation mismatch

The paper’s own tool-generation prompt conflicts with the implementation

The paper discusses “inconsistent function outputs” as a symptom, not as intended design

random.shuffle in other domains is cosmetic

Additional release inconsistencies

1. sop.txt vs sop_v2.txt

2. Task count mismatch

Minimal reproduction

Inspect the stochastic code paths

Inspect the CSV schema

Check task-count mismatch

Standalone reproduction script

Likely root cause

Proposed fix

Recommended fix

Minimum acceptable fix

Bottom line

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Issue Report: `content_flagging` benchmark is non-deterministic and internally inconsistent

1. `determineFinalDecision()` recomputes the label from passed scores

3. `calculate_user_trust_score()` depends on those random values

5. Likely logic bug in `calculate_user_trust_score()` (line 269)

6. `calculateContentSeverityIndex()` also disagrees with the CSV

Example 1: stored `user_banned` is impossible under the released formula

Example 2: stored `allowed` is impossible under the released formula

`random.shuffle` in other domains is cosmetic

1. `sop.txt` vs `sop_v2.txt`