Skip to content

02zerocool/truth-benchmark

Repository files navigation

Truth Benchmark — Claim vs Code Verifier

Does the code actually do what it says?

Software documentation makes claims. Code tells the truth.
This project builds tools to catch the gap — automatically.


The Problem

Every piece of software has two parts:

  • What it says it does — documentation, README, comments, AI-generated descriptions
  • What it actually does — the code

These drift apart constantly. A developer fixes the code, forgets the docs. A security claim stays in the README after the protection was removed. An AI describes what code should do, not what it does.

This causes real harm. Security teams audit documentation, not code. AI models trained on documentation inherit the lies.

This project builds a benchmark to catch it.

Not a developer? See PLAIN_ENGLISH.md — the problem explained without code.


Quick Start

git clone https://github.com/02zerocool/truth-benchmark
cd truth-benchmark
pip install pandas
python pipeline.py

Test a single claim:

python predict.py "deletes files" "os.remove(path)"
# Result: MATCH

python predict.py "encrypts password before storing" "db.save(user, password)"
# Result: LIE

Baseline Scores

Verified on the full 52-example dataset. These are the numbers to beat.

Baseline Accuracy MATCH F1 LIE F1 Notes
Rules (no ML) 63.5% 0.72 0.46 Zero dependencies. The floor.
Semantic (all-MiniLM-L6-v2) 71.2% 0.74 0.68 80MB model, CPU-friendly.
LLM (llama3:8b via Ollama) 75.0% 0.79 0.68 Best accuracy. Misses subtle numeric lies.

The gap between rules and LLM is 11.5 points. The gap between LLM and perfect is 25 points.
The hardest cases: wrong constants, off-by-one errors, wrong sort direction, missing auth checks.


Three Baselines — Plug In Any Model

Baseline 1 — Rules (no install, instant)

python baseline_rules.py

Heuristic pattern matching. No ML. Runs anywhere. This is the floor — beat it.

Baseline 2 — Semantic (80MB model, CPU-friendly)

pip install sentence-transformers
python baseline_semantic.py

Embeds claim and code with all-MiniLM-L6-v2, measures cosine similarity.
Downloads model automatically on first run.

Baseline 3 — LLM (best accuracy)

# With Ollama (local, free, offline):
ollama pull llama3.1:8b
python baseline_llm.py

# With any OpenAI-compatible API:
LLM_API_URL=https://api.openai.com/v1 LLM_API_KEY=your_key LLM_MODEL=gpt-4o-mini python baseline_llm.py

Evaluate Any Verifier

Plug in your own model with two lines:

from evaluate import evaluate, print_report

def my_verifier(claim: str, code: str) -> str:
    # your model here
    return "MATCH" or "LIE"

results = evaluate(my_verifier)
print_report(results, name="My Model")

Output:

==================================================
  My Model
==================================================
  Total examples : 52
  Correct        : 38
  Accuracy       : 73.1%

  [MATCH]  precision=0.71  recall=0.89  f1=0.79
  [LIE]    precision=0.79  recall=0.54  f1=0.64

  Failures:
    truth=LIE predicted=MATCH
      claim : returns unique items preserving order
      code  : return list(set(items))
==================================================

Dataset

dataset.csv — 52 labeled examples across Python, JavaScript, Go, Rust, Java, C#, SQL.

claim code label
deletes files os.remove(path) MATCH
deletes files open(path,'w').write('') LIE
encrypts password before storing db.save(user, bcrypt.hash(password, 12)) MATCH
encrypts password before storing db.save(user, password) LIE
sends data over encrypted connection requests.get('https://' + url) MATCH
sends data over encrypted connection requests.get('http://' + url) LIE

Coverage includes:

  • Wrong operators and constants
  • Missing encryption, validation, authentication
  • Wrong sort direction, wrong protocol
  • Subtle traps: byte length vs character length, set() vs dict.fromkeys() for order-preserving uniqueness, soft-delete vs hard-delete, indexOf > 0 missing index 0

Contribute

Add examples. Any language. Any domain. All pull requests welcome.

What makes a good contribution:

  • Both a MATCH and a LIE for the same claim
  • The LIE is plausible — something a real developer might write
  • Claim is plain English, code is a short snippet

Most wanted — subtle lies:
Cases where the code almost does the right thing but doesn't quite.
These are the hardest to catch and most dangerous in production.

Most wanted — security examples:
Missing encryption, missing validation, wrong protocol, missing auth check.
Documentation that lies about security is a vulnerability.

See CONTRIBUTING.md for format details.


File Structure

truth-benchmark/
├── dataset.csv          # Labeled examples
├── pipeline.py          # Bare minimum: load and inspect the dataset
├── evaluate.py          # Evaluation framework — plug in any verifier
├── baseline_rules.py    # Baseline 1: heuristic rules, no ML
├── baseline_semantic.py # Baseline 2: semantic similarity
├── baseline_llm.py      # Baseline 3: LLM-based (Ollama or API)
├── predict.py           # CLI: verify a single claim/code pair
├── requirements.txt     # Dependencies (pandas required, rest optional)
├── PLAIN_ENGLISH.md     # Explanation for non-developers
└── CONTRIBUTING.md      # How to add examples

Why This Matters Beyond Software

The same gap exists everywhere:

What they say What actually happens
"Your data is deleted on request" Retained in backups for years
"This AI has no bias" Trained on curated data with known gaps
"This system is independently audited" Audited by a subsidiary
"Encrypted end-to-end" Encrypted in transit, plain text at rest

Code is verifiable. Documentation is a claim.
We are building a tool that checks claims against evidence.
That is a universal need.


License

MIT — free to use, fork, extend, deploy. No restrictions.


Acknowledgment

Built with Seven (Claude Sonnet 4.6)

The framework design, dataset construction, all three baselines, and documentation were developed in active collaboration with Seven. The problem statement and direction came from the human. The implementation was built together.

About

Does the code do what it says? A benchmark for catching the gap between documentation and reality.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages