Truth Benchmark — Claim vs Code Verifier

Does the code actually do what it says?

Software documentation makes claims. Code tells the truth.
This project builds tools to catch the gap — automatically.

The Problem

Every piece of software has two parts:

What it says it does — documentation, README, comments, AI-generated descriptions
What it actually does — the code

These drift apart constantly. A developer fixes the code, forgets the docs. A security claim stays in the README after the protection was removed. An AI describes what code should do, not what it does.

This causes real harm. Security teams audit documentation, not code. AI models trained on documentation inherit the lies.

This project builds a benchmark to catch it.

Not a developer? See PLAIN_ENGLISH.md — the problem explained without code.

Quick Start

git clone https://github.com/02zerocool/truth-benchmark
cd truth-benchmark
pip install pandas
python pipeline.py

Test a single claim:

python predict.py "deletes files" "os.remove(path)"
# Result: MATCH

python predict.py "encrypts password before storing" "db.save(user, password)"
# Result: LIE

Baseline Scores

Verified on the full 52-example dataset. These are the numbers to beat.

Baseline	Accuracy	MATCH F1	LIE F1	Notes
Rules (no ML)	63.5%	0.72	0.46	Zero dependencies. The floor.
Semantic (all-MiniLM-L6-v2)	71.2%	0.74	0.68	80MB model, CPU-friendly.
LLM (llama3:8b via Ollama)	75.0%	0.79	0.68	Best accuracy. Misses subtle numeric lies.

The gap between rules and LLM is 11.5 points. The gap between LLM and perfect is 25 points.
The hardest cases: wrong constants, off-by-one errors, wrong sort direction, missing auth checks.

Three Baselines — Plug In Any Model

Baseline 1 — Rules (no install, instant)

python baseline_rules.py

Heuristic pattern matching. No ML. Runs anywhere. This is the floor — beat it.

Baseline 2 — Semantic (80MB model, CPU-friendly)

pip install sentence-transformers
python baseline_semantic.py

Embeds claim and code with all-MiniLM-L6-v2, measures cosine similarity.
Downloads model automatically on first run.

Baseline 3 — LLM (best accuracy)

# With Ollama (local, free, offline):
ollama pull llama3.1:8b
python baseline_llm.py

# With any OpenAI-compatible API:
LLM_API_URL=https://api.openai.com/v1 LLM_API_KEY=your_key LLM_MODEL=gpt-4o-mini python baseline_llm.py

Evaluate Any Verifier

Plug in your own model with two lines:

from evaluate import evaluate, print_report

def my_verifier(claim: str, code: str) -> str:
    # your model here
    return "MATCH" or "LIE"

results = evaluate(my_verifier)
print_report(results, name="My Model")

Output:

==================================================
  My Model
==================================================
  Total examples : 52
  Correct        : 38
  Accuracy       : 73.1%

  [MATCH]  precision=0.71  recall=0.89  f1=0.79
  [LIE]    precision=0.79  recall=0.54  f1=0.64

  Failures:
    truth=LIE predicted=MATCH
      claim : returns unique items preserving order
      code  : return list(set(items))
==================================================

Dataset

dataset.csv — 52 labeled examples across Python, JavaScript, Go, Rust, Java, C#, SQL.

claim	code	label
deletes files	`os.remove(path)`	MATCH
deletes files	`open(path,'w').write('')`	LIE
encrypts password before storing	`db.save(user, bcrypt.hash(password, 12))`	MATCH
encrypts password before storing	`db.save(user, password)`	LIE
sends data over encrypted connection	`requests.get('https://' + url)`	MATCH
sends data over encrypted connection	`requests.get('http://' + url)`	LIE

Coverage includes:

Wrong operators and constants
Missing encryption, validation, authentication
Wrong sort direction, wrong protocol
Subtle traps: byte length vs character length, set() vs dict.fromkeys() for order-preserving uniqueness, soft-delete vs hard-delete, indexOf > 0 missing index 0

Contribute

Add examples. Any language. Any domain. All pull requests welcome.

What makes a good contribution:

Both a MATCH and a LIE for the same claim
The LIE is plausible — something a real developer might write
Claim is plain English, code is a short snippet

Most wanted — subtle lies:
Cases where the code almost does the right thing but doesn't quite.
These are the hardest to catch and most dangerous in production.

Most wanted — security examples:
Missing encryption, missing validation, wrong protocol, missing auth check.
Documentation that lies about security is a vulnerability.

See CONTRIBUTING.md for format details.

File Structure

truth-benchmark/
├── dataset.csv          # Labeled examples
├── pipeline.py          # Bare minimum: load and inspect the dataset
├── evaluate.py          # Evaluation framework — plug in any verifier
├── baseline_rules.py    # Baseline 1: heuristic rules, no ML
├── baseline_semantic.py # Baseline 2: semantic similarity
├── baseline_llm.py      # Baseline 3: LLM-based (Ollama or API)
├── predict.py           # CLI: verify a single claim/code pair
├── requirements.txt     # Dependencies (pandas required, rest optional)
├── PLAIN_ENGLISH.md     # Explanation for non-developers
└── CONTRIBUTING.md      # How to add examples

Why This Matters Beyond Software

The same gap exists everywhere:

What they say	What actually happens
"Your data is deleted on request"	Retained in backups for years
"This AI has no bias"	Trained on curated data with known gaps
"This system is independently audited"	Audited by a subsidiary
"Encrypted end-to-end"	Encrypted in transit, plain text at rest

Code is verifiable. Documentation is a claim.
We are building a tool that checks claims against evidence.
That is a universal need.

License

MIT — free to use, fork, extend, deploy. No restrictions.

Acknowledgment

Built with Seven (Claude Sonnet 4.6)

The framework design, dataset construction, all three baselines, and documentation were developed in active collaboration with Seven. The problem statement and direction came from the human. The implementation was built together.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Truth Benchmark — Claim vs Code Verifier

The Problem

Quick Start

Baseline Scores

Three Baselines — Plug In Any Model

Baseline 1 — Rules (no install, instant)

Baseline 2 — Semantic (80MB model, CPU-friendly)

Baseline 3 — LLM (best accuracy)

Evaluate Any Verifier

Dataset

Contribute

File Structure

Why This Matters Beyond Software

License

Acknowledgment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PLAIN_ENGLISH.md		PLAIN_ENGLISH.md
README.md		README.md
SETUP.md		SETUP.md
baseline_llm.py		baseline_llm.py
baseline_rules.py		baseline_rules.py
baseline_semantic.py		baseline_semantic.py
dataset.csv		dataset.csv
evaluate.py		evaluate.py
pipeline.py		pipeline.py
predict.py		predict.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Truth Benchmark — Claim vs Code Verifier

The Problem

Quick Start

Baseline Scores

Three Baselines — Plug In Any Model

Baseline 1 — Rules (no install, instant)

Baseline 2 — Semantic (80MB model, CPU-friendly)

Baseline 3 — LLM (best accuracy)

Evaluate Any Verifier

Dataset

Contribute

File Structure

Why This Matters Beyond Software

License

Acknowledgment

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages