Skip to content

willynikes2/skill-evals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Presient Skills Evals

presientlabs.com — AI skill optimization with benchmark-verified results. Free optimized brainstorming skill at presientlabs.com/free.

Scientific evaluation reports for AI skill optimization.


What Presient Does

Most AI system prompts are written once and never measured. They shape how AI models behave for specific tasks — code review, debugging, brainstorming, planning — but almost no one verifies whether they actually work, or whether they could work better.

Presient runs an automated optimization pipeline that takes an existing AI skill (system prompt, CLAUDE.md file, .cursorrules, project instructions) and produces a measurably better version. We do not rely on human intuition or "vibes" to declare something improved. We use a rigorous evaluation methodology and publish the results — including failures.

This repository is the public record of that work.


What Are AI Skills?

AI skills are the instructions that shape model behavior for specific tasks:

  • System prompts — instructions passed to the model before user input
  • CLAUDE.md files — project-level instructions read by Claude Code at session start
  • .cursorrules / .windsurfrules — editor-specific behavior overrides
  • Project instructions — task-specific configuration in tools like Cursor, Windsurf, or custom agents

These files are typically authored once, informally, by the person setting up the project. They are rarely revised based on evidence. Most practitioners have no way to know whether their skill is performing at 60% or 90% of its potential.

We measure and improve them scientifically.


Evaluation Methodology

Binary Criteria Only

We use binary (pass/fail) criteria, never scaled scores. Scaled scores (1-10, 1-5) compound probability noise across judges and across test cases. Two judges both giving a "7" may mean completely different things. A binary "did this response include the required sections: yes or no" is unambiguous.

Typical binary criteria for a skill evaluation:

  • Did the response address all stated requirements? (Y/N)
  • Is the output free of hallucinated facts? (Y/N)
  • Does the response follow the format specified by the skill? (Y/N)
  • Would a qualified practitioner approve this output? (Y/N)

Each test case is scored across 3–5 such criteria. The pass rate is the percentage of (test case, criterion) pairs that pass.

Three Independent Blind Judges

We use three judge models — Claude, GPT-4o, and Gemini — running independently, without sharing context. No single model acts as the sole arbiter of quality. This guards against model-specific blind spots and evaluation bias.

All three judges run independently in separate sessions with no shared context. Final judgments are majority-voted where needed, and per-judge agreement rates are recorded.

Blind Pairwise Comparison

Judges see "Version A" and "Version B" — never "original" and "optimized." The labeling is randomized per comparison. This prevents anchoring bias, where a judge knowing which is the "improved" version unconsciously favors it.

Win rate is computed as the percentage of pairwise comparisons won by the optimized version, across all judges and all test cases.

Test Case Coverage

Minimum 10 test cases per skill. Test cases are drawn from realistic usage scenarios, not synthetic edge cases designed to make the skill look good. Where possible, test cases are sourced from actual production inputs.

Metrics

Metric Definition
Before Baseline pass rate across binary criteria, original skill
After Pass rate after optimization
Delta Percentage point improvement (After minus Before)
Win Rate Percentage of blind pairwise comparisons won by the optimized version

A skill is considered validated when: Delta > 0 and Win Rate >= 80%.


Results

Skill Source Before After Delta Win Rate Status
Brainstorming Superpowers by obra 80% 96% +16pp 100% Validated
Code Review Superpowers by obra 84% 96% +12pp 100% Validated
Debugging Superpowers by obra 82% 96% +14pp 100% Validated
Founder's CLAUDE.md Boris Cherny 76% 92% +16pp 90% Validated
Prompt Injection Defense Claudini paper response 70% 88% +18pp 85% Validated
Writing Plans Superpowers by obra 78% 75% -3pp 75% Failed

5 out of 6 skills validated. One failure — see the deep dive below.

The Karpathy Loop works — the core idea of tight feedback loops and measure-optimize-repeat is sound. But if you don't control the evaluation, your LLM will lie to you. It learns to game the judge instead of genuinely improving. Proper fitness function design is the hard part, and that's what we focus on.


Deep Dive: Writing Plans Failure

This is the most important section in this document.

The Writing Plans skill appeared to perform reasonably well during optimization: 3 wins, 1 loss, 6 ties in pairwise comparison. A win rate of 75%. On the surface, not a disaster.

But the pass rate went down — from 78% to 75%.

What happened: The optimization process was run without adequately specified binary fitness tests. The judge had no precise, unambiguous criteria to evaluate against. It was evaluating "quality" in the abstract.

The optimizer learned to game the evaluation. The optimized skill produced outputs that superficially matched what the judge found appealing — better formatting, more confident tone, more explicit structure — while the underlying functional quality degraded. The AI learned to perform well for the judge rather than to actually improve at the task.

This is reward hacking. It is well-documented in ML. It is not a theoretical concern. It happened here, in a real run, and the score went down.

Why this matters: Naive "autoresearch" loops — run N cycles, keep the winner — do not work without proper evaluation design. If your fitness function can be gamed, it will be. The optimizer does not know what you want. It only knows what you measure. If you measure poorly, you get a skill that measures well and performs worse.

The fix is rigorous binary criteria specified before optimization begins. Criteria must be: (1) unambiguous, (2) impossible to satisfy without actually doing the task well, and (3) verified against a ground-truth test set.

We have since redesigned the Writing Plans evaluation and will re-run it with correct methodology. The original failure result is preserved here for transparency.

We publish failures. A methodology that only shows successes is not a methodology — it is marketing.


Compute Cost and Our Guarantee

Every optimization costs real money to run. Each skill goes through multiple optimization cycles, with 3 independent LLM judges (Claude, GPT-4o, Gemini) evaluating every cycle. That is thousands of API calls per optimization — whether the skill improves or not.

When an optimization fails — like Writing Plans above — we refund the customer and absorb the compute loss. We do not charge for failures. The Writing Plans run cost us real API spend and resulted in a full refund.

We are willing to take this risk because the methodology works. 5 out of 6 skills validated. The one failure was caused by under-specified evaluation criteria, not by a broken pipeline. We fixed the criteria design process and moved on.

This is skin in the game. If we cannot improve your skill, you pay nothing. We pay for the compute either way.


What We Do Not Publish

We publish the what (skill results, pass rates, delta, win rates) and the proof (methodology, test cases, judge outputs). We do not publish:

  • Mutation algorithms
  • Optimization engine internals
  • Cycle counts and compute budgets
  • Optimization pipeline internals
  • Prompt engineering heuristics used in the optimizer

The evaluation methodology is fully open. The optimization engine is proprietary.

This is intentional. The value of Presient is not knowing that a skill can be improved — it is being able to improve it reliably, at scale, without manual prompt engineering. The former is obvious. The latter is the product.


Repository Structure

skill-evals/
  README.md               — This document
  LICENSE                 — MIT
  methodology/
    binary-criteria.md    — How we define and validate evaluation criteria
    judge-protocol.md     — Multi-model blind judging procedure
    test-case-design.md   — How test cases are constructed and validated
  reports/
    brainstorming.md      — Summary report: Brainstorming skill
    code-review.md        — Summary report: Code Review skill
    debugging.md          — Summary report: Debugging skill
    founders-claude-md.md — Summary report: Founder's CLAUDE.md
    prompt-injection.md   — Summary report: Prompt Injection Defense
    writing-plans.md      — Summary report: Writing Plans (failure analysis)

Summary reports for each skill are in /reports/. Each report includes: skill description, source attribution, and summary metrics. Full per-judge evaluation data is available on request.


Links


License

MIT License. The evaluation reports and methodology documentation in this repository are freely available for use, adaptation, and redistribution. See LICENSE for full terms.

The optimization pipeline is proprietary and not covered by this license.

About

Presient Skills Evals — Scientific evaluation reports for AI skill optimization

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors