Acceptance specs and scaffolds for Polarity Keystone evals.
Agents · Specs · Quickstart · Learnings · Docs
Nine agent personas, twelve acceptance specs, one reference implementation, and a field notebook of what actually runs on Keystone. Build evals quickly without rediscovering the platform's gotchas.
The repo holds descriptions and tests, not finished agents. When you want to run a spec, hand the matching scaffold to your AI coding tool and have it generate the agent for you. The reference implementation at agents/stripe-refund-aud/ shows the working pattern.
| # | Slug | Specialty | Status |
|---|---|---|---|
| 1 | general-coder |
Generalist coding | ✓ tested |
| 2 | bug-fixer |
Diagnose + minimal patches | ✓ tested |
| 3 | db-architect |
Postgres schema, seeds, SQL | needs Postgres infra |
| 4 | security-auditor |
Vulnerability detection | ✓ tested |
| 5 | web-builder |
HTTP servers, REST APIs | ✓ tested |
| 6 | data-pipeline |
ETL across services | needs multi-service infra |
| 7 | devops-shell |
Dockerfiles, infra-as-code | needs Docker infra |
| 8 | research-summarizer |
Read docs, write summaries | ✓ tested |
| 9 | stripe-refund-aud |
Refund Stripe charges, AUD only | ✓ implemented |
✓ tested = a throwaway agent built from the scaffold passed the linked spec on Keystone, then was deleted (test artifacts stay outside the repo).
✓ implemented = real code lives in the folder, ready to upload.
| # | Spec | Domain | Agent | Status |
|---|---|---|---|---|
| 0 | hello-world |
general | (cli) | ✓ runs |
| 1 | summarize-changelog |
general | research-summarizer | ✓ tested |
| 2 | bugfix-linked-list |
code-agents | bug-fixer | ✓ tested |
| 3 | refactor-god-class |
code-agents | general-coder | ✓ tested |
| 4 | language-matrix-csv |
code-agents | general-coder | ✓ tested (Python only) |
| 5 | rest-api-todo |
web-agents | web-builder | ✓ tested |
| 6 | webhook-receiver-hmac |
web-agents | web-builder | ✓ tested |
| 7 | postgres-ecommerce |
data-agents | db-architect | pending |
| 8 | security-review |
security-agents | security-auditor | ✓ tested |
| 9 | dockerize-flask-app |
devops-agents | devops-shell | pending |
| 10 | enterprise-reconciliation |
data-agents | data-pipeline | pending |
| ★ | refund-aud-only |
finance-agents | stripe-refund-aud | ✓ implemented |
You need a Keystone API key. Get one at https://app.paragon.run/app/keystone/settings.
# install ks, wire your key + AI-coder skill files, run the baseline
curl -fsSL https://ks.polarity.so/install.sh | bash
ks setup
ks eval run specs/general/hello-world.yamlks setup is the full wizard: drops AI-coder skill files for Claude Code, Cursor, Gemini CLI, OpenCode, Codex, Windsurf, etc. into the matching .claude/, .cursor/, etc. directories so your tool already knows the Keystone shape. Those directories are gitignored on purpose (regenerated per machine).
To run the one spec with a real agent committed in this repo:
# install the SDK locally so we can upload the snapshot
pip install polarity-keystone
# upload the agent (one-time, per code change)
python - <<'PY'
import polarity_keystone as pk
snap = pk.Keystone().agents.upload(
name="stripe-refund-aud",
path="agents/stripe-refund-aud",
entrypoint=["python3", "/agent/agent.py"],
runtime="python:3.11",
)
print(snap.id, snap.version)
PY
# run the eval (XAI_API_KEY needed because the agent calls Grok-4)
XAI_API_KEY=xai-... ks eval run specs/finance-agents/refund-aud-only.yamlEvery other spec needs you to build the agent first from its scaffold's instructions. Read LEARNINGS.md before you do; six undocumented Keystone behaviors have already cost us hours.
Hand a plain-English description of what you want to test to your AI coding tool (Claude Code, Cursor, etc.) and let it draft the spec. After you've run ks setup once, the skill files under .claude/, .cursor/, etc. teach those tools the canonical Keystone spec shape.
Watch the full walkthrough (1 min).
- LEARNINGS.md: what works, what doesn't, the six gotchas. Read this first.
- Spec anatomy: field-by-field walkthrough.
- Agent types:
snapshotvsclivspythonvsimagevshttp. - Concepts, glossary, best practices.
- Examples mapping: every spec mapped to its upstream Polarity example.
PRs welcome. The full guide is in .github/CONTRIBUTING.md. Quick version: open an issue, copy the matching template (agents/_template.md or specs/_template.yaml), validate locally with bash scripts/validate.sh, open a PR.
Security issues: see SECURITY.md. Conduct: Code of Conduct.
Apache 2.0. See LICENSE.
Copyright © 2026 Polarity, Inc.
