ai-stability

ai-stability is a CLI-first LLM stability analyzer for developers who want to measure output consistency, detect prompt variance, and inspect unstable model behavior locally.

It runs the same prompt multiple times against the same model, compares the responses, computes a simple stability score, and saves a local JSON artifact for replay and debugging.

TL;DR

install: pipx install ai-stability
run: ai-stability run prompt.txt --n 5 --provider openai --model gpt-4.1-mini
get: repeated outputs, similarity scores, a stability label, inline diffs, and a saved JSON artifact

Who It Is For

prompt engineers testing reliability
AI application developers debugging flaky model behavior
teams comparing model changes before shipping
developers who want a local CLI, not a hosted eval platform

What It Looks Like

Example prompt file:

Explain why deterministic prompts can still produce non-deterministic LLM outputs in exactly three sentences.

Example output:

Summary
- Average similarity: 82.64%
- Stability score: 83/100
- Stability label: High stability

Diff
[- hidden system behavior -] [+ internal state shifts +]

Why It Exists

LLM outputs often vary even when the prompt, model, and calling code stay the same. That makes it harder to:

evaluate prompt reliability
spot regressions during model upgrades
understand whether output drift is minor wording variance or meaningful behavior change
build confidence in AI-powered developer tooling

ai-stability is intentionally narrow and local-first:

one prompt file in
repeated model calls
simple, explicit similarity scoring
readable terminal output
JSON artifact saved locally for replay and debugging

Features

CLI-first workflow with no database, dashboard, or hosted backend
repeated prompt execution against the same model
explicit pairwise similarity and aggregate stability scoring
run-by-run output review
inline reference-vs-run diffing for fast variance inspection
local JSON artifact saving for debugging and replay
provider abstraction with OpenAI implemented first

Requirements

Python 3.11+
An OpenAI API key in OPENAI_API_KEY

Install

Recommended for end users

pipx install ai-stability

For development

python -m venv .venv
.venv\Scripts\activate
python -m pip install -e .[dev]

Configure

Set your API key in the shell:

$env:OPENAI_API_KEY="your_api_key"

You can copy .env.example for reference, but the CLI reads the key from the environment.

Quick Start

Run the analyzer:

ai-stability run prompt.txt --n 5 --provider openai --model gpt-4.1-mini

If you want to invoke it through the module instead of the installed script:

python -m ai_stability run prompt.txt --n 5 --provider openai --model gpt-4.1-mini

Example with a custom JSON output path:

ai-stability run prompt.txt --n 5 --provider openai --model gpt-4.1-mini --out results\sample-run.json

CLI Command

ai-stability run PROMPT_FILE --n 5 --provider openai --model MODEL_NAME

Current options:

--n: number of repeated runs, minimum 2
--provider: currently openai
--model: target model name
--temperature: sampling temperature, default 1.0
--out: optional output file or output directory for the JSON artifact

How Scoring Works

The v1 scoring heuristic is intentionally simple and inspectable:

normalize whitespace in each output
compute pairwise text similarity with Python's difflib.SequenceMatcher
average all pairwise similarity scores
convert the average to a 0-100 stability score

Stability labels:

80-100: High stability
50-79: Medium stability
0-49: Low stability

What the CLI Prints

summary first
average and pairwise similarity
final stability score and label
each run output
a simple reference-vs-run diff for variation review

JSON Artifact

By default, results are written to results/ai-stability-YYYYMMDD-HHMMSS.json.

The JSON artifact includes:

prompt metadata
provider and model
all collected outputs
pairwise similarities
stability score and label
human-readable diffs

Example Workflow

ai-stability run prompt.txt --n 5 --provider openai --model gpt-4.1-mini

Use this when you want to compare how stable a model is for a fixed prompt before shipping a prompt change, swapping models, or debugging flaky output behavior.

Run Tests

python -m pytest

Repository Structure

src/ai_stability/
  cli.py
  runner.py
  scoring.py
  diffing.py
  output.py
  storage.py
  providers/
    base.py
    openai_provider.py
tests/
  test_scoring.py
  test_runner.py

Project Docs

Release Process

ai-stability is published on PyPI:

https://pypi.org/project/ai-stability/

Releases are published from GitHub Actions with PyPI Trusted Publishing.

Typical release flow:

update the version in pyproject.toml and src/ai_stability/__init__.py
commit and push the release commit
create and push a Git tag like vX.Y.Z
let the publish.yml workflow run tests, build distributions, publish to PyPI, and create or update the matching GitHub release automatically

PyPI Trusted Publishing still requires one-time configuration on PyPI for this repository before automated publishing will succeed.

Example:

git tag vX.Y.Z
git push origin vX.Y.Z

Files to Review First

src/ai_stability/cli.py
src/ai_stability/runner.py
src/ai_stability/scoring.py
src/ai_stability/providers/openai_provider.py

Roadmap Notes

V1 runs requests sequentially on purpose.
Only OpenAI is implemented, but the provider boundary is small and ready for Anthropic later.
The scoring heuristic is intentionally simple and inspectable rather than statistically sophisticated.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github		.github
assets		assets
docs		docs
examples		examples
src/ai_stability		src/ai_stability
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ai-stability

TL;DR

Who It Is For

What It Looks Like

Why It Exists

Features

Requirements

Install

Recommended for end users

For development

Configure

Quick Start

CLI Command

How Scoring Works

What the CLI Prints

JSON Artifact

Example Workflow

Run Tests

Repository Structure

Project Docs

Release Process

Files to Review First

Roadmap Notes

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ai-stability

TL;DR

Who It Is For

What It Looks Like

Why It Exists

Features

Requirements

Install

Recommended for end users

For development

Configure

Quick Start

CLI Command

How Scoring Works

What the CLI Prints

JSON Artifact

Example Workflow

Run Tests

Repository Structure

Project Docs

Release Process

Files to Review First

Roadmap Notes

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages