Skip to content

BuildWithAbid/ai-stability

Repository files navigation

ai-stability

Tests PyPI version GitHub release PyPI downloads License: MIT Python

ai-stability hero

ai-stability is a CLI-first LLM stability analyzer for developers who want to measure output consistency, detect prompt variance, and inspect unstable model behavior locally.

It runs the same prompt multiple times against the same model, compares the responses, computes a simple stability score, and saves a local JSON artifact for replay and debugging.

TL;DR

  • install: pipx install ai-stability
  • run: ai-stability run prompt.txt --n 5 --provider openai --model gpt-4.1-mini
  • get: repeated outputs, similarity scores, a stability label, inline diffs, and a saved JSON artifact

Who It Is For

  • prompt engineers testing reliability
  • AI application developers debugging flaky model behavior
  • teams comparing model changes before shipping
  • developers who want a local CLI, not a hosted eval platform

What It Looks Like

Example prompt file:

Explain why deterministic prompts can still produce non-deterministic LLM outputs in exactly three sentences.

Example output:

Summary
- Average similarity: 82.64%
- Stability score: 83/100
- Stability label: High stability

Diff
[- hidden system behavior -] [+ internal state shifts +]

Why It Exists

LLM outputs often vary even when the prompt, model, and calling code stay the same. That makes it harder to:

  • evaluate prompt reliability
  • spot regressions during model upgrades
  • understand whether output drift is minor wording variance or meaningful behavior change
  • build confidence in AI-powered developer tooling

ai-stability is intentionally narrow and local-first:

  • one prompt file in
  • repeated model calls
  • simple, explicit similarity scoring
  • readable terminal output
  • JSON artifact saved locally for replay and debugging

Features

  • CLI-first workflow with no database, dashboard, or hosted backend
  • repeated prompt execution against the same model
  • explicit pairwise similarity and aggregate stability scoring
  • run-by-run output review
  • inline reference-vs-run diffing for fast variance inspection
  • local JSON artifact saving for debugging and replay
  • provider abstraction with OpenAI implemented first

Requirements

  • Python 3.11+
  • An OpenAI API key in OPENAI_API_KEY

Install

Recommended for end users

pipx install ai-stability

For development

python -m venv .venv
.venv\Scripts\activate
python -m pip install -e .[dev]

Configure

Set your API key in the shell:

$env:OPENAI_API_KEY="your_api_key"

You can copy .env.example for reference, but the CLI reads the key from the environment.

Quick Start

Run the analyzer:

ai-stability run prompt.txt --n 5 --provider openai --model gpt-4.1-mini

If you want to invoke it through the module instead of the installed script:

python -m ai_stability run prompt.txt --n 5 --provider openai --model gpt-4.1-mini

Example with a custom JSON output path:

ai-stability run prompt.txt --n 5 --provider openai --model gpt-4.1-mini --out results\sample-run.json

CLI Command

ai-stability run PROMPT_FILE --n 5 --provider openai --model MODEL_NAME

Current options:

  • --n: number of repeated runs, minimum 2
  • --provider: currently openai
  • --model: target model name
  • --temperature: sampling temperature, default 1.0
  • --out: optional output file or output directory for the JSON artifact

How Scoring Works

The v1 scoring heuristic is intentionally simple and inspectable:

  1. normalize whitespace in each output
  2. compute pairwise text similarity with Python's difflib.SequenceMatcher
  3. average all pairwise similarity scores
  4. convert the average to a 0-100 stability score

Stability labels:

  • 80-100: High stability
  • 50-79: Medium stability
  • 0-49: Low stability

What the CLI Prints

  • summary first
  • average and pairwise similarity
  • final stability score and label
  • each run output
  • a simple reference-vs-run diff for variation review

JSON Artifact

By default, results are written to results/ai-stability-YYYYMMDD-HHMMSS.json.

The JSON artifact includes:

  • prompt metadata
  • provider and model
  • all collected outputs
  • pairwise similarities
  • stability score and label
  • human-readable diffs

Example Workflow

ai-stability run prompt.txt --n 5 --provider openai --model gpt-4.1-mini

Use this when you want to compare how stable a model is for a fixed prompt before shipping a prompt change, swapping models, or debugging flaky output behavior.

Run Tests

python -m pytest

Repository Structure

src/ai_stability/
  cli.py
  runner.py
  scoring.py
  diffing.py
  output.py
  storage.py
  providers/
    base.py
    openai_provider.py
tests/
  test_scoring.py
  test_runner.py

Project Docs

Release Process

ai-stability is published on PyPI:

Releases are published from GitHub Actions with PyPI Trusted Publishing.

Typical release flow:

  1. update the version in pyproject.toml and src/ai_stability/__init__.py
  2. commit and push the release commit
  3. create and push a Git tag like vX.Y.Z
  4. let the publish.yml workflow run tests, build distributions, publish to PyPI, and create or update the matching GitHub release automatically

PyPI Trusted Publishing still requires one-time configuration on PyPI for this repository before automated publishing will succeed.

Example:

git tag vX.Y.Z
git push origin vX.Y.Z

Files to Review First

  • src/ai_stability/cli.py
  • src/ai_stability/runner.py
  • src/ai_stability/scoring.py
  • src/ai_stability/providers/openai_provider.py

Roadmap Notes

  • V1 runs requests sequentially on purpose.
  • Only OpenAI is implemented, but the provider boundary is small and ready for Anthropic later.
  • The scoring heuristic is intentionally simple and inspectable rather than statistically sophisticated.