LLM Evaluation Workflow

Overview

This project provides a structured workflow for evaluating and comparing LLM outputs across multiple models.

The core problem it solves: manually copy/pasting prompts into multiple models is tedious and error-prone. By scripting prompt execution across APIs, outputs can be collected systematically and evaluated using a consistent framework.

View the full evaluation sheet: Google Sheets

Note: Scoring logic and aggregation formulas are in progress.

Automation

Two scripts send the same prompts to different models programmatically:

Script	Model	Description
`run_prompts.py`	Gemini 2.5 Flash	Sends prompts to Gemini API
`run_prompts_claude.py`	Claude Haiku	Sends prompts to Anthropic API

Both scripts:

Read inputs from the same Prompt Set.csv file
Send each prompt to the model via API
Write model outputs to a new CSV file for evaluation

This enables side-by-side model comparison without manual copy/paste.

Note: API keys are stored securely using environment variables (.env) and are not included in this repository.

Workflow

Input → user query
Prompt → instruction given to the model
Output → model response (collected automatically via script)
Evaluation:
- Assign PASS / FAIL
- Tag issues using predefined categories

Issue Taxonomy

Too verbose
Wrong classification
Format incorrect
Missing information
Hallucination

Metrics

Outputs are analyzed using simple aggregation:

Count of each issue type
Percentage of total failures

Example insight:

50% of outputs were too verbose → indicates a need for tighter prompt constraints

Why This Matters

Manual evaluation of LLM outputs is inconsistent and doesn't scale. This workflow introduces:

Automation — prompt execution across multiple models without copy/paste
Consistency — same prompts, same evaluation criteria across models
Comparability — structured output enables side-by-side model analysis
Iteration — failure patterns inform prompt improvements

Example

Input	Prompt	Output	Issue
I want to cancel	Classify intent	"The user intent is..."	Too verbose

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
LLM Evaluation Workflow - Sheet1.tsv		LLM Evaluation Workflow - Sheet1.tsv
README.md		README.md
run_eval.py		run_eval.py
run_prompts_claude.py		run_prompts_claude.py
run_prompts_gemini.py		run_prompts_gemini.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Evaluation Workflow

Overview

Automation

Workflow

Issue Taxonomy

Metrics

Why This Matters

Example

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Evaluation Workflow

Overview

Automation

Workflow

Issue Taxonomy

Metrics

Why This Matters

Example

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages