Architecture Overview

This document covers the full system architecture for collaborators joining the project. It explains how the framework evolved, what each component does, and how all the pieces fit together.

Origin Story

The project started as a single Jupyter notebook (TcGLO_HPC_Local.ipynb, ~38MB) implementing an AI-powered benchmark for neuroscience predictive coding papers. It contained:

A 36-factor glossary spanning 3 hypotheses (Suppression, Propagation, Ubiquitousness)
A 4-part prompt system (Role, Logic, Constraints, Task) for consistent AI evaluation
A multi-model pipeline running papers through Gemini, Claude, local LLMs, and manual review
An interactive dashboard (ipywidgets) for manual scoring and data management
3D scatter plots and model comparison charts (Plotly)

The problem: everything was hardcoded to neuroscience. The glossary, contexts (Local Oddball / Global Oddball), hypothesis groups (H1/H2/H3), study types, prompt text, column names, and visualization labels were all neuroscience-specific.

The Generalization

We extracted all domain knowledge into a DomainConfig JSON schema, leaving the framework itself domain-agnostic.

What Was Already Generic

These components needed little or no change:

Component	Why Generic
`batch_evaluate_papers()`	Iterates PDFs, stores results — no domain logic
Factor grouping in sliders	Reads `hypothesis`/`theory_group` from glossary dynamically
Average calculation	Loops over groups found in glossary
Data persistence (JSON/CSV I/O)	Pure serialization
PDF text extraction	Domain-agnostic

What Was Hardcoded

Element	How Generalized
LO/GO contexts (200+ references)	`config.contexts[]` — supports N contexts
H1/H2/H3 hypothesis groups	`config.theory_groups[]` — supports N groups
Study types (Empirical/Theoretical/Computational)	`config.study_types[]` with colors and symbols
Neuroscience prompt text	Template variables filled from config
Column naming (`Local_Oddball_{factor}`)	`make_column_name(prefix, factor)`
`lo_evaluations`/`go_evaluations` in exec()	N `{ctx_id}_evaluations` variables
Dashboard class and tab labels	`StudyBenchmarkDashboard(config)`
3D scatter axis labels	Read from `config.theory_groups[].full_label`

Component Deep Dive

1. DomainConfig (`domains/*.json`)

A single JSON file captures everything domain-specific:

domain          → id, name, description
contexts[]      → id, name, column_prefix, description
theory_groups[] → id, name, full_label, description
study_types[]   → id, label, color, symbol
scoring         → scale_min, scale_max, descriptions
prompts         → role, logic, constraints, task_template (with placeholders)
glossary{}      → factor definitions with id, def, rel, tag, modes, theory_group

Validation (core/config.py):

All required keys present
Glossary modes reference valid context IDs
Glossary theory_group references valid group IDs
Column prefixes are unique
At least 1 context, 1 group, 1 factor

2. Column Naming (`core/columns.py`)

The central naming convention:

{context.column_prefix}_{sanitized_factor_name}

Where sanitization = spaces → underscores, parentheses removed.

Examples:

Neuroscience: Local_Oddball_Subtractive_Inhibition_SST
Electronics: High_Frequency_Impedance_Matching

Additional computed columns:

{prefix}_Count — number of non-NaN factors
{prefix}_Averages_{group_id}_(group_name) — mean score per theory group

3. Prompt System (`core/prompts.py`)

The 4-part prompt architecture:

┌─────────────┐
│    ROLE      │  Expert persona for the domain
├─────────────┤
│    LOGIC     │  Context definitions, scoring scale, full glossary
├─────────────┤
│ CONSTRAINTS  │  Independence rules, no hallucination, format compliance
├─────────────┤
│    TASK      │  Study text + glossary keys + required output format
└─────────────┘

Template variables are filled from the config:

{domain_name} → "Predictive Coding in Neuroscience"
{context_definitions} → auto-generated from config.contexts[]
{scoring_scale} → auto-generated from config.scoring
{full_glossary_definitions} → auto-generated from config.glossary
{context_eval_blocks} → generates lo_evaluations = {...} / go_evaluations = {...} etc.

4. Evaluation Pipeline (`core/evaluation.py`)

PDF → get_info_from_pdf() → text
text → create_prompt(text, config) → prompt
prompt → AI model → response
response → parse_ai_response(response, config) → {
    'evaluations': {'LO': {...}, 'GO': {...}},
    'first_author': '...',
    'publication_year': '...',
    'study_type': '...',
    'reasoning_log_text': '...'
}

Key design decisions:

N-context return: Returns {'evaluations': {ctx_id: {factor: score}}} dict instead of separate lo_evaluations/go_evaluations
exec() parsing: AI response is executable Python code, run in a sandboxed scope with only np available
Chunking: Long papers are split into <document_part> XML tags to handle token limits
Model routing: Supports cloud APIs (Gemini via ai.generate_text), local models (via OpenAI-compatible client), with prefix-based routing

5. Dashboard (`core/dashboard.py`)

StudyBenchmarkDashboard replaces the old NeuroBenchmarkDashboard:

Old (Hardcoded)	New (Parameterized)
`widgets_lo` / `widgets_go`	`widgets_by_context = {ctx_id: {}}`
Tab labels "Local Oddball (LO)"	`config.contexts[].name`
Hypothesis names H1/H2/H3	`config.theory_groups[].full_label`
Study types dropdown	`config.study_types[].label`
Column formula `Local_Oddball_{factor}`	`make_column_name(prefix, factor)`

6. Visualization (`core/visualization.py`)

Scatter Dispatch — selects chart type by group count:

Groups	Visualization	Example Domain
1	1D strip plot	Single-hypothesis evaluation
2	2D scatter	Binary classification domains
3	3D scatter + projection lines	Neuroscience (H1/H2/H3), Electronics (G1/G2/G3)
4+	Radar/spider chart	Multi-dimensional domains

Agent Comparison — N subplots (one per theory group), dots per model, filtered by context.

Summary — Bar chart of averages + NxN distance heatmap.

7. Agent Skills (`.claude/skills/`)

Skills are Claude Code packages that provide domain-specific knowledge:

Skill	Type	Purpose
`study-eval`	Background	Framework docs, DomainConfig schema
`study-eval-neuro`	User-invocable	Evaluate neuro papers (`/study-eval-neuro paper.pdf`)
`study-eval-electronics`	User-invocable	Evaluate electronics papers
`study-eval-glossary`	User-invocable	View/search/edit any glossary
`study-eval-compare`	User-invocable	Compare model evaluations

Each domain skill contains:

SKILL.md — instructions with frontmatter (name, description, allowed-tools)
glossary.json — the domain's factor glossary
Supporting docs (glossary-reference.md, output-format.md, etc.)

Data Flow

                    ┌──────────────┐
                    │  DomainConfig │
                    │    (.json)    │
                    └──────┬───────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
              ▼            ▼            ▼
         ┌─────────┐ ┌─────────┐ ┌──────────┐
         │ Prompts  │ │ Columns │ │Dashboard │
         │ (.py)    │ │ (.py)   │ │ (.py)    │
         └────┬─────┘ └────┬────┘ └────┬─────┘
              │            │           │
              ▼            │           │
    ┌──────────────┐       │           │
    │  Evaluation   │      │           │
    │  Pipeline     │──────┘           │
    │  (.py)        │                  │
    └──────┬───────┘                  │
           │                          │
           ▼                          ▼
    ┌──────────────┐          ┌──────────────┐
    │  Benchmark    │◄────────│  Interactive  │
    │  DataFrame    │         │  UI / Widgets │
    └──────┬───────┘          └──────────────┘
           │
           ▼
    ┌──────────────┐
    │Visualization │
    │ (3D, compare)│
    └──────────────┘

Backward Compatibility

The original notebook TcGLO_HPC_Local.ipynb is completely unchanged and still runs standalone. The core/ modules and domains/ configs are a parallel extraction — they don't modify or depend on the notebook.

The neuroscience DomainConfig produces identical column names to the original notebook:

Local_Oddball_Subtractive_Inhibition_SST (same as before)
Global_Oddball_Averages_H1_(Suppression) (same as before)

Creating a New Domain

Copy domains/electronics_architecture.json as a starting point
Replace: domain info, contexts, theory groups, study types
Write your glossary factors with definitions and relationships
Customize the prompt templates for your field's terminology
Run load_domain_config() — it will validate all cross-references
Optionally create a skill in .claude/skills/study-eval-<domain>/

The framework automatically handles:

Column naming for your contexts and factors
N-context evaluation variables in the AI prompt
Dashboard tabs and scoring panels
Visualization dispatch for your group count

Key Files Reference

File	Lines	Purpose
`core/config.py`	~180	Config loading, validation, queries
`core/columns.py`	~110	Column name generation
`core/prompts.py`	~200	Prompt template filling
`core/evaluation.py`	~220	PDF extraction, AI evaluation, response parsing
`core/dashboard.py`	~340	Interactive dashboard with N contexts
`core/visualization.py`	~350	Scatter dispatch, agent comparison, summary
`domains/neuroscience_predictive_coding.json`	~500	Full neuro config (36 factors, 4 prompts)
`domains/electronics_architecture.json`	~300	Electronics starter (18 factors)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture Overview

Origin Story

The Generalization

What Was Already Generic

What Was Hardcoded

Component Deep Dive

1. DomainConfig (`domains/*.json`)

2. Column Naming (`core/columns.py`)

3. Prompt System (`core/prompts.py`)

4. Evaluation Pipeline (`core/evaluation.py`)

5. Dashboard (`core/dashboard.py`)

6. Visualization (`core/visualization.py`)

7. Agent Skills (`.claude/skills/`)

Data Flow

Backward Compatibility

Creating a New Domain

Key Files Reference

FilesExpand file tree

OVERVIEW.md

Latest commit

History

OVERVIEW.md

File metadata and controls

Architecture Overview

Origin Story

The Generalization

What Was Already Generic

What Was Hardcoded

Component Deep Dive

1. DomainConfig (domains/*.json)

2. Column Naming (core/columns.py)

3. Prompt System (core/prompts.py)

4. Evaluation Pipeline (core/evaluation.py)

5. Dashboard (core/dashboard.py)

6. Visualization (core/visualization.py)

7. Agent Skills (.claude/skills/)

Data Flow

Backward Compatibility

Creating a New Domain

Key Files Reference

1. DomainConfig (`domains/*.json`)

2. Column Naming (`core/columns.py`)

3. Prompt System (`core/prompts.py`)

4. Evaluation Pipeline (`core/evaluation.py`)

5. Dashboard (`core/dashboard.py`)

6. Visualization (`core/visualization.py`)

7. Agent Skills (`.claude/skills/`)