This document covers the full system architecture for collaborators joining the project. It explains how the framework evolved, what each component does, and how all the pieces fit together.
The project started as a single Jupyter notebook (TcGLO_HPC_Local.ipynb, ~38MB) implementing an AI-powered benchmark for neuroscience predictive coding papers. It contained:
- A 36-factor glossary spanning 3 hypotheses (Suppression, Propagation, Ubiquitousness)
- A 4-part prompt system (Role, Logic, Constraints, Task) for consistent AI evaluation
- A multi-model pipeline running papers through Gemini, Claude, local LLMs, and manual review
- An interactive dashboard (ipywidgets) for manual scoring and data management
- 3D scatter plots and model comparison charts (Plotly)
The problem: everything was hardcoded to neuroscience. The glossary, contexts (Local Oddball / Global Oddball), hypothesis groups (H1/H2/H3), study types, prompt text, column names, and visualization labels were all neuroscience-specific.
We extracted all domain knowledge into a DomainConfig JSON schema, leaving the framework itself domain-agnostic.
These components needed little or no change:
| Component | Why Generic |
|---|---|
batch_evaluate_papers() |
Iterates PDFs, stores results — no domain logic |
| Factor grouping in sliders | Reads hypothesis/theory_group from glossary dynamically |
| Average calculation | Loops over groups found in glossary |
| Data persistence (JSON/CSV I/O) | Pure serialization |
| PDF text extraction | Domain-agnostic |
| Element | How Generalized |
|---|---|
| LO/GO contexts (200+ references) | config.contexts[] — supports N contexts |
| H1/H2/H3 hypothesis groups | config.theory_groups[] — supports N groups |
| Study types (Empirical/Theoretical/Computational) | config.study_types[] with colors and symbols |
| Neuroscience prompt text | Template variables filled from config |
Column naming (Local_Oddball_{factor}) |
make_column_name(prefix, factor) |
lo_evaluations/go_evaluations in exec() |
N {ctx_id}_evaluations variables |
| Dashboard class and tab labels | StudyBenchmarkDashboard(config) |
| 3D scatter axis labels | Read from config.theory_groups[].full_label |
A single JSON file captures everything domain-specific:
domain → id, name, description
contexts[] → id, name, column_prefix, description
theory_groups[] → id, name, full_label, description
study_types[] → id, label, color, symbol
scoring → scale_min, scale_max, descriptions
prompts → role, logic, constraints, task_template (with placeholders)
glossary{} → factor definitions with id, def, rel, tag, modes, theory_group
Validation (core/config.py):
- All required keys present
- Glossary
modesreference valid context IDs - Glossary
theory_groupreferences valid group IDs - Column prefixes are unique
- At least 1 context, 1 group, 1 factor
The central naming convention:
{context.column_prefix}_{sanitized_factor_name}
Where sanitization = spaces → underscores, parentheses removed.
Examples:
- Neuroscience:
Local_Oddball_Subtractive_Inhibition_SST - Electronics:
High_Frequency_Impedance_Matching
Additional computed columns:
{prefix}_Count— number of non-NaN factors{prefix}_Averages_{group_id}_(group_name)— mean score per theory group
The 4-part prompt architecture:
┌─────────────┐
│ ROLE │ Expert persona for the domain
├─────────────┤
│ LOGIC │ Context definitions, scoring scale, full glossary
├─────────────┤
│ CONSTRAINTS │ Independence rules, no hallucination, format compliance
├─────────────┤
│ TASK │ Study text + glossary keys + required output format
└─────────────┘
Template variables are filled from the config:
{domain_name}→ "Predictive Coding in Neuroscience"{context_definitions}→ auto-generated fromconfig.contexts[]{scoring_scale}→ auto-generated fromconfig.scoring{full_glossary_definitions}→ auto-generated fromconfig.glossary{context_eval_blocks}→ generateslo_evaluations = {...}/go_evaluations = {...}etc.
PDF → get_info_from_pdf() → text
text → create_prompt(text, config) → prompt
prompt → AI model → response
response → parse_ai_response(response, config) → {
'evaluations': {'LO': {...}, 'GO': {...}},
'first_author': '...',
'publication_year': '...',
'study_type': '...',
'reasoning_log_text': '...'
}
Key design decisions:
- N-context return: Returns
{'evaluations': {ctx_id: {factor: score}}}dict instead of separatelo_evaluations/go_evaluations - exec() parsing: AI response is executable Python code, run in a sandboxed scope with only
npavailable - Chunking: Long papers are split into
<document_part>XML tags to handle token limits - Model routing: Supports cloud APIs (Gemini via
ai.generate_text), local models (via OpenAI-compatible client), with prefix-based routing
StudyBenchmarkDashboard replaces the old NeuroBenchmarkDashboard:
| Old (Hardcoded) | New (Parameterized) |
|---|---|
widgets_lo / widgets_go |
widgets_by_context = {ctx_id: {}} |
| Tab labels "Local Oddball (LO)" | config.contexts[].name |
| Hypothesis names H1/H2/H3 | config.theory_groups[].full_label |
| Study types dropdown | config.study_types[].label |
Column formula Local_Oddball_{factor} |
make_column_name(prefix, factor) |
Scatter Dispatch — selects chart type by group count:
| Groups | Visualization | Example Domain |
|---|---|---|
| 1 | 1D strip plot | Single-hypothesis evaluation |
| 2 | 2D scatter | Binary classification domains |
| 3 | 3D scatter + projection lines | Neuroscience (H1/H2/H3), Electronics (G1/G2/G3) |
| 4+ | Radar/spider chart | Multi-dimensional domains |
Agent Comparison — N subplots (one per theory group), dots per model, filtered by context.
Summary — Bar chart of averages + NxN distance heatmap.
Skills are Claude Code packages that provide domain-specific knowledge:
| Skill | Type | Purpose |
|---|---|---|
study-eval |
Background | Framework docs, DomainConfig schema |
study-eval-neuro |
User-invocable | Evaluate neuro papers (/study-eval-neuro paper.pdf) |
study-eval-electronics |
User-invocable | Evaluate electronics papers |
study-eval-glossary |
User-invocable | View/search/edit any glossary |
study-eval-compare |
User-invocable | Compare model evaluations |
Each domain skill contains:
SKILL.md— instructions with frontmatter (name, description, allowed-tools)glossary.json— the domain's factor glossary- Supporting docs (glossary-reference.md, output-format.md, etc.)
┌──────────────┐
│ DomainConfig │
│ (.json) │
└──────┬───────┘
│
┌────────────┼────────────┐
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌──────────┐
│ Prompts │ │ Columns │ │Dashboard │
│ (.py) │ │ (.py) │ │ (.py) │
└────┬─────┘ └────┬────┘ └────┬─────┘
│ │ │
▼ │ │
┌──────────────┐ │ │
│ Evaluation │ │ │
│ Pipeline │──────┘ │
│ (.py) │ │
└──────┬───────┘ │
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Benchmark │◄────────│ Interactive │
│ DataFrame │ │ UI / Widgets │
└──────┬───────┘ └──────────────┘
│
▼
┌──────────────┐
│Visualization │
│ (3D, compare)│
└──────────────┘
The original notebook TcGLO_HPC_Local.ipynb is completely unchanged and still runs standalone. The core/ modules and domains/ configs are a parallel extraction — they don't modify or depend on the notebook.
The neuroscience DomainConfig produces identical column names to the original notebook:
Local_Oddball_Subtractive_Inhibition_SST(same as before)Global_Oddball_Averages_H1_(Suppression)(same as before)
- Copy
domains/electronics_architecture.jsonas a starting point - Replace: domain info, contexts, theory groups, study types
- Write your glossary factors with definitions and relationships
- Customize the prompt templates for your field's terminology
- Run
load_domain_config()— it will validate all cross-references - Optionally create a skill in
.claude/skills/study-eval-<domain>/
The framework automatically handles:
- Column naming for your contexts and factors
- N-context evaluation variables in the AI prompt
- Dashboard tabs and scoring panels
- Visualization dispatch for your group count
| File | Lines | Purpose |
|---|---|---|
core/config.py |
~180 | Config loading, validation, queries |
core/columns.py |
~110 | Column name generation |
core/prompts.py |
~200 | Prompt template filling |
core/evaluation.py |
~220 | PDF extraction, AI evaluation, response parsing |
core/dashboard.py |
~340 | Interactive dashboard with N contexts |
core/visualization.py |
~350 | Scatter dispatch, agent comparison, summary |
domains/neuroscience_predictive_coding.json |
~500 | Full neuro config (36 factors, 4 prompts) |
domains/electronics_architecture.json |
~300 | Electronics starter (18 factors) |