Skip to content

Latest commit

 

History

History
245 lines (193 loc) · 10.9 KB

File metadata and controls

245 lines (193 loc) · 10.9 KB

Architecture Overview

This document covers the full system architecture for collaborators joining the project. It explains how the framework evolved, what each component does, and how all the pieces fit together.

Origin Story

The project started as a single Jupyter notebook (TcGLO_HPC_Local.ipynb, ~38MB) implementing an AI-powered benchmark for neuroscience predictive coding papers. It contained:

  • A 36-factor glossary spanning 3 hypotheses (Suppression, Propagation, Ubiquitousness)
  • A 4-part prompt system (Role, Logic, Constraints, Task) for consistent AI evaluation
  • A multi-model pipeline running papers through Gemini, Claude, local LLMs, and manual review
  • An interactive dashboard (ipywidgets) for manual scoring and data management
  • 3D scatter plots and model comparison charts (Plotly)

The problem: everything was hardcoded to neuroscience. The glossary, contexts (Local Oddball / Global Oddball), hypothesis groups (H1/H2/H3), study types, prompt text, column names, and visualization labels were all neuroscience-specific.

The Generalization

We extracted all domain knowledge into a DomainConfig JSON schema, leaving the framework itself domain-agnostic.

What Was Already Generic

These components needed little or no change:

Component Why Generic
batch_evaluate_papers() Iterates PDFs, stores results — no domain logic
Factor grouping in sliders Reads hypothesis/theory_group from glossary dynamically
Average calculation Loops over groups found in glossary
Data persistence (JSON/CSV I/O) Pure serialization
PDF text extraction Domain-agnostic

What Was Hardcoded

Element How Generalized
LO/GO contexts (200+ references) config.contexts[] — supports N contexts
H1/H2/H3 hypothesis groups config.theory_groups[] — supports N groups
Study types (Empirical/Theoretical/Computational) config.study_types[] with colors and symbols
Neuroscience prompt text Template variables filled from config
Column naming (Local_Oddball_{factor}) make_column_name(prefix, factor)
lo_evaluations/go_evaluations in exec() N {ctx_id}_evaluations variables
Dashboard class and tab labels StudyBenchmarkDashboard(config)
3D scatter axis labels Read from config.theory_groups[].full_label

Component Deep Dive

1. DomainConfig (domains/*.json)

A single JSON file captures everything domain-specific:

domain          → id, name, description
contexts[]      → id, name, column_prefix, description
theory_groups[] → id, name, full_label, description
study_types[]   → id, label, color, symbol
scoring         → scale_min, scale_max, descriptions
prompts         → role, logic, constraints, task_template (with placeholders)
glossary{}      → factor definitions with id, def, rel, tag, modes, theory_group

Validation (core/config.py):

  • All required keys present
  • Glossary modes reference valid context IDs
  • Glossary theory_group references valid group IDs
  • Column prefixes are unique
  • At least 1 context, 1 group, 1 factor

2. Column Naming (core/columns.py)

The central naming convention:

{context.column_prefix}_{sanitized_factor_name}

Where sanitization = spaces → underscores, parentheses removed.

Examples:

  • Neuroscience: Local_Oddball_Subtractive_Inhibition_SST
  • Electronics: High_Frequency_Impedance_Matching

Additional computed columns:

  • {prefix}_Count — number of non-NaN factors
  • {prefix}_Averages_{group_id}_(group_name) — mean score per theory group

3. Prompt System (core/prompts.py)

The 4-part prompt architecture:

┌─────────────┐
│    ROLE      │  Expert persona for the domain
├─────────────┤
│    LOGIC     │  Context definitions, scoring scale, full glossary
├─────────────┤
│ CONSTRAINTS  │  Independence rules, no hallucination, format compliance
├─────────────┤
│    TASK      │  Study text + glossary keys + required output format
└─────────────┘

Template variables are filled from the config:

  • {domain_name} → "Predictive Coding in Neuroscience"
  • {context_definitions} → auto-generated from config.contexts[]
  • {scoring_scale} → auto-generated from config.scoring
  • {full_glossary_definitions} → auto-generated from config.glossary
  • {context_eval_blocks} → generates lo_evaluations = {...} / go_evaluations = {...} etc.

4. Evaluation Pipeline (core/evaluation.py)

PDF → get_info_from_pdf() → text
text → create_prompt(text, config) → prompt
prompt → AI model → response
response → parse_ai_response(response, config) → {
    'evaluations': {'LO': {...}, 'GO': {...}},
    'first_author': '...',
    'publication_year': '...',
    'study_type': '...',
    'reasoning_log_text': '...'
}

Key design decisions:

  • N-context return: Returns {'evaluations': {ctx_id: {factor: score}}} dict instead of separate lo_evaluations/go_evaluations
  • exec() parsing: AI response is executable Python code, run in a sandboxed scope with only np available
  • Chunking: Long papers are split into <document_part> XML tags to handle token limits
  • Model routing: Supports cloud APIs (Gemini via ai.generate_text), local models (via OpenAI-compatible client), with prefix-based routing

5. Dashboard (core/dashboard.py)

StudyBenchmarkDashboard replaces the old NeuroBenchmarkDashboard:

Old (Hardcoded) New (Parameterized)
widgets_lo / widgets_go widgets_by_context = {ctx_id: {}}
Tab labels "Local Oddball (LO)" config.contexts[].name
Hypothesis names H1/H2/H3 config.theory_groups[].full_label
Study types dropdown config.study_types[].label
Column formula Local_Oddball_{factor} make_column_name(prefix, factor)

6. Visualization (core/visualization.py)

Scatter Dispatch — selects chart type by group count:

Groups Visualization Example Domain
1 1D strip plot Single-hypothesis evaluation
2 2D scatter Binary classification domains
3 3D scatter + projection lines Neuroscience (H1/H2/H3), Electronics (G1/G2/G3)
4+ Radar/spider chart Multi-dimensional domains

Agent Comparison — N subplots (one per theory group), dots per model, filtered by context.

Summary — Bar chart of averages + NxN distance heatmap.

7. Agent Skills (.claude/skills/)

Skills are Claude Code packages that provide domain-specific knowledge:

Skill Type Purpose
study-eval Background Framework docs, DomainConfig schema
study-eval-neuro User-invocable Evaluate neuro papers (/study-eval-neuro paper.pdf)
study-eval-electronics User-invocable Evaluate electronics papers
study-eval-glossary User-invocable View/search/edit any glossary
study-eval-compare User-invocable Compare model evaluations

Each domain skill contains:

  • SKILL.md — instructions with frontmatter (name, description, allowed-tools)
  • glossary.json — the domain's factor glossary
  • Supporting docs (glossary-reference.md, output-format.md, etc.)

Data Flow

                    ┌──────────────┐
                    │  DomainConfig │
                    │    (.json)    │
                    └──────┬───────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
              ▼            ▼            ▼
         ┌─────────┐ ┌─────────┐ ┌──────────┐
         │ Prompts  │ │ Columns │ │Dashboard │
         │ (.py)    │ │ (.py)   │ │ (.py)    │
         └────┬─────┘ └────┬────┘ └────┬─────┘
              │            │           │
              ▼            │           │
    ┌──────────────┐       │           │
    │  Evaluation   │      │           │
    │  Pipeline     │──────┘           │
    │  (.py)        │                  │
    └──────┬───────┘                  │
           │                          │
           ▼                          ▼
    ┌──────────────┐          ┌──────────────┐
    │  Benchmark    │◄────────│  Interactive  │
    │  DataFrame    │         │  UI / Widgets │
    └──────┬───────┘          └──────────────┘
           │
           ▼
    ┌──────────────┐
    │Visualization │
    │ (3D, compare)│
    └──────────────┘

Backward Compatibility

The original notebook TcGLO_HPC_Local.ipynb is completely unchanged and still runs standalone. The core/ modules and domains/ configs are a parallel extraction — they don't modify or depend on the notebook.

The neuroscience DomainConfig produces identical column names to the original notebook:

  • Local_Oddball_Subtractive_Inhibition_SST (same as before)
  • Global_Oddball_Averages_H1_(Suppression) (same as before)

Creating a New Domain

  1. Copy domains/electronics_architecture.json as a starting point
  2. Replace: domain info, contexts, theory groups, study types
  3. Write your glossary factors with definitions and relationships
  4. Customize the prompt templates for your field's terminology
  5. Run load_domain_config() — it will validate all cross-references
  6. Optionally create a skill in .claude/skills/study-eval-<domain>/

The framework automatically handles:

  • Column naming for your contexts and factors
  • N-context evaluation variables in the AI prompt
  • Dashboard tabs and scoring panels
  • Visualization dispatch for your group count

Key Files Reference

File Lines Purpose
core/config.py ~180 Config loading, validation, queries
core/columns.py ~110 Column name generation
core/prompts.py ~200 Prompt template filling
core/evaluation.py ~220 PDF extraction, AI evaluation, response parsing
core/dashboard.py ~340 Interactive dashboard with N contexts
core/visualization.py ~350 Scatter dispatch, agent comparison, summary
domains/neuroscience_predictive_coding.json ~500 Full neuro config (36 factors, 4 prompts)
domains/electronics_architecture.json ~300 Electronics starter (18 factors)