|
| 1 | +# SPEC-SCHEMA-IMPL: Schema System Implementation Plan |
| 2 | + |
| 3 | +**Status:** Draft |
| 4 | +**Created:** 2025-02-06 |
| 5 | +**Branch:** `feature/schema-system` |
| 6 | +**Depends on:** [SPEC-SCHEMA](SPEC-SCHEMA.md) |
| 7 | + |
| 8 | +## Overview |
| 9 | + |
| 10 | +Implementation plan for the Basic Memory Schema System. The system is entirely programmatic — |
| 11 | +no LLM agent runtime or API key required. The LLM already in the user's session (Claude Code, |
| 12 | +Claude Desktop, etc.) provides the intelligence layer by reading schema notes via existing |
| 13 | +MCP tools. |
| 14 | + |
| 15 | +## Architecture |
| 16 | + |
| 17 | +``` |
| 18 | +┌─────────────────────────────────────────────────┐ |
| 19 | +│ Entry Points │ |
| 20 | +│ CLI (bm schema ...) │ MCP (schema_validate) │ |
| 21 | +└──────────┬────────────┴──────────┬──────────────┘ |
| 22 | + │ │ |
| 23 | + ▼ ▼ |
| 24 | +┌─────────────────────────────────────────────────┐ |
| 25 | +│ Schema Service Layer │ |
| 26 | +│ resolve_schema · validate · infer · diff │ |
| 27 | +└──────────┬────────────────────────┬──────────────┘ |
| 28 | + │ │ |
| 29 | + ▼ ▼ |
| 30 | +┌──────────────────────┐ ┌────────────────────────┐ |
| 31 | +│ Picoschema Parser │ │ Note/Entity Access │ |
| 32 | +│ YAML → SchemaModel │ │ (existing repository) │ |
| 33 | +└──────────────────────┘ └────────────────────────┘ |
| 34 | +``` |
| 35 | + |
| 36 | +No new database tables. Schemas are notes with `type: schema` — they're already indexed. |
| 37 | +Validation reads observations and relations from existing data. |
| 38 | + |
| 39 | +## Components |
| 40 | + |
| 41 | +### 1. Picoschema Parser |
| 42 | + |
| 43 | +**Location:** `src/basic_memory/schema/parser.py` |
| 44 | + |
| 45 | +Parses Picoschema YAML into an internal representation. |
| 46 | + |
| 47 | +```python |
| 48 | +@dataclass |
| 49 | +class SchemaField: |
| 50 | + name: str |
| 51 | + type: str # string, integer, number, boolean, any, or EntityName |
| 52 | + required: bool # True unless field name ends with ? |
| 53 | + is_array: bool # True if (array) notation |
| 54 | + is_enum: bool # True if (enum) notation |
| 55 | + enum_values: list[str] # Populated for enums |
| 56 | + description: str | None # Text after comma |
| 57 | + is_entity_ref: bool # True if type is capitalized (entity reference) |
| 58 | + children: list[SchemaField] # For (object) types |
| 59 | + |
| 60 | + |
| 61 | +@dataclass |
| 62 | +class SchemaDefinition: |
| 63 | + entity: str # The entity type this schema describes |
| 64 | + version: int # Schema version |
| 65 | + fields: list[SchemaField] # Parsed fields |
| 66 | + validation_mode: str # "warn" | "strict" | "off" |
| 67 | + |
| 68 | + |
| 69 | +def parse_picoschema(yaml_dict: dict) -> list[SchemaField]: |
| 70 | + """Parse a Picoschema YAML dict into a list of SchemaField objects.""" |
| 71 | + |
| 72 | + |
| 73 | +def parse_schema_note(frontmatter: dict) -> SchemaDefinition: |
| 74 | + """Parse a full schema note's frontmatter into a SchemaDefinition.""" |
| 75 | +``` |
| 76 | + |
| 77 | +**Input/Output:** |
| 78 | +```yaml |
| 79 | +# Input (YAML dict from frontmatter) |
| 80 | +schema: |
| 81 | + name: string, full name |
| 82 | + role?: string, job title |
| 83 | + works_at?: Organization, employer |
| 84 | + expertise?(array): string, areas of knowledge |
| 85 | +``` |
| 86 | +
|
| 87 | +```python |
| 88 | +# Output |
| 89 | +[ |
| 90 | + SchemaField(name="name", type="string", required=True, description="full name", ...), |
| 91 | + SchemaField(name="role", type="string", required=False, description="job title", ...), |
| 92 | + SchemaField(name="works_at", type="Organization", required=False, is_entity_ref=True, ...), |
| 93 | + SchemaField(name="expertise", type="string", required=False, is_array=True, ...), |
| 94 | +] |
| 95 | +``` |
| 96 | + |
| 97 | +### 2. Schema Resolver |
| 98 | + |
| 99 | +**Location:** `src/basic_memory/schema/resolver.py` |
| 100 | + |
| 101 | +Finds the applicable schema for a note using the resolution order. |
| 102 | + |
| 103 | +```python |
| 104 | +async def resolve_schema( |
| 105 | + note_frontmatter: dict, |
| 106 | + search_fn: Callable, # injected search capability |
| 107 | +) -> SchemaDefinition | None: |
| 108 | + """Resolve schema for a note. |
| 109 | +
|
| 110 | + Resolution order: |
| 111 | + 1. Inline schema (frontmatter['schema'] is a dict) |
| 112 | + 2. Explicit reference (frontmatter['schema'] is a string) |
| 113 | + 3. Implicit by type (frontmatter['type'] → schema note with matching entity) |
| 114 | + 4. No schema (returns None) |
| 115 | + """ |
| 116 | +``` |
| 117 | + |
| 118 | +### 3. Schema Validator |
| 119 | + |
| 120 | +**Location:** `src/basic_memory/schema/validator.py` |
| 121 | + |
| 122 | +Validates a note's observations and relations against a resolved schema. |
| 123 | + |
| 124 | +```python |
| 125 | +@dataclass |
| 126 | +class FieldResult: |
| 127 | + field: SchemaField |
| 128 | + status: str # "present" | "missing" | "type_mismatch" |
| 129 | + values: list[str] # Matched observation values or relation targets |
| 130 | + message: str | None # Human-readable detail |
| 131 | + |
| 132 | + |
| 133 | +@dataclass |
| 134 | +class ValidationResult: |
| 135 | + note_identifier: str |
| 136 | + schema_entity: str |
| 137 | + passed: bool # True if no errors (warnings are OK) |
| 138 | + field_results: list[FieldResult] |
| 139 | + unmatched_observations: dict[str, int] # category → count |
| 140 | + unmatched_relations: list[str] # relation types not in schema |
| 141 | + warnings: list[str] |
| 142 | + errors: list[str] |
| 143 | + |
| 144 | + |
| 145 | +async def validate_note( |
| 146 | + note: Note, |
| 147 | + schema: SchemaDefinition, |
| 148 | +) -> ValidationResult: |
| 149 | + """Validate a note against a schema definition. |
| 150 | +
|
| 151 | + Mapping rules: |
| 152 | + - field: string → observation [field] exists |
| 153 | + - field?(array): type → multiple [field] observations |
| 154 | + - field?: EntityType → relation 'field [[...]]' exists |
| 155 | + - field?(enum): [v] → observation [field] value ∈ enum values |
| 156 | + """ |
| 157 | +``` |
| 158 | + |
| 159 | +### 4. Schema Inference Engine |
| 160 | + |
| 161 | +**Location:** `src/basic_memory/schema/inference.py` |
| 162 | + |
| 163 | +Analyzes notes of a given type and suggests a schema based on usage frequency. |
| 164 | + |
| 165 | +```python |
| 166 | +@dataclass |
| 167 | +class FieldFrequency: |
| 168 | + name: str |
| 169 | + source: str # "observation" | "relation" |
| 170 | + count: int # notes containing this field |
| 171 | + total: int # total notes analyzed |
| 172 | + percentage: float |
| 173 | + sample_values: list[str] # representative values |
| 174 | + is_array: bool # True if typically appears multiple times per note |
| 175 | + target_type: str | None # For relations, the most common target entity type |
| 176 | + |
| 177 | + |
| 178 | +@dataclass |
| 179 | +class InferenceResult: |
| 180 | + entity_type: str |
| 181 | + notes_analyzed: int |
| 182 | + field_frequencies: list[FieldFrequency] |
| 183 | + suggested_schema: dict # Ready-to-use Picoschema YAML dict |
| 184 | + suggested_required: list[str] |
| 185 | + suggested_optional: list[str] |
| 186 | + excluded: list[str] # Below threshold |
| 187 | + |
| 188 | + |
| 189 | +async def infer_schema( |
| 190 | + entity_type: str, |
| 191 | + notes: list[Note], |
| 192 | + required_threshold: float = 0.95, # 95%+ = required |
| 193 | + optional_threshold: float = 0.25, # 25%+ = optional |
| 194 | +) -> InferenceResult: |
| 195 | + """Analyze notes and suggest a Picoschema definition.""" |
| 196 | +``` |
| 197 | + |
| 198 | +### 5. Schema Diff |
| 199 | + |
| 200 | +**Location:** `src/basic_memory/schema/diff.py` |
| 201 | + |
| 202 | +Compares current note usage against an existing schema definition. |
| 203 | + |
| 204 | +```python |
| 205 | +@dataclass |
| 206 | +class SchemaDrift: |
| 207 | + new_fields: list[FieldFrequency] # Fields not in schema but common in notes |
| 208 | + dropped_fields: list[FieldFrequency] # Fields in schema but rare in notes |
| 209 | + cardinality_changes: list[str] # one → many or many → one |
| 210 | + type_mismatches: list[str] # observation values don't match declared type |
| 211 | + |
| 212 | + |
| 213 | +async def diff_schema( |
| 214 | + schema: SchemaDefinition, |
| 215 | + notes: list[Note], |
| 216 | +) -> SchemaDrift: |
| 217 | + """Compare a schema against actual note usage to detect drift.""" |
| 218 | +``` |
| 219 | + |
| 220 | +## Entry Points |
| 221 | + |
| 222 | +### CLI Commands |
| 223 | + |
| 224 | +**Location:** `src/basic_memory/cli/schema.py` |
| 225 | + |
| 226 | +```python |
| 227 | +import typer |
| 228 | + |
| 229 | +schema_app = typer.Typer(name="schema", help="Schema management commands") |
| 230 | + |
| 231 | +@schema_app.command() |
| 232 | +async def validate( |
| 233 | + target: str = typer.Argument(None, help="Note path or entity type"), |
| 234 | + strict: bool = typer.Option(False, help="Override to strict mode"), |
| 235 | +): |
| 236 | + """Validate notes against their schemas.""" |
| 237 | + |
| 238 | +@schema_app.command() |
| 239 | +async def infer( |
| 240 | + entity_type: str = typer.Argument(..., help="Entity type to analyze"), |
| 241 | + threshold: float = typer.Option(0.25, help="Minimum frequency for optional fields"), |
| 242 | + save: bool = typer.Option(False, help="Save to schema/ directory"), |
| 243 | +): |
| 244 | + """Infer schema from existing notes of a type.""" |
| 245 | + |
| 246 | +@schema_app.command() |
| 247 | +async def diff( |
| 248 | + entity_type: str = typer.Argument(..., help="Entity type to diff"), |
| 249 | +): |
| 250 | + """Show drift between schema and actual usage.""" |
| 251 | +``` |
| 252 | + |
| 253 | +Registered as subcommand: `bm schema validate`, `bm schema infer`, `bm schema diff`. |
| 254 | + |
| 255 | +### MCP Tools |
| 256 | + |
| 257 | +**Location:** `src/basic_memory/mcp/tools/schema.py` |
| 258 | + |
| 259 | +```python |
| 260 | +@mcp_tool |
| 261 | +async def schema_validate( |
| 262 | + entity_type: str | None = None, |
| 263 | + identifier: str | None = None, |
| 264 | + project: str | None = None, |
| 265 | +) -> str: |
| 266 | + """Validate notes against their resolved schema.""" |
| 267 | + |
| 268 | +@mcp_tool |
| 269 | +async def schema_infer( |
| 270 | + entity_type: str, |
| 271 | + threshold: float = 0.25, |
| 272 | + project: str | None = None, |
| 273 | +) -> str: |
| 274 | + """Analyze existing notes and suggest a schema definition.""" |
| 275 | +``` |
| 276 | + |
| 277 | +### API Endpoints |
| 278 | + |
| 279 | +**Location:** `src/basic_memory/api/schema_router.py` |
| 280 | + |
| 281 | +```python |
| 282 | +router = APIRouter(prefix="/schema", tags=["schema"]) |
| 283 | + |
| 284 | +@router.post("/validate") |
| 285 | +async def validate_schema(...) -> ValidationReport: ... |
| 286 | + |
| 287 | +@router.post("/infer") |
| 288 | +async def infer_schema(...) -> InferenceResult: ... |
| 289 | + |
| 290 | +@router.get("/diff/{entity_type}") |
| 291 | +async def diff_schema(...) -> SchemaDrift: ... |
| 292 | +``` |
| 293 | + |
| 294 | +MCP tools call these endpoints via the typed client pattern (consistent with existing |
| 295 | +architecture). |
| 296 | + |
| 297 | +## Implementation Phases |
| 298 | + |
| 299 | +### Phase 1: Parser + Resolver |
| 300 | + |
| 301 | +Build the foundation — can parse Picoschema and find schemas for notes. |
| 302 | + |
| 303 | +**Deliverables:** |
| 304 | +- `schema/parser.py` — Picoschema YAML → `SchemaDefinition` |
| 305 | +- `schema/resolver.py` — Resolution order (inline → explicit ref → implicit by type → none) |
| 306 | +- Unit tests for all Picoschema syntax variations |
| 307 | +- Unit tests for resolution order |
| 308 | + |
| 309 | +**No external dependencies.** Pure Python parsing of YAML dicts. Can develop and test |
| 310 | +in isolation. |
| 311 | + |
| 312 | +### Phase 2: Validator |
| 313 | + |
| 314 | +Connect schemas to notes and produce validation results. |
| 315 | + |
| 316 | +**Deliverables:** |
| 317 | +- `schema/validator.py` — Validate note observations/relations against schema fields |
| 318 | +- API endpoint: `POST /schema/validate` |
| 319 | +- MCP tool: `schema_validate` |
| 320 | +- CLI command: `bm schema validate` |
| 321 | +- Integration tests with real notes and schemas |
| 322 | + |
| 323 | +**Depends on:** Phase 1 (parser + resolver) |
| 324 | + |
| 325 | +### Phase 3: Inference |
| 326 | + |
| 327 | +Analyze existing notes to suggest schemas. |
| 328 | + |
| 329 | +**Deliverables:** |
| 330 | +- `schema/inference.py` — Frequency analysis across notes of a type |
| 331 | +- API endpoint: `POST /schema/infer` |
| 332 | +- MCP tool: `schema_infer` |
| 333 | +- CLI command: `bm schema infer` |
| 334 | +- Option to save inferred schema as a note via `write_note` |
| 335 | + |
| 336 | +**Depends on:** Phase 1 (parser for output format) |
| 337 | + |
| 338 | +### Phase 4: Diff |
| 339 | + |
| 340 | +Compare schemas against current usage. |
| 341 | + |
| 342 | +**Deliverables:** |
| 343 | +- `schema/diff.py` — Drift detection between schema and actual notes |
| 344 | +- API endpoint: `GET /schema/diff/{entity_type}` |
| 345 | +- CLI command: `bm schema diff` |
| 346 | + |
| 347 | +**Depends on:** Phase 1 (parser), Phase 3 (inference, for frequency analysis) |
| 348 | + |
| 349 | +## Testing Strategy |
| 350 | + |
| 351 | +- **Unit tests** (`tests/schema/`): Parser edge cases, resolution logic, validation mapping, |
| 352 | + inference thresholds |
| 353 | +- **Integration tests** (`test-int/schema/`): End-to-end with real markdown files, schema notes |
| 354 | + on disk, CLI invocation |
| 355 | +- Coverage target: 100% (consistent with project standard) |
| 356 | + |
| 357 | +## What This Does NOT Include |
| 358 | + |
| 359 | +- No new database tables or migrations |
| 360 | +- No new markdown syntax (schemas validate existing observations/relations) |
| 361 | +- No LLM agent runtime or API key management |
| 362 | +- No hook integration (deferred) |
| 363 | +- No schema composition/inheritance (deferred) |
| 364 | +- No OWL/RDF export (deferred) |
| 365 | +- No built-in templates (deferred) |
0 commit comments