Skip to content

Commit 3c0419f

Browse files
authored
Merge branch 'main' into fix/tigris-consistent-header-all-transactions
Signed-off-by: Paul Hernandez <60959+phernandez@users.noreply.github.com>
2 parents 7fcf587 + 2b94d9a commit 3c0419f

108 files changed

Lines changed: 7123 additions & 1 deletion

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

NOTE-FORMAT.md

Lines changed: 494 additions & 0 deletions
Large diffs are not rendered by default.

docs/specs/SPEC-SCHEMA-IMPL.md

Lines changed: 365 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,365 @@
1+
# SPEC-SCHEMA-IMPL: Schema System Implementation Plan
2+
3+
**Status:** Draft
4+
**Created:** 2025-02-06
5+
**Branch:** `feature/schema-system`
6+
**Depends on:** [SPEC-SCHEMA](SPEC-SCHEMA.md)
7+
8+
## Overview
9+
10+
Implementation plan for the Basic Memory Schema System. The system is entirely programmatic —
11+
no LLM agent runtime or API key required. The LLM already in the user's session (Claude Code,
12+
Claude Desktop, etc.) provides the intelligence layer by reading schema notes via existing
13+
MCP tools.
14+
15+
## Architecture
16+
17+
```
18+
┌─────────────────────────────────────────────────┐
19+
│ Entry Points │
20+
│ CLI (bm schema ...) │ MCP (schema_validate) │
21+
└──────────┬────────────┴──────────┬──────────────┘
22+
│ │
23+
▼ ▼
24+
┌─────────────────────────────────────────────────┐
25+
│ Schema Service Layer │
26+
│ resolve_schema · validate · infer · diff │
27+
└──────────┬────────────────────────┬──────────────┘
28+
│ │
29+
▼ ▼
30+
┌──────────────────────┐ ┌────────────────────────┐
31+
│ Picoschema Parser │ │ Note/Entity Access │
32+
│ YAML → SchemaModel │ │ (existing repository) │
33+
└──────────────────────┘ └────────────────────────┘
34+
```
35+
36+
No new database tables. Schemas are notes with `type: schema` — they're already indexed.
37+
Validation reads observations and relations from existing data.
38+
39+
## Components
40+
41+
### 1. Picoschema Parser
42+
43+
**Location:** `src/basic_memory/schema/parser.py`
44+
45+
Parses Picoschema YAML into an internal representation.
46+
47+
```python
48+
@dataclass
49+
class SchemaField:
50+
name: str
51+
type: str # string, integer, number, boolean, any, or EntityName
52+
required: bool # True unless field name ends with ?
53+
is_array: bool # True if (array) notation
54+
is_enum: bool # True if (enum) notation
55+
enum_values: list[str] # Populated for enums
56+
description: str | None # Text after comma
57+
is_entity_ref: bool # True if type is capitalized (entity reference)
58+
children: list[SchemaField] # For (object) types
59+
60+
61+
@dataclass
62+
class SchemaDefinition:
63+
entity: str # The entity type this schema describes
64+
version: int # Schema version
65+
fields: list[SchemaField] # Parsed fields
66+
validation_mode: str # "warn" | "strict" | "off"
67+
68+
69+
def parse_picoschema(yaml_dict: dict) -> list[SchemaField]:
70+
"""Parse a Picoschema YAML dict into a list of SchemaField objects."""
71+
72+
73+
def parse_schema_note(frontmatter: dict) -> SchemaDefinition:
74+
"""Parse a full schema note's frontmatter into a SchemaDefinition."""
75+
```
76+
77+
**Input/Output:**
78+
```yaml
79+
# Input (YAML dict from frontmatter)
80+
schema:
81+
name: string, full name
82+
role?: string, job title
83+
works_at?: Organization, employer
84+
expertise?(array): string, areas of knowledge
85+
```
86+
87+
```python
88+
# Output
89+
[
90+
SchemaField(name="name", type="string", required=True, description="full name", ...),
91+
SchemaField(name="role", type="string", required=False, description="job title", ...),
92+
SchemaField(name="works_at", type="Organization", required=False, is_entity_ref=True, ...),
93+
SchemaField(name="expertise", type="string", required=False, is_array=True, ...),
94+
]
95+
```
96+
97+
### 2. Schema Resolver
98+
99+
**Location:** `src/basic_memory/schema/resolver.py`
100+
101+
Finds the applicable schema for a note using the resolution order.
102+
103+
```python
104+
async def resolve_schema(
105+
note_frontmatter: dict,
106+
search_fn: Callable, # injected search capability
107+
) -> SchemaDefinition | None:
108+
"""Resolve schema for a note.
109+
110+
Resolution order:
111+
1. Inline schema (frontmatter['schema'] is a dict)
112+
2. Explicit reference (frontmatter['schema'] is a string)
113+
3. Implicit by type (frontmatter['type'] → schema note with matching entity)
114+
4. No schema (returns None)
115+
"""
116+
```
117+
118+
### 3. Schema Validator
119+
120+
**Location:** `src/basic_memory/schema/validator.py`
121+
122+
Validates a note's observations and relations against a resolved schema.
123+
124+
```python
125+
@dataclass
126+
class FieldResult:
127+
field: SchemaField
128+
status: str # "present" | "missing" | "type_mismatch"
129+
values: list[str] # Matched observation values or relation targets
130+
message: str | None # Human-readable detail
131+
132+
133+
@dataclass
134+
class ValidationResult:
135+
note_identifier: str
136+
schema_entity: str
137+
passed: bool # True if no errors (warnings are OK)
138+
field_results: list[FieldResult]
139+
unmatched_observations: dict[str, int] # category → count
140+
unmatched_relations: list[str] # relation types not in schema
141+
warnings: list[str]
142+
errors: list[str]
143+
144+
145+
async def validate_note(
146+
note: Note,
147+
schema: SchemaDefinition,
148+
) -> ValidationResult:
149+
"""Validate a note against a schema definition.
150+
151+
Mapping rules:
152+
- field: string → observation [field] exists
153+
- field?(array): type → multiple [field] observations
154+
- field?: EntityType → relation 'field [[...]]' exists
155+
- field?(enum): [v] → observation [field] value ∈ enum values
156+
"""
157+
```
158+
159+
### 4. Schema Inference Engine
160+
161+
**Location:** `src/basic_memory/schema/inference.py`
162+
163+
Analyzes notes of a given type and suggests a schema based on usage frequency.
164+
165+
```python
166+
@dataclass
167+
class FieldFrequency:
168+
name: str
169+
source: str # "observation" | "relation"
170+
count: int # notes containing this field
171+
total: int # total notes analyzed
172+
percentage: float
173+
sample_values: list[str] # representative values
174+
is_array: bool # True if typically appears multiple times per note
175+
target_type: str | None # For relations, the most common target entity type
176+
177+
178+
@dataclass
179+
class InferenceResult:
180+
entity_type: str
181+
notes_analyzed: int
182+
field_frequencies: list[FieldFrequency]
183+
suggested_schema: dict # Ready-to-use Picoschema YAML dict
184+
suggested_required: list[str]
185+
suggested_optional: list[str]
186+
excluded: list[str] # Below threshold
187+
188+
189+
async def infer_schema(
190+
entity_type: str,
191+
notes: list[Note],
192+
required_threshold: float = 0.95, # 95%+ = required
193+
optional_threshold: float = 0.25, # 25%+ = optional
194+
) -> InferenceResult:
195+
"""Analyze notes and suggest a Picoschema definition."""
196+
```
197+
198+
### 5. Schema Diff
199+
200+
**Location:** `src/basic_memory/schema/diff.py`
201+
202+
Compares current note usage against an existing schema definition.
203+
204+
```python
205+
@dataclass
206+
class SchemaDrift:
207+
new_fields: list[FieldFrequency] # Fields not in schema but common in notes
208+
dropped_fields: list[FieldFrequency] # Fields in schema but rare in notes
209+
cardinality_changes: list[str] # one → many or many → one
210+
type_mismatches: list[str] # observation values don't match declared type
211+
212+
213+
async def diff_schema(
214+
schema: SchemaDefinition,
215+
notes: list[Note],
216+
) -> SchemaDrift:
217+
"""Compare a schema against actual note usage to detect drift."""
218+
```
219+
220+
## Entry Points
221+
222+
### CLI Commands
223+
224+
**Location:** `src/basic_memory/cli/schema.py`
225+
226+
```python
227+
import typer
228+
229+
schema_app = typer.Typer(name="schema", help="Schema management commands")
230+
231+
@schema_app.command()
232+
async def validate(
233+
target: str = typer.Argument(None, help="Note path or entity type"),
234+
strict: bool = typer.Option(False, help="Override to strict mode"),
235+
):
236+
"""Validate notes against their schemas."""
237+
238+
@schema_app.command()
239+
async def infer(
240+
entity_type: str = typer.Argument(..., help="Entity type to analyze"),
241+
threshold: float = typer.Option(0.25, help="Minimum frequency for optional fields"),
242+
save: bool = typer.Option(False, help="Save to schema/ directory"),
243+
):
244+
"""Infer schema from existing notes of a type."""
245+
246+
@schema_app.command()
247+
async def diff(
248+
entity_type: str = typer.Argument(..., help="Entity type to diff"),
249+
):
250+
"""Show drift between schema and actual usage."""
251+
```
252+
253+
Registered as subcommand: `bm schema validate`, `bm schema infer`, `bm schema diff`.
254+
255+
### MCP Tools
256+
257+
**Location:** `src/basic_memory/mcp/tools/schema.py`
258+
259+
```python
260+
@mcp_tool
261+
async def schema_validate(
262+
entity_type: str | None = None,
263+
identifier: str | None = None,
264+
project: str | None = None,
265+
) -> str:
266+
"""Validate notes against their resolved schema."""
267+
268+
@mcp_tool
269+
async def schema_infer(
270+
entity_type: str,
271+
threshold: float = 0.25,
272+
project: str | None = None,
273+
) -> str:
274+
"""Analyze existing notes and suggest a schema definition."""
275+
```
276+
277+
### API Endpoints
278+
279+
**Location:** `src/basic_memory/api/schema_router.py`
280+
281+
```python
282+
router = APIRouter(prefix="/schema", tags=["schema"])
283+
284+
@router.post("/validate")
285+
async def validate_schema(...) -> ValidationReport: ...
286+
287+
@router.post("/infer")
288+
async def infer_schema(...) -> InferenceResult: ...
289+
290+
@router.get("/diff/{entity_type}")
291+
async def diff_schema(...) -> SchemaDrift: ...
292+
```
293+
294+
MCP tools call these endpoints via the typed client pattern (consistent with existing
295+
architecture).
296+
297+
## Implementation Phases
298+
299+
### Phase 1: Parser + Resolver
300+
301+
Build the foundation — can parse Picoschema and find schemas for notes.
302+
303+
**Deliverables:**
304+
- `schema/parser.py` — Picoschema YAML → `SchemaDefinition`
305+
- `schema/resolver.py` — Resolution order (inline → explicit ref → implicit by type → none)
306+
- Unit tests for all Picoschema syntax variations
307+
- Unit tests for resolution order
308+
309+
**No external dependencies.** Pure Python parsing of YAML dicts. Can develop and test
310+
in isolation.
311+
312+
### Phase 2: Validator
313+
314+
Connect schemas to notes and produce validation results.
315+
316+
**Deliverables:**
317+
- `schema/validator.py` — Validate note observations/relations against schema fields
318+
- API endpoint: `POST /schema/validate`
319+
- MCP tool: `schema_validate`
320+
- CLI command: `bm schema validate`
321+
- Integration tests with real notes and schemas
322+
323+
**Depends on:** Phase 1 (parser + resolver)
324+
325+
### Phase 3: Inference
326+
327+
Analyze existing notes to suggest schemas.
328+
329+
**Deliverables:**
330+
- `schema/inference.py` — Frequency analysis across notes of a type
331+
- API endpoint: `POST /schema/infer`
332+
- MCP tool: `schema_infer`
333+
- CLI command: `bm schema infer`
334+
- Option to save inferred schema as a note via `write_note`
335+
336+
**Depends on:** Phase 1 (parser for output format)
337+
338+
### Phase 4: Diff
339+
340+
Compare schemas against current usage.
341+
342+
**Deliverables:**
343+
- `schema/diff.py` — Drift detection between schema and actual notes
344+
- API endpoint: `GET /schema/diff/{entity_type}`
345+
- CLI command: `bm schema diff`
346+
347+
**Depends on:** Phase 1 (parser), Phase 3 (inference, for frequency analysis)
348+
349+
## Testing Strategy
350+
351+
- **Unit tests** (`tests/schema/`): Parser edge cases, resolution logic, validation mapping,
352+
inference thresholds
353+
- **Integration tests** (`test-int/schema/`): End-to-end with real markdown files, schema notes
354+
on disk, CLI invocation
355+
- Coverage target: 100% (consistent with project standard)
356+
357+
## What This Does NOT Include
358+
359+
- No new database tables or migrations
360+
- No new markdown syntax (schemas validate existing observations/relations)
361+
- No LLM agent runtime or API key management
362+
- No hook integration (deferred)
363+
- No schema composition/inheritance (deferred)
364+
- No OWL/RDF export (deferred)
365+
- No built-in templates (deferred)

0 commit comments

Comments
 (0)