Skip to content

Commit d6c03b8

Browse files
Maraclaude
andcommitted
♻️ docs: add encoding and git module documentation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 361d83d commit d6c03b8

3 files changed

Lines changed: 60 additions & 9 deletions

File tree

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,8 @@ fragmentation.hash_fragment(root)
4949
| `fragmentation/store` | Content-addressed in-memory storage (Sha -> Fragment) |
5050
| `fragmentation/walk` | Depth-first traversal, fold, find, depth |
5151
| `fragmentation/diff` | Structural comparison between trees |
52+
| `fragmentation/encoding` | Text as content-addressed trees (document/paragraph/sentence/word/char) |
53+
| `fragmentation/git` | Content-addressed fragment persistence to disk |
5254

5355
## Documentation
5456

@@ -61,7 +63,7 @@ See [`docs/`](docs/INDEX.md) for the full documentation, including:
6163
## Development
6264

6365
```sh
64-
gleam test # 62 tests
66+
gleam test # 110 tests
6567
```
6668

6769
## Licence

docs/INDEX.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,11 @@ Content-addressed, arbitrary-depth, circular-reflexive trees. Reality for git.
66

77
1. **[What Fragmentation Is](WHAT-FRAGMENTATION-IS.md)** -- the data structure, the self-similar property, content addressing.
88
2. **[Witnessed](WITNESSED.md)** -- why the observer is part of the hash.
9-
3. **[Modules](MODULES.md)** -- how store, walk, and diff compose on the core types.
9+
3. **[Modules](MODULES.md)** -- how store, walk, diff, encoding, and git compose on the core types.
1010
4. **[Agent Guide](AGENT-GUIDE.md)** -- what future agents need to know that the code can't say.
1111

1212
## Why This Order
1313

1414
Start with what the thing is. Then understand the deepest design decision (Witnessed). Then learn how the pieces fit. Then read the guide for using it in practice.
1515

16-
The types are small enough to hold in your head. Four types, four modules. Everything else is consequences.
16+
The types are small enough to hold in your head. Four types, six modules. Everything else is consequences.

docs/MODULES.md

Lines changed: 55 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,14 @@
11
# Modules
22

3-
Fragmentation is four modules. They compose in one direction: core types flow outward, operations build on each other.
3+
Fragmentation is six modules. They compose in one direction: core types flow outward, operations build on each other.
44

55
```
6-
fragmentation core types, construction, hashing, queries
7-
fragmentation/store content-addressed storage (Sha -> Fragment)
8-
fragmentation/walk recursive tree traversal
9-
fragmentation/diff structural comparison between trees
6+
fragmentation core types, construction, hashing, queries
7+
fragmentation/store content-addressed storage (Sha -> Fragment)
8+
fragmentation/walk recursive tree traversal
9+
fragmentation/diff structural comparison between trees
10+
fragmentation/encoding text as content-addressed trees
11+
fragmentation/git content-addressed fragment persistence
1012
```
1113

1214
## fragmentation (core)
@@ -93,6 +95,51 @@ Positional comparison means: the first child of old is compared with the first c
9395

9496
**`summary`** reduces a list of changes to four counts: `#(added, removed, modified, unchanged)`.
9597

98+
## fragmentation/encoding
99+
100+
Source: `src/fragmentation/encoding.gleam`
101+
102+
Text as content-addressed trees. Five levels of structure: document, paragraph, sentence, word, character. Each level is a fragment containing the next level down. Characters are terminal shards.
103+
104+
```gleam
105+
let root = encoding.encode("Hello world.", witness)
106+
// document Fragment
107+
// paragraph Fragment
108+
// sentence Fragment
109+
// word Fragment ("Hello")
110+
// char Shard ("H")
111+
// char Shard ("e")
112+
// char Shard ("l")
113+
// char Shard ("l")
114+
// char Shard ("o")
115+
// word Fragment ("world.")
116+
// ...
117+
```
118+
119+
**`encode`**: takes text and a witness, returns a document fragment. Splits on double newlines into paragraphs, paragraphs into sentences (on `. `, `! `, `? ` boundaries), sentences into words (on spaces), words into characters (grapheme clusters).
120+
121+
**`encode_paragraph`**, **`encode_sentence`**, **`encode_word`**, **`encode_char`**: individual constructors for each level. You can enter the hierarchy wherever you want.
122+
123+
**`ingest`**: encodes text and collects every node into a `Store`, returning the root fragment and the populated store. Shared subtrees deduplicate automatically -- if two paragraphs contain the same word, that word's character shards exist once in the store.
124+
125+
**`decode`**: extracts the data string from a fragment. Lossless round-trip: `encode` then `decode` returns the original text.
126+
127+
**`DecodeError`**: error type for decode failures. Variant `UnknownLabel(String)`.
128+
129+
Labels prevent cross-level collisions. A character "a" and a one-letter word "a" have different SHAs because their labels differ (`utf8/a` vs `token/a`). The label is hashed alongside the data via `labeled_hash`, which prefixes the data with its level before hashing.
130+
131+
## fragmentation/git
132+
133+
Source: `src/fragmentation/git.gleam`
134+
135+
Content-addressed fragment persistence. Writes a fragment to disk named by its SHA.
136+
137+
**`write`**: takes a fragment and a directory path. Computes the SHA via `hash_fragment`, serializes via `serialize`, writes to `<dir>/<sha>`. Returns `Ok(Nil)` on success, `Error(simplifile.FileError)` on failure.
138+
139+
Idempotent. Writing the same fragment twice produces the same file at the same path with the same content. The file name is the content address. The file content is the canonical serialization.
140+
141+
The store is a directory. Each fragment becomes a file. This is the simplest possible persistence layer for content-addressed data -- the same principle as git's object store, without the pack files.
142+
96143
## How They Compose
97144

98145
A typical workflow:
@@ -101,5 +148,7 @@ A typical workflow:
101148
2. **Store** them for deduplication and lookup.
102149
3. **Walk** trees to traverse, search, or aggregate.
103150
4. **Diff** trees to understand what changed between two versions.
151+
5. **Encode** text into fragment trees for content-addressed document storage.
152+
6. **Persist** fragments to disk with git for durable, content-addressed storage.
104153

105-
These modules don't depend on each other (except that all depend on core types). You can use walk without store. You can use diff without walk. They're independent operations on the same data structure.
154+
These modules don't depend on each other (except that all depend on core types). You can use walk without store. You can use diff without walk. Encoding uses walk and store internally but doesn't require you to. They're independent operations on the same data structure.

0 commit comments

Comments
 (0)