You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1.**[What Fragmentation Is](WHAT-FRAGMENTATION-IS.md)** -- the data structure, the self-similar property, content addressing.
8
8
2.**[Witnessed](WITNESSED.md)** -- why the observer is part of the hash.
9
-
3.**[Modules](MODULES.md)** -- how store, walk, and diff compose on the core types.
9
+
3.**[Modules](MODULES.md)** -- how store, walk, diff, encoding, and git compose on the core types.
10
10
4.**[Agent Guide](AGENT-GUIDE.md)** -- what future agents need to know that the code can't say.
11
11
12
12
## Why This Order
13
13
14
14
Start with what the thing is. Then understand the deepest design decision (Witnessed). Then learn how the pieces fit. Then read the guide for using it in practice.
15
15
16
-
The types are small enough to hold in your head. Four types, four modules. Everything else is consequences.
16
+
The types are small enough to hold in your head. Four types, six modules. Everything else is consequences.
fragmentation/diff structural comparison between trees
10
+
fragmentation/encoding text as content-addressed trees
11
+
fragmentation/git content-addressed fragment persistence
10
12
```
11
13
12
14
## fragmentation (core)
@@ -93,6 +95,51 @@ Positional comparison means: the first child of old is compared with the first c
93
95
94
96
**`summary`** reduces a list of changes to four counts: `#(added, removed, modified, unchanged)`.
95
97
98
+
## fragmentation/encoding
99
+
100
+
Source: `src/fragmentation/encoding.gleam`
101
+
102
+
Text as content-addressed trees. Five levels of structure: document, paragraph, sentence, word, character. Each level is a fragment containing the next level down. Characters are terminal shards.
103
+
104
+
```gleam
105
+
let root = encoding.encode("Hello world.", witness)
106
+
// document Fragment
107
+
// paragraph Fragment
108
+
// sentence Fragment
109
+
// word Fragment ("Hello")
110
+
// char Shard ("H")
111
+
// char Shard ("e")
112
+
// char Shard ("l")
113
+
// char Shard ("l")
114
+
// char Shard ("o")
115
+
// word Fragment ("world.")
116
+
// ...
117
+
```
118
+
119
+
**`encode`**: takes text and a witness, returns a document fragment. Splits on double newlines into paragraphs, paragraphs into sentences (on `. `, `! `, `? ` boundaries), sentences into words (on spaces), words into characters (grapheme clusters).
120
+
121
+
**`encode_paragraph`**, **`encode_sentence`**, **`encode_word`**, **`encode_char`**: individual constructors for each level. You can enter the hierarchy wherever you want.
122
+
123
+
**`ingest`**: encodes text and collects every node into a `Store`, returning the root fragment and the populated store. Shared subtrees deduplicate automatically -- if two paragraphs contain the same word, that word's character shards exist once in the store.
124
+
125
+
**`decode`**: extracts the data string from a fragment. Lossless round-trip: `encode` then `decode` returns the original text.
126
+
127
+
**`DecodeError`**: error type for decode failures. Variant `UnknownLabel(String)`.
128
+
129
+
Labels prevent cross-level collisions. A character "a" and a one-letter word "a" have different SHAs because their labels differ (`utf8/a` vs `token/a`). The label is hashed alongside the data via `labeled_hash`, which prefixes the data with its level before hashing.
130
+
131
+
## fragmentation/git
132
+
133
+
Source: `src/fragmentation/git.gleam`
134
+
135
+
Content-addressed fragment persistence. Writes a fragment to disk named by its SHA.
136
+
137
+
**`write`**: takes a fragment and a directory path. Computes the SHA via `hash_fragment`, serializes via `serialize`, writes to `<dir>/<sha>`. Returns `Ok(Nil)` on success, `Error(simplifile.FileError)` on failure.
138
+
139
+
Idempotent. Writing the same fragment twice produces the same file at the same path with the same content. The file name is the content address. The file content is the canonical serialization.
140
+
141
+
The store is a directory. Each fragment becomes a file. This is the simplest possible persistence layer for content-addressed data -- the same principle as git's object store, without the pack files.
142
+
96
143
## How They Compose
97
144
98
145
A typical workflow:
@@ -101,5 +148,7 @@ A typical workflow:
101
148
2.**Store** them for deduplication and lookup.
102
149
3.**Walk** trees to traverse, search, or aggregate.
103
150
4.**Diff** trees to understand what changed between two versions.
151
+
5.**Encode** text into fragment trees for content-addressed document storage.
152
+
6.**Persist** fragments to disk with git for durable, content-addressed storage.
104
153
105
-
These modules don't depend on each other (except that all depend on core types). You can use walk without store. You can use diff without walk. They're independent operations on the same data structure.
154
+
These modules don't depend on each other (except that all depend on core types). You can use walk without store. You can use diff without walk. Encoding uses walk and store internally but doesn't require you to. They're independent operations on the same data structure.
0 commit comments