Section: Core Specification Version: 0.1
Codex documents use content-addressable hashing as a core identity mechanism. The document's hash serves as its canonical identifier, enabling:
- Integrity verification
- Version identification
- Lineage tracking
- Distributed storage
- Deduplication
The hash of a document's content IS its identity. This means:
- Identical content produces identical IDs
- Any content change produces a different ID
- IDs are deterministic (reproducible)
- No central authority needed for ID assignment
The specification supports multiple hash algorithms to accommodate:
- Different security requirements
- Future algorithm advances
- Post-quantum preparedness
Hashes are represented as: algorithm:hexdigest
sha256:3a7bd3e2360a3d29eea436fcfb7e44c735d117c42d1c1835420b6b9942dd4f1b
Components:
algorithm- Hash algorithm identifier (lowercase):- Separatorhexdigest- Lowercase hexadecimal hash value
| Algorithm | Identifier | Output Size | Status |
|---|---|---|---|
| SHA-256 | sha256 |
256 bits | Required (default) |
| SHA-384 | sha384 |
384 bits | Optional |
| SHA-512 | sha512 |
512 bits | Optional |
| SHA-3-256 | sha3-256 |
256 bits | Optional |
| SHA-3-512 | sha3-512 |
512 bits | Optional |
| BLAKE3 | blake3 |
256 bits | Optional |
Default: SHA-256 (sha256)
Implementations MUST support SHA-256. Support for other algorithms is OPTIONAL.
Documents MAY specify their hash algorithm in the manifest:
{
"codex": "0.1",
"hashAlgorithm": "sha256",
"id": "sha256:..."
}If hashAlgorithm is omitted, SHA-256 is assumed.
The document ID is computed from a canonical representation of the document's semantic content and essential metadata:
Document ID = Hash(CanonicalContent)
The canonical content includes:
- Content blocks (semantic content)
- Essential metadata (Dublin Core)
- Asset hashes (not asset content)
The canonical content EXCLUDES:
- Presentation layers (visual rendering, not part of content identity)
- Precise layouts (rendering fidelity, not part of content identity)
- Timestamps (administrative, change on every edit)
- Security data (signatures reference the hash, not part of it)
- Collaboration data (comments, change tracking)
- Phantom data (off-page annotations)
- Form data (
forms/data.json— filled values are mutable even on frozen documents)
Metadata inclusion: The Dublin Core terms included in the hash are
title,creator,subject,description, andlanguage. Administrative terms (date,publisher,identifier,rights) are excluded. See Metadata specification, section 6 for details.
Note: The document ID represents the document's semantic identity — what it says, not how it looks. Multiple visual presentations (letter, A4, responsive) of the same content produce the same document ID. For appearance attestation, see Scoped Signatures in the Security Extension.
The following table summarizes what is included in and excluded from the document content hash:
| Layer | Inside Hash | Notes |
|---|---|---|
| Content blocks | Yes | Core document identity — all text, structure, and semantic markup |
| Dublin Core metadata | Partial | Only title, creator, subject, description, language |
| Asset hashes | Yes | Asset identity via hash mapping (not asset bytes) |
| Asset content | No | Actual asset bytes are hashed separately; only references included |
| Presentation | No | Visual rendering instructions — not part of semantic identity |
| Precise layouts | No | Coordinate-level positioning — rendering fidelity |
| Collaboration | No | Comments, suggestions, change tracking |
| Phantoms | No | Off-page annotations and margin notes |
| Forms data | No | Fillable field values (mutable even on frozen documents) |
| Security | No | Signatures reference the hash — not part of it |
| Timestamps | No | Administrative metadata (created, modified) |
| Provenance | No | Lineage tracking and derivation history |
| CRDT metadata | No | Transient synchronization state from collaboration extension |
This boundary ensures that the document's identity represents its semantic content — what the document says — rather than how it appears or administrative metadata about it.
Note: CRDT metadata added by the collaboration extension (
crdtfields on content blocks) is excluded from the content hash. CRDT data represents transient synchronization state and MUST be stripped before computing the document hash.
{
"version": "0.1",
"content": { /* content blocks */ },
"metadata": { /* dublin core subset */ },
"assetHashes": { /* asset ID -> hash mapping */ }
}To ensure deterministic hashing:
-
JSON Canonicalization: Use RFC 8785 (JCS) for JSON serialization:
- Sort object keys lexicographically
- No whitespace between tokens
- Numbers without unnecessary precision
- Strings with minimal escaping
-
Unicode Normalization: All text content in NFC form
-
Field Ordering: Within content blocks:
typefirstidsecond (if present)childrenorvaluethird- Other fields in alphabetical order
1. Extract content blocks from document
2. Extract essential metadata
3. Collect asset hashes (ID -> hash mapping)
4. Build canonical structure
5. Serialize using JCS
6. Hash the serialized bytes
7. Format as "algorithm:hexdigest"
Given content:
{
"version": "0.1",
"blocks": [
{
"type": "heading",
"level": 1,
"children": [{ "type": "text", "value": "Hello" }]
}
]
}And metadata:
{
"title": "Test Document",
"creator": "Jane Doe"
}Canonical form (JCS serialized, shown formatted for readability):
{"assetHashes":{},"content":{"blocks":[{"children":[{"type":"text","value":"Hello"}],"level":1,"type":"heading"}],"version":"0.1"},"metadata":{"creator":"Jane Doe","title":"Test Document"},"version":"0.1"}Hash: sha256:... (computed from the JCS-serialized bytes)
Files within the archive have their own hashes:
{
"content": {
"path": "content/document.json",
"hash": "sha256:abc123..."
}
}These are computed from the raw file bytes (after decompression).
Assets include hashes in their index:
{
"id": "figure1",
"path": "figure1.avif",
"hash": "sha256:def456..."
}Asset hashes feed into the document ID computation via the assetHashes mapping.
| Level | Scope | When |
|---|---|---|
| File | Individual file integrity | On file access |
| Asset | Asset integrity | On asset load |
| Document | Full document integrity | On document open, sign, verify |
File-level verification:
- Decompress file from archive
- Compute hash of decompressed bytes
- Compare with hash in manifest
- Reject on mismatch
Document-level verification:
- Verify all file hashes
- Recompute document ID from canonical content
- Compare with ID in manifest
- Reject on mismatch
| Document State | Hash Mismatch Action |
|---|---|
| draft | Warning (content may have been edited externally) |
| review | Warning |
| frozen | Error (document integrity compromised) |
| published | Error (document integrity compromised) |
For frozen/published documents, hash mismatches indicate tampering or corruption.
Draft documents that haven't been finalized MAY use a pending placeholder:
{
"id": "pending",
"state": "draft"
}This indicates the document is in active editing and the ID hasn't been computed yet.
The document ID SHOULD be computed when:
- Document state transitions from
drafttoreview - Document is signed
- Document is exported for distribution
- Explicitly requested by user/application
When a document is derived from another, the lineage records the parent:
{
"lineage": {
"parent": "sha256:originaldochash...",
"version": 2
}
}The parent hash refers to the document ID of the previous version.
Documents form a chain through parent references:
doc-v1 (sha256:aaa...)
│
└── doc-v2 (sha256:bbb..., parent=sha256:aaa...)
│
└── doc-v3 (sha256:ccc..., parent=sha256:bbb...)
Multiple documents can share the same parent (branching):
doc-v1 (sha256:aaa...)
├── doc-v2a (sha256:bbb..., parent=sha256:aaa...)
└── doc-v2b (sha256:ccc..., parent=sha256:aaa...)
The lineage.branch field can distinguish branches:
{
"lineage": {
"parent": "sha256:aaa...",
"branch": "legal-review"
}
}SHA-256 provides strong collision resistance. The probability of accidental collision is negligible (2^-128).
Given a hash, it's computationally infeasible to find content that produces it.
Given content and its hash, it's computationally infeasible to find different content with the same hash.
If an algorithm is found to be weak:
- Implementations SHOULD support re-hashing with stronger algorithm
- Signatures can bind to new hash
- Old IDs can be listed as aliases
Current hash algorithms are believed to be quantum-resistant (Grover's algorithm provides only quadratic speedup). SHA-256 provides ~128 bits of security against quantum attacks, which is considered adequate.
Hash computation is fast (typically <1ms for small documents). For large documents with many assets:
- Compute asset hashes incrementally as assets are added
- Cache computed hashes
- Use streaming hash computation for large files
Implementations SHOULD cache:
- File hashes (invalidate when file modified)
- Document ID (invalidate when content changes)
- Asset hashes (invalidate when asset added/modified)
For large files, use streaming hash computation:
hasher = new SHA256()
while chunk = file.read(CHUNK_SIZE):
hasher.update(chunk)
hash = hasher.finalize()
Content:
{"blocks":[{"children":[{"type":"text","value":"Hello"}],"type":"paragraph"}],"version":"0.1"}Canonical form (no metadata, no assets):
{"assetHashes":{},"content":{"blocks":[{"children":[{"type":"text","value":"Hello"}],"type":"paragraph"}],"version":"0.1"},"metadata":{},"version":"0.1"}{
"assetHashes": {
"figure1": "sha256:abc123...",
"logo": "sha256:def456..."
},
"content": { /* ... */ },
"metadata": {
"title": "Annual Report",
"creator": "Finance Team"
},
"version": "0.1"
}function verifyDocument(archive) {
// 1. Load manifest
const manifest = parseJSON(archive.read("manifest.json"))
// 2. Verify file hashes
for (const fileRef of getAllFileRefs(manifest)) {
const fileBytes = archive.read(fileRef.path)
const computedHash = sha256(fileBytes)
if (computedHash !== fileRef.hash) {
throw new Error(`File hash mismatch: ${fileRef.path}`)
}
}
// 3. Verify document ID
const content = parseJSON(archive.read(manifest.content.path))
const metadata = parseJSON(archive.read(manifest.metadata.dublinCore))
const assetHashes = collectAssetHashes(archive, manifest)
const canonical = {
version: "0.1",
content: content,
metadata: metadata,
assetHashes: assetHashes
}
const canonicalBytes = JCS.serialize(canonical)
const computedId = "sha256:" + sha256hex(canonicalBytes)
if (manifest.id !== "pending" && computedId !== manifest.id) {
throw new Error(`Document ID mismatch`)
}
return true
}