Pithos File Format Specification

Version: 1.0 Status: Draft Date: September 2025 Purpose: Next-generation file format for scientific data management, optimized for object storage with built-in deduplication, encryption, and metadata support

1. Introduction

This document specifies the Pithos file format using the key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" as described in RFC 2119.

Pithos is an append-only archive format designed for efficient storage and sharing of scientific data. It combines content-defined deduplication, convergent encryption, and flexible metadata support in a privacy-preserving architecture optimized for object storage systems.

The code examples in this document are only intended to illustrate the architecture. Optimized implementations of the individual structures may of course differ.

2. Core Design Principles

Append-only architecture: New data and metadata MUST be appended, never modifying existing content
Content-addressed storage: All blocks MUST be identified by Blake3 hashes enabling deduplication
Privacy-preserving sharing: Users MUST NOT be able to see who else has access to files
Flexible metadata: Metadata MUST be stored as regular files with special type markers
Progressive enhancement: Implementations MUST support the base format and MAY support optional features
Emergency recovery: Block markers MUST enable reconstruction even with corrupted directories
Hierarchical organization: Files use full paths from archive root; directories MUST be declared before their contents

3. File Structure

A Pithos file MUST have the following structure:

[FileHeader]      // REQUIRED: Format identifier and version
[Block Data...]   // Zero or more data blocks with headers
[Directory]       // REQUIRED: Can repeat (append-only)
[Block Data...]   // Zero or more additional blocks
[Directory]       // REQUIRED: File MUST end with directory

4. Core Data Structures

4.1 File Header

Every Pithos file MUST begin with a FileHeader:

/// File header - appears once at the beginning of every Pithos file
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct FileHeader {
    pub magic: [u8; 4],    // MUST be b"PITH"
    pub version: u16,      // Format version (e.g., 0x0100 for 1.0)
}

4.2 Block Storage

4.2.1 Block Header

Each block MUST be preceded by a minimal header for emergency scanning:

/// Minimal block header - just for emergency scanning
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct BlockHeader {
    pub marker: [u8; 4],   // MUST be b"BLCK"
}

4.2.2 Block Index Entry

Complete block metadata MUST be stored in the directory:

/// Block index entry - single source of truth for block hashes
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct BlockIndexEntry {
    pub index: u64,              // Unique sequential identifier (varint encoded)
    pub hash: [u8; 32],          // Full Blake3 hash of original content
    pub offset: u64,             // Byte offset in file (varint encoded)
    pub stored_size: u64,        // Size as stored (compressed/encrypted) (varint)
    pub original_size: u64,      // Original uncompressed size (varint)
    pub flags: ProcessingFlags,  // Compression, encryption settings
    pub location: BlockLocation, // Where block data resides
}

/// Processing flags packed into single byte
bitflags::bitflags! {
    #[derive(Debug, Clone, PartialEq, Eq)]
    pub struct ProcessingFlags: u8 {
        // Bits 0-2: Compression level (0=none, 1-7=implementation-defined)
        const COMPRESSION_LEVEL_1 = 0b0000_0001;
        const COMPRESSION_LEVEL_2 = 0b0000_0010;
        const COMPRESSION_LEVEL_3 = 0b0000_0011;
        const COMPRESSION_LEVEL_4 = 0b0000_0100;
        const COMPRESSION_LEVEL_5 = 0b0000_0101;
        const COMPRESSION_LEVEL_6 = 0b0000_0110;
        const COMPRESSION_LEVEL_7 = 0b0000_0111;
        const COMPRESSION_MASK    = 0b0000_0111;

        // Bit 3: Encryption enabled
        const ENCRYPTION_ENABLED = 0b0000_1000;

        // Bits 4-7: Reserved for future use (MUST be zero)
    }
}

/// Block storage location
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum BlockLocation {
    Local,                      // Block data at specified offset in this file
    External { url: String },   // URL to external storage
}

4.3 Directory Structure

The directory MUST contain all file and block metadata:

/// Directory - lists all files and blocks in this segment
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Directory {
    pub identifier: [u8; 8],                            // MUST be b"PITHOSDR"
    pub parent_directory_offset: Option<(u64, u64)>,    // Previous directory (start, len) (varint, backwards chain)
    pub files: Vec<FileEntry>,                          // Files in this segment
    pub blocks: Vec<BlockIndexEntry>,                   // Blocks in this segment
    pub relations: Vec<(u64, String)>,                  // Relation idx, relationname / id
    pub encryption: Vec<EncryptionSection>,
    pub dir_len: u64,
    pub crc32: u32,                                     // CRC32 of all preceding fields
}

4.4 File Representation

4.4.1 File Types

Files MUST be classified by type:

/// File types (u8 representation for efficiency)
#[repr(u8)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum FileType {
    Data = 0,        // Regular data file (default)
    Metadata = 1,    // Metadata file (RO-Crate, DataCite, etc.)
    Directory = 2,   // Directory entry
    Symlink = 3,     // Symbolic link
    // Values 4-255 reserved for future use
}

4.4.2 File Entry

Each file MUST be represented by a FileEntry:

pub enum BlockDataState {
    Encrypted(Vec<u8>),             // Nonce + ChaCha20Poly1305
    Decrypted(Vec<(u64, [u8; 32])>) // Index / SHAKE256 hash
}



/// File entry - describes a single file, directory, or symlink
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct FileEntry {
    pub file_id: u64,                    // Sequential unique identifier (varint)
    pub path: String,                    // Full path from archive root (UTF-8)
    pub file_type: FileType,             // Type of entry
    pub block_data: BlockDataState,
    pub created: u64,                    // Unix timestamp (seconds since epoch)
    pub modified: u64,                   // Unix timestamp (seconds since epoch)
    pub file_size: u64,                  // Total size in bytes (varint)
    pub permissions: u32,                // Unix-style permissions
    pub references: Vec<Reference>,      // Data->Metadata references only
    pub symlink_target: Option<String>,  // Target path for symlinks
}

4.4.3 File References

References MUST be one-way from metadata to data files:

/// Simplified reference structure
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Reference {
    pub target_file_id: u64,    // Target file ID (varint)
    pub relationship: u64,      // Relationship type (varint)
}

Standard relationship types:

DESCRIBES = 0: Metadata describing target
ANNOTATES = 1: Additional annotations
DERIVED_FROM = 2: Derived from target
SOURCE_OF = 3: Source of target
PREVIOUS_VERSION = 4: Previous version
NEXT_VERSION = 5: Next version
PART_OF = 6: Part of collection
CONTAINS = 7: Contains target
INPUT_TO = 8: Input to process
OUTPUT_FROM = 9: Output from process
Custom relationships start at 1000

4.5 Encryption Section

Encryption sections MUST follow each directory:

/// Encryption section - privacy-preserving access control
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct EncryptionSection {
    pub sender_public_key: [u8; 32],       // X25519 public key
    pub recipients: Vec<RecipientSection>, // Per-recipient data
}

pub enum RecipientData {
    Encrypted(Vec<u8>),             // Chacha + nonce
    Decrypted(Vec<(u64, [u8; 32])>) // Fileindex / Shake256 hash
}


/// Per-recipient encrypted data
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct RecipientSection {
    pub recipient_public_key: [u8; 32],  // Recipient's X25519 public key
    pub recipient_data: RecipientData,   // Encrypted FileKeyEntry list
}

4.6 Error Types

#[derive(Debug)]
pub enum PithosError {
    Io(#[from] io::Error),
    Conversion(String),
    SystemTimeError(#[from] SystemTimeError),
    StripPrefix(#[from] std::path::StripPrefixError),
    WalkDir(#[from] walkdir::Error),
    FastCDC(#[from] fastcdc::v2020::Error),
    Serialization(#[from] SerializationError),
    Deserialization(#[from] DeserializationError), 
    Crypt(#[from] CryptError),
    Crypt4GH(#[from] Crypt4GHError),
    Cipher(#[from] ChaChaPoly1305Error),
    Compression(#[from] ZstdError),
    InvalidBlockDataState(String),
    BlockHashNotFound([u8; 32]),
    FileNotFound(String),
    DuplicateFileId(String),
    RelationIdOccupied(u64),
    PathOccupied(String),
    InvalidFileType(String),
    NoMatchingRecipient,
    InvalidRecipientDataState(String),
    Other(String),
}

5. Encoding Specifications

5.1 Integer Encoding

All integer fields marked as "varint" MUST use unsigned LEB128 encoding.

5.2 String Encoding

All strings MUST be encoded as UTF-8 with a varint length prefix.

5.3 Byte Order

All multi-byte values not using varint encoding MUST use big-endian byte order.

5.4 Directory Entry Ordering Requirements

CRITICAL: Directory entries MUST follow strict ordering rules to ensure proper extraction:

Parent Before Child Rule: A directory entry MUST appear before any entries for files or subdirectories within it
Path Format: All paths MUST be relative (no leading /) and use forward slashes as separators
Root Directory: The root directory is implicit and MUST NOT have an entry
Validation: Implementations MUST validate ordering during both writing and reading

Example of valid ordering:

data/                    (directory)
data/raw/                (directory - parent "data/" already exists)
data/raw/file1.csv       (file - parent "data/raw/" already exists)
data/processed/          (directory - parent "data/" already exists)
data/processed/file2.csv (file - parent "data/processed/" already exists)
docs/                    (directory)
docs/README.md           (file - parent "docs/" already exists)
data/raw/file1_v2.csv    (file - parent "data/raw/" already exists) -> Newer version of file1.csv

Example of INVALID ordering:

data/raw/file1.csv       (ERROR: parent "data/raw/" not yet declared)
data/raw/                (too late - file already referenced this directory)
data/                    (too late - subdirectory already referenced this)

6. Content Processing

6.1 Content-Defined Chunking

Implementations SHOULD use content-defined chunking with recommended parameters:

min_size: 64 KB
avg_size: 128 KB
max_size: 512 KB
window_size: 48 bytes

6.2 Block Hashing

All block hashes MUST use Blake3.

6.3 Convergent Encryption

Content keys MUST be derived deterministically using SHAKE256.

6.4 Compression

Implementations SHOULD support:

Level 0: No compression
Levels 1-3: Fast compression (e.g., Zstd levels 1-3)
Levels 4-6: Balanced compression (e.g., Zstd levels 4-9)
Level 7: Maximum compression (e.g., Zstd level 19+)

7. Operations Overview

7.1 Reading Operations

Read and validate file header
Find last directory by scanning from end
Validate directory ordering
Build block index
Extract files by reading referenced blocks

7.2 Writing Operations

Write file header
Process files in correct directory order
Chunk content using content-defined chunking
Deduplicate blocks by hash
Write directory and encryption sections
Validate complete structure

7.3 Directory Tree Operations

When archiving directory trees:

Process directories before their contents
Maintain relative path structure
Preserve file metadata (permissions, timestamps)
Handle symlinks appropriately per platform

8. Security Considerations

Implementations MUST verify block hashes before decompression/decryption
CRC32 values MUST be validated for directories and encryption sections
Convergent encryption reveals when identical files exist (accepted trade-off)
External block URLs MUST use HTTPS in production environments
Path traversal attacks MUST be prevented through validation

9. Implementation Requirements

9.1 Mandatory Features

Implementations MUST support:

Reading and writing base format (version 1.0)
Blake3 hashing
Varint encoding/decoding
CRC32 calculation
UTF-8 string handling
All four file types (Data, Metadata, Directory, Symlink)
Directory ordering validation (parents before children)
Path format validation (relative paths only)

9.2 Optional Features

Implementations MAY support:

Compression (levels 1-7)
Encryption (ChaCha20-Poly1305)
External block storage
Content-defined chunking

9.3 Platform-Specific Considerations

Symlinks

Unix systems: Create proper symbolic links
Windows: Handle symlinks according to platform capabilities

Permissions

Unix: Preserve full permission bits
Windows: Map Unix permissions to Windows ACLs where possible

Path Separators

Archives use forward slashes (/) internally
Convert to platform-appropriate separators on extraction

10. Future Extensions

The format reserves space for future extensions:

FileType values 4-255
ProcessingFlags bits 4-7
Custom relationship types starting at 1000

Extensions MUST maintain backwards compatibility for reading.

11. Constants and Identifiers

11.1 Magic Values

File Header: b"PITH"
Block Header: b"BLCK"
Directory: b"PITHOSDR"

11.2 Version Numbers

Version 1.0: 0x0100

11.3 Default Values

Current timestamp: Unix seconds since epoch
Default file permissions: 0o644
Default directory permissions: 0o755
Default symlink permissions: 0o777

FilesExpand file tree

PITHOS_1.0.0_draft.md

Latest commit

History