Version: 1.0 Status: Draft Date: September 2025 Purpose: Next-generation file format for scientific data management, optimized for object storage with built-in deduplication, encryption, and metadata support
This document specifies the Pithos file format using the key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" as described in RFC 2119.
Pithos is an append-only archive format designed for efficient storage and sharing of scientific data. It combines content-defined deduplication, convergent encryption, and flexible metadata support in a privacy-preserving architecture optimized for object storage systems.
The code examples in this document are only intended to illustrate the architecture. Optimized implementations of the individual structures may of course differ.
- Append-only architecture: New data and metadata MUST be appended, never modifying existing content
- Content-addressed storage: All blocks MUST be identified by Blake3 hashes enabling deduplication
- Privacy-preserving sharing: Users MUST NOT be able to see who else has access to files
- Flexible metadata: Metadata MUST be stored as regular files with special type markers
- Progressive enhancement: Implementations MUST support the base format and MAY support optional features
- Emergency recovery: Block markers MUST enable reconstruction even with corrupted directories
- Hierarchical organization: Files use full paths from archive root; directories MUST be declared before their contents
A Pithos file MUST have the following structure:
[FileHeader] // REQUIRED: Format identifier and version
[Block Data...] // Zero or more data blocks with headers
[Directory] // REQUIRED: Can repeat (append-only)
[Block Data...] // Zero or more additional blocks
[Directory] // REQUIRED: File MUST end with directory
Every Pithos file MUST begin with a FileHeader:
/// File header - appears once at the beginning of every Pithos file
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct FileHeader {
pub magic: [u8; 4], // MUST be b"PITH"
pub version: u16, // Format version (e.g., 0x0100 for 1.0)
}Each block MUST be preceded by a minimal header for emergency scanning:
/// Minimal block header - just for emergency scanning
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct BlockHeader {
pub marker: [u8; 4], // MUST be b"BLCK"
}Complete block metadata MUST be stored in the directory:
/// Block index entry - single source of truth for block hashes
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct BlockIndexEntry {
pub index: u64, // Unique sequential identifier (varint encoded)
pub hash: [u8; 32], // Full Blake3 hash of original content
pub offset: u64, // Byte offset in file (varint encoded)
pub stored_size: u64, // Size as stored (compressed/encrypted) (varint)
pub original_size: u64, // Original uncompressed size (varint)
pub flags: ProcessingFlags, // Compression, encryption settings
pub location: BlockLocation, // Where block data resides
}
/// Processing flags packed into single byte
bitflags::bitflags! {
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct ProcessingFlags: u8 {
// Bits 0-2: Compression level (0=none, 1-7=implementation-defined)
const COMPRESSION_LEVEL_1 = 0b0000_0001;
const COMPRESSION_LEVEL_2 = 0b0000_0010;
const COMPRESSION_LEVEL_3 = 0b0000_0011;
const COMPRESSION_LEVEL_4 = 0b0000_0100;
const COMPRESSION_LEVEL_5 = 0b0000_0101;
const COMPRESSION_LEVEL_6 = 0b0000_0110;
const COMPRESSION_LEVEL_7 = 0b0000_0111;
const COMPRESSION_MASK = 0b0000_0111;
// Bit 3: Encryption enabled
const ENCRYPTION_ENABLED = 0b0000_1000;
// Bits 4-7: Reserved for future use (MUST be zero)
}
}
/// Block storage location
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum BlockLocation {
Local, // Block data at specified offset in this file
External { url: String }, // URL to external storage
}The directory MUST contain all file and block metadata:
/// Directory - lists all files and blocks in this segment
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Directory {
pub identifier: [u8; 8], // MUST be b"PITHOSDR"
pub parent_directory_offset: Option<(u64, u64)>, // Previous directory (start, len) (varint, backwards chain)
pub files: Vec<FileEntry>, // Files in this segment
pub blocks: Vec<BlockIndexEntry>, // Blocks in this segment
pub relations: Vec<(u64, String)>, // Relation idx, relationname / id
pub encryption: Vec<EncryptionSection>,
pub dir_len: u64,
pub crc32: u32, // CRC32 of all preceding fields
}Files MUST be classified by type:
/// File types (u8 representation for efficiency)
#[repr(u8)]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum FileType {
Data = 0, // Regular data file (default)
Metadata = 1, // Metadata file (RO-Crate, DataCite, etc.)
Directory = 2, // Directory entry
Symlink = 3, // Symbolic link
// Values 4-255 reserved for future use
}Each file MUST be represented by a FileEntry:
pub enum BlockDataState {
Encrypted(Vec<u8>), // Nonce + ChaCha20Poly1305
Decrypted(Vec<(u64, [u8; 32])>) // Index / SHAKE256 hash
}
/// File entry - describes a single file, directory, or symlink
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct FileEntry {
pub file_id: u64, // Sequential unique identifier (varint)
pub path: String, // Full path from archive root (UTF-8)
pub file_type: FileType, // Type of entry
pub block_data: BlockDataState,
pub created: u64, // Unix timestamp (seconds since epoch)
pub modified: u64, // Unix timestamp (seconds since epoch)
pub file_size: u64, // Total size in bytes (varint)
pub permissions: u32, // Unix-style permissions
pub references: Vec<Reference>, // Data->Metadata references only
pub symlink_target: Option<String>, // Target path for symlinks
}References MUST be one-way from metadata to data files:
/// Simplified reference structure
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Reference {
pub target_file_id: u64, // Target file ID (varint)
pub relationship: u64, // Relationship type (varint)
}Standard relationship types:
DESCRIBES = 0: Metadata describing targetANNOTATES = 1: Additional annotationsDERIVED_FROM = 2: Derived from targetSOURCE_OF = 3: Source of targetPREVIOUS_VERSION = 4: Previous versionNEXT_VERSION = 5: Next versionPART_OF = 6: Part of collectionCONTAINS = 7: Contains targetINPUT_TO = 8: Input to processOUTPUT_FROM = 9: Output from process- Custom relationships start at
1000
Encryption sections MUST follow each directory:
/// Encryption section - privacy-preserving access control
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct EncryptionSection {
pub sender_public_key: [u8; 32], // X25519 public key
pub recipients: Vec<RecipientSection>, // Per-recipient data
}
pub enum RecipientData {
Encrypted(Vec<u8>), // Chacha + nonce
Decrypted(Vec<(u64, [u8; 32])>) // Fileindex / Shake256 hash
}
/// Per-recipient encrypted data
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct RecipientSection {
pub recipient_public_key: [u8; 32], // Recipient's X25519 public key
pub recipient_data: RecipientData, // Encrypted FileKeyEntry list
}#[derive(Debug)]
pub enum PithosError {
Io(#[from] io::Error),
Conversion(String),
SystemTimeError(#[from] SystemTimeError),
StripPrefix(#[from] std::path::StripPrefixError),
WalkDir(#[from] walkdir::Error),
FastCDC(#[from] fastcdc::v2020::Error),
Serialization(#[from] SerializationError),
Deserialization(#[from] DeserializationError),
Crypt(#[from] CryptError),
Crypt4GH(#[from] Crypt4GHError),
Cipher(#[from] ChaChaPoly1305Error),
Compression(#[from] ZstdError),
InvalidBlockDataState(String),
BlockHashNotFound([u8; 32]),
FileNotFound(String),
DuplicateFileId(String),
RelationIdOccupied(u64),
PathOccupied(String),
InvalidFileType(String),
NoMatchingRecipient,
InvalidRecipientDataState(String),
Other(String),
}All integer fields marked as "varint" MUST use unsigned LEB128 encoding.
All strings MUST be encoded as UTF-8 with a varint length prefix.
All multi-byte values not using varint encoding MUST use big-endian byte order.
CRITICAL: Directory entries MUST follow strict ordering rules to ensure proper extraction:
- Parent Before Child Rule: A directory entry MUST appear before any entries for files or subdirectories within it
- Path Format: All paths MUST be relative (no leading
/) and use forward slashes as separators - Root Directory: The root directory is implicit and MUST NOT have an entry
- Validation: Implementations MUST validate ordering during both writing and reading
Example of valid ordering:
data/ (directory)
data/raw/ (directory - parent "data/" already exists)
data/raw/file1.csv (file - parent "data/raw/" already exists)
data/processed/ (directory - parent "data/" already exists)
data/processed/file2.csv (file - parent "data/processed/" already exists)
docs/ (directory)
docs/README.md (file - parent "docs/" already exists)
data/raw/file1_v2.csv (file - parent "data/raw/" already exists) -> Newer version of file1.csv
Example of INVALID ordering:
data/raw/file1.csv (ERROR: parent "data/raw/" not yet declared)
data/raw/ (too late - file already referenced this directory)
data/ (too late - subdirectory already referenced this)
Implementations SHOULD use content-defined chunking with recommended parameters:
- min_size: 64 KB
- avg_size: 128 KB
- max_size: 512 KB
- window_size: 48 bytes
All block hashes MUST use Blake3.
Content keys MUST be derived deterministically using SHAKE256.
Implementations SHOULD support:
- Level 0: No compression
- Levels 1-3: Fast compression (e.g., Zstd levels 1-3)
- Levels 4-6: Balanced compression (e.g., Zstd levels 4-9)
- Level 7: Maximum compression (e.g., Zstd level 19+)
- Read and validate file header
- Find last directory by scanning from end
- Validate directory ordering
- Build block index
- Extract files by reading referenced blocks
- Write file header
- Process files in correct directory order
- Chunk content using content-defined chunking
- Deduplicate blocks by hash
- Write directory and encryption sections
- Validate complete structure
When archiving directory trees:
- Process directories before their contents
- Maintain relative path structure
- Preserve file metadata (permissions, timestamps)
- Handle symlinks appropriately per platform
- Implementations MUST verify block hashes before decompression/decryption
- CRC32 values MUST be validated for directories and encryption sections
- Convergent encryption reveals when identical files exist (accepted trade-off)
- External block URLs MUST use HTTPS in production environments
- Path traversal attacks MUST be prevented through validation
Implementations MUST support:
- Reading and writing base format (version 1.0)
- Blake3 hashing
- Varint encoding/decoding
- CRC32 calculation
- UTF-8 string handling
- All four file types (Data, Metadata, Directory, Symlink)
- Directory ordering validation (parents before children)
- Path format validation (relative paths only)
Implementations MAY support:
- Compression (levels 1-7)
- Encryption (ChaCha20-Poly1305)
- External block storage
- Content-defined chunking
- Unix systems: Create proper symbolic links
- Windows: Handle symlinks according to platform capabilities
- Unix: Preserve full permission bits
- Windows: Map Unix permissions to Windows ACLs where possible
- Archives use forward slashes (
/) internally - Convert to platform-appropriate separators on extraction
The format reserves space for future extensions:
- FileType values 4-255
- ProcessingFlags bits 4-7
- Custom relationship types starting at 1000
Extensions MUST maintain backwards compatibility for reading.
- File Header:
b"PITH" - Block Header:
b"BLCK" - Directory:
b"PITHOSDR"
- Version 1.0:
0x0100
- Current timestamp: Unix seconds since epoch
- Default file permissions:
0o644 - Default directory permissions:
0o755 - Default symlink permissions:
0o777