Skip to content

Commit 3460143

Browse files
kixelatedclaude
andauthored
Add draft-lcurley-compressed-mp4 (#16)
New IETF draft defining a compression scheme for ISO BMFF that reduces per-fragment overhead from ~96 bytes to ~21 bytes for live streaming. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 4c05cfd commit 3460143

1 file changed

Lines changed: 334 additions & 0 deletions

File tree

draft-lcurley-compressed-mp4.md

Lines changed: 334 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,334 @@
1+
---
2+
title: "Compressed MP4"
3+
abbrev: "cmp4"
4+
category: info
5+
6+
docname: draft-lcurley-compressed-mp4-latest
7+
submissiontype: IETF # also: "independent", "editorial", "IAB", or "IRTF"
8+
number:
9+
date:
10+
v: 3
11+
area: wit
12+
workgroup: moq
13+
14+
author:
15+
-
16+
fullname: Luke Curley
17+
email: kixelated@gmail.com
18+
19+
normative:
20+
RFC9000:
21+
22+
informative:
23+
ISOBMFF:
24+
title: "Information technology — Coding of audio-visual objects — Part 12: ISO base media file format"
25+
target: https://www.iso.org/standard/83102.html
26+
date: 2022
27+
28+
--- abstract
29+
30+
Fragmented MP4 (fMP4) is widely used for live streaming, but the ISO Base Media File Format (ISOBMFF) box structure imposes significant per-fragment overhead.
31+
Each box header requires 8 bytes (4-byte size + 4-byte type), and payload fields use fixed-width integers (u32/u64).
32+
For low-latency streaming with single-frame fragments, this overhead can exceed the media payload itself.
33+
34+
This document defines a compression scheme for ISO BMFF that replaces box headers with compact varint-encoded identifiers and sizes, and defines compressed variants of commonly used boxes with varint payload fields.
35+
The scheme reduces per-fragment overhead from ~100 bytes to ~20 bytes while preserving the full box hierarchy.
36+
37+
--- middle
38+
39+
# Conventions and Definitions
40+
{::boilerplate bcp14-tagged}
41+
42+
43+
# Introduction
44+
Fragmented MP4 (fMP4) is the dominant container format for low-latency live streaming.
45+
Each fragment consists of a `moof` (movie fragment) box followed by an `mdat` (media data) box.
46+
The `moof` box contains metadata describing the media samples in `mdat`.
47+
48+
For a typical single-frame video fragment, the overhead looks like this:
49+
50+
| Box | Header | Payload | Total |
51+
|:----|-------:|--------:|------:|
52+
| moof | 8 | 0 | 8 |
53+
| mfhd | 8 | 4 | 12 |
54+
| traf | 8 | 0 | 8 |
55+
| tfhd | 8 | 8 | 16 |
56+
| tfdt | 8 | 12 | 20 |
57+
| trun | 8 | 16 | 24 |
58+
| mdat | 8 | 0 | 8 |
59+
| **Total** | **56** | **40** | **96** |
60+
61+
This 96 bytes of overhead is substantial when a single encoded video frame might only be 100-500 bytes at low bitrates or high frame rates.
62+
Audio frames are even smaller, often 4-20 bytes for Opus at low bitrates, making the container overhead several times larger than the payload.
63+
64+
This document defines two layers of compression:
65+
66+
1. **Header Compression**: A compression table (`cmpd`) maps varint IDs to box type names. All boxes after the `moov` use compressed headers: `[varint ID][varint size]` instead of `[u32 size][4-char type]`.
67+
2. **Payload Compression**: Compressed variants of common boxes (`cmfh`, `cfhd`, `cfdt`, `crun`) replace fixed-width payload fields with varints.
68+
69+
Together, these reduce the per-fragment overhead to approximately 20 bytes.
70+
71+
72+
# Variable-Length Integer Encoding
73+
This document uses the variable-length integer encoding from QUIC {{RFC9000}}, Section 16.
74+
The first two bits of the first byte indicate the encoding length:
75+
76+
| 2MSB | Length | Usable Bits | Range |
77+
|:-----|-------:|------------:|:------|
78+
| 00 | 1 | 6 | 0-63 |
79+
| 01 | 2 | 14 | 0-16383 |
80+
| 10 | 4 | 30 | 0-1073741823 |
81+
| 11 | 8 | 62 | 0-4611686018427387903 |
82+
83+
84+
In the message formats below, fields marked with `(i)` use this variable-length integer encoding.
85+
86+
87+
# Compression Table (cmpd)
88+
The `cmpd` box is a standard ISO BMFF box placed inside the `moov` box.
89+
It defines a mapping from compact varint IDs to 4-character box type names.
90+
91+
~~~
92+
cmpd {
93+
Count (i),
94+
Compressed Box Entry (..) ...,
95+
}
96+
97+
Compressed Box Entry {
98+
ID (i),
99+
Name (32),
100+
}
101+
~~~
102+
103+
**Count**: The number of entries in the compression table.
104+
105+
**ID**: A varint identifier assigned to this box type. IDs SHOULD be assigned starting from 0 to minimize encoding size.
106+
107+
**Name**: The 4-character ISO BMFF box type name (e.g., `moof`, `mdat`, `traf`).
108+
109+
The `cmpd` box itself uses a standard ISO BMFF header since it appears inside the `moov` before compressed encoding takes effect.
110+
111+
A typical compression table for live video streaming:
112+
113+
| ID | Name | Description |
114+
|---:|:-----|:------------|
115+
| 0 | moof | Movie Fragment |
116+
| 1 | mdat | Media Data |
117+
| 2 | mfhd | Movie Fragment Header |
118+
| 3 | traf | Track Fragment |
119+
| 4 | tfhd | Track Fragment Header |
120+
| 5 | tfdt | Track Fragment Decode Time |
121+
| 6 | trun | Track Run |
122+
123+
With 7 entries using IDs 0-6, each ID fits in a single varint byte.
124+
Header compression alone reduces the 56 bytes of box headers (7 boxes x 8 bytes) to 14 bytes (7 boxes x 2 bytes), saving 42 bytes per fragment.
125+
126+
127+
# Compressed Box Header
128+
The presence of a `cmpd` box in the `moov` signals that all top-level boxes following the `moov` use compressed box headers.
129+
130+
A standard ISO BMFF box header is:
131+
132+
~~~
133+
Standard Box Header {
134+
Size (32),
135+
Type (32),
136+
}
137+
~~~
138+
139+
This is replaced with:
140+
141+
~~~
142+
Compressed Box Header {
143+
ID (i),
144+
Size (i),
145+
}
146+
~~~
147+
148+
**ID**: The varint identifier from the compression table. The receiver looks up the corresponding 4-character box type name in the `cmpd` table.
149+
150+
**Size**: A varint containing the size of the box payload in bytes. Unlike standard ISO BMFF where the size field includes the header itself, the compressed size field contains only the payload length. This avoids the need for extended size fields since varints natively handle large values.
151+
152+
The box hierarchy (nesting) is preserved exactly as in standard ISO BMFF.
153+
Container boxes (e.g., `moof`, `traf`) contain nested boxes whose sizes sum to the parent's payload size.
154+
The receiver MUST be able to reconstruct the original uncompressed ISO BMFF structure by reversing the ID-to-name mapping and adjusting size fields.
155+
156+
157+
# Compressed Box Variants
158+
This section defines compressed variants of commonly used fMP4 boxes.
159+
These variants replace fixed-width integer fields with varints, further reducing overhead.
160+
161+
An encoder MAY use the standard box (with a compressed header) OR the compressed variant for any given box.
162+
The compression table determines which box type is used.
163+
164+
## cmfh — Compressed Movie Fragment Header
165+
Replaces `mfhd` (Movie Fragment Header).
166+
167+
~~~
168+
cmfh {
169+
Sequence Number (i),
170+
}
171+
~~~
172+
173+
**Sequence Number**: The fragment sequence number (varint instead of u32).
174+
175+
Standard `mfhd` uses 4 bytes for the sequence number.
176+
With `cmfh`, a sequence number under 64 requires only 1 byte.
177+
178+
## cfhd — Compressed Track Fragment Header
179+
Replaces `tfhd` (Track Fragment Header).
180+
181+
~~~
182+
cfhd {
183+
Track ID (i),
184+
Flags (i),
185+
Base Data Offset (i), ; present if flags & 0x01
186+
Sample Description Index (i), ; present if flags & 0x02
187+
Default Sample Duration (i), ; present if flags & 0x08
188+
Default Sample Size (i), ; present if flags & 0x10
189+
Default Sample Flags (i), ; present if flags & 0x20
190+
}
191+
~~~
192+
193+
**Track ID**: Identifies the track (varint instead of u32).
194+
195+
**Flags**: A varint encoding the optional field presence flags. The flag values match the standard `tfhd` tf_flags semantics but are renumbered for compact varint encoding:
196+
197+
| Flag | Field |
198+
|-----:|:------|
199+
| 0x01 | base-data-offset-present |
200+
| 0x02 | sample-description-index-present |
201+
| 0x08 | default-sample-duration-present |
202+
| 0x10 | default-sample-size-present |
203+
| 0x20 | default-sample-flags-present |
204+
205+
Standard `tfhd` uses 4 bytes for version/flags and 4 bytes for track ID (minimum 8 bytes).
206+
With `cfhd`, a single-track stream with no optional fields requires as few as 2 bytes.
207+
208+
## cfdt — Compressed Track Fragment Decode Time
209+
Replaces `tfdt` (Track Fragment Decode Time).
210+
211+
~~~
212+
cfdt {
213+
Base Decode Time (i),
214+
}
215+
~~~
216+
217+
**Base Decode Time**: The decode timestamp of the first sample in this fragment (varint instead of u32/u64).
218+
219+
Standard `tfdt` uses 4 bytes for version/flags plus 4 or 8 bytes for the timestamp (8-12 bytes total).
220+
With `cfdt`, small timestamps require as few as 1 byte.
221+
222+
## crun — Compressed Track Run
223+
Replaces `trun` (Track Run).
224+
225+
~~~
226+
crun {
227+
Sample Count (i),
228+
Flags (i),
229+
Data Offset (i), ; present if flags & 0x01
230+
First Sample Flags (i), ; present if flags & 0x04
231+
Per-Sample Fields (..) ...,
232+
}
233+
234+
Per-Sample Fields {
235+
Sample Duration (i), ; present if flags & 0x100
236+
Sample Size (i), ; present if flags & 0x200
237+
Sample Flags (i), ; present if flags & 0x400
238+
Sample Composition Time Offset (i),; present if flags & 0x800
239+
}
240+
~~~
241+
242+
**Sample Count**: The number of samples in this run (varint instead of u32).
243+
244+
**Flags**: A varint encoding which optional fields are present. The flag values match the standard `trun` tr_flags semantics:
245+
246+
| Flag | Field |
247+
|-----:|:------|
248+
| 0x001 | data-offset-present |
249+
| 0x004 | first-sample-flags-present |
250+
| 0x100 | sample-duration-present |
251+
| 0x200 | sample-size-present |
252+
| 0x400 | sample-flags-present |
253+
| 0x800 | sample-composition-time-offset-present |
254+
255+
Standard `trun` uses 4 bytes for version/flags, 4 bytes for sample count, and 4 bytes per optional field.
256+
With `crun`, a single-sample run with only sample-size typically requires 4-5 bytes instead of 16.
257+
258+
259+
# Example
260+
This section provides a concrete byte-level comparison for a single-frame video fragment.
261+
262+
## Standard fMP4
263+
A typical single-frame fragment with sequence number 42, track ID 1, decode time 3840, and a 200-byte sample:
264+
265+
~~~
266+
moof (size=80) 8 bytes
267+
mfhd (size=16) 8 bytes
268+
version=0, flags=0 4 bytes
269+
sequence_number=42 4 bytes
270+
traf (size=56) 8 bytes
271+
tfhd (size=16) 8 bytes
272+
version=0, flags=0x020000 4 bytes
273+
track_id=1 4 bytes
274+
tfdt (size=20) 8 bytes
275+
version=1, flags=0 4 bytes
276+
base_decode_time=3840 8 bytes
277+
trun (size=20) 8 bytes
278+
version=0, flags=0x000200 4 bytes
279+
sample_count=1 4 bytes
280+
sample_size=200 4 bytes
281+
mdat (size=208) 8 bytes
282+
<200 bytes of media data>
283+
~~~
284+
285+
**Total overhead: 96 bytes** (excluding media data).
286+
287+
## Compressed fMP4
288+
The same fragment using compressed encoding, with the compression table from the example in Section 4:
289+
290+
~~~
291+
moof (id=0, size=11) 2 bytes
292+
cmfh (id=2, size=1) 2 bytes
293+
sequence_number=42 1 byte
294+
traf (id=3, size=6) 2 bytes
295+
cfhd (id=4, size=1) 2 bytes
296+
track_id=1 1 byte
297+
flags=0 0 bytes (no optional fields)
298+
cfdt (id=5, size=2) 2 bytes
299+
base_decode_time=3840 2 bytes
300+
crun (id=6, size=3) 2 bytes
301+
sample_count=1 1 byte
302+
flags=0x200 2 bytes
303+
sample_size=200 2 bytes
304+
mdat (id=1, size=200) 2 bytes
305+
<200 bytes of media data>
306+
~~~
307+
308+
**Total overhead: ~21 bytes** (excluding media data).
309+
310+
This represents a **78% reduction** in per-fragment overhead (from 96 bytes to ~21 bytes).
311+
312+
313+
# Security Considerations
314+
TODO Security
315+
316+
317+
# IANA Considerations
318+
This document registers the following ISO BMFF box types:
319+
320+
| Box Type | Description |
321+
|:---------|:------------|
322+
| cmpd | Compression Table |
323+
| cmfh | Compressed Movie Fragment Header |
324+
| cfhd | Compressed Track Fragment Header |
325+
| cfdt | Compressed Track Fragment Decode Time |
326+
| crun | Compressed Track Run |
327+
328+
329+
--- back
330+
331+
# Acknowledgments
332+
{:numbered="false"}
333+
334+
This draft was generated with the assistance of AI (Claude).

0 commit comments

Comments
 (0)