Skip to content

Commit eb68e49

Browse files
committed
cdc doc updates
1 parent 332d2f5 commit eb68e49

14 files changed

Lines changed: 489 additions & 46 deletions

README.md

Lines changed: 21 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -8,12 +8,15 @@ Document: https://tisunion.github.io/PrimeBackup/
88

99
## Features
1010

11-
- Only stores files with changes with the hash-based file pool. Supports unlimited number of backup
12-
- Comprehensive backup operations, including backup/restore, list/delete, import/export, etc
11+
- Hash-based, compressed file pool deduplication. Only new or changed data is stored, with no hard limit on backup count
12+
- Optional CDC (content-defined chunking) for large, locally edited files to improve deduplication across backups
13+
- Safe restore workflow: confirmation + countdown, automatic pre-restore backup, recycle-bin rollback, and data verification
14+
- Comprehensive backup operations, including backup/restore, list/delete, import/export, comments/tags, etc.
1315
- Smooth in-game interaction, with most operations achievable through mouse clicks
14-
- Highly customizable backup pruning strategies, similar to the strategy use by [PBS](https://pbs.proxmox.com/docs/prune-simulator/)
15-
- Crontab jobs, including automatic backup, automatic pruning, etc.
16-
- Supports use as a command-line tool. Manage the backups without MCDR
16+
- Rich database toolkit: overview statistics, integrity validation, orphan cleanup, file deletion, and hash/compression method migration
17+
- Highly customizable backup pruning strategies, similar to the strategy used by [PBS](https://pbs.proxmox.com/docs/prune-simulator/)
18+
- Scheduled jobs for automatic backup creation and backup pruning, support fixed intervals and crontab expressions
19+
- Provides a command-line tool if you want to manage backups without MCDR. Also supports mounting as a filesystem via FUSE
1720

1821
![!!pb command](docs/img/pb_welcome.png)
1922

@@ -29,18 +32,22 @@ See the document: https://tisunion.github.io/PrimeBackup/
2932

3033
## How it works
3134

32-
Prime Backup maintains a custom file pool to store the backup files. Every file in the pool is identified with the hash value of its content.
33-
With that, Prime Backup can deduplicate files with same content, and only stores 1 copy of them, greatly reduces the burden on disk usage.
35+
Prime Backup maintains a custom file pool to store backup data. Every stored object is identified by a hash of its content.
36+
With that, Prime Backup can deduplicate files with the same content, and only stores 1 copy of them, greatly reducing disk usage
3437

35-
Besides that, Prime Backup also supports compression on the stored files, which reduces the disk usage further more
38+
Prime Backup also supports compression on stored data to further reduce disk usage
3639

37-
PrimeBackup is capable of storing various of common file types, including regular files, directories, and symbolic links. For these 3 types:
40+
For large and locally edited files, Prime Backup can optionally use CDC (Content-Defined Chunking) for better deduplication.
41+
The file is split into content-defined chunks. Each chunk is hashed and reused across backups when unchanged, only new chunks are stored
3842

39-
- Regular file: Prime Backup calculates its hash values first. If the hash does not exist in the file pool,
40-
Prime backup will (compress and) store its content into a new blob in the file pool.
41-
The file status, including mode, uid, mtime etc., will be stored in the database
42-
- Directory: Prime Backup will store its information in the database
43-
- Symlink: Prime Backup will store the symlink itself, instead of the linked target
43+
Prime Backup stores common file types, including regular files, directories, and symbolic links. For these 3 types:
44+
45+
- Regular file: Prime Backup calculates hashes (and size)
46+
If CDC is enabled, it stores the file as a chunked blob that references chunks; chunks are deduplicated and compressed individually
47+
Otherwise, it stores a direct blob; the whole file is deduplicated and compressed as a single unit
48+
File metadata such as mode, uid, and mtime are stored in the database
49+
- Directory: Prime Backup stores its information in the database
50+
- Symlink: Prime Backup stores the symlink itself instead of the linked target
4451

4552
## Thanks
4653

README.zh.md

Lines changed: 22 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -2,18 +2,21 @@
22

33
[English](README.md) | **中文**
44

5-
一个强大的 MCDR 备份插件,一套先进的 Minecraft 存档备份解决方案
5+
一个强大的 MCDR 备份插件,一套先进的 Minecraft 世界备份解决方案
66

77
中文文档:https://tisunion.github.io/PrimeBackup/zh/
88

9-
## Features
9+
## 功能特性
1010

11-
- 基于哈希的文件池,只储存有变化的文件。支持无限数量的备份
11+
- 基于哈希的文件池与压缩去重:仅存储新增或变更的数据,备份数量没有上限
12+
- 可选的 CDC(Content-Defined Chunking,内容定义分块)分块存储:适用于大文件的局部编辑场景,能显著提升跨备份的去重效果
13+
- 安全的回档流程:包含确认与倒计时、回档前自动创建备份、回收站式的回滚机制以及数据完整性校验
1214
- 完善的备份操作,包括备份回档、展示删除、导入导出等
1315
- 流畅的游戏内交互,大部分操作都能点点点
16+
- 丰富的数据库工具,含概览统计、完整性校验、孤儿数据清理、文件删除、哈希/压缩算法迁移等功能
1417
- 高可自定义备份清理策略,是 [PBS](https://pbs.proxmox.com/docs/prune-simulator/) 所用策略的同款
15-
- 定时任务,包括自动备份、自动清理等
16-
- 支持作为命令行工具使用,无需 MCDR 即可管理备份
18+
- 定时任务:支持自动创建备份和自动清理备份,计划方式支持固定间隔和 crontab 表达式
19+
- 支持作为命令行工具使用,无需启动 MCDR 即可管理备份,还可以通过 FUSE 挂载为文件系统进行访问
1720

1821
![!!pb command](docs/img/pb_welcome.zh.png)
1922

@@ -29,19 +32,23 @@ Python 包要求:见 [requirements.txt](requirements.txt)
2932

3033
## 工作原理
3134

32-
Prime Backup 维护了一个自定义的文件池来储存备份文件,池中的每个文件都以其内容的哈希值作为其唯一标识符
33-
借此,Prime Backup 可以对那些内容相同的文件进行去重,并只存储它们的一份副本,从而有效地减少了磁盘占用的负担
35+
Prime Backup 使用一个自定义的文件池来存储备份数据。池中的每个对象都以其内容的哈希值作为唯一标识
36+
通过这种方式,Prime Backup 可以对内容完全相同的文件进行去重,并只存储它们的一份副本,从而显著降低磁盘空间占用
3437

35-
除此之外,Prime Backup 还支持对存储的文件进行压缩,从而进一步减少磁盘占用
38+
此外,Prime Backup 还支持对存储的数据进行压缩,以进一步减少磁盘使用量
3639

37-
Prime Backup 可以存储常见集中的文件类型,包括普通文件、目录和符号链接。对于这三种文件类型:
40+
对于体积较大且仅被局部修改的文件,Prime Backup 可选择启用 CDC(Content-Defined Chunking,内容定义分块)功能来提升去重效率。
41+
文件会被切分成由内容定义的数据块(chunk),每个数据块都会计算哈希值。只有新的数据块才会被写入存储。如果数据块的内容没有改变,它就可以在不同的备份中被复用
3842

39-
- 普通文件:Prime Backup 会先计算其哈希值。如果文件池里不存在这个哈希,
40-
就在池里新建一个数据对象,(压缩)储存该文件的内容。
41-
对于文件的状态信息,包括 mode、uid、mtime 等,将存储在数据库中
42-
- 文件夹:Prime Backup 将其信息存储在数据库中
43-
- 符号链接:Prime Backup 将存储符号链接本身,而非其所链接的目标对象
43+
Prime Backup 支持常见的文件类型,包括普通文件、目录和符号链接。对于这三类文件:
44+
45+
- 普通文件:Prime Backup 会先计算其哈希值(及文件大小)。
46+
启用 CDC 时,文件以“chunked blob”形式存储,并引用多个数据块。这些数据块会独立进行去重和压缩;
47+
否则,文件会以“direct blob”形式存储,整个文件作为一个单元进行去重和压缩。
48+
文件的权限(mode)、用户ID(uid)、修改时间(mtime)等元数据会存储在数据库中
49+
- 目录:Prime Backup 将其信息存储在数据库中
50+
- 符号链接:Prime Backup 存储的是符号链接本身,而不是它所指向的目标文件
4451

4552
## 致谢
4653

47-
基于哈希的文件池这个想法来自 https://github.com/z0z0r4/better_backup
54+
基于哈希的文件池思路来自 https://github.com/z0z0r4/better_backup

docs/concept/storage_structure.md

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -80,9 +80,25 @@ Blob is the actual storage object for file content
8080

8181
- Uses hash value as its unique identifier, one hash value has exactly one corresponding blob
8282
- Only stores file content data and its compression method, does not store actual file metadata
83-
- Stored independently as files, located in the blobs folder under the [storage_root](config.md#storage_root) path
83+
- Has two storage methods: `direct` and `chunked`
84+
- A direct blob is stored independently as a file in the `blobs` folder under [storage_root](config.md#storage_root)
85+
- A chunked blob is cut into serval chunks, and the chunks are stored as files in `blobs/_chunks`
8486
- One blob can be referenced by multiple file objects. When the reference count drops to 0, PrimeBackup will delete this blob
8587

88+
## Chunk and Chunk Group
89+
90+
Chunk is the deduplication unit used by CDC chunking for large files
91+
92+
- A chunk stores a piece of file content, its hash, its compression method, and its size information
93+
- Chunks are content-defined, so inserting or modifying data in the middle of a large file can still keep many neighboring chunks reusable
94+
- Chunk files are stored independently and deduplicated globally, just like direct blobs
95+
96+
Chunk group is an ordered list of chunks used to reduce metadata fan-out for a chunked blob
97+
98+
- Prime Backup groups consecutive chunks into chunk groups, then binds chunk groups back to the blob in order
99+
- Reconstructing a chunked blob means reading its chunk groups in order and then reading the chunks inside each group in order
100+
- For a chunked blob, the blob `stored_size` is the sum of unique stored chunk sizes instead of the size of one standalone blob file
101+
86102
## Storage Architecture Diagram
87103

88104
```mermaid
@@ -94,9 +110,11 @@ graph LR
94110
DB --> fileset[Fileset Objects]
95111
DB --> file[File Objects]
96112
DB --> blob[Blob Objects]
113+
DB --> chunk_group[Chunk Group Objects]
114+
DB --> chunk[Chunk Objects]
97115
98-
blob_pool --> blob_storage[Hash Sharding]
99-
blob_storage --> blob_file[Blob Files]
116+
blob_pool --> blob_storage[Direct Blob Files]
117+
blob_pool --> chunk_storage[Chunk Files]
100118
101119
style A fill:#e1f5fe
102120
style DB fill:#f3e5f5
@@ -105,6 +123,8 @@ graph LR
105123
style fileset fill:#e8f5e8
106124
style file fill:#e8f5e8
107125
style blob fill:#e8f5e8
126+
style chunk_group fill:#e8f5e8
127+
style chunk fill:#e8f5e8
108128
style blob_storage fill:#fff3e0
109-
style blob_file fill:#fff3e0
129+
style chunk_storage fill:#fff3e0
110130
```

docs/concept/storage_structure.zh.md

Lines changed: 27 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -76,12 +76,28 @@ graph TB
7676

7777
## 数据对象(Blob)
7878

79-
数据对象(Blob)是实际的文件内容的实际储存对象
79+
数据对象(Blob)是实际存储文件内容的对象
8080

8181
- 使用哈希值作为其唯一标识符,一个哈希值有且仅有一个对应的数据对象
82-
- 只储存文件的内容数据及其压缩方式,不储存实际文件的元信息
83-
- 以文件形式独立存储,位于 [storage_root](config.zh.md#storage_root) 路径下的 blobs 文件夹
84-
- 一个数据对象可被多个文件对象引用。当引用数下降为 0 时,PrimeBackup 会删除这一数据对象
82+
- 仅存储文件的内容数据及其压缩方式,不存储实际文件的元信息
83+
- 具有两种存储方式:`direct``chunked`
84+
- `direct`(直存)数据对象会以独立文件的形式存储在 [storage_root](config.zh.md#storage_root) 下的 `blobs` 目录中
85+
- `chunked`(分块)数据对象不直接对应一个独立的 blob 文件,而是由多个数据块组和数据块按顺序重建出来;数据块文件则独立存放在 `blobs/_chunks` 目录中
86+
- 一个数据对象可被多个文件对象引用。当引用数降为 0 时,PrimeBackup 会删除该数据对象
87+
88+
## 数据块与数据块组(Chunk and Chunk Group)
89+
90+
数据块(Chunk)是 CDC 为大文件引入的去重单位
91+
92+
- 一个数据块保存一段文件内容,以及它的哈希值、压缩方式和大小信息
93+
- 数据块按内容定义的边界切分,因此即使大文件在中间插入或修改了数据,周围未变化的部分仍有机会落在相同的数据块中被复用
94+
- 数据块文件会像直存数据对象(direct blob)一样独立存储,并在全局范围内去重
95+
96+
数据块组(Chunk Group)是一组按顺序组织的数据块,用于降低 chunked blob 的元数据展开规模
97+
98+
- Prime Backup 会将连续的数据块组织成数据块组,再按顺序将数据块组绑定回 blob
99+
- 重建分块 blob 时,会先按顺序读取其数据块组,再按顺序读取每个组内的数据块
100+
- 对于分块 blob,其 `stored_size` 表示所有唯一数据块存储大小之和,而非某个独立 blob 文件的大小
85101

86102
## 存储架构图
87103

@@ -94,9 +110,11 @@ graph LR
94110
DB --> fileset[文件集对象]
95111
DB --> file[文件对象]
96112
DB --> blob[数据对象]
113+
DB --> chunk_group[数据块组对象]
114+
DB --> chunk[数据块对象]
97115
98-
blob_pool --> blob_storage[哈希分片]
99-
blob_storage --> blob_file[数据对象文件]
116+
blob_pool --> blob_storage[直存 Blob 文件]
117+
blob_pool --> chunk_storage[数据块文件]
100118
101119
style A fill:#e1f5fe
102120
style DB fill:#f3e5f5
@@ -105,6 +123,8 @@ graph LR
105123
style fileset fill:#e8f5e8
106124
style file fill:#e8f5e8
107125
style blob fill:#e8f5e8
126+
style chunk_group fill:#e8f5e8
127+
style chunk fill:#e8f5e8
108128
style blob_storage fill:#fff3e0
109-
style blob_file fill:#fff3e0
129+
style chunk_storage fill:#fff3e0
110130
```

docs/config.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -224,6 +224,12 @@ Configs on how the backup is made
224224
"**"
225225
],
226226

227+
"cdc_enabled": false,
228+
"cdc_file_size_threshold": 104857600,
229+
"cdc_patterns": [
230+
"**/*.db"
231+
],
232+
227233
"hash_method": "blake3",
228234
"compress_method": "zstd",
229235
"compress_threshold": 64,
@@ -413,6 +419,53 @@ The default value is `["**"]`, which matches everything. It's suggested to limit
413419

414420
- Type: `List[str]`
415421

422+
#### cdc_enabled
423+
424+
Whether to enable content-defined chunking (CDC) for large files during backup creation
425+
426+
CDC stands for `Content-Defined Chunking`.
427+
Unlike fixed-size chunking, CDC determines chunk boundaries from the file content itself,
428+
so when data is inserted, deleted, or modified locally, many unchanged regions can still be cut into the same chunks and be reused across backups
429+
430+
Changing this option only affects files newly stored in future backups.
431+
Existing direct blobs or chunked blobs will not be converted automatically
432+
433+
!!! note
434+
435+
CDC chunking requires the optional `pyfastcdc` dependency.
436+
You can install all optional dependencies with `pip3 install -r requirements.optional.txt`,
437+
or install `pyfastcdc` manually
438+
439+
- Type: `bool`
440+
- Default: `false`
441+
442+
#### cdc_file_size_threshold
443+
444+
The minimum file size in bytes for a file to be considered for CDC chunking
445+
446+
Files smaller than this threshold will continue to use the regular direct blob storage flow,
447+
even if [cdc_enabled](#cdc_enabled) is enabled and the path matches [cdc_patterns](#cdc_patterns)
448+
449+
Changing this option only affects files newly stored in future backups.
450+
Existing stored data will not be repartitioned automatically
451+
452+
- Type: `int`
453+
- Default: `104857600` (`100 MiB`)
454+
455+
#### cdc_patterns
456+
457+
A list of [gitignore flavor](http://git-scm.com/docs/gitignore) pattern strings,
458+
matched against file paths relative to [source_root](#source_root)
459+
460+
CDC chunking will only be applied when the file path matches one of these patterns,
461+
the file size reaches [cdc_file_size_threshold](#cdc_file_size_threshold),
462+
and [cdc_enabled](#cdc_enabled) is enabled
463+
464+
The default value is `["**/*.db"]`.
465+
It is recommended to keep this list narrow and only include large files that are often modified locally and really need to be backed up
466+
467+
- Type: `List[str]`
468+
416469
#### hash_method
417470

418471
The algorithm to hash the files. Available options: `"xxh128"`, `"sha256"`, `"blake3"`

docs/config.zh.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -224,6 +224,12 @@ Prime Backup 在创建备份时的操作时序如下:
224224
"**"
225225
],
226226

227+
"cdc_enabled": false,
228+
"cdc_file_size_threshold": 104857600,
229+
"cdc_patterns": [
230+
"**/*.db"
231+
],
232+
227233
"hash_method": "blake3",
228234
"compress_method": "zstd",
229235
"compress_threshold": 64,
@@ -413,6 +419,51 @@ Prime Backup 会检查文件的如下这些信息。下述这些信息完全一
413419

414420
- 类型:`List[str]`
415421

422+
#### cdc_enabled
423+
424+
是否在创建备份时,对大文件启用内容定义分块(CDC)
425+
426+
CDC 是 `Content-Defined Chunking` 的缩写,即“按内容划分边界”的切块方式。
427+
它与固定大小切块不同,数据块边界由文件内容决定,因此当文件仅在局部发生增删改时,许多未变化的内容仍能被切成相同的数据块,从而复用已有数据块
428+
429+
修改此选项只会影响后续备份中新写入的文件。
430+
已存在的直存数据对象(direct blob)或分块数据对象(chunked blob)不会被自动转换
431+
432+
!!! note
433+
434+
CDC 分块需要可选依赖 `pyfastcdc`。
435+
你可以通过 `pip3 install -r requirements.optional.txt` 安装全部可选依赖,
436+
或者单独安装 `pyfastcdc`
437+
438+
- 类型:`bool`
439+
- 默认值:`false`
440+
441+
#### cdc_file_size_threshold
442+
443+
文件参与 CDC 分块所需达到的最小大小,单位为字节。
444+
445+
小于该阈值的文件,即使 [cdc_enabled](#cdc_enabled) 已启用、路径也匹配了 [cdc_patterns](#cdc_patterns)
446+
仍会继续使用常规的直存数据对象(direct blob)存储流程
447+
448+
修改此选项只会影响后续备份中新写入的文件。
449+
已入库的数据不会被自动重新切分
450+
451+
- 类型:`int`
452+
- 默认值:`104857600``100 MiB`
453+
454+
#### cdc_patterns
455+
456+
一个 [gitignore 风格](http://git-scm.com/docs/gitignore) 的模板串列表,
457+
匹配对象是相对于 [source_root](#source_root) 的文件路径
458+
459+
只有当文件路径匹配这些模式、文件大小达到 [cdc_file_size_threshold](#cdc_file_size_threshold)
460+
[cdc_enabled](#cdc_enabled) 已启用时,才会使用 CDC 分块
461+
462+
默认值为 `["**/*.db"]`
463+
建议将其控制得尽量精确,只包含那些体积大、经常发生局部修改、且确实需要备份的文件
464+
465+
- 类型:`List[str]`
466+
416467
#### hash_method
417468

418469
对文件进行哈希时所使用的算法。可用选项:`"xxh128"``"sha256"``"blake3"`

0 commit comments

Comments
 (0)