TISUnion
diff --git a/‎README.md‎
Lines changed: 21 additions & 14 deletions b/‎README.md‎
Lines changed: 21 additions & 14 deletions
diff --git a/‎README.zh.md‎
Lines changed: 22 additions & 15 deletions b/‎README.zh.md‎
Lines changed: 22 additions & 15 deletions
diff --git a/‎docs/concept/storage_structure.md‎
Lines changed: 24 additions & 4 deletions b/‎docs/concept/storage_structure.md‎
Lines changed: 24 additions & 4 deletions
diff --git a/‎docs/concept/storage_structure.zh.md‎
Lines changed: 27 additions & 7 deletions b/‎docs/concept/storage_structure.zh.md‎
Lines changed: 27 additions & 7 deletions
diff --git a/‎docs/config.md‎
Lines changed: 53 additions & 0 deletions b/‎docs/config.md‎
Lines changed: 53 additions & 0 deletions
diff --git a/‎docs/config.zh.md‎
Lines changed: 51 additions & 0 deletions b/‎docs/config.zh.md‎
Lines changed: 51 additions & 0 deletions
@@ -8,12 +8,15 @@ Document: https://tisunion.github.io/PrimeBackup/
 
 ## Features
 
-- Only stores files with changes with the hash-based file pool. Supports unlimited number of backup
-- Comprehensive backup operations, including backup/restore, list/delete, import/export, etc
+- Hash-based, compressed file pool deduplication. Only new or changed data is stored, with no hard limit on backup count
+- Optional CDC (content-defined chunking) for large, locally edited files to improve deduplication across backups
+- Safe restore workflow: confirmation + countdown, automatic pre-restore backup, recycle-bin rollback, and data verification
+- Comprehensive backup operations, including backup/restore, list/delete, import/export, comments/tags, etc.
 - Smooth in-game interaction, with most operations achievable through mouse clicks
-- Highly customizable backup pruning strategies, similar to the strategy use by [PBS](https://pbs.proxmox.com/docs/prune-simulator/)
-- Crontab jobs, including automatic backup, automatic pruning, etc.
-- Supports use as a command-line tool. Manage the backups without MCDR
+- Rich database toolkit: overview statistics, integrity validation, orphan cleanup, file deletion, and hash/compression method migration
+- Highly customizable backup pruning strategies, similar to the strategy used by [PBS](https://pbs.proxmox.com/docs/prune-simulator/)
+- Scheduled jobs for automatic backup creation and backup pruning, support fixed intervals and crontab expressions
+- Provides a command-line tool if you want to manage backups without MCDR. Also supports mounting as a filesystem via FUSE
 
 ![!!pb command](docs/img/pb_welcome.png)
 
@@ -29,18 +32,22 @@ See the document: https://tisunion.github.io/PrimeBackup/
 
 ## How it works
 
-Prime Backup maintains a custom file pool to store the backup files. Every file in the pool is identified with the hash value of its content.
-With that, Prime Backup can deduplicate files with same content, and only stores 1 copy of them, greatly reduces the burden on disk usage. 
+Prime Backup maintains a custom file pool to store backup data. Every stored object is identified by a hash of its content.
+With that, Prime Backup can deduplicate files with the same content, and only stores 1 copy of them, greatly reducing disk usage
 
-Besides that, Prime Backup also supports compression on the stored files, which reduces the disk usage further more
+Prime Backup also supports compression on stored data to further reduce disk usage
 
-PrimeBackup is capable of storing various of common file types, including regular files, directories, and symbolic links. For these 3 types:
+For large and locally edited files, Prime Backup can optionally use CDC (Content-Defined Chunking) for better deduplication.
+The file is split into content-defined chunks. Each chunk is hashed and reused across backups when unchanged, only new chunks are stored
 
-- Regular file: Prime Backup calculates its hash values first. If the hash does not exist in the file pool, 
-  Prime backup will (compress and) store its content into a new blob in the file pool.
-  The file status, including mode, uid, mtime etc., will be stored in the database
-- Directory: Prime Backup will store its information in the database
-- Symlink: Prime Backup will store the symlink itself, instead of the linked target
+Prime Backup stores common file types, including regular files, directories, and symbolic links. For these 3 types:
+
+- Regular file: Prime Backup calculates hashes (and size)
+  If CDC is enabled, it stores the file as a chunked blob that references chunks; chunks are deduplicated and compressed individually
+  Otherwise, it stores a direct blob; the whole file is deduplicated and compressed as a single unit
+  File metadata such as mode, uid, and mtime are stored in the database
+- Directory: Prime Backup stores its information in the database
+- Symlink: Prime Backup stores the symlink itself instead of the linked target
 
 ## Thanks
 
 
@@ -2,18 +2,21 @@
 
 [English](README.md) | **中文**
 
-一个强大的 MCDR 备份插件，一套先进的 Minecraft 存档备份解决方案
+一个强大的 MCDR 备份插件，一套先进的 Minecraft 世界备份解决方案
 
 中文文档：https://tisunion.github.io/PrimeBackup/zh/
 
-## Features
+## 功能特性
 
-- 基于哈希的文件池，只储存有变化的文件。支持无限数量的备份
+- 基于哈希的文件池与压缩去重：仅存储新增或变更的数据，备份数量没有上限
+- 可选的 CDC（Content-Defined Chunking，内容定义分块）分块存储：适用于大文件的局部编辑场景，能显著提升跨备份的去重效果
+- 安全的回档流程：包含确认与倒计时、回档前自动创建备份、回收站式的回滚机制以及数据完整性校验
 - 完善的备份操作，包括备份回档、展示删除、导入导出等
 - 流畅的游戏内交互，大部分操作都能点点点
+- 丰富的数据库工具，含概览统计、完整性校验、孤儿数据清理、文件删除、哈希/压缩算法迁移等功能
 - 高可自定义备份清理策略，是 [PBS](https://pbs.proxmox.com/docs/prune-simulator/) 所用策略的同款
-- 定时任务，包括自动备份、自动清理等
-- 支持作为命令行工具使用，无需 MCDR 即可管理备份
+- 定时任务：支持自动创建备份和自动清理备份，计划方式支持固定间隔和 crontab 表达式
+- 支持作为命令行工具使用，无需启动 MCDR 即可管理备份，还可以通过 FUSE 挂载为文件系统进行访问
 
 ![!!pb command](docs/img/pb_welcome.zh.png)
 
@@ -29,19 +32,23 @@ Python 包要求：见 [requirements.txt](requirements.txt)
 
 ## 工作原理
 
-Prime Backup 维护了一个自定义的文件池来储存备份文件，池中的每个文件都以其内容的哈希值作为其唯一标识符。
-借此，Prime Backup 可以对那些内容相同的文件进行去重，并只存储它们的一份副本，从而有效地减少了磁盘占用的负担
+Prime Backup 使用一个自定义的文件池来存储备份数据。池中的每个对象都以其内容的哈希值作为唯一标识。
+通过这种方式，Prime Backup 可以对内容完全相同的文件进行去重，并只存储它们的一份副本，从而显著降低磁盘空间占用
 
-除此之外，Prime Backup 还支持对存储的文件进行压缩，从而进一步减少磁盘占用
+此外，Prime Backup 还支持对存储的数据进行压缩，以进一步减少磁盘使用量
 
-Prime Backup 可以存储常见集中的文件类型，包括普通文件、目录和符号链接。对于这三种文件类型：
+对于体积较大且仅被局部修改的文件，Prime Backup 可选择启用 CDC（Content-Defined Chunking，内容定义分块）功能来提升去重效率。
+文件会被切分成由内容定义的数据块（chunk），每个数据块都会计算哈希值。只有新的数据块才会被写入存储。如果数据块的内容没有改变，它就可以在不同的备份中被复用
 
-- 普通文件：Prime Backup 会先计算其哈希值。如果文件池里不存在这个哈希，
-  就在池里新建一个数据对象，（压缩）储存该文件的内容。
-  对于文件的状态信息，包括 mode、uid、mtime 等，将存储在数据库中
-- 文件夹：Prime Backup 将其信息存储在数据库中
-- 符号链接：Prime Backup 将存储符号链接本身，而非其所链接的目标对象
+Prime Backup 支持常见的文件类型，包括普通文件、目录和符号链接。对于这三类文件：
+
+- 普通文件：Prime Backup 会先计算其哈希值（及文件大小）。
+  启用 CDC 时，文件以“chunked blob”形式存储，并引用多个数据块。这些数据块会独立进行去重和压缩；
+  否则，文件会以“direct blob”形式存储，整个文件作为一个单元进行去重和压缩。
+  文件的权限（mode）、用户ID（uid）、修改时间（mtime）等元数据会存储在数据库中
+- 目录：Prime Backup 将其信息存储在数据库中
+- 符号链接：Prime Backup 存储的是符号链接本身，而不是它所指向的目标文件
 
 ## 致谢
 
-基于哈希的文件池这个想法来自 https://github.com/z0z0r4/better_backup
+基于哈希的文件池思路来自 https://github.com/z0z0r4/better_backup
@@ -80,9 +80,25 @@ Blob is the actual storage object for file content
 
 - Uses hash value as its unique identifier, one hash value has exactly one corresponding blob
 - Only stores file content data and its compression method, does not store actual file metadata
-- Stored independently as files, located in the blobs folder under the [storage_root](config.md#storage_root) path
+- Has two storage methods: `direct` and `chunked`
+  - A direct blob is stored independently as a file in the `blobs` folder under [storage_root](config.md#storage_root)
+  - A chunked blob is cut into serval chunks, and the chunks are stored as files in `blobs/_chunks`
 - One blob can be referenced by multiple file objects. When the reference count drops to 0, PrimeBackup will delete this blob
 
+## Chunk and Chunk Group
+
+Chunk is the deduplication unit used by CDC chunking for large files
+
+- A chunk stores a piece of file content, its hash, its compression method, and its size information
+- Chunks are content-defined, so inserting or modifying data in the middle of a large file can still keep many neighboring chunks reusable
+- Chunk files are stored independently and deduplicated globally, just like direct blobs
+
+Chunk group is an ordered list of chunks used to reduce metadata fan-out for a chunked blob
+
+- Prime Backup groups consecutive chunks into chunk groups, then binds chunk groups back to the blob in order
+- Reconstructing a chunked blob means reading its chunk groups in order and then reading the chunks inside each group in order
+- For a chunked blob, the blob `stored_size` is the sum of unique stored chunk sizes instead of the size of one standalone blob file
+
 ## Storage Architecture Diagram
 
 ```mermaid
@@ -94,9 +110,11 @@ graph LR
     DB --> fileset[Fileset Objects]
     DB --> file[File Objects]
     DB --> blob[Blob Objects]
+    DB --> chunk_group[Chunk Group Objects]
+    DB --> chunk[Chunk Objects]
 
-    blob_pool --> blob_storage[Hash Sharding]
-    blob_storage --> blob_file[Blob Files]
+    blob_pool --> blob_storage[Direct Blob Files]
+    blob_pool --> chunk_storage[Chunk Files]
 
     style A fill:#e1f5fe
     style DB fill:#f3e5f5
@@ -105,6 +123,8 @@ graph LR
     style fileset fill:#e8f5e8
     style file fill:#e8f5e8
     style blob fill:#e8f5e8
+    style chunk_group fill:#e8f5e8
+    style chunk fill:#e8f5e8
     style blob_storage fill:#fff3e0
-    style blob_file fill:#fff3e0
+    style chunk_storage fill:#fff3e0
 ```
@@ -76,12 +76,28 @@ graph TB
 
 ## 数据对象（Blob）
 
-数据对象（Blob）是实际的文件内容的实际储存对象
+数据对象（Blob）是实际存储文件内容的对象
 
 - 使用哈希值作为其唯一标识符，一个哈希值有且仅有一个对应的数据对象
-- 只储存文件的内容数据及其压缩方式，不储存实际文件的元信息
-- 以文件形式独立存储，位于 [storage_root](config.zh.md#storage_root) 路径下的 blobs 文件夹
-- 一个数据对象可被多个文件对象引用。当引用数下降为 0 时，PrimeBackup 会删除这一数据对象
+- 仅存储文件的内容数据及其压缩方式，不存储实际文件的元信息
+- 具有两种存储方式：`direct` 和 `chunked`
+  - `direct`（直存）数据对象会以独立文件的形式存储在 [storage_root](config.zh.md#storage_root) 下的 `blobs` 目录中
+  - `chunked`（分块）数据对象不直接对应一个独立的 blob 文件，而是由多个数据块组和数据块按顺序重建出来；数据块文件则独立存放在 `blobs/_chunks` 目录中
+- 一个数据对象可被多个文件对象引用。当引用数降为 0 时，PrimeBackup 会删除该数据对象
+
+## 数据块与数据块组（Chunk and Chunk Group）
+
+数据块（Chunk）是 CDC 为大文件引入的去重单位
+
+- 一个数据块保存一段文件内容，以及它的哈希值、压缩方式和大小信息
+- 数据块按内容定义的边界切分，因此即使大文件在中间插入或修改了数据，周围未变化的部分仍有机会落在相同的数据块中被复用
+- 数据块文件会像直存数据对象（direct blob）一样独立存储，并在全局范围内去重
+
+数据块组（Chunk Group）是一组按顺序组织的数据块，用于降低 chunked blob 的元数据展开规模
+
+- Prime Backup 会将连续的数据块组织成数据块组，再按顺序将数据块组绑定回 blob
+- 重建分块 blob 时，会先按顺序读取其数据块组，再按顺序读取每个组内的数据块
+- 对于分块 blob，其 `stored_size` 表示所有唯一数据块存储大小之和，而非某个独立 blob 文件的大小
 
 ## 存储架构图
 
@@ -94,9 +110,11 @@ graph LR
     DB --> fileset[文件集对象]
     DB --> file[文件对象]
     DB --> blob[数据对象]
+    DB --> chunk_group[数据块组对象]
+    DB --> chunk[数据块对象]
 
-    blob_pool --> blob_storage[哈希分片]
-    blob_storage --> blob_file[数据对象文件]
+    blob_pool --> blob_storage[直存 Blob 文件]
+    blob_pool --> chunk_storage[数据块文件]
 
     style A fill:#e1f5fe
     style DB fill:#f3e5f5
@@ -105,6 +123,8 @@ graph LR
     style fileset fill:#e8f5e8
     style file fill:#e8f5e8
     style blob fill:#e8f5e8
+    style chunk_group fill:#e8f5e8
+    style chunk fill:#e8f5e8
     style blob_storage fill:#fff3e0
-    style blob_file fill:#fff3e0
+    style chunk_storage fill:#fff3e0
 ```
@@ -224,6 +224,12 @@ Configs on how the backup is made
        "**"
     ],
 
+    "cdc_enabled": false,
+    "cdc_file_size_threshold": 104857600,
+    "cdc_patterns": [
+       "**/*.db"
+    ],
+
     "hash_method": "blake3",
     "compress_method": "zstd",
     "compress_threshold": 64,
@@ -413,6 +419,53 @@ The default value is `["**"]`, which matches everything. It's suggested to limit
 
 - Type: `List[str]`
 
+#### cdc_enabled
+
+Whether to enable content-defined chunking (CDC) for large files during backup creation
+
+CDC stands for `Content-Defined Chunking`.
+Unlike fixed-size chunking, CDC determines chunk boundaries from the file content itself,
+so when data is inserted, deleted, or modified locally, many unchanged regions can still be cut into the same chunks and be reused across backups
+
+Changing this option only affects files newly stored in future backups.
+Existing direct blobs or chunked blobs will not be converted automatically
+
+!!! note
+
+    CDC chunking requires the optional `pyfastcdc` dependency.
+    You can install all optional dependencies with `pip3 install -r requirements.optional.txt`,
+    or install `pyfastcdc` manually
+
+- Type: `bool`
+- Default: `false`
+
+#### cdc_file_size_threshold
+
+The minimum file size in bytes for a file to be considered for CDC chunking
+
+Files smaller than this threshold will continue to use the regular direct blob storage flow,
+even if [cdc_enabled](#cdc_enabled) is enabled and the path matches [cdc_patterns](#cdc_patterns)
+
+Changing this option only affects files newly stored in future backups.
+Existing stored data will not be repartitioned automatically
+
+- Type: `int`
+- Default: `104857600` (`100 MiB`)
+
+#### cdc_patterns
+
+A list of [gitignore flavor](http://git-scm.com/docs/gitignore) pattern strings,
+matched against file paths relative to [source_root](#source_root)
+
+CDC chunking will only be applied when the file path matches one of these patterns,
+the file size reaches [cdc_file_size_threshold](#cdc_file_size_threshold),
+and [cdc_enabled](#cdc_enabled) is enabled
+
+The default value is `["**/*.db"]`.
+It is recommended to keep this list narrow and only include large files that are often modified locally and really need to be backed up
+
+- Type: `List[str]`
+
 #### hash_method
 
 The algorithm to hash the files. Available options: `"xxh128"`, `"sha256"`, `"blake3"`
 
@@ -224,6 +224,12 @@ Prime Backup 在创建备份时的操作时序如下：
        "**"
     ],
 
+    "cdc_enabled": false,
+    "cdc_file_size_threshold": 104857600,
+    "cdc_patterns": [
+       "**/*.db"
+    ],
+
     "hash_method": "blake3",
     "compress_method": "zstd",
     "compress_threshold": 64,
@@ -413,6 +419,51 @@ Prime Backup 会检查文件的如下这些信息。下述这些信息完全一
 
 - 类型：`List[str]`
 
+#### cdc_enabled
+
+是否在创建备份时，对大文件启用内容定义分块（CDC）
+
+CDC 是 `Content-Defined Chunking` 的缩写，即“按内容划分边界”的切块方式。
+它与固定大小切块不同，数据块边界由文件内容决定，因此当文件仅在局部发生增删改时，许多未变化的内容仍能被切成相同的数据块，从而复用已有数据块
+
+修改此选项只会影响后续备份中新写入的文件。
+已存在的直存数据对象（direct blob）或分块数据对象（chunked blob）不会被自动转换
+
+!!! note
+
+    CDC 分块需要可选依赖 `pyfastcdc`。
+    你可以通过 `pip3 install -r requirements.optional.txt` 安装全部可选依赖，
+    或者单独安装 `pyfastcdc`
+
+- 类型：`bool`
+- 默认值：`false`
+
+#### cdc_file_size_threshold
+
+文件参与 CDC 分块所需达到的最小大小，单位为字节。
+
+小于该阈值的文件，即使 [cdc_enabled](#cdc_enabled) 已启用、路径也匹配了 [cdc_patterns](#cdc_patterns)，
+仍会继续使用常规的直存数据对象（direct blob）存储流程
+
+修改此选项只会影响后续备份中新写入的文件。
+已入库的数据不会被自动重新切分
+
+- 类型：`int`
+- 默认值：`104857600`（`100 MiB`）
+
+#### cdc_patterns
+
+一个 [gitignore 风格](http://git-scm.com/docs/gitignore) 的模板串列表，
+匹配对象是相对于 [source_root](#source_root) 的文件路径
+
+只有当文件路径匹配这些模式、文件大小达到 [cdc_file_size_threshold](#cdc_file_size_threshold)、
+且 [cdc_enabled](#cdc_enabled) 已启用时，才会使用 CDC 分块
+
+默认值为 `["**/*.db"]`。
+建议将其控制得尽量精确，只包含那些体积大、经常发生局部修改、且确实需要备份的文件
+
+- 类型：`List[str]`
+
 #### hash_method
 
 对文件进行哈希时所使用的算法。可用选项：`"xxh128"`、`"sha256"`、`"blake3"`