|
| 1 | +# MFSL 数据库架构 v2 实现计划 |
| 2 | + |
| 3 | +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. |
| 4 | +
|
| 5 | +**Goal:** 将 MFSL 从旧架构(registry.csv + level3_compounds + 无标签 MSP)迁移到 v2 架构(8 维标签内嵌 MSP + compound_metadata + DATABASE_MANUAL 重构)。 |
| 6 | + |
| 7 | +**Architecture:** MSP 文件写入 8 维标签独立字段行(matchms 原生兼容)。compound_metadata.csv 作为 Level 3 数据源 + 元数据富集来源。MetaboFlow TagFilter 模型与新标签字段名对齐。DATABASE_MANUAL 重构为完整的架构说明书。 |
| 8 | + |
| 9 | +**Tech Stack:** Python 3.12 + RDKit + matchms, R (annotation_ms1.R), FastAPI (Pydantic models) |
| 10 | + |
| 11 | +**Spec:** `docs/superpowers/specs/2026-03-24-mfsl-architecture-v2-design.md` |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +## File Structure |
| 16 | + |
| 17 | +### 新建 |
| 18 | +- `~/spectral_libraries/scripts/fill_tags_msp.py` — 8 维标签写入 MSP 文件 |
| 19 | +- `~/spectral_libraries/scripts/verify_msp_tags.py` — 验证 MSP 标签完整性 |
| 20 | + |
| 21 | +### 修改 |
| 22 | +- `~/spectral_libraries/deduplicated/*.msp` — 50 个 MSP 文件写入标签 |
| 23 | +- `~/spectral_libraries/DATABASE_MANUAL.md` — 完全重构 |
| 24 | +- `packages/backend/app/models/analysis.py` — TagFilter 模型对齐 |
| 25 | +- `packages/engines/annot-worker/app/matchms_engine.py` — 标签过滤逻辑 |
| 26 | + |
| 27 | +### 删除 |
| 28 | +- `~/spectral_libraries/registry.csv` — 内容并入 DATABASE_MANUAL |
| 29 | +- `~/spectral_libraries/registry.db` — SQLite 版本一并删除 |
| 30 | +- `~/spectral_libraries/spectral_metadata.csv` — 已废弃 |
| 31 | + |
| 32 | +--- |
| 33 | + |
| 34 | +## Task 1: 八维标签写入 MSP 文件 |
| 35 | + |
| 36 | +**Goal:** 为 60 万条谱图写入 Chemical_class / Application / Sample / Confidence / Instrument / Polarity / Reg_lists 标签字段行。Source 维度使用已有的 Sources 字段。 |
| 37 | + |
| 38 | +**Files:** |
| 39 | +- Create: `~/spectral_libraries/scripts/fill_tags_msp.py` |
| 40 | +- Modify: `~/spectral_libraries/deduplicated/*.msp` (50 files) |
| 41 | + |
| 42 | +- [ ] **Step 1: 创建标签写入脚本** |
| 43 | + |
| 44 | +`~/spectral_libraries/scripts/fill_tags_msp.py`: |
| 45 | + |
| 46 | +核心逻辑: |
| 47 | +1. 加载 compound_metadata.csv 构建 InChIKey → 标签映射(InChIKey 前 14 位) |
| 48 | +2. 定义库级别确定性标签规则(Layer 1) |
| 49 | +3. 逐个 MSP 文件处理: |
| 50 | + a. 解析每条谱图 |
| 51 | + b. 确定标签值(优先级:MSP 已有字段 > compound_metadata 查询 > 库级别规则) |
| 52 | + c. 写入标签字段行(在 Num Peaks: 之前) |
| 53 | +4. 不覆盖已有的原始字段(Ion_mode, Instrument_type 等保留) |
| 54 | + |
| 55 | +**库级别标签规则(Layer 1):** |
| 56 | + |
| 57 | +```python |
| 58 | +FILE_TAGS = { |
| 59 | + # NORMAN — 环境污染物预测谱 |
| 60 | + "norman_negative.msp": {"application": "environmental_monitoring", "confidence": "predicted"}, |
| 61 | + "norman_positive.msp": {"application": "environmental_monitoring", "confidence": "predicted"}, |
| 62 | + # ISDB — 天然产物预测谱 |
| 63 | + "isdb_positive.msp": {"chemical_class": "natural_product", "confidence": "predicted", "sample": "plant"}, |
| 64 | + "isdb_negative.msp": {"chemical_class": "natural_product", "confidence": "predicted", "sample": "plant"}, |
| 65 | + # HMDB experimental |
| 66 | + "hmdb_experimental_positive.msp": {"confidence": "experimental"}, |
| 67 | + "hmdb_experimental_negative.msp": {"confidence": "experimental"}, |
| 68 | + # HMDB predicted |
| 69 | + "hmdb_predicted_positive.msp": {"confidence": "predicted"}, |
| 70 | + # MSnLib |
| 71 | + "msnlib_mcedrug_positive.msp": {"application": "pharmaceutical", "confidence": "experimental"}, |
| 72 | + "msnlib_mcedrug_negative.msp": {"application": "pharmaceutical", "confidence": "experimental"}, |
| 73 | + "msnlib_mcebio_positive.msp": {"application": "pharmaceutical", "confidence": "experimental"}, |
| 74 | + "msnlib_mcebio_negative.msp": {"application": "pharmaceutical", "confidence": "experimental"}, |
| 75 | + "msnlib_mcescaf_positive.msp": {"application": "pharmaceutical", "confidence": "experimental"}, |
| 76 | + "msnlib_mcescaf_negative.msp": {"application": "pharmaceutical", "confidence": "experimental"}, |
| 77 | + "msnlib_enamdisc_positive.msp": {"application": "pharmaceutical", "confidence": "experimental"}, |
| 78 | + "msnlib_enamdisc_negative.msp": {"application": "pharmaceutical", "confidence": "experimental"}, |
| 79 | + "msnlib_enammol_positive.msp": {"application": "pharmaceutical", "confidence": "experimental"}, |
| 80 | + "msnlib_enammol_negative.msp": {"application": "pharmaceutical", "confidence": "experimental"}, |
| 81 | + "msnlib_nihnp_positive.msp": {"chemical_class": "natural_product", "confidence": "experimental"}, |
| 82 | + "msnlib_nihnp_negative.msp": {"chemical_class": "natural_product", "confidence": "experimental"}, |
| 83 | + "msnlib_otavapep_positive.msp": {"chemical_class": "amino_acid_peptide", "confidence": "experimental"}, |
| 84 | + "msnlib_otavapep_negative.msp": {"chemical_class": "amino_acid_peptide", "confidence": "experimental"}, |
| 85 | + # FooDB |
| 86 | + "foodb_experimental_positive.msp": {"application": "food_safety", "sample": "food", "confidence": "experimental"}, |
| 87 | + "foodb_experimental_negative.msp": {"application": "food_safety", "sample": "food", "confidence": "experimental"}, |
| 88 | + # ReSpect |
| 89 | + "respect_positive.msp": {"sample": "plant", "confidence": "experimental"}, |
| 90 | + # NIST |
| 91 | + "nist_epa_tandem.msp": {"application": "environmental_monitoring", "confidence": "experimental"}, |
| 92 | + "nist_glycan_msms.msp": {"chemical_class": "glycan", "confidence": "experimental"}, |
| 93 | + "nist_dart_positive.msp": {"confidence": "experimental"}, |
| 94 | + # MassBank — all experimental |
| 95 | + # (massbank_*.msp files all get confidence=experimental, instrument from Instrument_type field) |
| 96 | + # MS-DIAL — mixed |
| 97 | + "msdial_all_positive.msp": {"confidence": "mixed"}, |
| 98 | + "msdial_all_negative.msp": {"confidence": "mixed"}, |
| 99 | + # EMBL-MCF |
| 100 | + "embl_mcf_positive.msp": {"confidence": "experimental"}, |
| 101 | + "embl_mcf_negative.msp": {"confidence": "experimental"}, |
| 102 | + # GNPS |
| 103 | + "gnps_library_mixed.msp": {"confidence": "mixed"}, |
| 104 | + "gnps_hmdb_mixed.msp": {"confidence": "experimental"}, |
| 105 | + "gnps_massbank_mixed.msp": {"confidence": "experimental"}, |
| 106 | + "gnps_mona_mixed.msp": {"confidence": "experimental"}, |
| 107 | + "gnps_nih-clinical1_mixed.msp": {"application": "pharmaceutical", "confidence": "experimental"}, |
| 108 | + "gnps_nih-naturalproducts_mixed.msp": {"chemical_class": "natural_product", "confidence": "experimental"}, |
| 109 | +} |
| 110 | +``` |
| 111 | + |
| 112 | +**Instrument 字段映射(从 MSP 已有字段提取):** |
| 113 | + |
| 114 | +```python |
| 115 | +def extract_instrument(instrument_type_raw): |
| 116 | + """Map MSP Instrument_type to normalized instrument tag.""" |
| 117 | + if not instrument_type_raw: return "" |
| 118 | + it = instrument_type_raw.lower() |
| 119 | + if "orbitrap" in it or "itft" in it: return "orbitrap" |
| 120 | + if "qtof" in it or "q-tof" in it: return "qtof" |
| 121 | + if "qqq" in it or "triple" in it: return "qqq" |
| 122 | + if "ion trap" in it or "iontrap" in it: return "ion_trap" |
| 123 | + if "tof" in it and "q" not in it: return "tof" |
| 124 | + if "ei" in it: return "ei" |
| 125 | + if "dart" in it: return "dart" |
| 126 | + return "" |
| 127 | +``` |
| 128 | + |
| 129 | +**Polarity 字段映射:** |
| 130 | + |
| 131 | +```python |
| 132 | +def extract_polarity(ion_mode_raw): |
| 133 | + if not ion_mode_raw: return "" |
| 134 | + im = ion_mode_raw.upper() |
| 135 | + if "POS" in im: return "positive" |
| 136 | + if "NEG" in im: return "negative" |
| 137 | + return "" |
| 138 | +``` |
| 139 | + |
| 140 | +**Chemical_class 从 compound_metadata 查询(Layer 2):** |
| 141 | +用 InChIKey 前 14 位匹配。如果 compound_metadata 有值且谱图没有,填入。 |
| 142 | + |
| 143 | +- [ ] **Step 2: 运行标签写入** |
| 144 | + |
| 145 | +```bash |
| 146 | +cd ~/spectral_libraries |
| 147 | +PYTHONUNBUFFERED=1 /Users/jiajun-agent/pony/ponylabASMS/.venv312/bin/python scripts/fill_tags_msp.py |
| 148 | +``` |
| 149 | + |
| 150 | +预期输出:每个 MSP 文件的标签填充统计。 |
| 151 | + |
| 152 | +- [ ] **Step 3: 验证** |
| 153 | + |
| 154 | +创建 `~/spectral_libraries/scripts/verify_msp_tags.py`: |
| 155 | +- 解析所有 MSP 文件 |
| 156 | +- 统计每个标签维度的覆盖率 |
| 157 | +- 用 matchms 加载测试确认标签可被正确解析 |
| 158 | +- 输出报告 |
| 159 | + |
| 160 | +```bash |
| 161 | +/Users/jiajun-agent/pony/ponylabASMS/.venv312/bin/python scripts/verify_msp_tags.py |
| 162 | +``` |
| 163 | + |
| 164 | +- [ ] **Step 4: Commit(不 push,MSP 文件不在 git 中)** |
| 165 | + |
| 166 | +--- |
| 167 | + |
| 168 | +## Task 2: MetaboFlow TagFilter 模型对齐 |
| 169 | + |
| 170 | +**Goal:** 将 MetaboFlow 的 TagFilter 模型字段名与 MFSL v2 的八维标签对齐。 |
| 171 | + |
| 172 | +**Files:** |
| 173 | +- Modify: `packages/backend/app/models/analysis.py` |
| 174 | +- Modify: `packages/engines/annot-worker/app/matchms_engine.py` |
| 175 | + |
| 176 | +- [ ] **Step 1: 更新 TagFilter 模型** |
| 177 | + |
| 178 | +`packages/backend/app/models/analysis.py` 中,将: |
| 179 | + |
| 180 | +```python |
| 181 | +class TagFilter(BaseModel): |
| 182 | + """Multi-label filter for selecting spectral libraries.""" |
| 183 | + instrument: list[str] = Field(default_factory=list) |
| 184 | + organism: list[str] = Field(default_factory=list) |
| 185 | + compound_class: list[str] = Field(default_factory=list) |
| 186 | + confidence: list[str] = Field(default_factory=lambda: ["high", "medium", "low"]) |
| 187 | +``` |
| 188 | + |
| 189 | +改为: |
| 190 | + |
| 191 | +```python |
| 192 | +class TagFilter(BaseModel): |
| 193 | + """8-dimension tag filter for spectral and compound databases.""" |
| 194 | + chemical_class: list[str] = Field(default_factory=list) |
| 195 | + application: list[str] = Field(default_factory=list) |
| 196 | + sample: list[str] = Field(default_factory=list) |
| 197 | + confidence: list[str] = Field(default_factory=list) |
| 198 | + instrument: list[str] = Field(default_factory=list) |
| 199 | + polarity: list[str] = Field(default_factory=list) |
| 200 | + reg_lists: list[str] = Field(default_factory=list) |
| 201 | + # source 维度不在 TagFilter 中——由 databases 参数控制 |
| 202 | +``` |
| 203 | + |
| 204 | +- [ ] **Step 2: 更新 annot-worker 过滤逻辑** |
| 205 | + |
| 206 | +`packages/engines/annot-worker/app/matchms_engine.py` 中,更新谱图过滤逻辑以使用新的标签字段名: |
| 207 | + |
| 208 | +```python |
| 209 | +def _filter_by_tags(spectra, tag_filter): |
| 210 | + """Filter spectra by 8-dimension tags from MSP metadata.""" |
| 211 | + filtered = spectra |
| 212 | + for dim in ["chemical_class", "application", "sample", "confidence", "instrument", "polarity", "reg_lists"]: |
| 213 | + values = getattr(tag_filter, dim, []) |
| 214 | + if values: |
| 215 | + filtered = [s for s in filtered if s.metadata.get(dim, "") in values] |
| 216 | + return filtered |
| 217 | +``` |
| 218 | + |
| 219 | +- [ ] **Step 3: 更新 AnnotationHit 模型** |
| 220 | + |
| 221 | +在 `AnnotationHit`(如果存在)或结果模型中新增字段: |
| 222 | + |
| 223 | +```python |
| 224 | +smiles: str | None = None |
| 225 | +chemical_class: str | None = None |
| 226 | +application: str | None = None |
| 227 | +``` |
| 228 | + |
| 229 | +- [ ] **Step 4: Rebuild + test** |
| 230 | + |
| 231 | +```bash |
| 232 | +cd ~/pony/MetaboFlow |
| 233 | +docker compose build backend celery-worker annot-worker |
| 234 | +``` |
| 235 | + |
| 236 | +- [ ] **Step 5: Commit** |
| 237 | + |
| 238 | +```bash |
| 239 | +git add packages/backend/app/models/analysis.py packages/engines/annot-worker/ |
| 240 | +git commit -m "feat: TagFilter 对齐 MFSL v2 八维标签体系 |
| 241 | +
|
| 242 | +- compound_class → chemical_class |
| 243 | +- organism → sample |
| 244 | +- 新增 application, polarity, reg_lists |
| 245 | +- confidence 值域从 high/medium/low 改为 experimental/predicted/mixed |
| 246 | +- annot-worker 过滤逻辑更新 |
| 247 | +
|
| 248 | +Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>" |
| 249 | +``` |
| 250 | + |
| 251 | +--- |
| 252 | + |
| 253 | +## Task 3: DATABASE_MANUAL 重构 |
| 254 | + |
| 255 | +**Goal:** 重构 DATABASE_MANUAL.md 为完整的架构说明书,包含 v2 架构设计、来源清单(原 registry.csv 内容)、构建过程中遇到的所有问题和设计选择。 |
| 256 | + |
| 257 | +**Files:** |
| 258 | +- Rewrite: `~/spectral_libraries/DATABASE_MANUAL.md` |
| 259 | + |
| 260 | +- [ ] **Step 1: 重构文档结构** |
| 261 | + |
| 262 | +新的 DATABASE_MANUAL.md 结构: |
| 263 | + |
| 264 | +```markdown |
| 265 | +# MFSL 技术说明书 v3.0 |
| 266 | + |
| 267 | +## 1. 架构概述 |
| 268 | + - 1.1 双库结构(质谱库 + 化合物库) |
| 269 | + - 1.2 八维标签体系 |
| 270 | + - 1.3 元数据富集机制 |
| 271 | + - 1.4 注释匹配流程 |
| 272 | + |
| 273 | +## 2. 质谱库 |
| 274 | + - 2.1 MSP 文件格式(含标签字段说明) |
| 275 | + - 2.2 来源清单(原 registry.csv 全部 50 个文件,每个文件的来源、谱图数、默认标签) |
| 276 | + - 2.3 去重机制 |
| 277 | + - 2.4 质量评分体系 |
| 278 | + |
| 279 | +## 3. 化合物库 |
| 280 | + - 3.1 compound_metadata.csv 字段定义 |
| 281 | + - 3.2 数据来源(11 个化合物数据库) |
| 282 | + - 3.3 覆盖率统计 |
| 283 | + |
| 284 | +## 4. 标签填充方法 |
| 285 | + - 4.1 四层填充策略 |
| 286 | + - 4.2 每个来源的可用标签字段 |
| 287 | + - 4.3 ClassyFire/NPClassifier 自动分类 |
| 288 | + |
| 289 | +## 5. 构建脚本清单 |
| 290 | + |
| 291 | +## 6. 跨产品使用(MetaboFlow / PonylabASMS) |
| 292 | + |
| 293 | +## 7. 设计演进记录 |
| 294 | + - 7.1 v1.0 → v2.0 架构变更时间线 |
| 295 | + - 7.2 遇到的问题和决策 |
| 296 | + - compound_class/organism 维度混淆 |
| 297 | + - 4 个错误标签 |
| 298 | + - level3_compounds 缺 SMILES |
| 299 | + - 201K 无 InChIKey 化合物 |
| 300 | + - MSP InChIKey 缺失补全 |
| 301 | + - InChIKey 14 位 vs 27 位匹配 |
| 302 | + - spectral_metadata 与 compound_metadata 交集 35% |
| 303 | + - 标签内嵌 MSP vs 独立 CSV 选型 |
| 304 | + - registry.csv 废弃 |
| 305 | + - 7.3 数据清洗记录 |
| 306 | + |
| 307 | +## 8. 版本历史 |
| 308 | +``` |
| 309 | + |
| 310 | +- [ ] **Step 2: 将 registry.csv 内容写入 §2.2 来源清单** |
| 311 | + |
| 312 | +把 50 个 MSP 文件的信息(来源 URL、原始格式、谱图数、处理流程、默认标签)从 registry.csv 和现有 DATABASE_MANUAL §4 中整合。 |
| 313 | + |
| 314 | +- [ ] **Step 3: 写入 §7 设计演进记录** |
| 315 | + |
| 316 | +记录本次 session 中遇到的所有问题、根因和解决方案(详见 spec §8 设计决策记录 + 本 session 讨论记录)。 |
| 317 | + |
| 318 | +- [ ] **Step 4: 验证文档完整性** |
| 319 | + |
| 320 | +确保所有 50 个 MSP 文件都在来源清单中,所有设计决策都有记录。 |
| 321 | + |
| 322 | +--- |
| 323 | + |
| 324 | +## Task 4: 清理废弃文件 + 最终验证 |
| 325 | + |
| 326 | +**Goal:** 删除废弃文件,运行全面验证。 |
| 327 | + |
| 328 | +**Files:** |
| 329 | +- Delete: `~/spectral_libraries/registry.csv` |
| 330 | +- Delete: `~/spectral_libraries/registry.db` |
| 331 | +- Delete: `~/spectral_libraries/spectral_metadata.csv` |
| 332 | + |
| 333 | +- [ ] **Step 1: 备份并删除** |
| 334 | + |
| 335 | +```bash |
| 336 | +cd ~/spectral_libraries |
| 337 | +mkdir -p _archived_v1 |
| 338 | +mv registry.csv _archived_v1/ |
| 339 | +mv registry.db _archived_v1/ 2>/dev/null |
| 340 | +mv spectral_metadata.csv _archived_v1/ 2>/dev/null |
| 341 | +``` |
| 342 | + |
| 343 | +- [ ] **Step 2: 全面验证** |
| 344 | + |
| 345 | +```python |
| 346 | +# 验证脚本检查项: |
| 347 | +# 1. 所有 MSP 文件都有标签字段 |
| 348 | +# 2. matchms 能正确加载和按标签过滤 |
| 349 | +# 3. compound_metadata.csv 覆盖所有 MSP 的 InChIKey(前 14 位) |
| 350 | +# 4. registry.csv 不再被任何代码引用 |
| 351 | +# 5. compound_metadata 字段与 MetaboFlow AnnotationHit 对齐 |
| 352 | +``` |
| 353 | + |
| 354 | +- [ ] **Step 3: Commit MetaboFlow 代码变更 + push** |
| 355 | + |
| 356 | +```bash |
| 357 | +cd ~/pony/MetaboFlow |
| 358 | +git add packages/backend/ packages/engines/annot-worker/ docs/ |
| 359 | +git commit -m "feat: MFSL v2 架构落地——八维标签 + TagFilter 对齐 + DATABASE_MANUAL 重构 |
| 360 | +
|
| 361 | +Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>" |
| 362 | +git push |
| 363 | +``` |
| 364 | + |
| 365 | +--- |
| 366 | + |
| 367 | +## 执行顺序和依赖 |
| 368 | + |
| 369 | +``` |
| 370 | +Task 1 (MSP 标签写入) ──── 独立,最大工作量 |
| 371 | +Task 2 (TagFilter 对齐) ──── 独立 |
| 372 | +Task 3 (DATABASE_MANUAL) ── 独立 |
| 373 | +Task 4 (清理 + 验证) ──── 依赖 Task 1-3 全部完成 |
| 374 | +``` |
| 375 | + |
| 376 | +Task 1/2/3 可以并行执行。Task 4 在最后做。 |
0 commit comments