Skip to content

Commit f0ce5c6

Browse files
jiajunagentclaude
andcommitted
docs: MFSL v2 实现计划(4 Task: MSP标签写入 + TagFilter对齐 + MANUAL重构 + 清理验证)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent bef1ab4 commit f0ce5c6

1 file changed

Lines changed: 376 additions & 0 deletions

File tree

Lines changed: 376 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,376 @@
1+
# MFSL 数据库架构 v2 实现计划
2+
3+
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
4+
5+
**Goal:** 将 MFSL 从旧架构(registry.csv + level3_compounds + 无标签 MSP)迁移到 v2 架构(8 维标签内嵌 MSP + compound_metadata + DATABASE_MANUAL 重构)。
6+
7+
**Architecture:** MSP 文件写入 8 维标签独立字段行(matchms 原生兼容)。compound_metadata.csv 作为 Level 3 数据源 + 元数据富集来源。MetaboFlow TagFilter 模型与新标签字段名对齐。DATABASE_MANUAL 重构为完整的架构说明书。
8+
9+
**Tech Stack:** Python 3.12 + RDKit + matchms, R (annotation_ms1.R), FastAPI (Pydantic models)
10+
11+
**Spec:** `docs/superpowers/specs/2026-03-24-mfsl-architecture-v2-design.md`
12+
13+
---
14+
15+
## File Structure
16+
17+
### 新建
18+
- `~/spectral_libraries/scripts/fill_tags_msp.py` — 8 维标签写入 MSP 文件
19+
- `~/spectral_libraries/scripts/verify_msp_tags.py` — 验证 MSP 标签完整性
20+
21+
### 修改
22+
- `~/spectral_libraries/deduplicated/*.msp` — 50 个 MSP 文件写入标签
23+
- `~/spectral_libraries/DATABASE_MANUAL.md` — 完全重构
24+
- `packages/backend/app/models/analysis.py` — TagFilter 模型对齐
25+
- `packages/engines/annot-worker/app/matchms_engine.py` — 标签过滤逻辑
26+
27+
### 删除
28+
- `~/spectral_libraries/registry.csv` — 内容并入 DATABASE_MANUAL
29+
- `~/spectral_libraries/registry.db` — SQLite 版本一并删除
30+
- `~/spectral_libraries/spectral_metadata.csv` — 已废弃
31+
32+
---
33+
34+
## Task 1: 八维标签写入 MSP 文件
35+
36+
**Goal:** 为 60 万条谱图写入 Chemical_class / Application / Sample / Confidence / Instrument / Polarity / Reg_lists 标签字段行。Source 维度使用已有的 Sources 字段。
37+
38+
**Files:**
39+
- Create: `~/spectral_libraries/scripts/fill_tags_msp.py`
40+
- Modify: `~/spectral_libraries/deduplicated/*.msp` (50 files)
41+
42+
- [ ] **Step 1: 创建标签写入脚本**
43+
44+
`~/spectral_libraries/scripts/fill_tags_msp.py`
45+
46+
核心逻辑:
47+
1. 加载 compound_metadata.csv 构建 InChIKey → 标签映射(InChIKey 前 14 位)
48+
2. 定义库级别确定性标签规则(Layer 1)
49+
3. 逐个 MSP 文件处理:
50+
a. 解析每条谱图
51+
b. 确定标签值(优先级:MSP 已有字段 > compound_metadata 查询 > 库级别规则)
52+
c. 写入标签字段行(在 Num Peaks: 之前)
53+
4. 不覆盖已有的原始字段(Ion_mode, Instrument_type 等保留)
54+
55+
**库级别标签规则(Layer 1):**
56+
57+
```python
58+
FILE_TAGS = {
59+
# NORMAN — 环境污染物预测谱
60+
"norman_negative.msp": {"application": "environmental_monitoring", "confidence": "predicted"},
61+
"norman_positive.msp": {"application": "environmental_monitoring", "confidence": "predicted"},
62+
# ISDB — 天然产物预测谱
63+
"isdb_positive.msp": {"chemical_class": "natural_product", "confidence": "predicted", "sample": "plant"},
64+
"isdb_negative.msp": {"chemical_class": "natural_product", "confidence": "predicted", "sample": "plant"},
65+
# HMDB experimental
66+
"hmdb_experimental_positive.msp": {"confidence": "experimental"},
67+
"hmdb_experimental_negative.msp": {"confidence": "experimental"},
68+
# HMDB predicted
69+
"hmdb_predicted_positive.msp": {"confidence": "predicted"},
70+
# MSnLib
71+
"msnlib_mcedrug_positive.msp": {"application": "pharmaceutical", "confidence": "experimental"},
72+
"msnlib_mcedrug_negative.msp": {"application": "pharmaceutical", "confidence": "experimental"},
73+
"msnlib_mcebio_positive.msp": {"application": "pharmaceutical", "confidence": "experimental"},
74+
"msnlib_mcebio_negative.msp": {"application": "pharmaceutical", "confidence": "experimental"},
75+
"msnlib_mcescaf_positive.msp": {"application": "pharmaceutical", "confidence": "experimental"},
76+
"msnlib_mcescaf_negative.msp": {"application": "pharmaceutical", "confidence": "experimental"},
77+
"msnlib_enamdisc_positive.msp": {"application": "pharmaceutical", "confidence": "experimental"},
78+
"msnlib_enamdisc_negative.msp": {"application": "pharmaceutical", "confidence": "experimental"},
79+
"msnlib_enammol_positive.msp": {"application": "pharmaceutical", "confidence": "experimental"},
80+
"msnlib_enammol_negative.msp": {"application": "pharmaceutical", "confidence": "experimental"},
81+
"msnlib_nihnp_positive.msp": {"chemical_class": "natural_product", "confidence": "experimental"},
82+
"msnlib_nihnp_negative.msp": {"chemical_class": "natural_product", "confidence": "experimental"},
83+
"msnlib_otavapep_positive.msp": {"chemical_class": "amino_acid_peptide", "confidence": "experimental"},
84+
"msnlib_otavapep_negative.msp": {"chemical_class": "amino_acid_peptide", "confidence": "experimental"},
85+
# FooDB
86+
"foodb_experimental_positive.msp": {"application": "food_safety", "sample": "food", "confidence": "experimental"},
87+
"foodb_experimental_negative.msp": {"application": "food_safety", "sample": "food", "confidence": "experimental"},
88+
# ReSpect
89+
"respect_positive.msp": {"sample": "plant", "confidence": "experimental"},
90+
# NIST
91+
"nist_epa_tandem.msp": {"application": "environmental_monitoring", "confidence": "experimental"},
92+
"nist_glycan_msms.msp": {"chemical_class": "glycan", "confidence": "experimental"},
93+
"nist_dart_positive.msp": {"confidence": "experimental"},
94+
# MassBank — all experimental
95+
# (massbank_*.msp files all get confidence=experimental, instrument from Instrument_type field)
96+
# MS-DIAL — mixed
97+
"msdial_all_positive.msp": {"confidence": "mixed"},
98+
"msdial_all_negative.msp": {"confidence": "mixed"},
99+
# EMBL-MCF
100+
"embl_mcf_positive.msp": {"confidence": "experimental"},
101+
"embl_mcf_negative.msp": {"confidence": "experimental"},
102+
# GNPS
103+
"gnps_library_mixed.msp": {"confidence": "mixed"},
104+
"gnps_hmdb_mixed.msp": {"confidence": "experimental"},
105+
"gnps_massbank_mixed.msp": {"confidence": "experimental"},
106+
"gnps_mona_mixed.msp": {"confidence": "experimental"},
107+
"gnps_nih-clinical1_mixed.msp": {"application": "pharmaceutical", "confidence": "experimental"},
108+
"gnps_nih-naturalproducts_mixed.msp": {"chemical_class": "natural_product", "confidence": "experimental"},
109+
}
110+
```
111+
112+
**Instrument 字段映射(从 MSP 已有字段提取):**
113+
114+
```python
115+
def extract_instrument(instrument_type_raw):
116+
"""Map MSP Instrument_type to normalized instrument tag."""
117+
if not instrument_type_raw: return ""
118+
it = instrument_type_raw.lower()
119+
if "orbitrap" in it or "itft" in it: return "orbitrap"
120+
if "qtof" in it or "q-tof" in it: return "qtof"
121+
if "qqq" in it or "triple" in it: return "qqq"
122+
if "ion trap" in it or "iontrap" in it: return "ion_trap"
123+
if "tof" in it and "q" not in it: return "tof"
124+
if "ei" in it: return "ei"
125+
if "dart" in it: return "dart"
126+
return ""
127+
```
128+
129+
**Polarity 字段映射:**
130+
131+
```python
132+
def extract_polarity(ion_mode_raw):
133+
if not ion_mode_raw: return ""
134+
im = ion_mode_raw.upper()
135+
if "POS" in im: return "positive"
136+
if "NEG" in im: return "negative"
137+
return ""
138+
```
139+
140+
**Chemical_class 从 compound_metadata 查询(Layer 2):**
141+
用 InChIKey 前 14 位匹配。如果 compound_metadata 有值且谱图没有,填入。
142+
143+
- [ ] **Step 2: 运行标签写入**
144+
145+
```bash
146+
cd ~/spectral_libraries
147+
PYTHONUNBUFFERED=1 /Users/jiajun-agent/pony/ponylabASMS/.venv312/bin/python scripts/fill_tags_msp.py
148+
```
149+
150+
预期输出:每个 MSP 文件的标签填充统计。
151+
152+
- [ ] **Step 3: 验证**
153+
154+
创建 `~/spectral_libraries/scripts/verify_msp_tags.py`
155+
- 解析所有 MSP 文件
156+
- 统计每个标签维度的覆盖率
157+
- 用 matchms 加载测试确认标签可被正确解析
158+
- 输出报告
159+
160+
```bash
161+
/Users/jiajun-agent/pony/ponylabASMS/.venv312/bin/python scripts/verify_msp_tags.py
162+
```
163+
164+
- [ ] **Step 4: Commit(不 push,MSP 文件不在 git 中)**
165+
166+
---
167+
168+
## Task 2: MetaboFlow TagFilter 模型对齐
169+
170+
**Goal:** 将 MetaboFlow 的 TagFilter 模型字段名与 MFSL v2 的八维标签对齐。
171+
172+
**Files:**
173+
- Modify: `packages/backend/app/models/analysis.py`
174+
- Modify: `packages/engines/annot-worker/app/matchms_engine.py`
175+
176+
- [ ] **Step 1: 更新 TagFilter 模型**
177+
178+
`packages/backend/app/models/analysis.py` 中,将:
179+
180+
```python
181+
class TagFilter(BaseModel):
182+
"""Multi-label filter for selecting spectral libraries."""
183+
instrument: list[str] = Field(default_factory=list)
184+
organism: list[str] = Field(default_factory=list)
185+
compound_class: list[str] = Field(default_factory=list)
186+
confidence: list[str] = Field(default_factory=lambda: ["high", "medium", "low"])
187+
```
188+
189+
改为:
190+
191+
```python
192+
class TagFilter(BaseModel):
193+
"""8-dimension tag filter for spectral and compound databases."""
194+
chemical_class: list[str] = Field(default_factory=list)
195+
application: list[str] = Field(default_factory=list)
196+
sample: list[str] = Field(default_factory=list)
197+
confidence: list[str] = Field(default_factory=list)
198+
instrument: list[str] = Field(default_factory=list)
199+
polarity: list[str] = Field(default_factory=list)
200+
reg_lists: list[str] = Field(default_factory=list)
201+
# source 维度不在 TagFilter 中——由 databases 参数控制
202+
```
203+
204+
- [ ] **Step 2: 更新 annot-worker 过滤逻辑**
205+
206+
`packages/engines/annot-worker/app/matchms_engine.py` 中,更新谱图过滤逻辑以使用新的标签字段名:
207+
208+
```python
209+
def _filter_by_tags(spectra, tag_filter):
210+
"""Filter spectra by 8-dimension tags from MSP metadata."""
211+
filtered = spectra
212+
for dim in ["chemical_class", "application", "sample", "confidence", "instrument", "polarity", "reg_lists"]:
213+
values = getattr(tag_filter, dim, [])
214+
if values:
215+
filtered = [s for s in filtered if s.metadata.get(dim, "") in values]
216+
return filtered
217+
```
218+
219+
- [ ] **Step 3: 更新 AnnotationHit 模型**
220+
221+
`AnnotationHit`(如果存在)或结果模型中新增字段:
222+
223+
```python
224+
smiles: str | None = None
225+
chemical_class: str | None = None
226+
application: str | None = None
227+
```
228+
229+
- [ ] **Step 4: Rebuild + test**
230+
231+
```bash
232+
cd ~/pony/MetaboFlow
233+
docker compose build backend celery-worker annot-worker
234+
```
235+
236+
- [ ] **Step 5: Commit**
237+
238+
```bash
239+
git add packages/backend/app/models/analysis.py packages/engines/annot-worker/
240+
git commit -m "feat: TagFilter 对齐 MFSL v2 八维标签体系
241+
242+
- compound_class → chemical_class
243+
- organism → sample
244+
- 新增 application, polarity, reg_lists
245+
- confidence 值域从 high/medium/low 改为 experimental/predicted/mixed
246+
- annot-worker 过滤逻辑更新
247+
248+
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>"
249+
```
250+
251+
---
252+
253+
## Task 3: DATABASE_MANUAL 重构
254+
255+
**Goal:** 重构 DATABASE_MANUAL.md 为完整的架构说明书,包含 v2 架构设计、来源清单(原 registry.csv 内容)、构建过程中遇到的所有问题和设计选择。
256+
257+
**Files:**
258+
- Rewrite: `~/spectral_libraries/DATABASE_MANUAL.md`
259+
260+
- [ ] **Step 1: 重构文档结构**
261+
262+
新的 DATABASE_MANUAL.md 结构:
263+
264+
```markdown
265+
# MFSL 技术说明书 v3.0
266+
267+
## 1. 架构概述
268+
- 1.1 双库结构(质谱库 + 化合物库)
269+
- 1.2 八维标签体系
270+
- 1.3 元数据富集机制
271+
- 1.4 注释匹配流程
272+
273+
## 2. 质谱库
274+
- 2.1 MSP 文件格式(含标签字段说明)
275+
- 2.2 来源清单(原 registry.csv 全部 50 个文件,每个文件的来源、谱图数、默认标签)
276+
- 2.3 去重机制
277+
- 2.4 质量评分体系
278+
279+
## 3. 化合物库
280+
- 3.1 compound_metadata.csv 字段定义
281+
- 3.2 数据来源(11 个化合物数据库)
282+
- 3.3 覆盖率统计
283+
284+
## 4. 标签填充方法
285+
- 4.1 四层填充策略
286+
- 4.2 每个来源的可用标签字段
287+
- 4.3 ClassyFire/NPClassifier 自动分类
288+
289+
## 5. 构建脚本清单
290+
291+
## 6. 跨产品使用(MetaboFlow / PonylabASMS)
292+
293+
## 7. 设计演进记录
294+
- 7.1 v1.0 → v2.0 架构变更时间线
295+
- 7.2 遇到的问题和决策
296+
- compound_class/organism 维度混淆
297+
- 4 个错误标签
298+
- level3_compounds 缺 SMILES
299+
- 201K 无 InChIKey 化合物
300+
- MSP InChIKey 缺失补全
301+
- InChIKey 14 位 vs 27 位匹配
302+
- spectral_metadata 与 compound_metadata 交集 35%
303+
- 标签内嵌 MSP vs 独立 CSV 选型
304+
- registry.csv 废弃
305+
- 7.3 数据清洗记录
306+
307+
## 8. 版本历史
308+
```
309+
310+
- [ ] **Step 2: 将 registry.csv 内容写入 §2.2 来源清单**
311+
312+
把 50 个 MSP 文件的信息(来源 URL、原始格式、谱图数、处理流程、默认标签)从 registry.csv 和现有 DATABASE_MANUAL §4 中整合。
313+
314+
- [ ] **Step 3: 写入 §7 设计演进记录**
315+
316+
记录本次 session 中遇到的所有问题、根因和解决方案(详见 spec §8 设计决策记录 + 本 session 讨论记录)。
317+
318+
- [ ] **Step 4: 验证文档完整性**
319+
320+
确保所有 50 个 MSP 文件都在来源清单中,所有设计决策都有记录。
321+
322+
---
323+
324+
## Task 4: 清理废弃文件 + 最终验证
325+
326+
**Goal:** 删除废弃文件,运行全面验证。
327+
328+
**Files:**
329+
- Delete: `~/spectral_libraries/registry.csv`
330+
- Delete: `~/spectral_libraries/registry.db`
331+
- Delete: `~/spectral_libraries/spectral_metadata.csv`
332+
333+
- [ ] **Step 1: 备份并删除**
334+
335+
```bash
336+
cd ~/spectral_libraries
337+
mkdir -p _archived_v1
338+
mv registry.csv _archived_v1/
339+
mv registry.db _archived_v1/ 2>/dev/null
340+
mv spectral_metadata.csv _archived_v1/ 2>/dev/null
341+
```
342+
343+
- [ ] **Step 2: 全面验证**
344+
345+
```python
346+
# 验证脚本检查项:
347+
# 1. 所有 MSP 文件都有标签字段
348+
# 2. matchms 能正确加载和按标签过滤
349+
# 3. compound_metadata.csv 覆盖所有 MSP 的 InChIKey(前 14 位)
350+
# 4. registry.csv 不再被任何代码引用
351+
# 5. compound_metadata 字段与 MetaboFlow AnnotationHit 对齐
352+
```
353+
354+
- [ ] **Step 3: Commit MetaboFlow 代码变更 + push**
355+
356+
```bash
357+
cd ~/pony/MetaboFlow
358+
git add packages/backend/ packages/engines/annot-worker/ docs/
359+
git commit -m "feat: MFSL v2 架构落地——八维标签 + TagFilter 对齐 + DATABASE_MANUAL 重构
360+
361+
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>"
362+
git push
363+
```
364+
365+
---
366+
367+
## 执行顺序和依赖
368+
369+
```
370+
Task 1 (MSP 标签写入) ──── 独立,最大工作量
371+
Task 2 (TagFilter 对齐) ──── 独立
372+
Task 3 (DATABASE_MANUAL) ── 独立
373+
Task 4 (清理 + 验证) ──── 依赖 Task 1-3 全部完成
374+
```
375+
376+
Task 1/2/3 可以并行执行。Task 4 在最后做。

0 commit comments

Comments
 (0)