Skip to content

Commit 43ccb4a

Browse files
authored
Merge pull request #66 from bigbio/dev
bug fixing for quantms-utils and .d conversion to parquet information.
2 parents 994602a + fab1912 commit 43ccb4a

30 files changed

Lines changed: 380 additions & 39427 deletions

.DS_Store

8 KB
Binary file not shown.

.github/workflows/python-package.yml

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,12 @@ jobs:
1616
strategy:
1717
fail-fast: false
1818
matrix:
19-
python-version: ["3.9", "3.10"]
19+
python-version: ["3.10", "3.11"]
2020

2121
steps:
2222
- uses: actions/checkout@v4
2323
- name: Set up Python ${{ matrix.python-version }}
24-
uses: actions/setup-python@v3
24+
uses: actions/setup-python@v5
2525
with:
2626
python-version: ${{ matrix.python-version }}
2727
- name: Install dependencies
@@ -51,8 +51,7 @@ jobs:
5151
- name: Test package online TMT
5252
run: |
5353
quantmsutilsc mzmlstats --ms_path TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_60min_01.mzML --ms2_file --feature_detection
54-
- name: Text package online DIA
54+
- name: Test package online DIA
5555
run: |
5656
quantmsutilsc mzmlstats --ms_path RD139_Narrow_UPS1_0_1fmol_inj1.mzML --ms2_file --feature_detection
57-
58-
57+

.gitignore

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -160,12 +160,11 @@ cython_debug/
160160
.idea/
161161
*.csv
162162
*_df.csv
163-
*.tsv
164163
/tests/test_data/hMICAL1_coiPAnP-N2-200_3Murea-1Mthiourea-200mMtcep_14733.d/
165-
/tests/test_data/diann2mztab/RD139_Narrow_UPS1_0_1fmol_inj1.mzML
166-
/tests/test_data/diann2mztab/RD139_Narrow_UPS1_0_1fmol_inj2.mzML
167-
/tests/test_data/diann2mztab/RD139_Narrow_UPS1_0_25fmol_inj1.mzML
168-
/tests/test_data/diann2mztab/RD139_Narrow_UPS1_0_25fmol_inj2.mzML
164+
/tests/test_data/diann2msstats/RD139_Narrow_UPS1_0_1fmol_inj1.mzML
165+
/tests/test_data/diann2msstats/RD139_Narrow_UPS1_0_1fmol_inj2.mzML
166+
/tests/test_data/diann2msstats/RD139_Narrow_UPS1_0_25fmol_inj1.mzML
167+
/tests/test_data/diann2msstats/RD139_Narrow_UPS1_0_25fmol_inj2.mzML
169168

170169
.qodo
171170
/tests/test_data/RD139_Narrow_UPS1_0_1fmol_inj1.mzML

.markdownlint.json

Lines changed: 0 additions & 5 deletions
This file was deleted.

README.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,8 @@ The following functionalities are available in the package:
1919
### Diann scripts
2020

2121
- `dianncfg` - Create a configuration file for Diann including enzymes, modifications, and other parameters.
22-
- `diann2mztab` - Convert Diann output to mzTab format. In addition, convert DIA-NN output to MSstats, Triqler or mzTab.
23-
The output formats are used for quality control and downstream analysis in quantms.
22+
- `diann2msstats` - Convert DIA-NN output to MSstats format.
23+
The output is used for quality control and downstream analysis in quantms.
2424

2525
### SDRF scripts
2626

@@ -29,8 +29,8 @@ The following functionalities are available in the package:
2929

3030
### Other scripts
3131

32-
- `psmconvert` - The convert_psm function converts peptide spectrum matches (PSMs) from an idXML file to a CSV file, optionally filtering out decoy matches. It extracts and processes data from both the idXML and an associated spectra file, handling multiple search engines and scoring systems.
33-
- `mzmlstats` - The `mzmlstats` processes mass spectrometry data files in either `.mzML` or `Bruker .d` formats to extract and compile statistics about the spectra. It supports generating detailed or ID-only CSV files based on the spectra data.
32+
- `psmconvert` - The convert_psm function converts peptide spectrum matches (PSMs) from an idXML file to a parquet file, optionally filtering out decoy matches. It extracts and processes data from both the idXML and an associated spectra file, handling multiple search engines and scoring systems.
33+
- `mzmlstats` - The `mzmlstats` processes `.mzML` mass spectrometry data files to extract and compile statistics about the spectra. It supports generating detailed parquet files with spectrum metadata and MS2 peak data.
3434

3535
#### mzml statistics
3636

@@ -101,6 +101,8 @@ to extract the features from the MS1 spectra. We use an algorithm based on the o
101101
102102
The tool will generate a gzip compressed parquet file with the extension `{file_name}_ms1_feature_info.parquet`.
103103
104+
</details>
105+
104106
## Contributions and issues
105107
106108
Contributions and issues are welcome. Please, open an issue in the [GitHub repository](https://github.com/bigbio/quantms) or PR in the [GitHub repository](https://github.com/bigbio/quantms-utils).

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ name = "quantms-utils"
33
description = "Python scripts and helpers for the quantMS workflow"
44
readme = "README.md"
55
license = "MIT"
6-
version = "0.0.24"
6+
version = "0.0.25"
77
authors = [
88
"Yasset Perez-Riverol <ypriverol@gmail.com>",
99
"Dai Chengxin <chengxin2024@126.com>",

quantmsutils/.DS_Store

6 KB
Binary file not shown.

quantmsutils/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.0.24"
1+
__version__ = "0.0.25"
Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
"""
2+
Convert DIA-NN output to MSstats format.
3+
License: Apache 2.0
4+
Authors: Hong Wong, Yasset Perez-Riverol
5+
Revisions:
6+
2023-Aug-05: J. Sebastian Paez
7+
"""
8+
9+
import logging
10+
from pathlib import Path
11+
12+
import click
13+
import pandas as pd
14+
from pyopenms import AASequence
15+
16+
CONTEXT_SETTINGS = dict(help_option_names=["-h", "--help"])
17+
REVISION = "0.1.1"
18+
19+
logging.basicConfig(format="%(asctime)s [%(funcName)s] - %(message)s", level=logging.DEBUG)
20+
logger = logging.getLogger(__name__)
21+
22+
23+
@click.command("diann2msstats", short_help="Convert DIA-NN output to MSstats")
24+
@click.option("--report", "-r", "report_path", required=True, type=click.Path(exists=True))
25+
@click.option("--exp_design", "-d", required=True, type=click.Path(exists=True))
26+
@click.option("--qvalue_threshold", "-q", type=float, required=True)
27+
def diann2msstats(
28+
report_path,
29+
exp_design,
30+
qvalue_threshold,
31+
):
32+
"""
33+
Convert DIA-NN output to MSstats format for downstream analysis.
34+
35+
:param report_path: DIA-NN main report file (.tsv or .parquet)
36+
:param exp_design: Experimental design file
37+
:param qvalue_threshold: Q-value filter threshold
38+
"""
39+
logger.debug(f"Revision {REVISION}")
40+
logger.debug("Reading input files...")
41+
report = load_report(report_path, qvalue_threshold)
42+
s_data_frame, f_table = get_exp_design_dfs(exp_design)
43+
44+
msstats_columns_keep = [
45+
"Protein.Names",
46+
"Modified.Sequence",
47+
"Precursor.Charge",
48+
"Precursor.Quantity",
49+
"Run",
50+
]
51+
52+
logger.debug("Converting to MSstats format...")
53+
if "Decoy" in report.columns:
54+
out_msstats = report[report["Decoy"] != 1][msstats_columns_keep].copy()
55+
else:
56+
out_msstats = report[msstats_columns_keep].copy()
57+
58+
out_msstats.columns = [
59+
"ProteinName",
60+
"PeptideSequence",
61+
"PrecursorCharge",
62+
"Intensity",
63+
"Run",
64+
]
65+
out_msstats = out_msstats[out_msstats["Intensity"] != 0]
66+
67+
out_msstats.loc[:, "PeptideSequence"] = out_msstats.apply(
68+
lambda x: (
69+
AASequence.fromString(x["PeptideSequence"]).toString()
70+
if "^" not in x["PeptideSequence"]
71+
else "^" + AASequence.fromString(x["PeptideSequence"].replace("^", "")).toString()
72+
),
73+
axis=1,
74+
)
75+
out_msstats["FragmentIon"] = "NA"
76+
out_msstats["ProductCharge"] = "0"
77+
out_msstats["IsotopeLabelType"] = "L"
78+
79+
logger.debug(f"out_msstats ({out_msstats.shape}) >>>")
80+
logger.debug("Adding Fraction, BioReplicate, Condition columns")
81+
82+
design_lookup = (
83+
s_data_frame[["Sample", "MSstats_Condition", "MSstats_BioReplicate"]]
84+
.merge(f_table[["Fraction", "Sample", "run"]], on="Sample")
85+
.rename(
86+
columns={
87+
"run": "Run",
88+
"MSstats_BioReplicate": "BioReplicate",
89+
"MSstats_Condition": "Condition",
90+
}
91+
)
92+
.drop(columns=["Sample"])
93+
)
94+
out_msstats = out_msstats.merge(design_lookup, on="Run", how="left", validate="many_to_one")
95+
96+
unmatched = out_msstats["BioReplicate"].isna()
97+
if unmatched.any():
98+
bad_runs = out_msstats.loc[unmatched, "Run"].unique().tolist()
99+
logger.warning(
100+
"Run(s) in DIA-NN report have no match in experimental design: %s. "
101+
"These rows will be dropped. Check that Run names (spectra file stems) match Spectra_Filepath in the design.",
102+
bad_runs,
103+
)
104+
out_msstats = out_msstats.dropna(subset=["BioReplicate"])
105+
exp_out_prefix = Path(exp_design).stem
106+
out_msstats.to_csv(exp_out_prefix + "_msstats_in.csv", sep=",", index=False)
107+
logger.info(f"MSstats input file is saved as {exp_out_prefix}_msstats_in.csv")
108+
109+
110+
def _true_stem(x):
111+
"""Return the file name stem (without extension)."""
112+
return Path(x).stem
113+
114+
115+
def get_exp_design_dfs(exp_design_file):
116+
logger.info(f"Reading experimental design file: {exp_design_file}")
117+
with open(exp_design_file, "r") as f:
118+
data = [line.replace("\r\n", "\n").replace("\r", "\n") for line in f.readlines()]
119+
try:
120+
empty_row = data.index("\n")
121+
except ValueError:
122+
raise ValueError(
123+
f"Could not find blank separator row in {exp_design_file}. "
124+
"Ensure the file contains a blank line between the file and sample tables."
125+
)
126+
f_table = [i.replace("\n", "").split("\t") for i in data[1:empty_row]]
127+
f_header = data[0].replace("\n", "").split("\t")
128+
f_table = pd.DataFrame(f_table, columns=f_header)
129+
f_table.loc[:, "run"] = f_table.apply(lambda x: _true_stem(x["Spectra_Filepath"]), axis=1)
130+
131+
s_table = [i.replace("\n", "").split("\t") for i in data[empty_row + 1 :]][1:]
132+
s_header = data[empty_row + 1].replace("\n", "").split("\t")
133+
s_data_frame = pd.DataFrame(s_table, columns=s_header)
134+
135+
return s_data_frame, f_table
136+
137+
138+
def load_report(report_path, qvalue_threshold: float) -> pd.DataFrame:
139+
"""Load DIA-NN report from TSV or Parquet, detecting format from file extension."""
140+
path = Path(report_path)
141+
remain_cols = [
142+
"Run",
143+
"Protein.Names",
144+
"Modified.Sequence",
145+
"Precursor.Charge",
146+
"Precursor.Quantity",
147+
"Q.Value",
148+
]
149+
if path.suffix == ".parquet":
150+
pq_columns = pd.read_parquet(path, columns=[]).columns.tolist()
151+
use_cols = remain_cols + (["Decoy"] if "Decoy" in pq_columns else [])
152+
report = pd.read_parquet(path, columns=use_cols)
153+
else:
154+
tsv_header = pd.read_csv(path, sep="\t", header=0, nrows=0).columns.tolist()
155+
tsv_cols = remain_cols + (["Decoy"] if "Decoy" in tsv_header else [])
156+
report = pd.read_csv(path, sep="\t", header=0, usecols=tsv_cols)
157+
158+
report = report[report["Q.Value"] < qvalue_threshold]
159+
return report

0 commit comments

Comments
 (0)