Skip to content
Merged
147 changes: 80 additions & 67 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,15 +107,9 @@ X9,0.7232472324723247,0.7352941176470589,...,0.8066914498141264,0.0
|![combined_white](https://github.com/user-attachments/assets/48b3f6e3-6dd5-4298-a793-23dcd549e90c)|![kpclust](https://github.com/user-attachments/assets/98a4d540-7c43-4802-8f77-277a5637a7a1)|

## Quick Start (Full Pipeline)
To run the full pipeline, use the following command:
```bash
KrakenParser --complete -i data/kreports -o results/
#Having troubles? Run KrakenParser --complete -h
```

For **reproducible** β-diversity (rarefaction is stochastic by default):
```bash
KrakenParser -i data/kreports -o results/ -s 42
KrakenParser -i data/kreports -o results/
```

This will:
Expand All @@ -127,147 +121,165 @@ This will:
6. Calculate relative abundance
7. Calculate α & β-diversities

## Installation
> [!TIP]
> After the pipeline finishes, the output window will remind you about calibrating
> rarefaction depth for β-diversity and re-running relative abundance normalization
> before visualization — with ready-to-paste example commands tailored to your output paths.

### Full help output

```
pip install krakenparser
usage: KrakenParser [-h] [-i INPUT] [-o OUTPUT] [--viruses] [--keep-human]
[-V] [-d DEPTH] [-s SEED] [--overwrite]
[--step {mpa,combine,split,process,csv,relabund,diversity}]

KrakenParser: Convert Kraken2 Reports to CSV.

options:
-h, --help show this help message and exit

Core Arguments:
-i, --input INPUT Directory containing Kraken2 report files
-o, --output OUTPUT Output directory (default: parent of input)
--viruses Extract only VIRUSES domain taxa in the pipeline
--keep-human Do not filter human-related taxa
-V, --version show program's version number and exit

Pipeline Options (Full Run):
-d, --depth DEPTH Rarefaction depth for β-diversity (default: 1000)
-s, --seed SEED Random seed for reproducible rarefaction (default: random)
--overwrite Overwrite the output directory if it already exists

Advanced (Step-by-step control):
--step {mpa,combine,split,process,csv,relabund,diversity}
Run only a specific part of the pipeline.
Type 'krakenparser --step <name> -h' for more.
```

## Before Visualization: Grouping Low-Abundance Taxa

The full pipeline automatically calculates relative abundance. Before passing data to visualization, it is strongly recommended to re-run `--relabund` with the `-O` flag — this collapses all taxa below the chosen threshold into a single **"Other"** group, producing much cleaner and more readable plots.
## Installation

```bash
KrakenParser --relabund -i data/counts/counts_species.csv -o data/rel_abund/ra_species.csv -O 4
```

This groups every taxon with relative abundance **< 4 %** into `Other (<4.0%)`. Adjust the threshold to your data.

> **Note:** The pipeline-generated `rel_abund/ra_*.csv` files (no `-O`) preserve the full unfiltered data — use them for statistical analysis. Use the `-O` variant specifically for visualization.
pip install krakenparser
```

---

<details>
<summary><b>Using Individual Modules (Advanced)</b></summary>
<br>

Each step of the pipeline can also be run individually. This is useful for re-running a single step, debugging, or integrating KrakenParser into a custom workflow.
Each step of the pipeline can be run individually via `--step`. This is useful for re-running a single step, debugging, or integrating KrakenParser into a custom workflow. Run `krakenparser --step <name> -h` to see the full argument list for any step.

### **Step 1: Convert Kraken2 Reports to MPA Format**
```bash
# Batch mode (directory)
KrakenParser --kreport2mpa -i data/kreports -o data/intermediate/mpa
KrakenParser --step mpa -i data/kreports -o data/intermediate/mpa
# Single file
KrakenParser --kreport2mpa -r data/kreports/sample.kreport -o data/intermediate/mpa/sample.MPA.TXT
#Having troubles? Run KrakenParser --kreport2mpa -h
KrakenParser --step mpa -r data/kreports/sample.kreport -o data/intermediate/mpa/sample.MPA.TXT
```
Converts Kraken2 `.kreport` files into **MPA format**.

### **Step 2: Combine MPA Files**
```bash
KrakenParser --combine_mpa -i data/intermediate/mpa/* -o data/intermediate/COMBINED.txt
#Having troubles? Run KrakenParser --combine_mpa -h
KrakenParser --step combine -i data/intermediate/mpa/* -o data/intermediate/COMBINED.txt
```
Merges multiple MPA files into a single combined table.

### **Step 3: Extract Taxonomic Levels**
```bash
KrakenParser --deconstruct -i data/intermediate/COMBINED.txt -o data/intermediate
#Having troubles? Run KrakenParser --deconstruct -h
KrakenParser --step split -i data/intermediate/COMBINED.txt -o data/intermediate
```

By default, human-related taxa (Homo sapiens, Hominidae, Primates, Mammalia, Chordata) are removed. To keep them:
```bash
KrakenParser --deconstruct -i data/intermediate/COMBINED.txt -o data/intermediate --keep-human
KrakenParser --step split -i data/intermediate/COMBINED.txt -o data/intermediate --keep-human
```

To inspect the **Viruses** domain separately:
To inspect the **Viruses** domain only:
```bash
KrakenParser --deconstruct_viruses -i data/intermediate/COMBINED.txt -o data/counts_viruses
#Having troubles? Run KrakenParser --deconstruct_viruses -h
KrakenParser --step split -i data/intermediate/COMBINED.txt -o data/counts_viruses --viruses-only
```

### **Step 4: Process Extracted Taxonomic Data**
```bash
KrakenParser --process -i data/intermediate/COMBINED.txt -o data/intermediate/txt/counts_phylum.txt
#Having troubles? Run KrakenParser --process -h
KrakenParser --step process -i data/intermediate/COMBINED.txt -o data/intermediate/txt/counts_phylum.txt
```

Repeat on other 5 taxonomical levels (class, order, family, genus, species) or wrap up `KrakenParser --process` in a loop.
Repeat on other 5 taxonomical levels (class, order, family, genus, species) or wrap `--step process` in a loop.

Cleans up taxonomic names: removes prefixes (`s__`, `g__`, etc.) and replaces underscores with spaces.

### **Step 5: Convert TXT to CSV**
```bash
KrakenParser --txt2csv -i data/intermediate/txt/counts_phylum.txt -o data/counts/counts_phylum.csv
#Having troubles? Run KrakenParser --txt2csv -h
KrakenParser --step csv -i data/intermediate/txt/counts_phylum.txt -o data/counts/counts_phylum.csv
```
Repeat on other 5 taxonomical levels or wrap in a loop. Transposes data so that sample names become rows.

### **Step 6: Calculate Relative Abundance**
```bash
KrakenParser --relabund -i data/counts/counts_phylum.csv -o data/rel_abund/ra_phylum.csv
#Having troubles? Run KrakenParser --relabund -h
KrakenParser --step relabund -i data/counts/counts_phylum.csv -o data/rel_abund/ra_phylum.csv
```
Repeat on other 5 taxonomical levels or wrap in a loop.

With "Other" grouping:
```bash
KrakenParser --relabund -i data/counts/counts_phylum.csv -o data/rel_abund/ra_phylum.csv -O 3.5
KrakenParser --step relabund -i data/counts/counts_phylum.csv -o data/rel_abund/ra_phylum.csv -O 3.5
```
Groups all taxa with abundance < 3.5 % into `Other (<3.5%)`.

### **Step 7: Calculate α & β-Diversities**
```bash
KrakenParser --diversity -i data/counts/counts_species.csv -o data/diversity
#Having troubles? Run KrakenParser --diversity -h
KrakenParser --step diversity -i data/counts/counts_species.csv -o data/diversity
```

With a custom rarefaction depth:
```bash
KrakenParser --diversity -i data/counts/counts_species.csv -o data/diversity -d 750
KrakenParser --step diversity -i data/counts/counts_species.csv -o data/diversity -d 750
```

For reproducible results (rarefaction uses random subsampling — fix the seed to get the same matrix every run):
For reproducible results (fix the seed to get the same matrix every run):
```bash
KrakenParser --diversity -i data/counts/counts_species.csv -o data/diversity -s 42
KrakenParser --step diversity -i data/counts/counts_species.csv -o data/diversity -s 42
```

---

## Arguments Breakdown

### **--complete** (Full Pipeline)
- Requires `-i`: path to the Kraken2 reports directory (e.g., `data/kreports`).
- Optional `-o`: output directory (default: parent of `-i`).
- Optional `--keep-human`: retain human-related taxa (default: filtered out).
- Optional `-s INT`: random seed for reproducible β-diversity rarefaction (default: random).
### **Full Pipeline** (`-i`)
- `-i / --input`: path to the Kraken2 reports directory (e.g., `data/kreports`). Triggers the full pipeline.
- `-o / --output`: output directory (default: parent of `-i`).
- `--viruses`: extract only Viruses domain taxa throughout the pipeline.
- `--keep-human`: retain human-related taxa (default: filtered out).
- `-d INT / --depth`: rarefaction depth for β-diversity (default: 1000).
- `-s INT / --seed`: random seed for reproducible β-diversity rarefaction (default: random).
- `--overwrite`: overwrite the output directory if it already exists.

### **--kreport2mpa** (Step 1)
### **--step mpa** (Step 1)
- Batch mode: `-i DIR -o DIR` — converts all files in a directory.
- Single-file mode: `-r FILE -o FILE`.

### **--combine_mpa** (Step 2)
### **--step combine** (Step 2)
- `-i FILE [FILE ...]`: one or more MPA files.
- `-o FILE`: output merged table.

### **--deconstruct** & **--deconstruct_viruses** (Step 3)
### **--step split** (Step 3)
- Extracts **phylum, class, order, family, genus, species** into separate text files.
- `--deconstruct` removes human-related reads by default; use `--keep-human` to retain them.
- `--deconstruct_viruses` extracts only the Viruses domain.
- Removes human-related reads by default; use `--keep-human` to retain them.
- Use `--viruses-only` to extract only the Viruses domain.

### **--process** (Step 4)
### **--step process** (Step 4)
- Removes prefixes (`s__`, `g__`, etc.), replaces underscores with spaces.
- `-i`: COMBINED.txt (source for sample-name header); `-o`: target txt file.

### **--txt2csv** (Step 5)
### **--step csv** (Step 5)
- Transposes a processed txt file into a CSV with sample names as rows.

### **--relabund** (Step 6)
### **--step relabund** (Step 6)
- Calculates relative abundance from a total-counts CSV.
- `-O FLOAT`: group taxa below FLOAT % into `Other (<FLOAT%)`.

### **--diversity** (Step 7)
### **--step diversity** (Step 7)
- Shannon, Pielou & Chao1 for α-diversity.
- Bray-Curtis & Jaccard for β-diversity.
- `-d INT`: rarefaction depth for β-diversity (default: 1000).
Expand All @@ -293,16 +305,17 @@ results/
│ ├─ alpha_div.csv
│ ├─ beta_div_bray.csv
│ └─ beta_div_jaccard.csv
└─ intermediate/ # Intermediate files
├─ mpa/ # Converted MPA files
│ ├─ {sample}.txt
│ ├─ ...
├─ COMBINED.txt # Merged MPA table
└─ txt/ # Extracted taxonomic levels in TXT
├─ counts_species.txt
├─ counts_genus.txt
├─ ...
└─ counts_phylum.txt
├─ intermediate/ # Intermediate files
│ ├─ mpa/ # Converted MPA files
│ │ ├─ {sample}.txt
│ │ ├─ ...
│ ├─ COMBINED.txt # Merged MPA table
│ └─ txt/ # Extracted taxonomic levels in TXT
│ ├─ counts_species.txt
│ ├─ counts_genus.txt
│ ├─ ...
│ └─ counts_phylum.txt
└─ krakenparser.log # Pipeline execution logs
```

## Conclusion
Expand Down
Loading
Loading