|
1 | | -[](https://github.com/bede/seqsum/actions/workflows/test.yml) [](https://pypi.org/project/seqsum) |
| 1 | +[](https://github.com/bede/seqsum/actions/workflows/test.yml) |
2 | 2 |
|
3 | 3 | # Seqsum |
4 | 4 |
|
5 | | -Robust checksums for nucleotide sequences. Accepts data from either standard input or `fast[a|q][.gz|.zst|.xz|.bz2]` files. Generates *individual* checksums for each sequence, and an *aggregate* checksum-of-checksums for a collection of sequences. Warnings are shown for duplicate sequences and within-collection checksum collisions at the chosen bit depth. Sequences are uppercased prior to hashing with [xxHash](https://github.com/ifduyue/python-xxhash) (`xxh3_128`) and may be normalised (with `-n`) to use only the characters `ACGTN-`. Read IDs and FASTQ base quality scores do not inform the checksum. Outputs tab delimited text or JSON to stdout. |
6 | | - |
7 | | -A typical use case is determining whether reordered, renamed or otherwise bit-inexact fasta/fastq files have equivalent sequence composition. Another use is generating the shortest possible collision-free identifiers for sequence collections. |
8 | | - |
9 | | -By default, seqsum outputs both individual and aggregate checksums when supplied with more than one sequence. This can be modified with the flags `--individual` (`-i`) or `--aggregate` (`-a`). |
10 | | - |
11 | | -Uses the excellent library [`dnaio`](https://github.com/marcelm/dnaio) efficient sequence parsing. |
| 5 | +Robust checksums for nucleotide sequences. Accepts input from either standard input or `fast[a|q][.gz|.zst|.xz|.bz2]` files. Generates individual checksums for each sequence, plus an aggregate checksum for a collection. Warnings are shown for duplicate sequences and within-collection checksum collisions at the selected bit depth. Sequences are uppercased before hashing with [rapidhash](https://github.com/Nicoshev/rapidhash) (`v3`) and may be normalised (with `-n`) to use only `ACGTN-`. Read IDs and FASTQ base quality scores do not inform the checksum. Output is tab-delimited text to stdout. |
12 | 6 |
|
| 7 | +By default, seqsum outputs individual checksums and, when there is more than one sequence, an aggregate checksum. This can be modified with `--individual` (`-i`) or `--aggregate` (`-a`). |
13 | 8 |
|
| 9 | +Uses [`paraseq`](https://github.com/mbhall88/paraseq) for efficient FASTA/FASTQ parsing. |
14 | 10 |
|
15 | 11 | ## Install |
16 | 12 |
|
17 | | -Installation inside a clean Python 3.10+ virtualenv (or conda environment) is recommended. To open `.zst` archives, you will also need to `pip install zstandard`. |
18 | | - |
19 | 13 | ```bash |
20 | | -pip install seqsum |
| 14 | +cargo install --path . |
21 | 15 | ``` |
22 | 16 |
|
23 | | - |
24 | | - |
25 | 17 | ## Development |
26 | 18 |
|
27 | | -Development uses [uv](https://docs.astral.sh/uv/). |
28 | | - |
29 | 19 | ```bash |
30 | 20 | git clone https://github.com/bede/seqsum.git |
31 | 21 | cd seqsum |
32 | | -uv sync |
33 | | -uv run pytest |
34 | | -uv run pre-commit install |
35 | | -uv run pre-commit run --all-files |
| 22 | +cargo test |
| 23 | +cargo fmt --all --check |
| 24 | +cargo clippy --all-targets -- -D warnings |
36 | 25 | ``` |
37 | 26 |
|
38 | | - |
39 | | - |
40 | 27 | ## Command line usage |
41 | 28 |
|
42 | 29 | ```bash |
43 | 30 | # Fasta with one record |
44 | | -$ seqsum nt MN908947.fasta |
45 | | -MN908947.3 ca5e95436b957f93 |
| 31 | +$ seqsum tests/data/MN908947.fasta |
| 32 | +33ba13564e0a63e3 MN908947.3 |
46 | 33 |
|
47 | 34 | # Fasta with two records |
48 | | -$ seqsum nt MN908947-BA_2_86_1.fasta |
49 | | -MN908947.3 ca5e95436b957f93 |
50 | | -BA.2.86.1 d5f014ee6745cb77 |
51 | | -aggregate 837cfd6836b9a406 |
| 35 | +$ seqsum tests/data/MN908947-BA_2_86_1.fasta |
| 36 | +33ba13564e0a63e3 MN908947.3 |
| 37 | +9fef3b61d54d8902 BA.2.86.1 |
| 38 | +d3a94eb82357ece5 aggregate |
52 | 39 |
|
53 | | -# Fasta with two records, only show the aggregate checksum |
54 | | -$ seqsum nt -a MN908947-BA_2_86_1.fasta |
55 | | -aggregate 837cfd6836b9a406 |
| 40 | +# Fasta with two records, only show aggregate checksum |
| 41 | +$ seqsum tests/data/MN908947-BA_2_86_1.fasta --aggregate |
| 42 | +d3a94eb82357ece5 aggregate |
56 | 43 |
|
57 | 44 | # Fasta via stdin |
58 | | -% cat MN908947.fasta | seqsum nt - |
59 | | -MN908947.3 ca5e95436b957f93 |
| 45 | +$ cat tests/data/MN908947.fasta | seqsum - |
| 46 | +33ba13564e0a63e3 MN908947.3 |
60 | 47 |
|
61 | | -# Fastq (gzipped) with 1m records, redirected to file, with progress bar |
62 | | -$ seqsum nt illumina.r12.fastq.gz --progress > checksums.tsv |
63 | | -Processed 1000000 records (317kit/s) |
64 | | -INFO: Found duplicate sequences |
65 | 48 | ``` |
66 | 49 |
|
67 | 50 | **Built-in help** |
68 | 51 |
|
69 | 52 | ```bash |
70 | | -$ seqsum nt -h |
71 | | -usage: seqsum nt [-h] [-n] [-s] [-b BITS] [-i] [-a] [-j] [-p] input |
72 | | - |
73 | | -Robust individual and aggregate checksums for nucleotide sequences. Accepts input |
74 | | -from either stdin or fast[a|q][.gz|.zst|.xz|.bz2] files. Generates individual |
75 | | -checksums for each sequence, and an aggregate checksum-of-checksums for a |
76 | | -collection of sequences. Warnings are shown for duplicate sequences and |
77 | | -within-collection checksum collisions at the chosen bit depth. Sequences are |
78 | | -uppercased prior to hashing with xxHash and may optionally be normalised to use only |
79 | | -the characters ACGTN-. Read IDs and base quality scores do not inform the checksum |
80 | | - |
81 | | -positional arguments: |
82 | | - input path to fasta/q file (or - for stdin) |
83 | | - |
84 | | -options: |
85 | | - -h, --help show this help message and exit |
86 | | - -n, --normalise replace U with T and characters other than ACGT- with N |
87 | | - (default: False) |
88 | | - -s, --strict raise error for characters other than ABCDGHKMNRSTVWY- |
89 | | - (default: False) |
90 | | - -b BITS, --bits BITS displayed checksum length |
91 | | - (default: 64) |
92 | | - -i, --individual output only individual checksums |
93 | | - (default: False) |
94 | | - -a, --aggregate output only aggregate checksum |
95 | | - (default: False) |
96 | | - -j, --json output JSON |
97 | | - (default: False) |
98 | | - -p, --progress show progress and speed |
99 | | - (default: False) |
100 | | -``` |
101 | | - |
102 | | - |
103 | | - |
104 | | -## Python usage |
105 | | - |
106 | | -```python |
107 | | -from seqsum import lib |
108 | | - |
109 | | -checksums, aggregate_checksum = lib.sum_nt(">read1\nACGT") |
110 | | -print(checksums) |
111 | | -# {'read1': '81db282b97e7dfd1'} |
112 | | -``` |
113 | | - |
114 | | -```python |
115 | | -from pathlib import Path |
116 | | -from seqsum import lib |
117 | | - |
118 | | -fasta_path = Path("tests/data/MN908947-BA_2_86_1.fasta") |
119 | | -checksums, aggregate_checksum = lib.sum_nt(fasta_path) |
120 | | -print(checksums) |
121 | | -print(aggregate_checksum) |
122 | | -# {'MN908947.3': 'ca5e95436b957f93', 'BA.2.86.1': 'd5f014ee6745cb77'} |
123 | | -# 837cfd6836b9a406 |
| 53 | +$ seqsum -h |
124 | 54 | ``` |
0 commit comments