Skip to content

Commit 2232fb0

Browse files
committed
Initial streaming rapidhash rust rewrite, aggregate checksum from wrapping addition
1 parent 7fcd1dd commit 2232fb0

11 files changed

Lines changed: 48 additions & 1529 deletions

File tree

.github/workflows/test.yml

Lines changed: 12 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -6,18 +6,16 @@ jobs:
66
strategy:
77
matrix:
88
os: [ubuntu-latest, macos-latest]
9-
python-version: ["3.10", "3.12"]
10-
name: Python ${{ matrix.python-version }} (${{ matrix.os }})
9+
name: Rust stable (${{ matrix.os }})
1110
steps:
12-
- uses: actions/checkout@v3
13-
- name: Set up Python ${{ matrix.python-version }}
14-
uses: actions/setup-python@v4
15-
with:
16-
python-version: ${{ matrix.python-version }}
17-
- name: Install dependencies
18-
run: |
19-
python -m pip install pytest
20-
python -m pip install .
21-
- name: Test with pytest
22-
run: |
23-
pytest
11+
- uses: actions/checkout@v4
12+
- name: Set up Rust
13+
uses: dtolnay/rust-toolchain@stable
14+
- name: Cache cargo registry and target
15+
uses: Swatinem/rust-cache@v2
16+
- name: Check format
17+
run: cargo fmt --all --check
18+
- name: Lint
19+
run: cargo clippy --all-targets -- -D warnings
20+
- name: Test
21+
run: cargo test --locked

.gitignore

Lines changed: 4 additions & 135 deletions
Original file line numberDiff line numberDiff line change
@@ -1,137 +1,6 @@
1-
# Byte-compiled / optimized / DLL files
2-
__pycache__/
3-
*.py[cod]
4-
*$py.class
5-
6-
# C extensions
7-
*.so
8-
9-
# Distribution / packaging
10-
.Python
11-
build/
12-
develop-eggs/
13-
dist/
14-
downloads/
15-
eggs/
16-
.eggs/
17-
lib/
18-
lib64/
19-
parts/
20-
sdist/
21-
var/
22-
wheels/
23-
pip-wheel-metadata/
24-
share/python-wheels/
25-
*.egg-info/
26-
.installed.cfg
27-
*.egg
28-
MANIFEST
29-
30-
# PyInstaller
31-
# Usually these files are written by a python script from a template
32-
# before PyInstaller builds the exe, so as to inject date/other infos into it.
33-
*.manifest
34-
*.spec
35-
36-
# Installer logs
37-
pip-log.txt
38-
pip-delete-this-directory.txt
39-
40-
# Unit test / coverage reports
41-
htmlcov/
42-
.tox/
43-
.nox/
44-
.coverage
45-
.coverage.*
46-
.cache
47-
nosetests.xml
48-
coverage.xml
49-
*.cover
50-
*.py,cover
51-
.hypothesis/
52-
.pytest_cache/
53-
54-
# Translations
55-
*.mo
56-
*.pot
57-
58-
# Django stuff:
59-
*.log
60-
local_settings.py
61-
db.sqlite3
62-
db.sqlite3-journal
63-
64-
# Flask stuff:
65-
instance/
66-
.webassets-cache
67-
68-
# Scrapy stuff:
69-
.scrapy
70-
71-
# Sphinx documentation
72-
docs/_build/
73-
74-
# PyBuilder
751
target/
76-
77-
# Jupyter Notebook
78-
.ipynb_checkpoints
79-
80-
# IPython
81-
profile_default/
82-
ipython_config.py
83-
84-
# pyenv
85-
.python-version
86-
87-
# pipenv
88-
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
89-
# However, in case of collaboration, if having platform-specific dependencies or dependencies
90-
# having no cross-platform support, pipenv may install dependencies that don't work, or not
91-
# install all needed dependencies.
92-
#Pipfile.lock
93-
94-
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
95-
__pypackages__/
96-
97-
# Celery stuff
98-
celerybeat-schedule
99-
celerybeat.pid
100-
101-
# SageMath parsed files
102-
*.sage.py
103-
104-
# Environments
105-
.env
106-
.venv
107-
env/
108-
venv/
109-
ENV/
110-
env.bak/
111-
venv.bak/
112-
113-
# Spyder project settings
114-
.spyderproject
115-
.spyproject
116-
117-
# Rope project settings
118-
.ropeproject
119-
120-
# mkdocs documentation
121-
/site
122-
123-
# mypy
124-
.mypy_cache/
125-
.dmypy.json
126-
dmypy.json
127-
128-
# Pyre type checker
129-
.pyre/
130-
1312
.DS_Store
132-
133-
.vscode
134-
135-
100k.ERR3239334_1.fastq
136-
1m.ERR3239334_1.fastq
137-
1m.ERR3239334_1.fastq.gz
3+
.vscode/
4+
tests/data/100k.ERR3239334_1.fastq
5+
tests/data/1m.ERR3239334_1.fastq
6+
tests/data/1m.ERR3239334_1.fastq.gz

.pre-commit-config.yaml

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,13 @@
11
repos:
2-
- repo: https://github.com/astral-sh/ruff-pre-commit
3-
rev: v0.1.5
4-
hooks:
5-
- id: ruff
6-
args: [ --fix ]
7-
- id: ruff-format
2+
- repo: local
3+
hooks:
4+
- id: cargo-fmt
5+
name: cargo fmt
6+
entry: cargo fmt --all --check
7+
language: system
8+
pass_filenames: false
9+
- id: cargo-clippy
10+
name: cargo clippy
11+
entry: cargo clippy --all-targets -- -D warnings
12+
language: system
13+
pass_filenames: false

README.md

Lines changed: 20 additions & 90 deletions
Original file line numberDiff line numberDiff line change
@@ -1,124 +1,54 @@
1-
[![Tests](https://github.com/bede/seqsum/actions/workflows/test.yml/badge.svg)](https://github.com/bede/seqsum/actions/workflows/test.yml) [![PyPI version](https://img.shields.io/pypi/v/seqsum)](https://pypi.org/project/seqsum)
1+
[![Tests](https://github.com/bede/seqsum/actions/workflows/test.yml/badge.svg)](https://github.com/bede/seqsum/actions/workflows/test.yml)
22

33
# Seqsum
44

5-
Robust checksums for nucleotide sequences. Accepts data from either standard input or `fast[a|q][.gz|.zst|.xz|.bz2]` files. Generates *individual* checksums for each sequence, and an *aggregate* checksum-of-checksums for a collection of sequences. Warnings are shown for duplicate sequences and within-collection checksum collisions at the chosen bit depth. Sequences are uppercased prior to hashing with [xxHash](https://github.com/ifduyue/python-xxhash) (`xxh3_128`) and may be normalised (with `-n`) to use only the characters `ACGTN-`. Read IDs and FASTQ base quality scores do not inform the checksum. Outputs tab delimited text or JSON to stdout.
6-
7-
A typical use case is determining whether reordered, renamed or otherwise bit-inexact fasta/fastq files have equivalent sequence composition. Another use is generating the shortest possible collision-free identifiers for sequence collections.
8-
9-
By default, seqsum outputs both individual and aggregate checksums when supplied with more than one sequence. This can be modified with the flags `--individual` (`-i`) or `--aggregate` (`-a`).
10-
11-
Uses the excellent library [`dnaio`](https://github.com/marcelm/dnaio) efficient sequence parsing.
5+
Robust checksums for nucleotide sequences. Accepts input from either standard input or `fast[a|q][.gz|.zst|.xz|.bz2]` files. Generates individual checksums for each sequence, plus an aggregate checksum for a collection. Warnings are shown for duplicate sequences and within-collection checksum collisions at the selected bit depth. Sequences are uppercased before hashing with [rapidhash](https://github.com/Nicoshev/rapidhash) (`v3`) and may be normalised (with `-n`) to use only `ACGTN-`. Read IDs and FASTQ base quality scores do not inform the checksum. Output is tab-delimited text to stdout.
126

7+
By default, seqsum outputs individual checksums and, when there is more than one sequence, an aggregate checksum. This can be modified with `--individual` (`-i`) or `--aggregate` (`-a`).
138

9+
Uses [`paraseq`](https://github.com/mbhall88/paraseq) for efficient FASTA/FASTQ parsing.
1410

1511
## Install
1612

17-
Installation inside a clean Python 3.10+ virtualenv (or conda environment) is recommended. To open `.zst` archives, you will also need to `pip install zstandard`.
18-
1913
```bash
20-
pip install seqsum
14+
cargo install --path .
2115
```
2216

23-
24-
2517
## Development
2618

27-
Development uses [uv](https://docs.astral.sh/uv/).
28-
2919
```bash
3020
git clone https://github.com/bede/seqsum.git
3121
cd seqsum
32-
uv sync
33-
uv run pytest
34-
uv run pre-commit install
35-
uv run pre-commit run --all-files
22+
cargo test
23+
cargo fmt --all --check
24+
cargo clippy --all-targets -- -D warnings
3625
```
3726

38-
39-
4027
## Command line usage
4128

4229
```bash
4330
# Fasta with one record
44-
$ seqsum nt MN908947.fasta
45-
MN908947.3 ca5e95436b957f93
31+
$ seqsum tests/data/MN908947.fasta
32+
33ba13564e0a63e3 MN908947.3
4633

4734
# Fasta with two records
48-
$ seqsum nt MN908947-BA_2_86_1.fasta
49-
MN908947.3 ca5e95436b957f93
50-
BA.2.86.1 d5f014ee6745cb77
51-
aggregate 837cfd6836b9a406
35+
$ seqsum tests/data/MN908947-BA_2_86_1.fasta
36+
33ba13564e0a63e3 MN908947.3
37+
9fef3b61d54d8902 BA.2.86.1
38+
d3a94eb82357ece5 aggregate
5239

53-
# Fasta with two records, only show the aggregate checksum
54-
$ seqsum nt -a MN908947-BA_2_86_1.fasta
55-
aggregate 837cfd6836b9a406
40+
# Fasta with two records, only show aggregate checksum
41+
$ seqsum tests/data/MN908947-BA_2_86_1.fasta --aggregate
42+
d3a94eb82357ece5 aggregate
5643

5744
# Fasta via stdin
58-
% cat MN908947.fasta | seqsum nt -
59-
MN908947.3 ca5e95436b957f93
45+
$ cat tests/data/MN908947.fasta | seqsum -
46+
33ba13564e0a63e3 MN908947.3
6047

61-
# Fastq (gzipped) with 1m records, redirected to file, with progress bar
62-
$ seqsum nt illumina.r12.fastq.gz --progress > checksums.tsv
63-
Processed 1000000 records (317kit/s)
64-
INFO: Found duplicate sequences
6548
```
6649

6750
**Built-in help**
6851

6952
```bash
70-
$ seqsum nt -h
71-
usage: seqsum nt [-h] [-n] [-s] [-b BITS] [-i] [-a] [-j] [-p] input
72-
73-
Robust individual and aggregate checksums for nucleotide sequences. Accepts input
74-
from either stdin or fast[a|q][.gz|.zst|.xz|.bz2] files. Generates individual
75-
checksums for each sequence, and an aggregate checksum-of-checksums for a
76-
collection of sequences. Warnings are shown for duplicate sequences and
77-
within-collection checksum collisions at the chosen bit depth. Sequences are
78-
uppercased prior to hashing with xxHash and may optionally be normalised to use only
79-
the characters ACGTN-. Read IDs and base quality scores do not inform the checksum
80-
81-
positional arguments:
82-
input path to fasta/q file (or - for stdin)
83-
84-
options:
85-
-h, --help show this help message and exit
86-
-n, --normalise replace U with T and characters other than ACGT- with N
87-
(default: False)
88-
-s, --strict raise error for characters other than ABCDGHKMNRSTVWY-
89-
(default: False)
90-
-b BITS, --bits BITS displayed checksum length
91-
(default: 64)
92-
-i, --individual output only individual checksums
93-
(default: False)
94-
-a, --aggregate output only aggregate checksum
95-
(default: False)
96-
-j, --json output JSON
97-
(default: False)
98-
-p, --progress show progress and speed
99-
(default: False)
100-
```
101-
102-
103-
104-
## Python usage
105-
106-
```python
107-
from seqsum import lib
108-
109-
checksums, aggregate_checksum = lib.sum_nt(">read1\nACGT")
110-
print(checksums)
111-
# {'read1': '81db282b97e7dfd1'}
112-
```
113-
114-
```python
115-
from pathlib import Path
116-
from seqsum import lib
117-
118-
fasta_path = Path("tests/data/MN908947-BA_2_86_1.fasta")
119-
checksums, aggregate_checksum = lib.sum_nt(fasta_path)
120-
print(checksums)
121-
print(aggregate_checksum)
122-
# {'MN908947.3': 'ca5e95436b957f93', 'BA.2.86.1': 'd5f014ee6745cb77'}
123-
# 837cfd6836b9a406
53+
$ seqsum -h
12454
```

pyproject.toml

Lines changed: 0 additions & 36 deletions
This file was deleted.

src/seqsum/__init__.py

Lines changed: 0 additions & 2 deletions
This file was deleted.

0 commit comments

Comments
 (0)