bede
diff --git a/‎.github/workflows/test.yml‎
Lines changed: 12 additions & 14 deletions b/‎.github/workflows/test.yml‎
Lines changed: 12 additions & 14 deletions
diff --git a/‎.gitignore‎
Lines changed: 4 additions & 135 deletions b/‎.gitignore‎
Lines changed: 4 additions & 135 deletions
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 12 additions & 6 deletions b/‎.pre-commit-config.yaml‎
Lines changed: 12 additions & 6 deletions
diff --git a/‎README.md‎
Lines changed: 20 additions & 90 deletions b/‎README.md‎
Lines changed: 20 additions & 90 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 0 additions & 36 deletions b/‎pyproject.toml‎
Lines changed: 0 additions & 36 deletions
diff --git a/‎src/seqsum/__init__.py‎
Lines changed: 0 additions & 2 deletions b/‎src/seqsum/__init__.py‎
Lines changed: 0 additions & 2 deletions
@@ -6,18 +6,16 @@ jobs:
     strategy:
       matrix:
         os: [ubuntu-latest, macos-latest]
-        python-version: ["3.10", "3.12"]
-    name: Python ${{ matrix.python-version }} (${{ matrix.os }})
+    name: Rust stable (${{ matrix.os }})
     steps:
-      - uses: actions/checkout@v3
-      - name: Set up Python ${{ matrix.python-version }}
-        uses: actions/setup-python@v4
-        with:
-          python-version: ${{ matrix.python-version }}
-      - name: Install dependencies
-        run: |
-          python -m pip install pytest
-          python -m pip install .
-      - name: Test with pytest
-        run: |
-          pytest
+      - uses: actions/checkout@v4
+      - name: Set up Rust
+        uses: dtolnay/rust-toolchain@stable
+      - name: Cache cargo registry and target
+        uses: Swatinem/rust-cache@v2
+      - name: Check format
+        run: cargo fmt --all --check
+      - name: Lint
+        run: cargo clippy --all-targets -- -D warnings
+      - name: Test
+        run: cargo test --locked
@@ -1,137 +1,6 @@
-# Byte-compiled / optimized / DLL files
-__pycache__/
-*.py[cod]
-*$py.class
-
-# C extensions
-*.so
-
-# Distribution / packaging
-.Python
-build/
-develop-eggs/
-dist/
-downloads/
-eggs/
-.eggs/
-lib/
-lib64/
-parts/
-sdist/
-var/
-wheels/
-pip-wheel-metadata/
-share/python-wheels/
-*.egg-info/
-.installed.cfg
-*.egg
-MANIFEST
-
-# PyInstaller
-#  Usually these files are written by a python script from a template
-#  before PyInstaller builds the exe, so as to inject date/other infos into it.
-*.manifest
-*.spec
-
-# Installer logs
-pip-log.txt
-pip-delete-this-directory.txt
-
-# Unit test / coverage reports
-htmlcov/
-.tox/
-.nox/
-.coverage
-.coverage.*
-.cache
-nosetests.xml
-coverage.xml
-*.cover
-*.py,cover
-.hypothesis/
-.pytest_cache/
-
-# Translations
-*.mo
-*.pot
-
-# Django stuff:
-*.log
-local_settings.py
-db.sqlite3
-db.sqlite3-journal
-
-# Flask stuff:
-instance/
-.webassets-cache
-
-# Scrapy stuff:
-.scrapy
-
-# Sphinx documentation
-docs/_build/
-
-# PyBuilder
 target/
-
-# Jupyter Notebook
-.ipynb_checkpoints
-
-# IPython
-profile_default/
-ipython_config.py
-
-# pyenv
-.python-version
-
-# pipenv
-#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
-#   However, in case of collaboration, if having platform-specific dependencies or dependencies
-#   having no cross-platform support, pipenv may install dependencies that don't work, or not
-#   install all needed dependencies.
-#Pipfile.lock
-
-# PEP 582; used by e.g. github.com/David-OConnor/pyflow
-__pypackages__/
-
-# Celery stuff
-celerybeat-schedule
-celerybeat.pid
-
-# SageMath parsed files
-*.sage.py
-
-# Environments
-.env
-.venv
-env/
-venv/
-ENV/
-env.bak/
-venv.bak/
-
-# Spyder project settings
-.spyderproject
-.spyproject
-
-# Rope project settings
-.ropeproject
-
-# mkdocs documentation
-/site
-
-# mypy
-.mypy_cache/
-.dmypy.json
-dmypy.json
-
-# Pyre type checker
-.pyre/
-
 .DS_Store
-
-.vscode
-
-100k.ERR3239334_1.fastq
-1m.ERR3239334_1.fastq
-1m.ERR3239334_1.fastq.gz
+.vscode/
+tests/data/100k.ERR3239334_1.fastq
+tests/data/1m.ERR3239334_1.fastq
+tests/data/1m.ERR3239334_1.fastq.gz
@@ -1,7 +1,13 @@
 repos:
-- repo: https://github.com/astral-sh/ruff-pre-commit
-  rev: v0.1.5
-  hooks:
-    - id: ruff
-      args: [ --fix ]
-    - id: ruff-format
+  - repo: local
+    hooks:
+      - id: cargo-fmt
+        name: cargo fmt
+        entry: cargo fmt --all --check
+        language: system
+        pass_filenames: false
+      - id: cargo-clippy
+        name: cargo clippy
+        entry: cargo clippy --all-targets -- -D warnings
+        language: system
+        pass_filenames: false
@@ -1,124 +1,54 @@
-[![Tests](https://github.com/bede/seqsum/actions/workflows/test.yml/badge.svg)](https://github.com/bede/seqsum/actions/workflows/test.yml) [![PyPI version](https://img.shields.io/pypi/v/seqsum)](https://pypi.org/project/seqsum)
+[![Tests](https://github.com/bede/seqsum/actions/workflows/test.yml/badge.svg)](https://github.com/bede/seqsum/actions/workflows/test.yml)
 
 # Seqsum
 
-Robust checksums for nucleotide sequences. Accepts data from either standard input or `fast[a|q][.gz|.zst|.xz|.bz2]` files. Generates *individual* checksums for each sequence, and an *aggregate* checksum-of-checksums for a collection of sequences. Warnings are shown for duplicate sequences and within-collection checksum collisions at the chosen bit depth. Sequences are uppercased prior to hashing with [xxHash](https://github.com/ifduyue/python-xxhash) (`xxh3_128`) and may be normalised (with `-n`) to use only the characters `ACGTN-`. Read IDs and FASTQ base quality scores do not inform the checksum. Outputs tab delimited text or JSON to stdout.
-
-A typical use case is determining whether reordered, renamed or otherwise bit-inexact fasta/fastq files have equivalent sequence composition. Another use is generating the shortest possible collision-free identifiers for sequence collections.
-
-By default, seqsum outputs both individual and aggregate checksums when supplied with more than one sequence. This can be modified with the flags `--individual` (`-i`) or `--aggregate` (`-a`).
-
-Uses the excellent library [`dnaio`](https://github.com/marcelm/dnaio) efficient sequence parsing.
+Robust checksums for nucleotide sequences. Accepts input from either standard input or `fast[a|q][.gz|.zst|.xz|.bz2]` files. Generates individual checksums for each sequence, plus an aggregate checksum for a collection. Warnings are shown for duplicate sequences and within-collection checksum collisions at the selected bit depth. Sequences are uppercased before hashing with [rapidhash](https://github.com/Nicoshev/rapidhash) (`v3`) and may be normalised (with `-n`) to use only `ACGTN-`. Read IDs and FASTQ base quality scores do not inform the checksum. Output is tab-delimited text to stdout.
 
+By default, seqsum outputs individual checksums and, when there is more than one sequence, an aggregate checksum. This can be modified with `--individual` (`-i`) or `--aggregate` (`-a`).
 
+Uses [`paraseq`](https://github.com/mbhall88/paraseq) for efficient FASTA/FASTQ parsing.
 
 ## Install
 
-Installation inside a clean Python 3.10+ virtualenv (or conda environment) is recommended. To open `.zst` archives, you will also need to `pip install zstandard`.
-
 ```bash
-pip install seqsum
+cargo install --path .
 ```
 
-
-
 ## Development
 
-Development uses [uv](https://docs.astral.sh/uv/).
-
 ```bash
 git clone https://github.com/bede/seqsum.git
 cd seqsum
-uv sync
-uv run pytest
-uv run pre-commit install
-uv run pre-commit run --all-files
+cargo test
+cargo fmt --all --check
+cargo clippy --all-targets -- -D warnings
 ```
 
-
-
 ## Command line usage
 
 ```bash
 # Fasta with one record
-$ seqsum nt MN908947.fasta
-MN908947.3	ca5e95436b957f93
+$ seqsum tests/data/MN908947.fasta
+33ba13564e0a63e3	MN908947.3
 
 # Fasta with two records
-$ seqsum nt MN908947-BA_2_86_1.fasta
-MN908947.3	ca5e95436b957f93
-BA.2.86.1	d5f014ee6745cb77
-aggregate	837cfd6836b9a406
+$ seqsum tests/data/MN908947-BA_2_86_1.fasta
+33ba13564e0a63e3	MN908947.3
+9fef3b61d54d8902	BA.2.86.1
+d3a94eb82357ece5	aggregate
 
-# Fasta with two records, only show the aggregate checksum
-$ seqsum nt -a MN908947-BA_2_86_1.fasta
-aggregate	837cfd6836b9a406
+# Fasta with two records, only show aggregate checksum
+$ seqsum tests/data/MN908947-BA_2_86_1.fasta --aggregate
+d3a94eb82357ece5	aggregate
 
 # Fasta via stdin
-% cat MN908947.fasta | seqsum nt -
-MN908947.3	ca5e95436b957f93
+$ cat tests/data/MN908947.fasta | seqsum -
+33ba13564e0a63e3	MN908947.3
 
-# Fastq (gzipped) with 1m records, redirected to file, with progress bar
-$ seqsum nt illumina.r12.fastq.gz --progress > checksums.tsv
-Processed 1000000 records (317kit/s)
-INFO: Found duplicate sequences
 ```
 
 **Built-in help**
 
 ```bash
-$ seqsum nt -h                       
-usage: seqsum nt [-h] [-n] [-s] [-b BITS] [-i] [-a] [-j] [-p] input
-
-Robust individual and aggregate checksums for nucleotide sequences. Accepts input
-from either stdin or fast[a|q][.gz|.zst|.xz|.bz2] files. Generates individual
-checksums for each sequence, and an aggregate checksum-of-checksums for a
-collection of sequences. Warnings are shown for duplicate sequences and
-within-collection checksum collisions at the chosen bit depth. Sequences are
-uppercased prior to hashing with xxHash and may optionally be normalised to use only
-the characters ACGTN-. Read IDs and base quality scores do not inform the checksum
-
-positional arguments:
-  input                 path to fasta/q file (or - for stdin)
-
-options:
-  -h, --help            show this help message and exit
-  -n, --normalise       replace U with T and characters other than ACGT- with N
-                        (default: False)
-  -s, --strict          raise error for characters other than ABCDGHKMNRSTVWY-
-                        (default: False)
-  -b BITS, --bits BITS  displayed checksum length
-                        (default: 64)
-  -i, --individual      output only individual checksums
-                        (default: False)
-  -a, --aggregate       output only aggregate checksum
-                        (default: False)
-  -j, --json            output JSON
-                        (default: False)
-  -p, --progress        show progress and speed
-                        (default: False)
-```
-
-
-
-## Python usage
-
-```python
-from seqsum import lib
-
-checksums, aggregate_checksum = lib.sum_nt(">read1\nACGT")
-print(checksums)
-# {'read1': '81db282b97e7dfd1'}
-```
-
-```python
-from pathlib import Path
-from seqsum import lib
-
-fasta_path = Path("tests/data/MN908947-BA_2_86_1.fasta")
-checksums, aggregate_checksum = lib.sum_nt(fasta_path)
-print(checksums)
-print(aggregate_checksum)
-# {'MN908947.3': 'ca5e95436b957f93', 'BA.2.86.1': 'd5f014ee6745cb77'}
-# 837cfd6836b9a406
+$ seqsum -h
 ```