Provenant is a Rust reimplementation of ScanCode Toolkit focused on trustworthy feature parity, explicit behavioral documentation, and targeted improvements where Rust makes the result safer or easier to maintain.
- Strong compatibility goals: preserve ScanCode behavior where users depend on it
- Better performance: native code, parallel processing, and efficient parsing
- Enhanced security: no code execution and explicit DoS protection
- Intentional improvements: document deliberate Rust-side enhancements and any remaining parity gaps clearly
See SUPPORTED_FORMATS.md for the full list of supported ecosystems and formats.
"always prefer correctness and full feature parity over effort/pragmatism"
- Every feature, edge case, and requirement from Python ScanCode must be preserved
- Zero tolerance for bugs - identify and fix issues from the original
- Comprehensive test coverage across unit, golden, scanner-contract, and integration layers
- No code execution: AST parsing only, never eval/exec
- DoS protection: Explicit limits on file size, recursion, iterations
- Archive safety: Zip bomb prevention, compression ratio validation
- Input validation: Robust error handling, graceful degradation
See ADR 0004: Security-First Parsing for details.
Critical separation of concerns:
- Parsers extract raw data from manifests and may normalize trustworthy declared package-license metadata
- Detection engines normalize and analyze file-content license text and broader detection inputs
Parsers still MUST NOT:
- Run broad fuzzy license-text matching over file content
- Extract copyright holders from file content (detection engine's job)
- Backfill package declared licenses from sibling files or file detections silently
Parsers MAY populate declared_license_expression, declared_license_expression_spdx, and deterministic parser-side license_detections when the source field is a bounded, trustworthy declared-license surface such as an SPDX-expression-compatible manifest field.
Most package extraction in Provenant is path-owned and flows through PackageParser or
recognizer registration. A small set of scanner-owned exceptions can exist when the package surface
is content-aware rather than filename-aware. The current example is compiled-binary package
detection behind --package-in-compiled: the scanner already has the file bytes in memory, raw
executables do not have stable manifest-like filenames, and the detector must stay explicitly
bounded and opt-in.
See ADR 0002: Extraction vs Detection Separation for details.
Provenant follows the same broad stage model as ScanCode, but the concrete implementation is narrower in a few places. In particular, Provenant primarily scans native paths and already-extracted inputs, while some archive-aware parsers inspect their own archive formats directly instead of relying on one universal pre-scan extraction stage.
- Input preparation
- collect input paths
- apply include/exclude rules and depth limits
- recognize extracted layouts and parser-specific archive surfaces where applicable
- Scanning
- package manifest and package-database parsing
- license detection
- copyright, email, and URL extraction
- Post-processing
- package assembly (sibling, nested, file-reference, workspace)
- summaries, tallies, classification, facets, generated-code handling
- Filtering and reshaping
- license-policy evaluation
- include/exclude and findings-only shaping over native scans or
--from-jsoninputs
- Output
- ScanCode-style JSON / JSONL / YAML / HTML
- SPDX, CycloneDX, Debian copyright, and custom-template output
- Package Parsers: See SUPPORTED_FORMATS.md for complete list
- Scanner Pipeline: File discovery, parallel processing, progress tracking
- Security Layer: DoS protection, no code execution, archive safety
- Package Assembly: Sibling and nested merge strategies for combining related manifests
- Text Detection: License detection (n-gram matching), copyright detection (4-stage pipeline), email/URL extraction
- Post-Processing: Summarization, tallies, classification
- Output Schema: Dedicated serde-enabled types in
src/output_schema/that define the ScanCode-compatible JSON schema, separate from internal domain types - Output: JSON, JSON Lines, YAML, HTML, SPDX (TV/RDF), CycloneDX (JSON/XML), Debian copyright, and custom templates
- Testing Infrastructure: Doctests, unit tests, golden tests, parser-local scanner/assembly contract tests, and system integration tests
- Infrastructure: Caching, enhanced progress tracking, static integration points
This document stays architecture-focused. For concrete feature and support status, use:
- README.md for user-facing features and usage
- SUPPORTED_FORMATS.md for supported formats and ecosystems
- TESTING_STRATEGY.md for verification and regression approach
Python ScanCode uses a plugin-based architecture with 5 plugin types:
- PreScan Plugins: Archive extraction, file type detection
- Scan Plugins: Package detection, license detection, copyright detection
- PostScan Plugins: Package assembly, summarization, classification
- OutputFilter Plugins: License policy filtering, custom filters
- Output Plugins: Format-specific output (SPDX, CycloneDX, etc.)
Provenant keeps the same high-level stages, but wires them statically through trait-based parsers and explicit pipeline stages instead of a runtime plugin system.
Core abstraction: each parser exposes three durable concepts — its package type, a path-matching predicate, and an extraction entry point that returns one or more normalized PackageData values.
Benefits:
- Type-safe dispatch at compile time
- Zero runtime overhead
- Clear contract for all parsers
- Easy to test in isolation
Implementation pattern: parsers are usually zero-sized types with compile-time registration. The exact trait signature and helper APIs belong in code and the parser how-to guide, not in this architecture overview.
See ADR 0001: Trait-Based Parser Architecture for details.
How parsers are wired to the scanner:
Parsers and recognizers are registered centrally in src/parsers/mod.rs through the package-handler registration macro.
What this macro generates:
-
try_parse_file(path: &Path) -> Option<ParsePackagesResult>- Called by scanner for every file
- Tries each parser's
is_match()in order - Returns extracted packages plus parser diagnostics
-
parse_by_type_name(type_name: &str, path: &Path) -> Option<PackageData>- Used by test utilities for golden test generation
- Allows direct parser invocation by name
-
list_parser_types() -> Vec<&'static str>- Returns all registered parser type names
- Used by integration tests to verify registration
Critical: If a parser is implemented but not listed in this macro, it will never be called by the scanner, even if fully implemented and tested. Integration coverage verifies that parser registration stays aligned with the scanner entry points.
This registration path is intentionally for path-matched parsers and recognizers. Content-aware
scanner-owned package detectors, such as compiled-binary package extraction, are exceptional
surfaces wired from the scanner rather than through register_package_handlers!.
All parsers output the same normalized PackageData shape. The durable categories in that model are:
- identity: package type, namespace/name/version, qualifiers, PURL, datasource IDs
- metadata: description, language, release/homepage information, parties, keywords
- dependencies: dependency edges plus scope/optionality/runtime hints
- license metadata: extracted statements, declared expressions, and parser-owned declared-license detections
- provenance and references: checksums, repository/download/API URLs, source packages, file references, and extra ecosystem-specific metadata
The field-level schema evolves over time and is owned by the Rust model definitions, not this overview.
For dependency booleans and similar semantic hints, Provenant prefers honest unknowns over
invented certainty. If a manifest or lockfile does not prove flags such as is_runtime,
is_optional, is_direct, or is_pinned, the parser should leave them unset rather than
coercing ScanCode-style defaults into the core data model. Compatibility-oriented normalization, if
ever needed for stricter downstream consumers, belongs in an explicit output-layer decision rather
than in parser semantics.
Rationale:
- Normalizes differences across all supported ecosystems
- SBOM-compliant output format
- Single source of truth for structure
The scanner also owns a small number of opt-in content-aware package detector paths in addition to the normal parser/recognizer dispatch. Those paths should reuse the scanner's already-read bytes, remain explicitly bounded, and carry their own scanner-contract and golden coverage because they do not travel through the standard parser registry.
┌────────────────────────────────────────────────────────────┐
│ Provenant │
├────────────────────────────────────────────────────────────┤
│ │
│ 1. File Discovery 2. Parser Selection │
│ ┌────────────────┐ ┌───────────────┐ │
│ │ Walk directory │─────────>│ Match file │ │
│ │ Apply filters │ │ to parser │ │
│ └────────────────┘ └───────┬───────┘ │
│ │ │
│ 3. Extraction v │
│ ┌────────────────────────────────────────────┐ │
│ │ PackageParser::extract_packages() │ │
│ │ ─ Read manifest │ │
│ │ ─ Parse structure │ │
│ │ ─ Extract metadata │ │
│ │ ─ Return PackageData │ │
│ └────────────────┬───────────────────────────┘ │
│ │ │
│ 4. Output v │
│ ┌─────────────────────────────────────┐ │
│ │ Output format dispatch │ │
│ │ ─ JSON / YAML / JSONL │ │
│ │ ─ SPDX / CycloneDX / HTML / template│ │
│ └─────────────────────────────────────┘ │
│ │
│ Detection Engines (Integrated) │
│ ┌───────────────────┐ ┌──────────────────┐ │
│ │ License Detection │ │ Copyright │ │
│ │ ─ SPDX normalize │ │ Detection │ │
│ │ ─ Confidence │ │ ─ Holder extract │ │
│ │ ─ Score threshold │ │ ─ Author extract │ │
│ └───────────────────┘ └──────────────────┘ │
└────────────────────────────────────────────────────────────┘
The scanner uses rayon to process files in parallel. At a high level, each worker:
- selects the relevant parser or recognizer for the file
- extracts package data when applicable
- runs enabled text-detection stages
- records scan errors and progress for that file
Benefits:
- Utilizes all CPU cores
- Maintains thread safety (Rust ownership guarantees)
- Progress tracking with atomic operations
After scanning, the assembly system merges related manifests into logical packages using DatasourceId-based matching.
Assembly layers:
- SiblingMerge: Combines sibling files in the same directory (e.g.,
package.json+package-lock.json→ single npm package) - NestedMerge: Combines parent/child manifests across directories (e.g., Maven parent POM + module POMs)
- TopologyPlan: Claims directories or multi-directory domains whose package boundaries are defined by project structure instead of plain sibling files (e.g., npm/pnpm workspaces, Cargo workspaces,
go.work,pixi.toml, Hackage project roots) - FileRefResolve: Resolves
file_referencesfrom package database entries (RPM/Alpine/Debian) against scanned files, setsfor_packageson matched files, tracks missing references, and resolves RPM namespace from os-release - Post-assembly passes: Final targeted repair or enrichment steps that still need whole-scan context (for example file-reference resolution and the remaining workspace-specific finalization hooks)
How it works:
- Each
AssemblerConfigdeclares whichDatasourceIdvariants belong together and which file patterns to look for. - After scanning, assembly groups package-bearing files by directory.
- A topology-planning phase inspects parser-emitted structural hints and claims directories or multi-directory domains whose package boundaries are project-defined instead of purely sibling-defined.
- Unclaimed directories continue through the default sibling or nested assembly paths, and combined packages aggregate
datafile_pathsanddatasource_idsfrom all contributing files. - Claimed topology domains execute with the existing ecosystem-specific assemblers or mergers, but they do so from an explicit plan instead of first creating packages in the generic path and then repairing them later.
- File reference resolution matches installed-package database entries to files on disk (e.g., Alpine
installedDB lists files belonging to each package). - Post-assembly passes handle the remaining whole-scan cases that still need them. npm/pnpm and Cargo still finalize workspace-specific dependency/resource behavior there, but their roots and members are now planned before the generic directory loop runs.
Assembly is configurable via the --no-assemble CLI flag. See src/assembly/ for implementation details.
┌─────────────────────────────────────────────────────────┐
│ Security Layers │
├─────────────────────────────────────────────────────────┤
│ │
│ Layer 1: No Code Execution │
│ ┌────────────────────────────────────────────────┐ │
│ │ AST parsing only (setup.py, build.gradle) │ │
│ │ Never eval/exec/subprocess │ │
│ │ Regex/token-based for DSLs │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ Layer 2: Resource Limits │
│ ┌────────────────────────────────────────────────┐ │
│ │ File size: 100MB max │ │
│ │ Recursion depth: 50 levels │ │
│ │ Iterations: 100,000 max │ │
│ │ String length: 10MB per field │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ Layer 3: Archive Safety │
│ ┌────────────────────────────────────────────────┐ │
│ │ Uncompressed size: 1GB max │ │
│ │ Compression ratio: 100:1 max (zip bomb detect) │ │
│ │ Path traversal: Block ../ patterns │ │
│ │ Temp cleanup: Automatic via TempDir │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ Layer 4: Input Validation │
│ ┌────────────────────────────────────────────────┐ │
│ │ Result<T, E> error handling │ │
│ │ No .unwrap() in library code │ │
│ │ Graceful degradation on errors │ │
│ │ UTF-8 validation with lossy fallback │ │
│ └────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
The exact numeric thresholds are implementation details. Treat the code and tests as the canonical source for current limits; this document focuses on the architectural safety layers they enforce.
See ADR 0004: Security-First Parsing for comprehensive security analysis.
/\
/ \ Layer 4: System integration tests
/----\ Layer 3: Scanner/assembly contract tests
/ \
/ Golden \ Layer 2: Golden tests
/----------\
/ Unit \ Layer 1: Unit tests
/--------------\
/ Doctests \ Layer 0: API documentation examples
/__________________\
Five layers (see TESTING_STRATEGY.md for full details):
- Layer 0 — Doctests: API documentation examples that run as tests
- Layer 1 — Unit Tests: Component-level tests for individual functions and edge cases
- Layer 2 — Golden Tests: Fixture-backed regression tests for parser and subsystem contracts
- Layer 3 — Scanner/Assembly Contract Tests: parser-local tests that prove extracted data survives real scanner wiring and assembly
- Layer 4 — System Integration Tests: end-to-end tests validating user-facing behavior across the full scanner pipeline
See ADR 0003: Golden Test Strategy for golden test details.
┌─────────────────────────────────────────────────────────┐
│ Documentation Sources │
└─────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌──────────────┐ ┌────────────┐
│ Parser │ │ Doc Comments │ │ Manual │
│ Metadata │ │ (/// //!) │ │ Markdown │
│ (code) │ │ │ │ Files │
└──────┬──────┘ └──────┬───────┘ └──────┬─────┘
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌──────────────┐ ┌────────────┐
│ Auto-Gen │ │ cargo doc │ │ GitHub │
│ Formats.md │ │ (docs.rs) │ │ README │
└─────────────┘ └──────────────┘ └────────────┘
Auto-Generated: docs/SUPPORTED_FORMATS.md (from parser metadata)
API Reference: cargo doc (from /// and //! comments)
Architecture: ADRs, improvements, guides (manual Markdown)
See ADR 0005: Auto-Generated Documentation for details.
We don't just match Python ScanCode - we improve it:
| Parser | Improvement | Type |
|---|---|---|
| Alpine | SHA1 checksums correctly decoded + Provider field extraction | 🐛 Bug Fix + ✨ Feature |
| RPM | Full dependency extraction with version constraints | ✨ Feature |
| Debian | .deb archive introspection | ✨ Feature |
| Conan | conanfile.txt and conan.lock parsers (Python has neither) | ✨ Feature |
| Gradle | No code execution (token lexer vs Groovy engine) | 🛡️ Security |
| Gradle Lockfile | gradle.lockfile parser (Python has no equivalent) | ✨ Feature |
| Maven | SCM developerConnection separation, inception_year, renamed extra_data keys for consistency | 🔍 Enhanced |
| npm Workspace | pnpm-workspace.yaml extraction + workspace assembly with per-member packages (Python has stub parser + basic assembly) | ✨ Feature |
| Cargo Workspace | Full [workspace.package] metadata inheritance + workspace = true dependency resolution (Python has basic assembly) |
✨ Feature |
| Composer | Richer provenance metadata (7 extra fields) | 🔍 Enhanced |
| Ruby | Semantic party model (unified name+email) | 🔍 Enhanced |
| Dart | Proper scope handling + YAML preservation | 🔍 Enhanced |
| CPAN | Full metadata extraction (Python has stubs only) | ✨ Feature |
| Copyright Detection | Year range 2099 (was 2039), regex bug fixes, type-safe POS tags, thread-safe design, Unicode preservation, encoded-data skip | 🐛 Bug Fix + 🔍 Enhanced + ⚡ Performance |
| Assembly | LazyLock static assembler lookup (zero allocation per call) | ⚡ Performance |
See docs/improvements/ for detailed documentation of each improvement.
The codebase follows a modular architecture:
src/parsers/- Package manifest parsers (one per ecosystem)src/models/- Core data structures (PackageData, Dependency, DatasourceId, etc.)src/output_schema/- ScanCode-compatible output schema types (one file per type, with serde for JSON output)src/assembly/- Package assembly system (merging related manifests)src/scanner/- File system traversal and orchestrationdocs/- Architecture decisions, improvement docs, and guidestestdata/- Test manifests for validationreference/- Python ScanCode Toolkit (reference submodule)
For broad performance-sensitive changes, maintainers use cargo run --manifest-path xtask/Cargo.toml --bin benchmark-target -- ... with an explicit target (--repo-url or --target-path) to measure scanner behavior. Smaller changes usually rely on targeted regression suites plus normal scan-time profiling during development.
- Parallel Processing: Uses all CPU cores via rayon
- Zero-Copy Parsing:
&strinstead ofStringwhere possible - Embedded License Artifact: License loader snapshot embedded via
include_bytes! - Lazy Evaluation: Iterators instead of eager Vec building
- Efficient Parsers: quick-xml, toml, serde_json (production-grade)
[profile.release]
lto = true # Link-time optimization
codegen-units = 1 # Single codegen unit for max optimization
strip = true # Strip symbols for smaller binary
opt-level = 3 # Maximum optimizationThe following sections describe major architectural components in detail.
License Detection:
- License text matching using fingerprinting algorithms
- SPDX license expression generation with boolean simplification of equivalent expressions
- Confidence scoring and multi-license handling
- Integration with existing SPDX license data
Copyright Detection:
The copyright detection engine extracts copyright statements, holder names, and author information from source files using a four-stage pipeline:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ 1. Text │───>│ 2. Candidate│───>│ 3. Lex + │───>│ 4. Tree │
│ Preparation │ │ Selection │ │ Parse │ │ Walk + │
│ │ │ │ │ │ │ Refinement │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
- Text Preparation: Normalizes copyright symbols (
©,(c), HTML entities), strips comment markers and markup, preserves Unicode (no ASCII transliteration) - Candidate Selection: Filters lines using hint markers (
opyr,auth,©, year patterns), groups multi-line statements, and skips encoded or non-promising content early - Lexing + Parsing: POS-tags tokens using an ordered pattern set (type-safe
PosTagenum), then applies grammar rules to build parse trees identifyingCOPYRIGHT,AUTHOR,NAME,COMPANYstructures - Tree Walk + Refinement: Extracts
CopyrightDetection,HolderDetection,AuthorDetectionfrom parse trees, then applies cleanup (for example unbalanced parens, duplicate "Copyright" words, and junk patterns)
Key design decisions vs Python reference:
- Type-safe POS tags: Enum-based (not string-based) — compiler catches tag typos
- Thread-safe: No global mutable state (Python uses a singleton
DETECTOR) - Sequential pattern matching:
LazyLock<Vec<(Regex, PosTag)>>with first-match-wins semantics (RegexSet cannot preserve match order) - Extended year range: 1960-2099 (Python stops at 2039)
- Bug fixes: Fixed year-year separator bug, short-year typo, French/Spanish case-sensitivity, duplicate patterns
Special cases handled:
- Linux CREDITS files (structured
N:/E:/W:format) - SPDX-FileCopyrightText and SPDX-FileContributor
- "All Rights Reserved" in English, German, French, Spanish, Dutch
- Multi-line copyright statements spanning consecutive lines
Behavioral compatibility model:
- Default expectation: Follow Python ScanCode behavior closely for copyright, holder, and author extraction.
- Intentional Rust differences: Preserve Unicode names, apply correctness bug fixes from the Python reference, and keep detection thread-safe for parallel scans.
- Known parity gaps: Some edge-case files still differ from Python output; these are treated as targeted follow-up work with regression tests.
- Fixture ownership: Copyright golden fixtures in this repository are Rust-owned expectations; Python fixtures are a reference input, not the source of truth for local expected outputs.
Migration expectation:
- Most projects should observe equivalent results to Python ScanCode.
- Where differences exist, they are either intentional improvements (for example Unicode preservation) or explicitly tracked parity gaps.
Module location: src/copyright/
Email/URL Detection:
The email/URL detection engine is the simplest text detection feature — regex-based extraction with an ordered filter pipeline to remove junk results.
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ 1. Read │───>│ 2. Regex │───>│ 3. Filter │───>│ 4. Yield │
│ Lines │ │ Match │ │ Pipeline │ │ Results │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
Email detection: RFC-ish regex ([A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,63}) → 3-step filter pipeline (junk domain filter, uninteresting email filter, dedup).
URL detection: Three regex alternatives (scheme URLs, bare-domain URLs, git-style URLs) → 10-step filter pipeline:
- CRLF cleanup → trailing junk stripping → empty URL filter → scheme addition → user/password stripping → invalid URL filter → canonicalization (via
urlcrate) → junk host filter → junk URL filter → dedup
Both support configurable thresholds (--max-email N, --max-url N, default 50).
Golden regression coverage for this module uses local, repo-owned fixtures and a dedicated finder golden-test harness.
Key design decisions vs Python reference:
urlcrate for URL parsing/canonicalization (replacesurlpy)std::netfor IP classification (replacesipaddress)- Extended TLD support:
{2,63}per RFC 1035 (Python's{2,4}rejects.museum,.technology) - Fixed IPv6 private detection: Python has assignment bug making IPv6 private detection non-functional
- Proper error handling: No silent exception swallowing in URL canonicalization
Junk classification data (~150 entries): example domains, private IPs, W3C/XML namespaces, DTD URLs, PKI/certificate URLs, CDN URLs, image file suffixes.
Module location: src/finder/
Compatibility-Oriented Consolidation (Deferred):
- Legacy-compatible grouped package/resource view from ScanCode's
--consolidate - Not part of the current Provenant roadmap
- Retained only as a documented future compatibility decision, not as active architecture
Summarization:
- License tallies and facets
- Copyright holder aggregation
- File classification (source, docs, data, etc.)
- Summary statistics
Internal types vs. output schema types:
Provenant separates internal domain types from the ScanCode-compatible JSON output schema:
- Internal types (
src/models/) carry domain invariants (e.g.,LineNumberwrapsNonZeroUsize,Sha1Digestvalidates hex length). They retain serde only for cache round-tripping and--from-jsondeserialization. - Output schema types (
src/output_schema/) are dedicated serde-enabled types that define the public JSON schema: field renames, conditional omission, type widening (LineNumber→u64, digests →Option<String>), and theFileInfoinfo-surface gating logic. - Conversion boundary in
main.rsconvertsmodels::Output→output_schema::Outputbefore serialization. The--from-jsonpath deserializes into output schema types and converts back viaTryFrom.
See ADR 0008: Output Schema Type Separation for the full decision record.
Implementation and parity tracking:
- Multi-format output layer is implemented in
src/output/mod.rs - CLI follows ScanCode-style output flags (for example
--json-pp FILE,--spdx-tv FILE) and dispatches throughwrite_output_file - Format compatibility is verified through fixture-backed tests and documented
in
docs/TESTING_STRATEGY.md
SBOM Formats:
- SPDX: Tag-value and RDF/XML
- CycloneDX: JSON, XML
- Compatibility with SBOM tooling ecosystem
Additional Formats:
- YAML (human-readable)
- HTML report
- Custom templates (user-defined formats)
Plugin System:
- No runtime plugin system is planned for Provenant
- Compile-time integration points are preferred over a public plugin ABI
- Revisit only if concrete extension needs justify the complexity
Caching:
Provenant uses one shared persistent cache root for both the opt-in incremental manifests stored
under incremental/ and the reusable license index cache stored under license-index/.
The cache implementation lives in src/cache/ (config, io, locking, incremental). It
provides cache-root selection, sidecar lock coordination for cache writes/clears, incremental
manifest persistence, and atomic persistence helpers reused by cache writers.
The intent is straightforward: repeated scans of the same checkout should reuse unchanged file results from the last completed scan instead of rescanning the whole tree every time.
User-facing behavior is:
--cache-dirandPROVENANT_CACHEselect the shared persistent cache root--cache-clearclears that root before scanning--incrementalreuses unchanged file results from the last completed scan after validating stored metadata + SHA256 against the previous manifest- license scans reuse a persistent license-index cache under that same root unless
--no-license-index-cacheis set
Custom --license-dataset-path scans still participate in the incremental manifest workflow, and their fingerprinted license-index cache entries also live under the shared license-index/custom/ namespace.
Progress Tracking:
Centralized ScanProgress struct manages mode-aware progress output via indicatif::MultiProgress:
- Discovery phase: Spinner/message while counting files, recording initial file/dir/size counts.
- SPDX load phase: Startup message and timing capture around license DB load.
- Scan phase: Main progress bar (default mode, TTY only) with ETA, elapsed time, and
{per_sec}throughput; verbose mode keeps file-by-file paths on TTY and degrades to bounded scan lifecycle messages plus per-file warning/error context when stderr is not a TTY. - Assembly and output phases: Phase messages/spinners with timing capture.
- Scan summary: Files/sec, bytes/sec, error count, initial/final counts (including sizes), package assembly counts, and per-phase timings.
Verbosity behavior is implemented in src/progress.rs and wired through src/main.rs: quiet suppresses stderr output, default shows progress/summary, and verbose stays detailed without flooding redirected logs by limiting successful per-file path output to TTY runs while still surfacing per-file warnings/errors in non-TTY environments.
Logging integration uses indicatif-log-bridge for startup and global warnings, while parser and other file-scoped scan failures are attached to FileInfo.scan_errors in the scanner process pipeline under src/scanner/process/ (primarily pipeline.rs). That keeps serialized output, CI logs, and the quiet/default/verbose progress modes aligned: default mode shows concise failing paths, verbose mode shows the underlying per-file error details.
Module location: src/progress.rs
Quality verification in this area is currently centered on:
- fixture-backed golden and integration suites
- targeted benchmark runs via
cargo run --manifest-path xtask/Cargo.toml --bin benchmark-target -- ...when broad performance could change - explicit parity-gap tracking in evergreen docs and completed rollout records where behavior intentionally differs from Python
For detailed documentation of the license detection pipeline, matching algorithms, and engine components, see LICENSE_DETECTION_ARCHITECTURE.md.
The binary ships with a built-in license index embedded at compile time. This eliminates the need for external files during normal usage:
- Embedded artifact:
resources/license_detection/license_index.zst - Embedded build policy: compile-time-bundled from
resources/license_detection/index_build_policy.toml - Embedded overlay files: compile-time-bundled from
resources/license_detection/overlay/ - Format: MessagePack-serialized, zstd-compressed
EmbeddedLoaderSnapshotdata - Contents: Sorted
LoadedRuleandLoadedLicensevalues derived from the ScanCode rules dataset - Structured provenance surface:
headers[0].extra_data.license_index_provenance - Exported custom dataset root:
manifest.json+rules/+licenses/
The license detection system uses a two-stage loading process:
┌─────────────────────────────────────────────────────────────────┐
│ License Index Loading │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Loader Stage (Embedded Artifact) │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ • Decompress and deserialize EmbeddedLoaderSnapshot │ │
│ │ • Validate schema version │ │
│ │ • No runtime filesystem access to ScanCode data │ │
│ └────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Build Stage (Runtime) │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ • Build runtime index from embedded rules/licenses │ │
│ │ • Apply deprecated filtering policy │ │
│ │ • Synthesize license-derived rules │ │
│ │ • Build LicenseIndex (token dict, automatons, maps) │ │
│ │ • Build SpdxMapping │ │
│ └────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Artifact-generation responsibilities (performed when building license_index.zst):
- Parse the ScanCode rules and licenses dataset
- Normalize rule/license data before embedding
- Apply the checked-in index build policy before sorting/serialization
- Apply the single downstream overlay directory (ignore via policy manifest, add/replace via
.RULE/.LICENSEfiles) - Fail fast on stale ignore ids or redundant overlays that upstream has absorbed verbatim
- Serialize sorted
LoadedRule/LoadedLicensesnapshot bytes - Compress the serialized bytes for embedding
Loader-stage responsibilities (runtime, file-local):
- Decompress and deserialize the embedded loader snapshot
- Reconstruct the runtime
LicenseIndex - Build the SPDX mapping from the reconstructed index
Build-stage responsibilities (cross-file policies):
- Deprecated filtering (
with_deprecated: bool) - License-derived rule synthesis
- Tokenization and dictionary building
- Aho-Corasick automaton construction
- SPDX key mapping
// Default: Use embedded artifact
let engine = LicenseDetectionEngine::from_embedded()?;
// Custom dataset: Load from dataset root
let engine = LicenseDetectionEngine::from_directory(&rules_path)?;The CLI uses from_embedded() by default. Use --license-dataset-path to load from a custom dataset root instead, or --export-license-dataset to dump the built-in effective dataset for inspection and reuse.
Maintainers can regenerate the embedded license artifact when the ScanCode rules dataset is updated:
# Initialize the reference submodule (if not already)
./setup.sh
# Regenerate the artifact
cargo run --manifest-path xtask/Cargo.toml --bin generate-index-artifact
# The generated artifact reflects the compile-time-bundled policy in
# resources/license_detection/index_build_policy.toml, so policy edits need the
# same regeneration step.
# Commit the updated artifact
git add resources/license_detection/license_index.zst
git commit -m "chore: update embedded license data"The reference/scancode-toolkit/ submodule is optional for end users. It's only needed for:
- Developers updating embedded data: Regenerating the compact embedded loader artifact
- Custom license datasets: Using
--license-dataset-pathto load exported or user-maintained dataset roots - Parity testing: Comparing Rust behavior against Python reference
Normal builds work without the submodule because the embedded artifact is checked into the repository.
- README.md - User-facing overview, installation, and usage
- AGENTS.md - Contributor guidelines and code style
- ADRs - Architectural decision records
- Improvements - Beyond-parity features
- SUPPORTED_FORMATS.md - Complete format list (auto-generated)