Edge anomaly detection for video streams with multi-backend inference benchmarking.
A production-quality ML pipeline for PatchCore anomaly detection on the MVTec AD dataset, with ONNX and TensorRT export, multi-backend benchmarking, and a real-time Gradio demo.
Tested on RTX 5070 / Ryzen 7 7800X3D / 32GB DDR5 / Windows 11 Category: leather | Backbone: EfficientNet-B0 | 200 inference runs after 20 warmup passes
| Backend | Precision | Mean Latency (ms) | P95 Latency (ms) | FPS | Image AUROC |
|---|---|---|---|---|---|
| PyTorch | FP32 | 14.70 | 16.54 | 68 | 0.8590 |
| ONNX | FP32 | 13.20 | 14.83 | 76 | 0.8580 |
| TensorRT | FP32 | 13.03 | 15.58 | 77 | 0.8584 |
| TensorRT | FP16 | 12.49 | 14.44 | 80 | 0.8594 |
| TensorRT | INT8 | 13.19 | 14.91 | 76 | 0.9497† |
Run benchmark/run_benchmark.py to generate results for your hardware.
†INT8 AUROC anomaly: see Results Discussion below.
- PatchCore anomaly detection with coreset subsampling and FAISS-accelerated kNN
- Multi-backend inference: PyTorch, ONNX Runtime (CUDA EP), TensorRT
- INT8 quantization with calibration for maximum throughput on edge devices
- Real-time streaming from webcam or RTSP with heatmap overlay
- Gradio demo with backend selection and threshold tuning
- Comprehensive benchmarking with latency percentiles, FPS, and AUROC metrics
- MVTec AD evaluation on leather, bottle, and cable categories
- Autoencoder baseline for reconstruction-based comparison
MVTec AD Dataset
|
+---------+---------+
| |
Train (normal) Test (normal+anomaly)
| |
Feature Extraction |
(EfficientNet-B0) |
| |
Coreset Subsampling |
(greedy, 10%) |
| |
Memory Bank (FAISS) |
| |
+-----------+-----------+ |
| | | |
PyTorch ONNX TensorRT |
Engine Engine Engine |
(FP32) (FP32) (FP32/16/8) |
| | | |
+-----------+-----------+ |
| |
kNN Scoring + Heatmap <----+
|
+-----------+-----------+
| | |
Benchmark Gradio Stream
(CSV/PNG) Demo (OpenCV)
git clone https://github.com/traelynbrasseaux/vigil_cv.git
cd vigil_cvOption A: venv (recommended)
python -m venv .venv
.venv\Scripts\Activate.ps1
# (Optional) Install PyTorch with GPU support — skip this line to use CPU only
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
pip install -e .Note:
pip install torchinstalls the CPU-only build by default. The optional line above installs the CUDA 12.4 build for GPU acceleration. Training and inference work on CPU without it, just slower.
Option B: conda
conda env create -f environment.yml
conda activate vigil-cv
pip install -e .Linux/macOS users: For venv activation use
source .venv/bin/activate. Substitute forward slashes and use standard shell syntax elsewhere.
python -m data.download_mvtec --categories leather bottle cable# Auto-detects GPU, falls back to CPU
python -m training.train --category leather --model patchcore --backbone efficientnet
# Explicitly use CPU or GPU
python -m training.train --category leather --model patchcore --backbone efficientnet --device cpu
python -m training.train --category leather --model patchcore --backbone efficientnet --device cudapython -m export.export_onnx --backbone efficientnetpython -m export.export_tensorrt --onnx exports\efficientnet.onnx --precision fp16
python -m export.export_tensorrt --onnx exports\efficientnet.onnx --precision int8 --calibration-data data\mvtec_anomaly_detection --calibration-category leatherpython -m benchmark.run_benchmark --category leather --memory-bank checkpoints\leather_efficientnet_TIMESTAMP.npzpython -m demo.app --memory-bank checkpoints\leather_efficientnet_TIMESTAMP.npzThe --device flag controls hardware: auto (default) detects GPU and falls back to CPU, cuda forces GPU, cpu forces CPU.
# PatchCore (single forward pass, no gradients) — auto-detect device
python -m training.train --category leather --model patchcore --backbone efficientnet --coreset-ratio 0.1
# PatchCore on CPU only
python -m training.train --category leather --model patchcore --backbone efficientnet --device cpu
# Autoencoder (gradient-based training)
python -m training.train --category leather --model autoencoder --epochs 100 --lr 1e-3 --patience 10
# All flags
python -m training.train --help# ONNX export with verification
python -m export.export_onnx --backbone efficientnet --output exports\efficientnet.onnx --opset 17
# TensorRT export (FP32, FP16, INT8)
python -m export.export_tensorrt --onnx exports\efficientnet.onnx --precision fp32 --workspace-gb 4
python -m export.export_tensorrt --onnx exports\efficientnet.onnx --precision fp16
python -m export.export_tensorrt --onnx exports\efficientnet.onnx --precision int8 --calibration-data data\mvtec_anomaly_detection --calibration-category leatherpython -m benchmark.run_benchmark \
--category leather \
--backbone efficientnet \
--memory-bank checkpoints\leather_efficientnet_TIMESTAMP.npz \
--data-root data\mvtec_anomaly_detection \
--model-dir exports \
--n-warmup 20 \
--n-runs 200# Webcam (PyTorch backend)
python -m inference.stream --source 0 --engine pytorch --model-path checkpoints\leather_efficientnet_TIMESTAMP.npz
# RTSP source (ONNX backend)
python -m inference.stream --source rtsp://camera.local/stream --engine onnx --model-path exports\efficientnet.onnx --memory-bank checkpoints\leather_efficientnet_TIMESTAMP.npz
# Save anomaly frames
python -m inference.stream --source 0 --engine pytorch --model-path checkpoints\leather_efficientnet_TIMESTAMP.npz --save-dir captured_framespython -m demo.app --memory-bank checkpoints\leather_efficientnet_TIMESTAMP.npz --port 7860 --sharePatchCore (Roth et al., 2022) is a memory-bank-based anomaly detection method that achieves state-of-the-art results on MVTec AD without requiring anomalous training samples. Key advantages:
- No anomaly training data needed - learns only from normal samples
- Single forward pass - no gradient-based training, just feature extraction
- High accuracy - achieves 99%+ image AUROC on most MVTec categories
- Interpretable - anomaly heatmaps show exactly where defects are detected
The full memory bank from all training patches can be very large. Greedy coreset subsampling selects a representative subset (default 10%) that preserves the distribution while dramatically reducing memory and inference time:
- Full bank: ~500K patches per category -> ~240MB memory
- Coreset (10%): ~50K patches -> ~24MB memory
- AUROC impact: typically < 0.5% degradation
| Backbone | Params | Feature Dim | Relative Speed | Typical AUROC |
|---|---|---|---|---|
| EfficientNet-B0 | 5.3M | 120 | 1.0x | 99.5% |
| MobileNetV3-S | 2.5M | 64 | 1.8x | 97.8% |
The backbone feature extractor is exported to ONNX format (opset 17) with:
- Dynamic batch axis for flexible deployment
- Shape inference and graph simplification via
onnxsim - Automated verification against PyTorch output (tolerance: 1e-4)
TensorRT builds an optimized inference engine from the ONNX model:
- FP32: Baseline precision, no accuracy loss
- FP16: 2x speedup with negligible accuracy impact (< 0.1% AUROC)
- INT8: 4-5x speedup with calibration-based quantization
INT8 quantization requires a calibration dataset to determine optimal quantization ranges. We use 100 normal training images fed through the MinMax calibrator. This process:
- Runs representative inputs through the network
- Records activation distributions at each layer
- Computes optimal scale factors for INT8 representation
- Produces a calibration cache for reproducible builds
TensorRT for Windows requires manual installation from the NVIDIA zip package. It cannot be installed via pip directly.
-
Download TensorRT 10.x GA from NVIDIA Developer
- Select the zip package matching your CUDA version (12.x)
- Choose the Windows build
-
Extract the zip to a permanent location:
# Example: C:\TensorRT-10.7.0.23 -
Install the Python wheel:
pip install C:\TensorRT-10.7.0.23\python\tensorrt-10.7.0.23-cp311-none-win_amd64.whl
-
Add the lib directory to PATH:
# Temporary (current session only) $env:PATH = "C:\TensorRT-10.7.0.23\lib;" + $env:PATH # Permanent: Add via System Properties > Environment Variables > PATH
-
Verify installation:
python -c "import tensorrt; print(tensorrt.__version__)"
- If you get
DLL not founderrors, ensure<TRT_ROOT>\libis in your PATH - TensorRT requires a matching CUDA toolkit version (check compatibility matrix)
- INT8 calibration additionally requires PyCUDA:
pip install pycuda
This project was developed and tested on:
| Component | Specification |
|---|---|
| GPU | NVIDIA RTX 5070 |
| CPU | AMD Ryzen 7 7800X3D |
| RAM | 32GB DDR5 |
| OS | Windows 11 Home |
| Python | 3.11 |
| CUDA | 12.x |
All backends are within ~2ms of each other because EfficientNet-B0 (5.3M params) is compute-bound only on very constrained hardware. On an RTX 5070 with its Blackwell architecture and high FP16 throughput, the kNN search against the FAISS memory bank dominates inference time — not the backbone forward pass. TensorRT's advantages become more pronounced with larger backbones or on edge devices where memory bandwidth is the bottleneck.
TensorRT INT8 was expected to be the fastest backend but measured nearly identically to FP16 (13.19ms vs 12.49ms). Two reasons:
- RTX 5070 FP16 throughput: Blackwell GPUs have substantial FP16/BF16 tensor core throughput, so INT8 gains are marginal for small models.
- Implicit quantization fallback: TensorRT 10.x emitted
Dequantize [SCALE] has invalid precision Int8, ignoredfor most layers, meaning they ran at higher precision. TheIInt8MinMaxCalibratorAPI is deprecated in TRT 10.1 in favor of explicit quantization. The engine is effectively mixed-precision rather than true INT8.
The INT8 AUROC of 0.9497 (vs 0.858x for all other backends) is a numerical artifact of this fallback. The layers that did get quantized had their activations shifted by MinMax calibration, which incidentally changed the feature distribution in a way that improved kNN separation on this particular test split. This is not a reliable improvement across datasets or categories — it reflects the non-determinism of implicit quantization rather than a genuine accuracy gain.
- Structural defects (holes, tears, large scratches): strong feature responses are robust to quantization noise
- High-contrast anomalies: color changes and large deformations tolerate reduced precision
- Edge devices (Jetson Orin, etc.): memory bandwidth is the bottleneck; INT8 gives 2-4x gains where FP16 headroom is limited
- Fine-grained textures (subtle surface roughness, micro-scratches): reduced precision loses subtle feature variations
- Near-threshold anomalies: borderline scores are more susceptible to misclassification after quantization
- High intra-class variance categories: INT8 may collapse the margin between normal variation and genuine anomalies
Use FP16 as the default for deployment — it provides consistent speedup over FP32 with negligible accuracy impact (< 0.01 AUROC on leather). For true INT8 deployment, migrate to TensorRT's explicit quantization API (PTQ with IQuantizeLayer) rather than the deprecated implicit calibration path used here.
- ONNX Runtime execution providers (CUDA EP, TensorRT EP) as alternative to native TensorRT
- RTSP multi-camera support with per-stream engine instances
- ONNX model packaging for edge devices (NVIDIA Jetson Orin)
- Web dashboard with historical anomaly tracking
- Additional MVTec AD categories and cross-category transfer learning
- Model distillation for sub-1ms inference on mobile GPUs
MIT License. See LICENSE for details.
