This project implements a lightweight version of DETR (DEtection TRansformer) for pedestrian detection using the PennFudan dataset.
The code includes the full pipeline:
- Dataset loading and preprocessing
- Model (backbone + transformer)
- Training with Hungarian matching
- Evaluation using mAP@0.5
- Experiment comparisons
- Dataset: PennFudanPed
- Converts segmentation masks to bounding boxes
- Bounding boxes are normalized as (cx, cy, w, h)
- Horizontal flip
- Scale jitter
- Optional color jitter
- ResNet18 or MobileNetV2
- Extracts feature maps from images
- Encoder: global feature understanding
- Decoder: predicts objects using learned queries
- Fixed number of queries (e.g., 10)
- Each query predicts one object
- Classification (object / no-object)
- Bounding box regression (cx, cy, w, h)
- Hungarian algorithm for one-to-one assignment
- Cross-Entropy (classification)
- L1 Loss (bounding boxes)
- GIoU Loss
- Early training: frozen backbone, weak augmentation
- Later training: unfrozen backbone, stronger augmentation
- Optimizer: AdamW
- Scheduler: OneCycleLR
- Metric: mAP@0.5
- Implemented using TorchMetrics
- Evaluated on validation and test sets
The following experiments are included:
- Backbone comparison (ResNet18 vs MobileNetV2)
- Query count (5, 10, 20)
- Augmentation (on vs off)
Each experiment:
- Trains a model
- Saves best checkpoint
- Reports validation and test performance
- MobileNetV2 achieved the best performance (~0.507 mAP)
- Optimal query count is around 10
- Increasing queries too much degrades performance
- Removing augmentation causes significant performance drop
- Ground truth visualization
- Prediction visualization (top-k predictions vs GT)
pip install torch torchvision torchmetrics scipy matplotlib- No anchor boxes or NMS required
- Hungarian matching is central to training
- Query count is an important hyperparameter
- Data augmentation is critical for generalization