In this final milestone of the "Basics of Deep Learning" course, we tackled the complex task of fine-grained image classification and retrieval using the Stanford Cars Dataset. This dataset includes 16,185 images labeled into 196 subtle car categories, with visual distinctions often limited to minor differences in shape, color, or model details. The challenge was to build models that can accurately classify and retrieve images of cars from these categories.
To address this challenge, we implemented three distinct deep learning configurations:
- Transfer Learning with a pre-trained ResNet-50 - Link to Notebook 1
- Image Retrieval using feature embeddings and nearest neighbors - Link to Notebook 2
- End-to-End Convolutional Neural Network trained from scratch - Link to Notebook 3
Each configuration was explored through three experiments, allowing us to evaluate the impact of architectural depth, regularization, and data augmentation strategies. All implementations were done in PyTorch, with classification and retrieval tasks evaluated using standard metrics.
Dataset Source - The dataset we used is sourced from: Stanford Cars Dataset on Kaggle.
View the full report - MS3_Report.pdf
We fine-tuned pre-trained models on ImageNet to classify the 196 car categories. This configuration tested the power of transfer learning in adapting general-purpose features to a fine-grained classification task. We explored both shallow (frozen) and deep (fine-tuned/progressive) training strategies.
| Backbone | Input Size | Optimizer | Learning Rate | Loss Function | Batch Size | Scheduler | Epochs |
|---|---|---|---|---|---|---|---|
| ResNet-50 / DenseNet121 | 224×224 | Adam | 0.001 | CrossEntropyLoss | 32 | StepLR (γ=0.1 every 7 epochs) | 10 |
-
Experiment 1 - ResNet-50 (Frozen) – Only the classification head was trained. The backbone was frozen, and no augmentation or regularization was used.
-
Experiment 2 - ResNet-50 (Fine-Tuned + Aug + Dropout) – All layers were unfrozen. Data augmentation (RandomCrop, Flip, Rotation) and Dropout (p=0.5) were used.
-
Experiment 3 - DenseNet121 (Progressive Unfreezing) – Used DenseNet121 with gradual unfreezing and same augmentations. Aimed to explore a more compact yet powerful architecture.
| Experiment | Model Description | Loss | Accuracy | F1 Score | Precision | Recall |
|---|---|---|---|---|---|---|
| Exp. 1 | ResNet-50 – Frozen | 2.3724 | 42.37% | 41.96% | 42.91% | 42.37% |
| Exp. 2 | ResNet-50 – Fine-Tuned + Aug + Dropout | 0.9012 | 73.78% | 73.63% | 78.18% | 73.78% |
| Exp. 3 | DenseNet121 – Progressive Unfreezing | 1.0800 | 71.69% | 71.70% | 75.43% | 71.69% |
This configuration confirmed the strength of transfer learning for fine-grained tasks:
- The frozen backbone alone produced weak results.
- Fine-tuning with regularization significantly improved performance.
- DenseNet121 performed well, though slightly behind the fine-tuned ResNet-50.
The best model in this configuration was Experiment 2 – ResNet-50 Fine-Tuned, which reached the highest accuracy and F1 Score. It was later used as the baseline for deployment.
This configuration focused on visual similarity retrieval instead of direct classification. We trained CNN backbones (ResNet-50, DenseNet121) to generate deep embeddings, then used K-Nearest Neighbors (KNN) over these features to find the most visually similar images. The setup is useful for search engines, recommendation systems, and tasks requiring semantic similarity.
| Input Size | Backbone | Optimizer | Learning Rate | Loss Function | Batch Size | Epochs |
|---|---|---|---|---|---|---|
| 224×224 | ResNet-50 / DenseNet121 | Adam | 0.001 | CrossEntropyLoss | 64 | 10 |
-
Experiment 1 - ResNet-50 + KNN (k = 3, Euclidean) - Embeddings were extracted from a fine-tuned ResNet-50 and used with KNN (k = 3, Euclidean).
-
Experiment 2 - ResNet-50 + KNN (k = 10, Euclidean) - Same ResNet-50 embeddings, but KNN was configured with a larger neighborhood (k = 10) to improve robustness.
-
Experiment 3 - DenseNet121 + KNN (k = 7, Cosine Similarity) - Switched to DenseNet121 as backbone, with embeddings of size 1024 and cosine similarity instead of Euclidean distance.
| Experiment | Model Description | Accuracy | F1 Score | Precision | Recall | Predict Time |
|---|---|---|---|---|---|---|
| Exp. 1 | ResNet-50 + KNN (k = 3, Euclidean) | 74.93% | 75.23% | 77.38% | 74.93% | 1.1998s |
| Exp. 2 | ResNet-50 + KNN (k = 10, Euclidean) | 76.77% | 76.77% | 78.81% | 76.77% | 1.3873s |
| Exp. 3 | DenseNet121 + KNN (k = 7, Cosine Similarity) | 73.41% | 73.49% | 75.33% | 73.41% | 1.0178s |
This configuration highlighted the potential of deep feature embeddings for image-to-image search:
- Even small neighborhoods (k = 3) yielded good results, but increasing k improved stability.
- Using cosine similarity with DenseNet provided competitive performance and the fastest prediction time.
- The retrieval pipeline is highly flexible, making it well-suited for applications requiring fast, interpretable, and semantic similarity.
The best model in this configuration was Experiment 2, which used ResNet-50 with KNN (k = 10, Euclidean) and achieved the highest accuracy and F1 score.
In this configuration, we built and trained a CNN model entirely from scratch, with no pre-trained weights. This allowed full control over the architecture and learning process. The goal was to evaluate the viability of training a custom network on a complex fine-grained dataset like Stanford Cars.
| Input Size | Optimizer | Learning Rate | Loss Function | Batch Size | Pooling | Activation | Weight Init | Max Epochs |
|---|---|---|---|---|---|---|---|---|
| 224×224 | Adam | 0.001 | CrossEntropyLoss | 64 | MaxPooling | ReLU | He Initialization | 40 |
-
Experiment 1 - Basic CNN (No Regularization) - A minimal CNN with 3 convolutional blocks, no normalization, and no data augmentation.
-
Experiment 2 - CNN + BatchNorm + Dropout - Expanded to 4 convolutional blocks, added Batch Normalization and Dropout, still without augmentation.
-
Experiment 3 - Advanced CNN + Data Augmentation - Same as Experiment 2, with added RandomCrop, HorizontalFlip, and ColorJitter augmentations. Trained for 40 epochs.
| Experiment | Model Description | Accuracy | F1 Score | Precision | Recall | Loss |
|---|---|---|---|---|---|---|
| Exp. 1 | Basic CNN (3 blocks, no regularization) | 8.88% | 8.52% | 9.80% | 8.88% | 4.6955 |
| Exp. 2 | CNN + BatchNorm + Dropout | 8.16% | 6.23% | 7.17% | 8.16% | 4.5436 |
| Exp. 3 | CNN + Augmentation + Regularization | 72.26% | 72.17% | 72.99% | 72.26% | 1.1026 |
This configuration showed a steep learning curve but promising results:
- Basic models without augmentation or normalization failed to generalize.
- Adding BatchNorm and Dropout without augmentation was not enough.
- The full combination of regularization, deeper architecture, and data augmentation led to competitive results.
The best model in this configuration was Experiment 3, which demonstrated that with the right setup, custom CNNs can achieve results comparable to pre-trained solutions.
This milestone provided a comprehensive comparison between three distinct approaches to solving a fine-grained visual classification task.
-
Transfer Learning offered the best accuracy for classification with the least engineering effort. It leveraged powerful pre-trained features and benefited significantly from fine-tuning and dropout. This approach is ideal for fast deployment and works especially well when annotated data is limited but domain similarity is high.
-
Image Retrieval proved effective for semantic similarity tasks. It outperformed classification in raw accuracy when evaluated using KNN over learned embeddings. This approach is especially useful for search interfaces and user-facing systems where interpretability of similarity matters.
-
End-to-End CNN required more careful design and training but ultimately proved that custom networks can perform competitively when equipped with solid regularization and data augmentation. While resource-intensive, it gives full control over the learning process and is valuable for research and experimentation.
In summary, each configuration serves different practical needs:
- Use Transfer Learning for classification pipelines requiring strong accuracy with minimal compute.
- Use Image Retrieval for interactive tools or similarity-based search.
- Use End-to-End CNN when full customization or architectural research is needed.
This project deepened our understanding of deep learning design, training stability, and performance evaluation under different constraints - from off-the-shelf reuse to scratch-built solutions.