Image and Text Embeddings for Product Category Classification
This project implements a multimodal product classification pipeline that classifies e-commerce products into predefined categories using pre-trained deep learning models for feature extraction (image and text embeddings) followed by classical machine learning and MLP classifiers.
- Problem Statement
- Dataset Description
- Modeling Approach
- Evaluation Metrics
- Results
- How to Run
- Architecture Diagram
- Project Structure
- Technologies Used
E-commerce platforms require accurate product categorization to enable search, recommendations, inventory management, and user navigation. Manual categorization does not scale for large catalogs.
This project addresses the following:
- Multiclass product classification into predefined category hierarchies
- Multimodal input: Product images and text descriptions must be leveraged jointly
- Performance targets:
- Multimodal model: at least 85% accuracy and 80% F1-score
- Text-only model: at least 85% accuracy and 80% F1-score
- Image-only model: at least 75% accuracy and 70% F1-score
- Extract image embeddings using pre-trained computer vision models (ConvNextV2, ResNet50)
- Extract text embeddings using pre-trained language models (e.g., MiniLM-L6-v2)
- Train and evaluate classical ML models (Random Forest, Logistic Regression)
- Train and evaluate MLP models for single-modality and multimodal fusion
- Generate reproducible results with classification reports and confusion matrices
The processed_products_with_images.csv dataset contains product listings from BestBuy.com with structured metadata, descriptions, and image URLs.
| Column | Description |
|---|---|
sku |
Unique product identifier |
name |
Product name |
description |
Product description (text) |
image |
URL of the product image |
type |
Product type (e.g., HardGood, Software) |
price |
Product price |
shipping |
Shipping cost |
manufacturer |
Manufacturer name |
class_id |
Product class identifier |
sub_class1_id |
Product sub-class identifier |
image_path |
Local path where the image is stored |
- Images are downloaded from URLs and stored in
data/images/ - Default image resolution: 224x224 pixels (JPG format)
- Optional:
categories.jsonprovides hierarchical category relationships
- Place the preprocessed dataset at
data/processed_products_with_images.csv - Download the images zip file and extract to
data/images/
- Backbone models: ConvNextV2 (Hugging Face) and ResNet50 (TensorFlow Keras)
- Input: 224x224 RGB images, normalized to [0, 1]
- Output: Fixed-size embedding vectors per image
- Models supported: ResNet50, ResNet101, DenseNet121/169, InceptionV3, ConvNextV2, ViT, Swin Transformer
- Primary model:
sentence-transformers/all-MiniLM-L6-v2(Hugging Face) - Alternative: OpenAI API (GPT class) for proprietary embeddings
- Input: Product name and description
- Output: Fixed-size text embedding vectors
- Merge image and text embeddings into a single feature matrix
- Split dataset into train and test sets
- Extract feature columns for text-only, image-only, and combined inputs
- Models: Random Forest, Logistic Regression (minimum required)
- Modalities: Image-only, text-only, and early-fusion (combined embeddings)
- Embedding visualization: PCA and t-SNE (2D and 3D) for exploratory analysis
- Architecture: Early-fusion MLP with BatchNorm, Dropout
- Input: Supports single-modality or multimodal concatenated embeddings
- Training: Categorical cross-entropy, Adam optimizer, early stopping
- Modalities: Image-only, text-only, and combined embeddings
- Accuracy: Overall proportion of correct predictions
- F1-score: Macro-averaged F1 (balance of precision and recall across classes)
- Precision and Recall: Per-class and macro-averaged
- Confusion matrix: Per-class performance visualization
- Classification report: Precision, recall, F1 per class
- Train/test split for held-out evaluation
- Unit tests validate model implementations and expected behavior (
pytest tests/)
| Model Type | Modality | Target Accuracy | Target F1-Score |
|---|---|---|---|
| MLP | Text-only | 85% | 80% |
| MLP | Image-only | 75% | 70% |
| MLP | Multimodal | 85% | 80% |
| Model Type | Modality | Accuracy | Macro F1 |
|---|---|---|---|
| MLP | Text-only | 93% | 88% |
| MLP | Image-only | 84% | 77% |
| MLP | Multimodal | 89% | 84% |
results/multimodal_results.csv: Predictions for combined embeddingsresults/image_results.csv: Predictions for image-only modelresults/text_results.csv: Predictions for text-only model- Classification reports and confusion matrices rendered in the Jupyter notebook
- Python 3.9+
- Git (for cloning the repository)
- GPU recommended for embedding extraction (optional; CPU supported)
git clone <your-repo-url>
cd sprint4Windows (PowerShell):
python -m venv venv
.\venv\Scripts\Activate.ps1
# If execution policy error:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUserLinux/macOS:
python3 -m venv venv
source venv/bin/activatepython -m pip install --upgrade pip
pip install -r requirements.txtMac (M-series GPU):
pip install -r requirements_mac.txt- Place
processed_products_with_images.csvindata/ - Download images and extract to
data/images/
Verify structure:
data/
├── processed_products_with_images.csv
└── images/
└── (product images)
jupyter notebookOpen AnyoneAI - Sprint Project 04.ipynb and run cells in order:
- EDA and Image Downloading – Load data, download images if needed
- Generating Image Embeddings – Extract image features (ConvNextV2, ResNet50)
- Generating Text Embeddings – Extract text features (MiniLM-L6-v2)
- Merge Embeddings – Combine image and text embeddings
- Classical ML Training – Train Random Forest and Logistic Regression
- MLP Training – Train MLP for image-only, text-only, and multimodal
pytest tests/or without warnings:
pytest tests/ --disable-warningsdocker build -t anyoneai-project .
docker run -p 8888:8888 -v $(pwd):/app anyoneai-projectThen open the Jupyter URL with the token from the console.
black --line-length=88 .┌─────────────────────────────────────────────────────────────────┐
│ DATA SOURCES │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────┐ ┌─────────────────────────┐ │
│ │ processed_products_ │ │ Product Images │ │
│ │ with_images.csv │ │ (data/images/) │ │
│ │ (metadata + descriptions) │ │ 224x224 JPG │ │
│ └──────────────────────────────┘ └─────────────────────────┘ │
│ │ │ │
└───────────────┼──────────────────────────────┼──────────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ EMBEDDING EXTRACTION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────────────────┐ ┌──────────────────────────┐ │
│ │ Text Embeddings │ │ Image Embeddings │ │
│ │ (src/nlp_models.py) │ │ (src/vision_embeddings_ │ │
│ ├────────────────────────────┤ │ tf.py) │ │
│ │ • MiniLM-L6-v2 │ ├──────────────────────────┤ │
│ │ • BERT / GPT (optional) │ │ • ConvNextV2 │ │
│ └────────────────────────────┘ │ • ResNet50 │ │
│ │ └──────────────────────────┘ │
│ │ │ │
└───────────────┼──────────────────────────────┼──────────────────┘
│ │
└──────────────┬───────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ PREPROCESSING │
│ (src/utils.py) │
├─────────────────────────────────────────────────────────────────┤
│ • Merge text + image embeddings │
│ • Train/test split │
│ • Feature extraction (text_cols, image_cols, label_col) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ CLASSIFICATION │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────┐ ┌────────────────────────────┐ │
│ │ Classical ML │ │ MLP (src/classifiers_mlp) │ │
│ │ (src/classifiers_ │ ├────────────────────────────┤ │
│ │ classic_ml.py) │ │ • Early fusion │ │
│ ├─────────────────────────┤ │ • BatchNorm, Dropout │ │
│ │ • Random Forest │ │ • Early stopping │ │
│ │ • Logistic Regression │ └────────────────────────────┘ │
│ └─────────────────────────┘ │
│ │
│ Modalities: Image-only | Text-only | Combined │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ OUTPUT │
├─────────────────────────────────────────────────────────────────┤
│ • results/multimodal_results.csv │
│ • results/image_results.csv │
│ • results/text_results.csv │
│ • Classification reports and confusion matrices │
│ • Embeddings/*.csv (for reuse) │
└─────────────────────────────────────────────────────────────────┘
- Extract – Load products and images from CSV and disk
- Embed – Generate image embeddings (ConvNextV2/ResNet50) and text embeddings (MiniLM)
- Merge – Combine embeddings into a single feature matrix
- Train – Fit classical ML and MLP models per modality
- Evaluate – Report accuracy, F1, confusion matrices, and save predictions
sprint4/
├── data/ # Data files (download separately)
│ ├── processed_products_with_images.csv
│ └── images/ # Product images
├── Embeddings/ # Generated embeddings (large; exclude from repo)
│ ├── Embeddings_*.csv
│ └── text_embeddings_*.csv
├── results/ # Model predictions
│ ├── multimodal_results.csv
│ ├── image_results.csv
│ └── text_results.csv
├── src/ # Source code
│ ├── vision_embeddings_tf.py # Image embedding extraction
│ ├── nlp_models.py # Text embedding extraction
│ ├── utils.py # Preprocessing and train/test split
│ ├── classifiers_classic_ml.py # Random Forest, Logistic Regression
│ └── classifiers_mlp.py # MLP classifier
├── tests/ # Unit tests
│ └── test_models.py
├── AnyoneAI - Sprint Project 04.ipynb # Main notebook
├── README.md # This file
├── README_og.md # Original project instructions
├── requirements.txt # Python dependencies
└── Dockerfile # Optional container setup
- Python 3.9+: Core programming language
- TensorFlow: Image embedding backbones (ResNet50, ConvNextV2)
- Transformers (Hugging Face): ConvNextV2, ViT, MiniLM-L6-v2
- scikit-learn: Classical ML models, preprocessing, metrics
- Pandas: Data manipulation
- NumPy: Numerical operations
- Matplotlib / Seaborn / Plotly: Visualizations
- Pytest: Testing framework
- Black: Code formatter
This project is part of an educational course assignment.
Marissa Singh
- Best Buy for product data (educational use)
- AnyoneAI for the project framework and guidance
- Hugging Face and TensorFlow for pre-trained models