Data Refactoring & Tag System Design - Summary

Overview

This document summarizes the data refactoring phase that transformed the raw dataset into a clean, normalized, and production-ready tagging dataset suitable for hybrid tagging systems (classification + retrieval).

What Changed

1. Entity Name Normalization

Problem: 1,153 unique entity names with many duplicates due to:

Persian/Arabic character variations (ی/ي, ک/ك)
Inconsistent formatting (spaces, underscores, hyphens)
Typos and naming inconsistencies

Solution: Implemented robust normalization pipeline:

Normalized Persian/Arabic characters to standard forms
Lowercased all text
Removed punctuation, extra spaces, underscores, hyphens
Detected and merged obvious duplicates

Results:

Before: 1,150 unique entity names
After: 1,110 normalized entity names
Reduction: 40 names (3.5% reduction)
Impact: Reduced tag space fragmentation

Output: entity_name_mapping.json - mapping from original to normalized names

2. Entity Value Cleaning

Problem: Entity values contained:

Noise (codes, model numbers, random IDs)
Inconsistent formatting
Near-duplicates (e.g., "مشکی براق" vs "مشکی")

Solution: Implemented cleaning pipeline:

Normalized text (same rules as entity names)
Removed meaningless tokens (codes, IDs)
Collapsed near-duplicates into canonical forms
Filtered out noise values

Results:

Before: 116,857 unique values
After: 115,834 unique values
Reduction: 1,023 values (0.9% reduction)
Records removed: 30,506 (2.8% of records filtered as noise)

Impact: Cleaner, more consistent tag values

3. Controlled Tag Vocabulary Construction

Problem: 149,604 unique tags with extreme long-tail distribution:

90.8% of tags appeared ≤5 times
Only 427 tags accounted for 50% of occurrences
Impossible to train standard multi-label classification

Solution: Built controlled tag vocabulary with frequency threshold (≥20):

Computed tag frequencies after normalization
Split tags into:
- Controlled tags: Frequent, stable, model-friendly (frequency ≥ 20)
- Long-tail tags: Rare, to be handled via retrieval (frequency < 20)

Results:

Total unique tags: 143,846 (after normalization)
Controlled tags: 4,620 (3.2% of tags)
Long-tail tags: 139,226 (96.8% of tags)
Controlled tag coverage: 74.8% of all tag occurrences

Impact:

Reduced tag space from 143,846 → 4,620 (97% reduction)
Maintained 74.8% coverage of tag occurrences
Made multi-label classification feasible

Outputs:

controlled_tags.json - 4,620 controlled tags with frequencies
long_tail_tags.json - 139,226 long-tail tags for retrieval

4. Dataset Refactoring

Problem: Raw dataset had:

Images with zero tags (6,485 images, 1.76%)
Inconsistent tag structure
No separation between controlled and long-tail tags

Solution: Created cleaned dataset version (v3):

Removed images with zero controlled tags
Preserved mapping to original data (image_url, product_id)
Produced normalized flat table with:
- image_id: Unique identifier
- image_url: Image URL
- controlled_tags: List of controlled tags (for classification)
- long_tail_tags: List of long-tail tags (for retrieval)
- Metadata: title, group, product

Results:

Original images: 368,101
Cleaned images: 341,578 (with at least one controlled tag)
Image retention: 92.8%
Images removed: 26,523 (7.2% - mostly zero-tag images)

Impact: Clean, consistent dataset ready for modeling

Output: dataset_v3_cleaned.csv / dataset_v3_cleaned.parquet

5. Validation & Quality Checks

Validation Results:

✅ No missing image URLs: 0 missing (100% complete)
✅ No zero controlled tags: All images have ≥1 controlled tag
✅ No duplicate image IDs: All unique
✅ 100% controlled tag coverage: All 4,620 controlled tags are used

Tag Distribution:

Mean controlled tags per image: 2.28
Median: 2.0
Range: 1 - 20 tags
Percentiles: p25=1, p50=2, p75=3, p90=4, p95=5, p99=6

Top Tags:

رنگ:مشکی (black color): 18,450 occurrences (5.4%)
نوع کلی:تیشرت (T-shirt): 14,517 occurrences (4.25%)
نوع کلی:جاکلیدی (keychain): 8,133 occurrences (2.38%)

Why This Dataset is Now Suitable for Production Modeling

1. Controlled Tag Space ✅

4,620 controlled tags is manageable for multi-label classification
74.8% coverage means most tag occurrences are covered
Tags are frequent enough (≥20 occurrences) for reliable learning

2. Normalized & Consistent ✅

Entity names normalized (1,110 unique types)
Entity values cleaned and deduplicated
Consistent tag format: entity_name:entity_value

3. High Quality ✅

All images have at least one controlled tag
No missing image URLs
No duplicate image IDs
100% controlled tag coverage

4. Proper Structure ✅

Clear separation: controlled tags (classification) vs long-tail tags (retrieval)
Image-level aggregation with metadata preserved
Ready for train/val/test splits

5. Production-Ready ✅

341,578 images is substantial for training
Average 2.28 controlled tags per image is reasonable for multi-label
92.8% image retention maintains good coverage

Comparison: Before vs After

Metric	Before (Raw)	After (Cleaned)	Change
Total Images	368,101	341,578	-7.2% (removed zero-tag images)
Unique Entity Names	1,150	1,110	-3.5% (normalized)
Unique Entity Values	116,857	115,834	-0.9% (cleaned)
Unique Tags	149,604	4,620 (controlled)	-97% (filtered)
Long-tail Ratio	90.8%	96.8% (long-tail)	Separated
Tag Coverage	N/A	74.8% (controlled)	Defined
Avg Tags/Image	2.98	2.28 (controlled)	More focused
Images with 0 Tags	6,485 (1.76%)	0 (0%)	✅ Fixed

Key Improvements

1. Made Multi-Label Classification Feasible

Reduced tag space from 149K → 4.6K (97% reduction)
Maintained 74.8% coverage of tag occurrences
All tags have sufficient frequency (≥20) for learning

2. Normalized Inconsistencies

Entity name variations consolidated
Entity value duplicates merged
Consistent tag format throughout

3. Separated Concerns

Controlled tags: For classification models (frequent, stable)
Long-tail tags: For retrieval systems (rare, handled separately)

4. Production Quality

All images have controlled tags
No missing data
Clean, consistent structure

Next Steps for Modeling

1. Train/Validation/Test Split

Use stratified split to ensure rare tags appear in all splits
Recommended: 70% train, 15% val, 15% test

2. Model Architecture

Multi-label classification for 4,620 controlled tags
Consider:
- Multi-label CNN (e.g., ResNet with sigmoid output)
- Vision Transformer (ViT) for better performance
- Hybrid approach: Classification + Retrieval

3. Evaluation Metrics

Primary: F1-score (macro, micro, per-class)
Secondary: Precision@K, Recall@K
Long-tail handling: Focus on per-class F1 for rare tags

4. Retrieval System (for long-tail tags)

Build separate retrieval system for 139,226 long-tail tags
Use image embeddings + semantic search
Combine with classification predictions

Files Generated

Core Outputs

entity_name_mapping.json - Entity name normalization mapping
controlled_tags.json - 4,620 controlled tags with frequencies
long_tail_tags.json - 139,226 long-tail tags for retrieval
dataset_v3_cleaned.csv - Cleaned dataset (341,578 images)
dataset_v3_cleaned.parquet - Same dataset in parquet format

Validation Outputs

data_validation_report.md - Comprehensive validation report
data_validation_results.json - Validation results (JSON)
data_validation_plots.png - Distribution plots

Scripts

entity_name_normalizer.py - Entity name normalization
entity_value_cleaner.py - Entity value cleaning
controlled_tag_builder.py - Controlled tag vocabulary builder
build_clean_dataset.py - Dataset refactoring
data_validation.py - Validation checks
run_refactoring_pipeline.py - Main orchestration script

Conclusion

The data refactoring phase successfully transformed the raw dataset into a production-ready tagging dataset:

✅ Normalized: Entity names and values are consistent
✅ Filtered: Controlled tag space is manageable (4,620 tags)
✅ Coverage: 74.8% of tag occurrences are covered
✅ Quality: All images have controlled tags, no missing data
✅ Structure: Clear separation between controlled and long-tail tags

The dataset is now ready for modeling.

The controlled tags (4,620) can be used for multi-label classification, while the long-tail tags (139,226) can be handled via a separate retrieval system, creating a hybrid tagging approach suitable for production deployment.

Generated: 2024-12-22
Pipeline Version: v1.0
Status: ✅ Complete

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Refactoring & Tag System Design - Summary

Overview

What Changed

1. Entity Name Normalization

2. Entity Value Cleaning

3. Controlled Tag Vocabulary Construction

4. Dataset Refactoring

5. Validation & Quality Checks

Why This Dataset is Now Suitable for Production Modeling

1. Controlled Tag Space ✅

2. Normalized & Consistent ✅

3. High Quality ✅

4. Proper Structure ✅

5. Production-Ready ✅

Comparison: Before vs After

Key Improvements

1. Made Multi-Label Classification Feasible

2. Normalized Inconsistencies

3. Separated Concerns

4. Production Quality

Next Steps for Modeling

1. Train/Validation/Test Split

2. Model Architecture

3. Evaluation Metrics

4. Retrieval System (for long-tail tags)

Files Generated

Core Outputs

Validation Outputs

Scripts

Conclusion

FilesExpand file tree

DATA_REFACTORING_SUMMARY.md

Latest commit

History

DATA_REFACTORING_SUMMARY.md

File metadata and controls

Data Refactoring & Tag System Design - Summary

Overview

What Changed

1. Entity Name Normalization

2. Entity Value Cleaning

3. Controlled Tag Vocabulary Construction

4. Dataset Refactoring

5. Validation & Quality Checks

Why This Dataset is Now Suitable for Production Modeling

1. Controlled Tag Space ✅

2. Normalized & Consistent ✅

3. High Quality ✅

4. Proper Structure ✅

5. Production-Ready ✅

Comparison: Before vs After

Key Improvements

1. Made Multi-Label Classification Feasible

2. Normalized Inconsistencies

3. Separated Concerns

4. Production Quality

Next Steps for Modeling

1. Train/Validation/Test Split

2. Model Architecture

3. Evaluation Metrics

4. Retrieval System (for long-tail tags)

Files Generated

Core Outputs

Validation Outputs

Scripts

Conclusion