This project performs Natural Language Processing (NLP) on a large dataset of online retail product reviews written in Portuguese. The goal is to analyze customer sentiments through review scores and text mining techniques like unigrams, bigrams, and trigrams.
It also visualizes rating distributions and frequent word patterns to better understand customer satisfaction and feedback trends.
This dataset contains over 100,000 product reviews from an online retail platform. The analysis focuses on:
- Understanding review score distribution (1–5 stars)
- Cleaning and normalizing Portuguese text
- Tokenizing text and removing stopwords
- Generating and analyzing N-grams (Unigrams, Bigrams, Trigrams)
- Visualizing frequent patterns for positive (5-star) and negative (1-star) reviews
- 📊 5-star reviews dominate the dataset, indicating overall customer satisfaction.
⚠️ 1-star reviews show recurring issues related to delivery delays and product quality.- 🔠 Top unigrams and bigrams reveal common terms customers use in both positive and negative feedback.
- 🗣️ Portuguese-specific stopwords were removed to retain only meaningful words.
- 🎨 Visualizations provide comparative frequency distributions for positive and negative review language.
| Tool | Purpose |
|---|---|
| Python | Core programming language |
| NLTK (Natural Language Toolkit) | Tokenization, stopword removal, and N-gram generation |
| Pandas | Data manipulation and handling |
| Matplotlib & Seaborn | Data visualization |
| UnicodeData | Removing text accents for normalization |
| Portuguese Stopwords | Filtering non-informative words from reviews |
- Load dataset:
order_reviews.csv(≈100,000 rows) - Convert
review_creation_dateto datetime format - Visualize review score distribution using pie charts and bar plots
- Convert all text to lowercase
- Remove accents using
unicodedata.normalize() - Tokenize text using
nltk.word_tokenize() - Remove Portuguese stopwords using NLTK’s corpus
Generate:
- Unigrams → single words
- Bigrams → two-word combinations
- Trigrams → three-word combinations
Used for both:
- 5-star reviews (positive sentiment)
- 1-star reviews (negative sentiment)
Plotted top 25 most frequent n-grams for both positive and negative reviews using:
nltk.FreqDist(tokens).plot(25, cumulative=False, color=color)Online_Retail_Order_Reviews/
│
├── order_reviews_analysis.py # Main Python analysis script
├── order_reviews.csv # Dataset (100k+ Portuguese reviews)
├── README.md # Project documentation
└── images/ # (Optional) visualizations and n-gram plots
Install all dependencies:
pip install pandas matplotlib seaborn nltkAlso download NLTK datasets (only once):
import nltk
nltk.download('punkt')
nltk.download('stopwords')python order_reviews_analysis.py- Pie charts and bar plots display review distributions
- N-gram frequency charts visualize word usage patterns
-
Pie Chart: Distribution of review scores (1–5)
-
Bar Graph: Count of reviews per score on a dark grid background
-
N-gram Frequency Graphs:
- Top 25 Unigrams for 5-star reviews
- Top 25 Bigrams for 1-star reviews
- Top 25 Trigrams for both categories
- Add sentiment polarity analysis using TextBlob or Vader
- Translate Portuguese text for global reporting
- Implement topic modeling (LDA) for deeper insight into customer issues
- Deploy a Streamlit NLP dashboard for interactive visualization
- Integrate word clouds for better visual storytelling
Rachakonda Ganesh 📧 [rachakondaganesh60@gmail.com] 🔗 GitHub 🔗 LinkedIn
This project highlights the power of NLP in retail analytics, converting unstructured text data into meaningful insights. By combining Python, NLTK, and visualization tools, it offers a data-driven view of customer satisfaction and recurring pain points — essential for enhancing user experience and improving product quality.