The Urban Environmental Intelligence Engine (UEIE-2025) is a scalable smart-city analytics system designed to detect environmental anomalies using hourly air-quality data from 100 global sensor stations collected via the OpenAQ Global Air Quality API for the year 2025.
This project implements a diagnostic engine for identifying environmental anomalies in air quality data from 100 global sensor nodes. The system analyzes hourly values for PM2.5, PM10, NO2, Ozone, Temperature, and Humidity throughout the year 2025.
Data Scienece Assignment 2/
├── data_fetcher.py # OpenAQ API data fetching module
├── data_processor.py # Big data processing module
├── task1_dimensionality.py # Task 1: PCA dimensionality reduction
├── task2_temporal.py # Task 2: High-density temporal analysis
├── task3_distribution.py # Task 3: Distribution modeling
├── task4_visual_integrity.py # Task 4: Visual integrity audit
├── main.py # Main pipeline script
├── dashboard.py # Streamlit interactive dashboard
├── requirements.txt # Python dependencies
└── README.md # This file
- Install Python 3.8 or higher
- Install dependencies:
pip install -r requirements.txtRun the main pipeline to fetch data and execute all tasks:
python main.pyThis will:
- Fetch data from OpenAQ API (or create synthetic data if API is unavailable)
- Process and cache the data
- Execute all four tasks
- Generate visualizations in
outputs/directory
Each task can be run independently:
python task1_dimensionality.py
python task2_temporal.py
python task3_distribution.py
python task4_visual_integrity.pystreamlit run dashboard.pyThe dashboard will open in your browser at http://localhost:8501
- Applies PCA to project 6-dimensional environmental data into 2D
- Visualizes Industrial vs Residential zone clustering
- Analyzes PCA loadings to identify main pollution drivers
Outputs:
outputs/task1_clusters.png- Zone clustering visualizationoutputs/task1_loadings.png- PCA loadings visualizationoutputs/task1_loadings.csv- Loadings data
- Identifies PM2.5 > 35 violations across 100 sensors
- Uses heatmap visualization to avoid overplotting
- Identifies periodic signatures (daily vs monthly patterns)
Outputs:
outputs/task2_heatmap.png- High-density heatmapoutputs/task2_small_multiples.png- Temporal pattern analysis
- Creates peak-optimized and tail-optimized distribution plots
- Determines 99th percentile of pollution levels
- Provides technical justification for plot selection
Outputs:
outputs/task3_peak_optimized.png- Peak-optimized plotoutputs/task3_tail_optimized.png- Tail-optimized plotoutputs/task3_comparison.png- Side-by-side comparisonoutputs/task3_percentiles.csv- Percentile statisticsoutputs/task3_justification.txt- Technical justification
- Evaluates 3D bar chart proposal (REJECTED)
- Implements Bivariate Mapping and Small Multiples alternatives
- Justifies Sequential color scale choice
Outputs:
outputs/task4_bivariate_mapping.png- Bivariate mappingoutputs/task4_small_multiples.png- Small multiplesoutputs/task4_evaluation.txt- 3D proposal evaluationoutputs/task4_color_justification.txt- Color scale justification
- Big Data Handling: Uses Parquet format and chunked processing for efficiency
- No Graphical Ducks: All visualizations avoid 3D effects, shadows, and unnecessary grids
- Reproducibility: Modular Python pipeline (not Jupyter notebooks)
- Data Location: All data stored in
D:/Data Scienece Assignment 2/data/
- API: OpenAQ Global Air Quality API
- Parameters: PM2.5, PM10, NO2, Ozone, Temperature, Humidity
- Stations: 100 global sensor nodes
- Period: Entire year 2025 (hourly values)
- Modular Design: Each task is a separate module
- Efficient Processing: Handles multi-gigabyte datasets
- Interactive Dashboard: Streamlit-based visualization interface
- Comprehensive Analysis: Covers dimensionality reduction, temporal analysis, distribution modeling, and visual integrity
outputs/
├── task1_clusters.png
├── task1_loadings.png
├── task1_loadings.csv
├── task2_heatmap.png
├── task2_small_multiples.png
├── task3_peak_optimized.png
├── task3_tail_optimized.png
├── task3_comparison.png
├── task3_percentiles.csv
├── task3_justification.txt
├── task4_bivariate_mapping.png
├── task4_small_multiples.png
├── task4_evaluation.txt
└── task4_color_justification.txt
- If OpenAQ API is unavailable, the system automatically generates synthetic data for demonstration
- All visualizations follow Tufte's principles: maximize data-ink ratio, minimize chartjunk
- Sequential color scales (YlOrRd) are used for quantitative data visualization
Muhammad Anas
- GitHub: @MuhammadAnas4774
- Repository: Urban Environmental Intelligence Engine (UEIE-2025)
Data Science Assignment 2 - Urban Environmental Intelligence Challenge