Skip to content

salsabila-rahmah/SQL-Based-Data-Preparation-and-Validation-for-BI-Analysis

Repository files navigation

SQL-Based Data Preparation and Validation for BI Analysis

PROJECT OVERVIEW

DATA VALIDATION & MODELING PROCESS


PROJECT OVERVIEW

Two quarterly sales datasets were consolidated & rebuilt into a controlled analytical layer using SQL. This project focuses on making sure KPIs & profitability analysis are based on structurally sound, logically valid, & statistically controlled data before any BI dashboard is built. All transformations are rule-based & fully traceable. The project covers the full pipeline from raw ingestion to validated output.

  • Total processed: 25,234 transactions
  • Final validated dataset: 24,735 transactions → Interactive Tableau Public dashboard built from this dataset [VIEW HERE]

1. DATASETS

  • Superstore Sales: The Data Quality Challenge (Kaggle) as the primary transactional source
  • List of U.S. Cities (Britannica ─ scraped & cleaned using Google Sheets) for city & postal code validation

2. DATA QUALITY METRICS

  • 25,234 rows consolidated from two sources
  • 40 exact duplicate transactions removed
  • 459 logically invalid records eliminated
  • 1,483 product identity conflicts resolved (canonical mapping)
  • 14,988+ category values standardized
  • 1,022 statistical outliers flagged (P01/P99 method)
  • 100% of changes logged through audit_log tables with triggers back up

3. TECHNICAL APPROACH & TOOLS

This project built using SQL (SQLite) in Visual Studio Code, with dbdiagram.io for ERD design & Google Sheets for supporting data cleanup. The process covers:

  • Data cleaning & normalization
  • Rule-based transaction validation
  • Percentile outlier detection (P01/P99)
  • Audit logging with trigger automation
  • Star-schema modeling
  • CTEs, window functions, joins, & aggregation
  • Conditional feature engineering
  • BI-ready view creation

DATA VALIDATION & MODELING PROCESS

1. STRUCTURAL CLEANING & NORMALIZATION

The raw dataset had fragmented categories, inconsistent naming formats, missing values, & duplicate rows. Main corrections:

  • 1,766 blank discount values standardized
  • 2,457 invalid region entries corrected
  • 14,988+ category values unified into proper domains (South, West, Central, East)
  • 860 inconsistent product name formats standardized
  • 63 postal code–city mismatches reconciled
  • 40 duplicate rows removed

2. ENTITY INTEGRITY CONTROLS

The dataset contained identity conflicts that would break aggregation & dimensional modeling. Canonical mapping was implemented to ensure consistent product representation.

  • 835 cases where one product_name mapped to multiple product_id
  • 648 cases where one canonical_id mapped to multiple product_name

Total entity conflicts resolved: 1,483

3. BUSINESS RULE VALIDATION

Transaction-level validation was applied to align with real business logic.

  • 352 cases of profit < -sales
  • 7 cases of profit > sales
  • 100 invalid sales records

Total logically invalid rows removed: 459

Additional checks performed to prevents incorrect KPI & profitability calculations, included:

  • Sales & quantity alignment
  • Discount range validation
  • Date hierarchy consistency
  • Cross-table aggregation checks

4. STATISTICAL CONTROL

Outliers were identified using percentile thresholds (P01/P99) per sub_category. Outliers remain in the dataset but are clearly labeled to protects aggregate metrics while keeping full transactional history.

  • 1,022 rows flagged across unit_price, sales, profit, & margin
  • Outliers were flagged, not deleted

5. DATA MODELING

The ERD was designed after cleaning & validation were completed. The final structure follows a star-schema approach & supports BI reporting (see figure). Fact & dimension tables were built from the validated dataset to ensure:

  • Clean transactional grain
  • Standardized dimensions
  • Consistent product identifiers
  • No duplicate or logically invalid records
image

6. AUDIT & CHANGE GOVERNANCE

Every step of cleaning, validation, & transformation is tracked for full traceability:

  • Actions are recorded in audit_log tables & BACKUP_audit_log
  • Triggers automatically capture updates, deletions, & anomalies into BACKUP_audit_log tables
  • Each change is linked to a defined rule for full accountability

7. ANALYTICAL LAYER

The final analytical layer provides a BI-ready dataset:

  • Primary view: v_orders_analysis_ready integrates all cleaned, validated, & standardized data
  • Invalid or inconsistent transactions are excluded
  • Aggregation-ready attributes are exposed for KPI & profitability analysis
  • Supporting views (v_kpi_core, v_discount_analysis) allow exploration without touching the core dataset

Releases

No releases published

Packages

 
 
 

Contributors