SQL-Based Data Preparation and Validation for BI Analysis

1. Datasets
2. Data Quality Metrics
3. Technical Approach & Tools

1. Structural Cleaning & Normalization
2. Entity Integrity Controls
3. Business Rule Validation
4. Statistical Control
5. Data Modeling
6. Audit & Change Governance
7. Analytical Layer

PROJECT OVERVIEW

Two quarterly sales datasets were consolidated & rebuilt into a controlled analytical layer using SQL. This project focuses on making sure KPIs & profitability analysis are based on structurally sound, logically valid, & statistically controlled data before any BI dashboard is built. All transformations are rule-based & fully traceable. The project covers the full pipeline from raw ingestion to validated output.

Total processed: 25,234 transactions
Final validated dataset: 24,735 transactions → Interactive Tableau Public dashboard built from this dataset [VIEW HERE]

1. DATASETS

Superstore Sales: The Data Quality Challenge (Kaggle) as the primary transactional source
List of U.S. Cities (Britannica ─ scraped & cleaned using Google Sheets) for city & postal code validation

2. DATA QUALITY METRICS

25,234 rows consolidated from two sources
40 exact duplicate transactions removed
459 logically invalid records eliminated
1,483 product identity conflicts resolved (canonical mapping)
14,988+ category values standardized
1,022 statistical outliers flagged (P01/P99 method)
100% of changes logged through audit_log tables with triggers back up

3. TECHNICAL APPROACH & TOOLS

This project built using SQL (SQLite) in Visual Studio Code, with dbdiagram.io for ERD design & Google Sheets for supporting data cleanup. The process covers:

Data cleaning & normalization
Rule-based transaction validation
Percentile outlier detection (P01/P99)
Audit logging with trigger automation
Star-schema modeling
CTEs, window functions, joins, & aggregation
Conditional feature engineering
BI-ready view creation

DATA VALIDATION & MODELING PROCESS

1. STRUCTURAL CLEANING & NORMALIZATION

The raw dataset had fragmented categories, inconsistent naming formats, missing values, & duplicate rows. Main corrections:

1,766 blank discount values standardized
2,457 invalid region entries corrected
14,988+ category values unified into proper domains (South, West, Central, East)
860 inconsistent product name formats standardized
63 postal code–city mismatches reconciled
40 duplicate rows removed

2. ENTITY INTEGRITY CONTROLS

The dataset contained identity conflicts that would break aggregation & dimensional modeling. Canonical mapping was implemented to ensure consistent product representation.

835 cases where one product_name mapped to multiple product_id
648 cases where one canonical_id mapped to multiple product_name

Total entity conflicts resolved: 1,483

3. BUSINESS RULE VALIDATION

Transaction-level validation was applied to align with real business logic.

352 cases of profit < -sales
7 cases of profit > sales
100 invalid sales records

Total logically invalid rows removed: 459

Additional checks performed to prevents incorrect KPI & profitability calculations, included:

Sales & quantity alignment
Discount range validation
Date hierarchy consistency
Cross-table aggregation checks

4. STATISTICAL CONTROL

Outliers were identified using percentile thresholds (P01/P99) per sub_category. Outliers remain in the dataset but are clearly labeled to protects aggregate metrics while keeping full transactional history.

1,022 rows flagged across unit_price, sales, profit, & margin
Outliers were flagged, not deleted

5. DATA MODELING

The ERD was designed after cleaning & validation were completed. The final structure follows a star-schema approach & supports BI reporting (see figure). Fact & dimension tables were built from the validated dataset to ensure:

Clean transactional grain
Standardized dimensions
Consistent product identifiers
No duplicate or logically invalid records

6. AUDIT & CHANGE GOVERNANCE

Every step of cleaning, validation, & transformation is tracked for full traceability:

Actions are recorded in audit_log tables & BACKUP_audit_log
Triggers automatically capture updates, deletions, & anomalies into BACKUP_audit_log tables
Each change is linked to a defined rule for full accountability

7. ANALYTICAL LAYER

The final analytical layer provides a BI-ready dataset:

Primary view: v_orders_analysis_ready integrates all cleaned, validated, & standardized data
Invalid or inconsistent transactions are excluded
Aggregation-ready attributes are exposed for KPI & profitability analysis
Supporting views (v_kpi_core, v_discount_analysis) allow exploration without touching the core dataset

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
datasets		datasets
original db		original db
1. Data Cleaning.sql		1. Data Cleaning.sql
2. Data Validation.sql		2. Data Validation.sql
3. Data Analysis.sql		3. Data Analysis.sql
4. Log Problem.sql		4. Log Problem.sql
ERD diagram.jpeg		ERD diagram.jpeg
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SQL-Based Data Preparation and Validation for BI Analysis

PROJECT OVERVIEW

1. DATASETS

2. DATA QUALITY METRICS

3. TECHNICAL APPROACH & TOOLS

DATA VALIDATION & MODELING PROCESS

1. STRUCTURAL CLEANING & NORMALIZATION

2. ENTITY INTEGRITY CONTROLS

3. BUSINESS RULE VALIDATION

4. STATISTICAL CONTROL

5. DATA MODELING

6. AUDIT & CHANGE GOVERNANCE

7. ANALYTICAL LAYER

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SQL-Based Data Preparation and Validation for BI Analysis

PROJECT OVERVIEW

1. DATASETS

2. DATA QUALITY METRICS

3. TECHNICAL APPROACH & TOOLS

DATA VALIDATION & MODELING PROCESS

1. STRUCTURAL CLEANING & NORMALIZATION

2. ENTITY INTEGRITY CONTROLS

3. BUSINESS RULE VALIDATION

4. STATISTICAL CONTROL

5. DATA MODELING

6. AUDIT & CHANGE GOVERNANCE

7. ANALYTICAL LAYER

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages