🧪 GAN-Data-Corruption-Risk-Assessment

"Generative models are powerful, but are they faithful to your original data manifold?"

⚠️ Copyright Notice Copyright (c) 2026 Kang Gyu Min. All rights reserved.

📢 Note: Project Context & Re-evaluation

This repository serves as a critical re-evaluation of the project that won 1st Place at the Kyung Hee University Internal Competition.

Original Premise: The initial work was awarded based on the hypothesis that GAN-based augmentation would successfully enhance data quality for predictive maintenance.
The Discovery: Upon further, rigorous investigation using data integrity metrics, I discovered that the specific GAN implementation used in the original project actually degrades model performance.
The Goal: This repository documents the deep-dive analysis that uncovered this "hidden corruption." It serves as a warning for practitioners: what appears to be "enhanced" data may actually be statistically compromised.

🚩 Abstract

This project quantitatively and qualitatively analyzes how GAN (Generative Adversarial Networks)-based data augmentation compromises the underlying data manifold. Beyond merely comparing model performance, this study exposes a critical flaw: the 'Original' baseline data used in the original GAN experiment was already corrupted, highlighting the absolute necessity of rigorous data integrity verification before augmentation.

📉 Manifold Distortion Analysis (t-SNE)

t-SNE visualization reveals that GAN-based augmentation distorts the clustering structure of the data and compromises the distribution. In contrast, Gaussian noise preserves the original manifold effectively.

Gaussian Augmentation (Preserved)	GAN Augmentation (Distorted)

📊 Quantitative Analysis: Distortion Rate (MAE)

We quantified the distortion rate (MAE) between Gaussian noise and GAN augmentation. GAN augmentation introduces uncontrolled distortion, damaging the correlation structure of the dataset approximately 79.7 times more than Gaussian noise.

Technique	Distortion Rate (MAE)	Note
Gaussian Noise	0.0011	Excellent preservation
GAN Augmentation	0.0854	79.7x higher distortion

🚨 Data Integrity Audit: The "Fake Original" Problem

The most critical finding of this analysis is that the 'Original' dataset used in the GAN experimental group deviates significantly from the true 'Ground Truth' distribution.

Audit Target	Distortion Rate (MAE) vs. Ground Truth
GAN 'Fake' Original (Baseline)	0.0107
GAN Augmented (Output)	0.0889

Conclusion: Before even considering the limitations of GANs, the experimental setup was inherently flawed because the starting point (the 'Original' dataset) was already corrupted, with a correlation error averaging 0.0107.

🔍 Detailed Correlation Heatmap Analysis

We validated how variable dependencies changed through correlation heatmaps. (The heatmaps are displayed individually to allow for precise verification of correlation coefficients.)

1. Original vs. Gaussian Noise (Preserved)

Gaussian noise stably maintains the inherent correlation structure of the data.

2. Original vs. GAN Data (Distorted)

GAN augmentation severely distorts the original correlation distribution, introducing noise and reducing the predictive reliability of diagnostic models.

⚙️ Methodology & Domain Application

This analysis was performed based on vibration data from wind turbine drive trains.

Turbine Drive Train & Vibration Data Analysis

Drive Train Configuration	Vibration Amplitude Comparison

🔍 View Data Integrity Verification Logic (Click to expand)

Correlation Matrix Cleansing: Used np.triu_indices to remove symmetric upper-triangular matrices and the diagonal (1.0), isolating the pure comparison region.
Distortion Calculation: Calculated Mean Absolute Error (MAE) via np.nanmean(np.abs(orig - aug)) to quantify the damage.
Control Group Anchoring: Compared both the original and augmented data for every technique (Gaussian vs. GAN) against their unique Ground Truth to measure the exact level of corruption.

🛡️ Key Takeaways for Practitioners

Data Integrity First: Before deploying any augmentation model, rigorous data hygiene (integrity verification) must be your top priority.
Quantify the Bias: Do not rely solely on accuracy metrics. Calculate the distortion rate (MAE) against your control group to understand how much the data structure has drifted.
Simple is Best: For critical diagnostic systems where data distribution consistency is paramount, well-understood statistical techniques like Gaussian noise are often safer and more reliable than complex generative models.

📂 Dataset Citation

This analysis is based on the dataset provided from the following source:

Dataset: Wind Turbine Gearbox CM Vibration Source: Kaggle - Wind Turbine Gearbox CM Vibration

Created by [Yani-Studio] | Data Integrity Research Team

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Wind Turbine		Wind Turbine
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧪 GAN-Data-Corruption-Risk-Assessment

📢 Note: Project Context & Re-evaluation

🚩 Abstract

📉 Manifold Distortion Analysis (t-SNE)

📊 Quantitative Analysis: Distortion Rate (MAE)

🚨 Data Integrity Audit: The "Fake Original" Problem

🔍 Detailed Correlation Heatmap Analysis

1. Original vs. Gaussian Noise (Preserved)

2. Original vs. GAN Data (Distorted)

⚙️ Methodology & Domain Application

Turbine Drive Train & Vibration Data Analysis

🛡️ Key Takeaways for Practitioners

📂 Dataset Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧪 GAN-Data-Corruption-Risk-Assessment

📢 Note: Project Context & Re-evaluation

🚩 Abstract

📉 Manifold Distortion Analysis (t-SNE)

📊 Quantitative Analysis: Distortion Rate (MAE)

🚨 Data Integrity Audit: The "Fake Original" Problem

🔍 Detailed Correlation Heatmap Analysis

1. Original vs. Gaussian Noise (Preserved)

2. Original vs. GAN Data (Distorted)

⚙️ Methodology & Domain Application

Turbine Drive Train & Vibration Data Analysis

🛡️ Key Takeaways for Practitioners

📂 Dataset Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages