"Generative models are powerful, but are they faithful to your original data manifold?"
⚠️ Copyright Notice Copyright (c) 2026 Kang Gyu Min. All rights reserved.
This repository serves as a critical re-evaluation of the project that won 1st Place at the Kyung Hee University Internal Competition.
- Original Premise: The initial work was awarded based on the hypothesis that GAN-based augmentation would successfully enhance data quality for predictive maintenance.
- The Discovery: Upon further, rigorous investigation using data integrity metrics, I discovered that the specific GAN implementation used in the original project actually degrades model performance.
- The Goal: This repository documents the deep-dive analysis that uncovered this "hidden corruption." It serves as a warning for practitioners: what appears to be "enhanced" data may actually be statistically compromised.
This project quantitatively and qualitatively analyzes how GAN (Generative Adversarial Networks)-based data augmentation compromises the underlying data manifold. Beyond merely comparing model performance, this study exposes a critical flaw: the 'Original' baseline data used in the original GAN experiment was already corrupted, highlighting the absolute necessity of rigorous data integrity verification before augmentation.
t-SNE visualization reveals that GAN-based augmentation distorts the clustering structure of the data and compromises the distribution. In contrast, Gaussian noise preserves the original manifold effectively.
| Gaussian Augmentation (Preserved) | GAN Augmentation (Distorted) |
|---|---|
![]() |
![]() |
We quantified the distortion rate (MAE) between Gaussian noise and GAN augmentation. GAN augmentation introduces uncontrolled distortion, damaging the correlation structure of the dataset approximately 79.7 times more than Gaussian noise.
| Technique | Distortion Rate (MAE) | Note |
|---|---|---|
| Gaussian Noise | 0.0011 | Excellent preservation |
| GAN Augmentation | 0.0854 | 79.7x higher distortion |
The most critical finding of this analysis is that the 'Original' dataset used in the GAN experimental group deviates significantly from the true 'Ground Truth' distribution.
| Audit Target | Distortion Rate (MAE) vs. Ground Truth |
|---|---|
| GAN 'Fake' Original (Baseline) | 0.0107 |
| GAN Augmented (Output) | 0.0889 |
Conclusion: Before even considering the limitations of GANs, the experimental setup was inherently flawed because the starting point (the 'Original' dataset) was already corrupted, with a correlation error averaging 0.0107.
We validated how variable dependencies changed through correlation heatmaps. (The heatmaps are displayed individually to allow for precise verification of correlation coefficients.)
Gaussian noise stably maintains the inherent correlation structure of the data.
GAN augmentation severely distorts the original correlation distribution, introducing noise and reducing the predictive reliability of diagnostic models.
This analysis was performed based on vibration data from wind turbine drive trains.
| Drive Train Configuration | Vibration Amplitude Comparison |
|---|---|
![]() |
![]() |
🔍 View Data Integrity Verification Logic (Click to expand)
- Correlation Matrix Cleansing: Used
np.triu_indicesto remove symmetric upper-triangular matrices and the diagonal (1.0), isolating the pure comparison region. - Distortion Calculation: Calculated Mean Absolute Error (MAE) via
np.nanmean(np.abs(orig - aug))to quantify the damage. - Control Group Anchoring: Compared both the original and augmented data for every technique (Gaussian vs. GAN) against their unique Ground Truth to measure the exact level of corruption.
- Data Integrity First: Before deploying any augmentation model, rigorous data hygiene (integrity verification) must be your top priority.
- Quantify the Bias: Do not rely solely on accuracy metrics. Calculate the distortion rate (MAE) against your control group to understand how much the data structure has drifted.
- Simple is Best: For critical diagnostic systems where data distribution consistency is paramount, well-understood statistical techniques like Gaussian noise are often safer and more reliable than complex generative models.
This analysis is based on the dataset provided from the following source:
Dataset: Wind Turbine Gearbox CM Vibration Source: Kaggle - Wind Turbine Gearbox CM Vibration
Created by [Yani-Studio] | Data Integrity Research Team



