Skip to content

Yani-Studio/GAN-Data-Corruption-Risk-Assessment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

🧪 GAN-Data-Corruption-Risk-Assessment

Award Status Methodology Domain

"Generative models are powerful, but are they faithful to your original data manifold?"


⚠️ Copyright Notice Copyright (c) 2026 Kang Gyu Min. All rights reserved.


📢 Note: Project Context & Re-evaluation

This repository serves as a critical re-evaluation of the project that won 1st Place at the Kyung Hee University Internal Competition.

  • Original Premise: The initial work was awarded based on the hypothesis that GAN-based augmentation would successfully enhance data quality for predictive maintenance.
  • The Discovery: Upon further, rigorous investigation using data integrity metrics, I discovered that the specific GAN implementation used in the original project actually degrades model performance.
  • The Goal: This repository documents the deep-dive analysis that uncovered this "hidden corruption." It serves as a warning for practitioners: what appears to be "enhanced" data may actually be statistically compromised.

🚩 Abstract

This project quantitatively and qualitatively analyzes how GAN (Generative Adversarial Networks)-based data augmentation compromises the underlying data manifold. Beyond merely comparing model performance, this study exposes a critical flaw: the 'Original' baseline data used in the original GAN experiment was already corrupted, highlighting the absolute necessity of rigorous data integrity verification before augmentation.


📉 Manifold Distortion Analysis (t-SNE)

t-SNE visualization reveals that GAN-based augmentation distorts the clustering structure of the data and compromises the distribution. In contrast, Gaussian noise preserves the original manifold effectively.

Gaussian Augmentation (Preserved) GAN Augmentation (Distorted)
Gaussian t-SNE GAN t-SNE

📊 Quantitative Analysis: Distortion Rate (MAE)

We quantified the distortion rate (MAE) between Gaussian noise and GAN augmentation. GAN augmentation introduces uncontrolled distortion, damaging the correlation structure of the dataset approximately 79.7 times more than Gaussian noise.

Technique Distortion Rate (MAE) Note
Gaussian Noise 0.0011 Excellent preservation
GAN Augmentation 0.0854 79.7x higher distortion

🚨 Data Integrity Audit: The "Fake Original" Problem

The most critical finding of this analysis is that the 'Original' dataset used in the GAN experimental group deviates significantly from the true 'Ground Truth' distribution.

Audit Target Distortion Rate (MAE) vs. Ground Truth
GAN 'Fake' Original (Baseline) 0.0107
GAN Augmented (Output) 0.0889

Conclusion: Before even considering the limitations of GANs, the experimental setup was inherently flawed because the starting point (the 'Original' dataset) was already corrupted, with a correlation error averaging 0.0107.


🔍 Detailed Correlation Heatmap Analysis

We validated how variable dependencies changed through correlation heatmaps. (The heatmaps are displayed individually to allow for precise verification of correlation coefficients.)

1. Original vs. Gaussian Noise (Preserved)

Gaussian Heatmap Gaussian noise stably maintains the inherent correlation structure of the data.

2. Original vs. GAN Data (Distorted)

GAN Heatmap GAN augmentation severely distorts the original correlation distribution, introducing noise and reducing the predictive reliability of diagnostic models.


⚙️ Methodology & Domain Application

This analysis was performed based on vibration data from wind turbine drive trains.

Turbine Drive Train & Vibration Data Analysis

Drive Train Configuration Vibration Amplitude Comparison
Turbine Configuration Vibration Comparison
🔍 View Data Integrity Verification Logic (Click to expand)
  1. Correlation Matrix Cleansing: Used np.triu_indices to remove symmetric upper-triangular matrices and the diagonal (1.0), isolating the pure comparison region.
  2. Distortion Calculation: Calculated Mean Absolute Error (MAE) via np.nanmean(np.abs(orig - aug)) to quantify the damage.
  3. Control Group Anchoring: Compared both the original and augmented data for every technique (Gaussian vs. GAN) against their unique Ground Truth to measure the exact level of corruption.

🛡️ Key Takeaways for Practitioners

  1. Data Integrity First: Before deploying any augmentation model, rigorous data hygiene (integrity verification) must be your top priority.
  2. Quantify the Bias: Do not rely solely on accuracy metrics. Calculate the distortion rate (MAE) against your control group to understand how much the data structure has drifted.
  3. Simple is Best: For critical diagnostic systems where data distribution consistency is paramount, well-understood statistical techniques like Gaussian noise are often safer and more reliable than complex generative models.

📂 Dataset Citation

This analysis is based on the dataset provided from the following source:

Dataset: Wind Turbine Gearbox CM Vibration Source: Kaggle - Wind Turbine Gearbox CM Vibration


Created by [Yani-Studio] | Data Integrity Research Team

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors