Hybrid Generative Adversarial Neural Network and Copula Data Synthesizer
Model Demo: [ https://drive.google.com/file/d/1fd-9cMRGWv_yN45iL_XvwTw8XLJSGH6C/view?usp=drive_link ]
A powerful Streamlit-based application that enables you to upload a real-world tabular dataset, preprocess it automatically, and generate high-quality synthetic datasets using Gaussian Copula and Generative Adversarial Network (GAN)–based synthesis models.
This project provides a hybrid framework for tabular data generation using both statistical and deep-learning-based generative models.
It combines the strengths of:
- Gaussian Copula models for learning statistical dependencies,
- Generative Adversarial Networks (CTGAN) for capturing complex, non-linear relationships.
The interface is built with Streamlit, providing an intuitive and interactive web-based environment for:
- Uploading datasets,
- Configuring synthesis parameters,
- Training models,
- Evaluating generated data quality,
- Downloading synthetic datasets.
Choose between:
- Gaussian Copula: Fast, statistically driven model for smaller datasets.
- Generative Adversarial Network (GAN/CTGAN): Deep-learning approach for complex, large, or high-dimensional data.
- Detects column types automatically.
- Cleans missing values, encodes categories, parses datetimes.
- Handles high-cardinality columns with configurable Top-K mapping.
- Auto-converts incompatible types to SDV-supported formats.
After generating synthetic data, the app computes:
- Kolmogorov–Smirnov (KS) Statistic: For numeric distribution similarity.
- Jensen–Shannon (JS) Divergence: For categorical distributions.
- Correlation Matrix MSE: To assess structure preservation.
- ML Utility (AUC): Tests predictive similarity between real and synthetic data.
- Privacy Metrics (Memorization Check): Ensures synthetic data doesn’t memorize real rows.
- Instantly download your generated synthetic dataset as a CSV file.
- Columns and types aligned with your original dataset.
| Component | Technology |
|---|---|
| Frontend | Streamlit |
| Backend / Model Engine | SDV (Synthetic Data Vault) |
| Models Used | GaussianCopula, CTGAN |
| Language | Python 3.8+ |
| Libraries | pandas, numpy, sdv, scipy, scikit-learn |
| Interface | Streamlit UI with custom CSS |