Skip to content

SandyGCabanes/Synthetic-Dataset-Generation-From-Surveys-V1

Repository files navigation

Synthetic Data Generation & Validation Workflow

Objective

Create reliable synthetic survey data that protects privacy while maintaining analytical value. Focus on parameter tuning with privacy-utility tradeoffs and rigorous validation of complex multi-select variables essential for accurate dashboards.

Action from Results

Leverage tuning reports and interactive dashboards to choose synthesis settings offering the best balance between disclosure risk and data quality. Detect and correct multi-select response inconsistencies to ensure dependable insights and stakeholder confidence.

Background

Synthetic data lets analysts explore sensitive datasets without exposing real respondents. Multi-select survey responses are especially tricky to replicate, risking inaccurate representation in summaries and visualizations. This workflow combines automated tuning with targeted validation to safeguard both privacy and data fidelity.

Process

  1. Data Preparation: Load raw data, generate uniqueness flags capturing key identifier combinations, and assemble predictor matrices excluding direct IDs.
  2. Automated Parameter Tuning: Run multiple syntheses varying CART parameters, measure privacy risk (replicated uniques), and data utility (KS tests), then find optimal parameters based on a weighted score.
  3. Summary Generation: Export tuning results summary highlighting best parameters and associated risk-utility metrics.
  4. Final Synthesis & Profiling: Generate synthetic data with optimal parameters; create profiling reports to compare real and synthetic data comprehensively.
  5. Multi-select Validation: Explode and compare frequencies of multi-response columns to detect discrepancies; visualize with interactive Plotly dashboards saved as HTML.
  6. Post-processing (Optional): Manually adjust synthetic multi-select responses to align distributions more closely with original data for dashboard accuracy.

Expected Output

  • Text file summarizing parameter tuning and optimal synthesis settings.
  • High-quality synthetic dataset respecting privacy and supporting analysis.
  • DataExplorer HTML reports for holistic data profile comparison.
  • Folder of interactive HTML dashboards comparing multi-select variable frequencies.
  • Tools and insights allowing ongoing tuning and refinement of synthetic data quality.

Designed for analysts who value transparency, reproducibility, and practical insights—this workflow empowers you to produce synthetic data you can trust.

About

Synthetic data generation from surveys using CART in R synthpop package

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors