Synthetic Data Generation & Validation Workflow

Objective

Create reliable synthetic survey data that protects privacy while maintaining analytical value. Focus on parameter tuning with privacy-utility tradeoffs and rigorous validation of complex multi-select variables essential for accurate dashboards.

Action from Results

Leverage tuning reports and interactive dashboards to choose synthesis settings offering the best balance between disclosure risk and data quality. Detect and correct multi-select response inconsistencies to ensure dependable insights and stakeholder confidence.

Background

Synthetic data lets analysts explore sensitive datasets without exposing real respondents. Multi-select survey responses are especially tricky to replicate, risking inaccurate representation in summaries and visualizations. This workflow combines automated tuning with targeted validation to safeguard both privacy and data fidelity.

Process

Data Preparation: Load raw data, generate uniqueness flags capturing key identifier combinations, and assemble predictor matrices excluding direct IDs.
Automated Parameter Tuning: Run multiple syntheses varying CART parameters, measure privacy risk (replicated uniques), and data utility (KS tests), then find optimal parameters based on a weighted score.
Summary Generation: Export tuning results summary highlighting best parameters and associated risk-utility metrics.
Final Synthesis & Profiling: Generate synthetic data with optimal parameters; create profiling reports to compare real and synthetic data comprehensively.
Multi-select Validation: Explode and compare frequencies of multi-response columns to detect discrepancies; visualize with interactive Plotly dashboards saved as HTML.
Post-processing (Optional): Manually adjust synthetic multi-select responses to align distributions more closely with original data for dashboard accuracy.

Expected Output

Text file summarizing parameter tuning and optimal synthesis settings.
High-quality synthetic dataset respecting privacy and supporting analysis.
DataExplorer HTML reports for holistic data profile comparison.
Folder of interactive HTML dashboards comparing multi-select variable frequencies.
Tools and insights allowing ongoing tuning and refinement of synthetic data quality.

Designed for analysts who value transparency, reproducibility, and practical insights—this workflow empowers you to produce synthetic data you can trust.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.devcontainer		.devcontainer
data		data
scripts		scripts
LICENSE		LICENSE
README.md		README.md
dataset_sub.csv		dataset_sub.csv
helper_functions_flowchart.txt		helper_functions_flowchart.txt
program_flowchart_synthesized_dataset.txt		program_flowchart_synthesized_dataset.txt
r_fullcode.R		r_fullcode.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synthetic Data Generation & Validation Workflow

Objective

Action from Results

Background

Process

Expected Output

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Synthetic Data Generation & Validation Workflow

Objective

Action from Results

Background

Process

Expected Output

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages