| title | DataGen |
|---|---|
| emoji | 🧬 |
| colorFrom | indigo |
| colorTo | pink |
| sdk | docker |
| short_description | AI-powered synthetic data generator |
Generate realistic synthetic datasets by simply describing what you need.
DataGen transforms simple descriptions into structured datasets using AI. Perfect for researchers, data scientists, and developers who need realistic test data fast.
Key Features:
- Type what you want → Get real data
- Multiple formats: CSV, JSON, Parquet, Markdown
- Dataset types: Tables, time-series, text data
- AI-powered: Uses GPT and Claude models
- Instant download with clean, ready-to-use datasets
To understand the full workflow from user input to file output, see the architecture section.
- Python 3.11+
- Docker Desktop
- uv package manager
git clone https://github.com/lisekarimi/datagen.git
cd datagen
uv sync
source .venv/bin/activate # Unix/macOS
# or .\.venv\Scripts\activate on Windows- Copy
.env.exampleto.env - Populate it with the required secrets
# Local development
make run
# With hot reload
make uiFor complete setup instructions, commands, and development guidelines, see the Docs Page.
- Describe your data: "Customer purchase history with demographics"
- Choose format: CSV, JSON, Parquet, or Markdown
- Select AI model: GPT or Claude
- Set sample size: Number of records to generate
- Generate & download your dataset
DataGen maintains high standards with comprehensive test coverage, automated security scanning, and code quality enforcement.
For CI/CD setup and technical details, see the docs Page.
- Generated files are automatically cleaned up after 5 minutes
- Supports 10-1000 samples per dataset
- JSON output includes proper indentation for readability
- Cross-platform compatibility (Windows, macOS, Linux)
MIT
