Skip to content

Commit c601a55

Browse files
committed
Clean up docs, ensure examples for YAML, add in advanced SQL section
1 parent 6e9e2fe commit c601a55

33 files changed

Lines changed: 1938 additions & 26 deletions

CLAUDE.md

Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
Data Caterer is a test data management tool built with Scala and Apache Spark that provides automated data generation, validation, and cleanup capabilities. It supports multiple data sources including databases, files, messaging systems, and HTTP APIs.
8+
9+
## Build System & Common Commands
10+
11+
The project uses Gradle with Kotlin DSL and follows a multi-module structure:
12+
- **Root module**: Configuration and project orchestration
13+
- **api**: Builder patterns and models for programmatic usage
14+
- **app**: Core execution engine, Spark integration, and UI server
15+
- **example**: Sample implementations and Docker configurations
16+
17+
### Essential Commands
18+
19+
```bash
20+
# Build the entire project
21+
./gradlew build
22+
23+
# Build individual modules
24+
./gradlew :app:build
25+
./gradlew :api:build
26+
27+
# Run tests (use exact class names, NOT wildcards)
28+
./gradlew :app:test --tests "io.github.datacatering.datacaterer.core.ui.plan.PlanRepositoryTest" --info
29+
./gradlew :api:test
30+
31+
# Generate test coverage with Scoverage
32+
./gradlew reportScoverage
33+
34+
# Create fat/shadow JAR for distribution
35+
./gradlew :app:shadowJar
36+
37+
# Run specific configurations from IDE
38+
./gradlew :app:run --args="DataCatererUI"
39+
```
40+
41+
### Important Test Running Notes
42+
43+
ScalaTest with JUnit Platform has limitations with Gradle's `--tests` filtering:
44+
- ✅ Use exact class names: `--tests "io.github.datacatering.datacaterer.core.ui.plan.PlanRepositoryTest"`
45+
- ❌ Do NOT use wildcards: `--tests "*PlanRunTest*"` (runs ALL tests instead of filtering)
46+
47+
## Architecture Overview
48+
49+
### Core Domain Concepts
50+
51+
- **Plans**: High-level configuration defining data operations to perform
52+
- **Tasks**: Individual data sources (databases, files, messaging systems, HTTP)
53+
- **Steps**: Sub-operations within tasks (tables, topics, file paths)
54+
- **Fields**: Individual data field configurations with generation rules
55+
- **Validations**: Data quality checks and assertions
56+
57+
### Module Structure
58+
59+
```
60+
api/ # Builder API and models
61+
├── model/ # Core data models and types
62+
├── connection/ # Data source connection builders
63+
└── validation/ # Validation builders
64+
65+
app/ # Core application
66+
├── core/
67+
│ ├── generator/ # Data generation engine
68+
│ ├── validator/ # Data validation engine
69+
│ ├── sink/ # Data output processors
70+
│ ├── metadata/ # Metadata discovery and integration
71+
│ ├── ui/ # Web UI server components
72+
│ └── util/ # Utilities and helpers
73+
└── main/resources/ # Configuration files and UI assets
74+
```
75+
76+
### Key Architectural Patterns
77+
78+
**Builder Pattern**: All configuration uses immutable builders with method chaining
79+
```scala
80+
postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer")
81+
.table("accounts")
82+
.fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true))
83+
```
84+
85+
**Case Class Data Models**: Immutable data structures with Jackson JSON serialization
86+
```scala
87+
@JsonIgnoreProperties(ignoreUnknown = true)
88+
case class DataSource(
89+
name: String,
90+
`type`: String,
91+
options: Map[String, String] = Map(),
92+
enabled: Boolean = true
93+
)
94+
```
95+
96+
**Spark Integration**: Uses Apache Spark for distributed data processing and Spark SQL for data operations
97+
98+
## Development Patterns
99+
100+
### Code Style Requirements
101+
102+
- Use `com.softwaremill.quicklens.ModifyPimp` for immutable updates in builders
103+
- Always provide parameterless constructors: `def this() = this(DefaultValue())`
104+
- Use `@JsonIgnoreProperties(ignoreUnknown = true)` for JSON serialization compatibility
105+
- Use `Option[T]` instead of `null` for optional values
106+
- Follow package structure under `io.github.datacatering.datacaterer`
107+
108+
### Builder Implementation Pattern
109+
110+
```scala
111+
case class TaskBuilder(task: Task = Task()) {
112+
def this() = this(Task())
113+
114+
def name(name: String): TaskBuilder =
115+
this.modify(_.task.name).setTo(name)
116+
117+
def option(option: (String, String)): TaskBuilder =
118+
this.modify(_.task.options)(_ ++ Map(option))
119+
}
120+
```
121+
122+
### Environment Configuration
123+
124+
Runtime behavior is controlled via environment variables:
125+
- `ENABLE_GENERATE_DATA`: Enable/disable data generation
126+
- `ENABLE_DELETE_GENERATED_RECORDS`: Enable cleanup mode
127+
- `PLAN_FILE_PATH`: Path to YAML plan configuration
128+
- `TASK_FOLDER_PATH`: Directory containing task definitions
129+
- `APPLICATION_CONFIG_PATH`: Custom application configuration
130+
131+
### Data Source Support
132+
133+
The system supports:
134+
- **Databases**: Postgres, MySQL, Cassandra, BigQuery
135+
- **Files**: CSV, JSON, Parquet, Delta Lake, Iceberg, ORC
136+
- **Messaging**: Kafka, RabbitMQ, Solace
137+
- **HTTP**: REST APIs with OpenAPI/Swagger integration
138+
- **Metadata Sources**: Great Expectations, JSON Schema, Data Contract CLI, OpenMetadata, Marquez
139+
140+
## UI and API Integration
141+
142+
The application includes a web UI server that provides:
143+
- Connection management and testing
144+
- Interactive plan creation
145+
- Execution history tracking
146+
- Real-time results viewing
147+
148+
The UI is implemented as a separate module with React frontend and Scala backend using HTTP4S.
149+
150+
## Testing Strategy
151+
152+
- Use ScalaTest for unit testing
153+
- Test both API builders and core application logic
154+
- Mock external dependencies (databases, file systems)
155+
- Use exact class names for test filtering, not wildcards
156+
- Leverage the example module for integration testing
157+
158+
## Key Dependencies
159+
160+
- **Scala**: 2.12.x
161+
- **Apache Spark**: 3.5.x
162+
- **Jackson**: JSON serialization
163+
- **Quicklens**: Immutable data updates
164+
- **ScalaTest**: Testing framework
165+
- **HTTP4S**: Web server framework
166+
- **Logback/Log4j**: Logging

docs/docs/configuration.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ Flags are used to control which processes are executed when you run Data Caterer
3131
| `enableRecordTracking` | false | Enable/disable which data records have been generated for any data source |
3232
| `enableDeleteGeneratedRecords` | false | Delete all generated records based off record tracking (if `enableRecordTracking` has been set to true) |
3333
| `enableGenerateValidations` | false | If enabled, it will generate validations based on the data sources defined. |
34+
| `enableFastGeneration` | false | Enable fast generation to maximize throughput. This automatically disables slower features and applies runtime optimizations for maximum performance |
3435

3536
=== "Java"
3637

@@ -96,6 +97,40 @@ Flags are used to control which processes are executed when you run Data Caterer
9697
enableGenerateValidations = ${?ENABLE_GENERATE_VALIDATIONS}
9798
enableAlerts = false
9899
enableAlerts = ${?ENABLE_ALERTS}
100+
# Fast generation disables slower features for maximum throughput
101+
enableFastGeneration = false
102+
enableFastGeneration = ${?ENABLE_FAST_GENERATION}
103+
}
104+
```
105+
106+
### Fast generation mode
107+
108+
Enable fast generation to maximize throughput. This automatically disables slower features (record tracking, count,
109+
sink metadata, unique checks, save reports, validations, alerts) and applies runtime optimizations (e.g. lower shuffle
110+
partitions, enable AQE, Kryo serializer) and increases `numRecordsPerBatch`.
111+
112+
[:material-run-fast: Scala Example](https://github.com/data-catering/data-caterer-example/blob/main/src/main/scala/io/github/datacatering/plan/FastGenerationAndReferencePlanRun.scala) | [:material-coffee: Java Example](https://github.com/data-catering/data-caterer-example/blob/main/src/main/java/io/github/datacatering/plan/FastGenerationAndReferenceJavaPlanRun.java)
113+
114+
=== "Java"
115+
116+
```java
117+
configuration()
118+
.enableFastGeneration(true);
119+
```
120+
121+
=== "Scala"
122+
123+
```scala
124+
configuration
125+
.enableFastGeneration(true)
126+
```
127+
128+
=== "application.conf"
129+
130+
```
131+
flags {
132+
enableFastGeneration = true
133+
enableFastGeneration = ${?ENABLE_FAST_GENERATION}
99134
}
100135
```
101136

@@ -216,6 +251,11 @@ when analysing the generated data if the number of records generated is large.
216251
oneOfDistinctCountVsCountThreshold = 0.2
217252
numGeneratedSamples = 10
218253
}
254+
255+
uniqueCheck {
256+
uniqueBloomFilterNumItems = 100000
257+
uniqueBloomFilterFalsePositiveProbability = 0.1
258+
}
219259
```
220260

221261
## Generation
@@ -289,6 +329,35 @@ Configurations to alter how validations are executed.
289329
}
290330
```
291331

332+
### Unique generation tuning
333+
334+
If `enableUniqueCheck` is enabled, you can tune the underlying Bloom filter used for uniqueness checks to balance memory usage and false positive probability.
335+
336+
=== "Java"
337+
338+
```java
339+
configuration()
340+
.uniqueBloomFilterNumItems(100000L)
341+
.uniqueBloomFilterFalsePositiveProbability(0.1);
342+
```
343+
344+
=== "Scala"
345+
346+
```scala
347+
configuration
348+
.uniqueBloomFilterNumItems(100000L)
349+
.uniqueBloomFilterFalsePositiveProbability(0.1)
350+
```
351+
352+
=== "application.conf"
353+
354+
```
355+
uniqueCheck {
356+
uniqueBloomFilterNumItems = 100000
357+
uniqueBloomFilterFalsePositiveProbability = 0.1
358+
}
359+
```
360+
292361
## Runtime
293362

294363
Given Data Caterer uses Spark as the base framework for data processing, you can configure the job as to your

docs/docs/generator/count.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -413,7 +413,7 @@ It can generate a dataset like below where all combinations of `debit_credit` an
413413
csv("transactions", "app/src/test/resources/sample/csv/transactions")
414414
.fields(
415415
field().name("account_id"),
416-
field().name("debit_creidt").oneOf("D", "C"),
416+
field().name("debit_credit").oneOf("D", "C"),
417417
field().name("status").oneOf("open", "closed", "suspended")
418418
)
419419
.allCombinations(true);
@@ -425,7 +425,7 @@ It can generate a dataset like below where all combinations of `debit_credit` an
425425
csv("transactions", "app/src/test/resources/sample/csv/transactions")
426426
schema(
427427
field.name("account_id"),
428-
field.name("debit_creidt").oneOf("D", "C"),
428+
field.name("debit_credit").oneOf("D", "C"),
429429
field.name("status").oneOf("open", "closed", "suspended")
430430
)
431431
.allCombinations(true)

0 commit comments

Comments
 (0)