Skip to content

Commit e0da001

Browse files
authored
feat: Implement new unified configuration and migration tools (#127)
* feat: Implement new unified configuration and migration tools - Introduced unified configuration schema and migration scripts to facilitate transitions from legacy formats. - Added comprehensive documentation for migration processes and examples for new configurations. - Enhanced memory profiling tools with new scenarios and scripts for performance analysis. - Updated various documentation sections to reflect recent changes in configuration and validation capabilities. - Removed obsolete sample plans to streamline the testing framework. This update aims to improve user experience during configuration migrations and enhance the overall performance profiling capabilities of the application. * feat: Enhance SQL generation with seeded randomness and improved determinism - Introduced new methods for generating SQL expressions with seeded randomness in various data generators, ensuring consistent and varied outputs. - Updated `DataGeneratorFactory` to utilize new random expression methods for weight calculations. - Refactored SQL generation logic in `RandomDataGenerator`, `OneOfDataGenerator`, and `RegexNode` to support indexed random values. - Added tests to verify deterministic behavior of seeded generators, ensuring expected outputs across multiple runs. - Enhanced `DataGeneratorDeterminismTest` to validate the consistency and variability of generated values with seeded configurations. These changes improve the reliability of data generation processes, particularly in scenarios requiring reproducible results. * feat: Enhance credit card generation and SQL expression handling - Updated `creditCard` method in `FieldBuilder` to include specific card types in the generated expressions, improving accuracy for Visa, Mastercard, and Amex. - Refactored `BatchDataProcessor` to ensure proper resource management with a lazy initialization of the `SinkFactory`, enhancing performance and reliability. - Improved SQL generation in `RandomDataGenerator` to handle credit card patterns more effectively, ensuring correct regex expressions are generated based on card type. - Added new utility methods for handling unique SQL generation for regex patterns, enhancing the flexibility of data generation. - Updated tests to validate the new credit card generation logic and SQL expression handling, ensuring expected behavior across various scenarios. These changes improve the robustness and accuracy of data generation processes, particularly for financial data.
1 parent d035ef7 commit e0da001

155 files changed

Lines changed: 22287 additions & 3317 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CLAUDE.md

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,21 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
66

77
Data Caterer is a test data management tool built with Scala and Apache Spark that provides automated data generation, validation, and cleanup capabilities. It supports multiple data sources including databases, files, messaging systems, and HTTP APIs.
88

9+
### YAML Configuration Formats
10+
11+
Data Caterer supports two YAML configuration formats:
12+
13+
1. **Unified Format (v1.0+)** - Recommended for new projects
14+
- Single-file configuration with `version: "1.0"`
15+
- Examples in `misc/schema/examples/`
16+
- Schema: `misc/schema/unified-config-schema.json`
17+
18+
2. **Legacy Format** - Still supported but will be deprecated
19+
- Separate plan and task files
20+
- Examples in `example/docker/data/custom/`
21+
22+
**Migration**: Use `migrate_yaml.py` to convert legacy YAML to unified format. See [docs/migrations/yaml-unified-format/](docs/migrations/yaml-unified-format/) for details.
23+
924
## Build System & Common Commands
1025

1126
The project uses Gradle with Kotlin DSL and follows a multi-module structure:
@@ -63,6 +78,8 @@ ScalaTest with JUnit Platform has limitations with Gradle's `--tests` filtering:
6378
- **Unit tests** (`app/src/test`): Fast, isolated tests for individual components
6479
- **Integration tests** (`app/src/integrationTest`): Slower tests that verify end-to-end workflows (e.g., YAML plan processing)
6580
- **Performance tests** (`app/src/performanceTest`): Benchmarking tests for data generation and foreign key performance
81+
- **Manual tests** (`app/src/manualTest`): Standalone tests for external dependencies (Kafka, PostgreSQL, etc.) - must be run explicitly
82+
- **Memory profiling** (`misc/memory-profiling`): Production-grade memory validation with JFR and heap dumps - see section below
6683

6784
## Architecture Overview
6885

@@ -250,6 +267,97 @@ DEPLOY_MODE=standalone ./gradlew :app:run --args="DataCatererUI"
250267
./gradlew :app:test --tests "io.github.datacatering.datacaterer.core.generator.DataGeneratorFactoryTest"
251268
```
252269

270+
### Manual Tests for External Dependencies
271+
272+
Manual tests (`app/src/manualTest`) are standalone tests designed for verifying integrations with real external services (Kafka, PostgreSQL, etc.). These tests are NOT run as part of regular test suites.
273+
274+
**Prerequisites**:
275+
- Install [insta-infra](https://github.com/data-catering/insta-infra) for automatic service management
276+
- Or start required services manually (Docker, local installations)
277+
278+
**Running Manual Tests**:
279+
```bash
280+
# Run specific manual test (will auto-start Kafka via insta-infra if available)
281+
./gradlew :app:manualTest --tests "io.github.datacatering.datacaterer.core.manual.KafkaStreamingManualTest"
282+
283+
# Run PostgreSQL manual test
284+
./gradlew :app:manualTest --tests "io.github.datacatering.datacaterer.core.manual.PostgresManualTest"
285+
286+
# Run any unified YAML file for testing
287+
YAML_FILE=/path/to/my-config.yaml ./gradlew :app:manualTest --tests "io.github.datacatering.datacaterer.core.manual.YamlFileManualTest"
288+
289+
# Run an example from misc/schema/examples
290+
YAML_FILE=misc/schema/examples/kafka-streaming.yaml ./gradlew :app:manualTest --tests "io.github.datacatering.datacaterer.core.manual.YamlFileManualTest"
291+
292+
# With custom service configuration
293+
KAFKA_BROKERS=my-kafka:9092 ./gradlew :app:manualTest --tests "*KafkaStreamingManualTest"
294+
POSTGRES_URL=jdbc:postgresql://myhost:5432/mydb ./gradlew :app:manualTest --tests "*PostgresManualTest"
295+
```
296+
297+
**Available Manual Tests**:
298+
- `KafkaStreamingManualTest`: Tests Kafka streaming with real Kafka cluster
299+
- `PostgresManualTest`: Tests PostgreSQL data generation with real database
300+
- `YamlFileManualTest`: Generic runner for any unified YAML configuration file
301+
302+
## Memory Profiling
303+
304+
**IMPORTANT**: Memory optimization validation uses production-grade profiling scripts in `misc/memory-profiling/`, NOT unit tests.
305+
306+
### Why Not Use Tests?
307+
308+
Test-based memory measurement fails due to:
309+
- Non-deterministic GC behavior (`System.gc()` is just a suggestion)
310+
- Test framework and Gradle daemon overhead
311+
- Inability to test real Spark jobs with HTTP streaming
312+
- Inconsistent results between runs (50%+ variance)
313+
314+
### Using Memory Profiling Scripts
315+
316+
Located in `misc/memory-profiling/`, these scripts provide accurate memory validation using Java Flight Recorder, heap dumps, and GC analysis.
317+
318+
**Quick Start**:
319+
```bash
320+
cd misc/memory-profiling
321+
322+
# Quick validation (10K records)
323+
./scripts/run-memory-profile.sh
324+
325+
# Bounded buffer validation (250K records)
326+
./scripts/run-memory-profile.sh scenarios/bounded-buffer-test.yaml 512m 2g
327+
328+
# Stress test with OOM detection (2M records)
329+
./scripts/run-memory-profile.sh scenarios/stress-test-http.yaml 1g 2g --oom-dump
330+
331+
# Full regression testing
332+
./scripts/run-all-scenarios.sh 1g 2g
333+
```
334+
335+
**Available Scenarios**:
336+
- `baseline-http.yaml` - Quick smoke test (10K records)
337+
- `bounded-buffer-test.yaml` - Validates bounded buffer optimization (250K records)
338+
- `high-throughput-http.yaml` - High throughput validation (500K records)
339+
- `large-batch-http.yaml` - Large batch processing (500K records)
340+
- `sustained-load-http.yaml` - Long-running load test (1M records)
341+
- `stress-test-http.yaml` - Stress test with OOM detection (2M records)
342+
343+
**Profiling Options**:
344+
```bash
345+
# With Java Flight Recorder
346+
./scripts/run-memory-profile.sh scenarios/stress-test-http.yaml 1g 2g --flight-recorder
347+
348+
# With heap dump on OOM
349+
./scripts/run-memory-profile.sh scenarios/stress-test-http.yaml 1g 2g --oom-dump
350+
351+
# Custom HTTP port
352+
./scripts/run-memory-profile.sh scenarios/baseline-http.yaml 512m 2g --port 9090
353+
```
354+
355+
**Results**:
356+
- Memory usage reports in `misc/memory-profiling/results/`
357+
358+
**Documentation**:
359+
- [misc/memory-profiling/README.md](misc/memory-profiling/README.md) - Comprehensive guide
360+
253361
## Key Dependencies
254362

255363
- **Scala**: 2.12.x

README.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,19 @@ Press Enter to run the default example. Check results at `docker/sample/report/i
4949

5050
### YAML
5151

52+
#### New Unified Format (v1.0+)
53+
54+
```shell
55+
git clone git@github.com:data-catering/data-caterer.git
56+
cd data-caterer/example
57+
export YAML_FILE=misc/schema/examples/minimal.yaml
58+
./gradlew :app:run
59+
```
60+
61+
Check the [unified YAML examples](misc/schema/examples/) for more configurations.
62+
63+
#### Legacy Format (Still Supported)
64+
5265
```shell
5366
git clone git@github.com:data-catering/data-caterer.git
5467
cd data-caterer/example
@@ -58,10 +71,12 @@ cd data-caterer/example
5871
It will run the [`csv.yaml`](example/docker/data/custom/plan/csv.yaml) plan file and the [`csv_transaction_file`](example/docker/data/custom/task/file/csv/csv_transaction_file.yaml) task file.
5972
Check results at `docker/data/custom/report/index.html`.
6073

74+
**📦 Migrating from Legacy to Unified Format?** See [Migration Guide](docs/migrations/yaml-unified-format/MIGRATION.md) for the automated migration tool.
75+
6176
### UI
6277

6378
```shell
64-
docker run -d -p 9898:9898 -e DEPLOY_MODE=standalone --name datacaterer datacatering/data-caterer:0.18.0
79+
docker run -d -p 9898:9898 -e DEPLOY_MODE=standalone --name datacaterer datacatering/data-caterer:0.19.0
6580
```
6681

6782
Open [http://localhost:9898](http://localhost:9898).
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
package io.github.datacatering.datacaterer.api
2+
3+
/**
4+
* Type-safe builder for conditional value generation.
5+
* Allows referencing other fields and building CASE WHEN expressions.
6+
*
7+
* @example {{{
8+
* field.name("discount").conditionalValue(
9+
* when("total").greaterThan(1000) -> 100,
10+
* when("total").greaterThan(500) -> 50,
11+
* when("total").greaterThan(100) -> 10
12+
* )(elseValue = 0)
13+
* }}}
14+
*/
15+
case class ConditionalBuilder(fieldName: String) {
16+
17+
/**
18+
* Creates a condition where field is greater than a value.
19+
*
20+
* @param value the value to compare against
21+
* @return ConditionalBranch for chaining with then value
22+
*/
23+
def greaterThan(value: Any): ConditionalBranch =
24+
ConditionalBranch(s"$fieldName > ${formatValue(value)}")
25+
26+
/**
27+
* Creates a condition where field is less than a value.
28+
*
29+
* @param value the value to compare against
30+
* @return ConditionalBranch for chaining with then value
31+
*/
32+
def lessThan(value: Any): ConditionalBranch =
33+
ConditionalBranch(s"$fieldName < ${formatValue(value)}")
34+
35+
/**
36+
* Creates a condition where field equals a value.
37+
*
38+
* @param value the value to compare against
39+
* @return ConditionalBranch for chaining with then value
40+
*/
41+
def equalTo(value: Any): ConditionalBranch =
42+
ConditionalBranch(s"$fieldName = ${formatValue(value)}")
43+
44+
/**
45+
* Creates a condition where field is between two values (inclusive).
46+
*
47+
* @param min the minimum value
48+
* @param max the maximum value
49+
* @return ConditionalBranch for chaining with then value
50+
*/
51+
def between(min: Any, max: Any): ConditionalBranch =
52+
ConditionalBranch(s"$fieldName BETWEEN ${formatValue(min)} AND ${formatValue(max)}")
53+
54+
/**
55+
* Creates a condition where field is in a set of values.
56+
*
57+
* @param values the values to check against
58+
* @return ConditionalBranch for chaining with then value
59+
*/
60+
def in(values: Any*): ConditionalBranch = {
61+
val valuesList = values.map(formatValue).mkString(", ")
62+
ConditionalBranch(s"$fieldName IN ($valuesList)")
63+
}
64+
65+
/**
66+
* Creates a condition where field is greater than or equal to a value.
67+
*
68+
* @param value the value to compare against
69+
* @return ConditionalBranch for chaining with then value
70+
*/
71+
def greaterThanOrEqual(value: Any): ConditionalBranch =
72+
ConditionalBranch(s"$fieldName >= ${formatValue(value)}")
73+
74+
/**
75+
* Creates a condition where field is less than or equal to a value.
76+
*
77+
* @param value the value to compare against
78+
* @return ConditionalBranch for chaining with then value
79+
*/
80+
def lessThanOrEqual(value: Any): ConditionalBranch =
81+
ConditionalBranch(s"$fieldName <= ${formatValue(value)}")
82+
83+
/**
84+
* Creates a condition where field is not equal to a value.
85+
*
86+
* @param value the value to compare against
87+
* @return ConditionalBranch for chaining with then value
88+
*/
89+
def notEqualTo(value: Any): ConditionalBranch =
90+
ConditionalBranch(s"$fieldName != ${formatValue(value)}")
91+
92+
private def formatValue(value: Any): String = value match {
93+
case s: String => s"'$s'"
94+
case v => v.toString
95+
}
96+
}
97+
98+
/**
99+
* Represents a conditional branch (the condition part of WHEN ... THEN).
100+
*/
101+
case class ConditionalBranch(condition: String) {
102+
103+
/**
104+
* Completes the conditional with a constant value.
105+
* Usage: when("field").greaterThan(100) -> 50
106+
*
107+
* @param thenValue the value to return when condition is true
108+
* @return ConditionalCase for use in conditionalValue()
109+
*/
110+
def ->(thenValue: Any): ConditionalCase = {
111+
val formattedValue = thenValue match {
112+
case s: String => s"'$s'"
113+
case v => v.toString
114+
}
115+
ConditionalCase(condition, formattedValue)
116+
}
117+
118+
/**
119+
* Alternative syntax: value ->: condition
120+
* Completes the conditional with a constant value.
121+
*
122+
* @param thenValue the value to return when condition is true
123+
* @return ConditionalCase for use in conditionalValue()
124+
*/
125+
def ->:(thenValue: Any): ConditionalCase = ->(thenValue)
126+
}
127+
128+
/**
129+
* A complete conditional case (condition + value).
130+
* Represents one WHEN ... THEN clause in a SQL CASE expression.
131+
*/
132+
case class ConditionalCase(condition: String, thenValue: String) {
133+
/**
134+
* Converts this case to SQL WHEN ... THEN syntax.
135+
*
136+
* @return SQL string for this case
137+
*/
138+
def toSql: String = s"WHEN $condition THEN $thenValue"
139+
}

0 commit comments

Comments
 (0)