data-catering
diff --git a/‎CLAUDE.md‎
Lines changed: 108 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 108 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 16 additions & 1 deletion b/‎README.md‎
Lines changed: 16 additions & 1 deletion
diff --git a/‎api/src/main/scala/io/github/datacatering/datacaterer/api/ConditionalBuilder.scala‎
Lines changed: 139 additions & 0 deletions b/‎api/src/main/scala/io/github/datacatering/datacaterer/api/ConditionalBuilder.scala‎
Lines changed: 139 additions & 0 deletions
@@ -6,6 +6,21 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 Data Caterer is a test data management tool built with Scala and Apache Spark that provides automated data generation, validation, and cleanup capabilities. It supports multiple data sources including databases, files, messaging systems, and HTTP APIs.
 
+### YAML Configuration Formats
+
+Data Caterer supports two YAML configuration formats:
+
+1. **Unified Format (v1.0+)** - Recommended for new projects
+   - Single-file configuration with `version: "1.0"`
+   - Examples in `misc/schema/examples/`
+   - Schema: `misc/schema/unified-config-schema.json`
+
+2. **Legacy Format** - Still supported but will be deprecated
+   - Separate plan and task files
+   - Examples in `example/docker/data/custom/`
+
+**Migration**: Use `migrate_yaml.py` to convert legacy YAML to unified format. See [docs/migrations/yaml-unified-format/](docs/migrations/yaml-unified-format/) for details.
+
 ## Build System & Common Commands
 
 The project uses Gradle with Kotlin DSL and follows a multi-module structure:
@@ -63,6 +78,8 @@ ScalaTest with JUnit Platform has limitations with Gradle's `--tests` filtering:
 - **Unit tests** (`app/src/test`): Fast, isolated tests for individual components
 - **Integration tests** (`app/src/integrationTest`): Slower tests that verify end-to-end workflows (e.g., YAML plan processing)
 - **Performance tests** (`app/src/performanceTest`): Benchmarking tests for data generation and foreign key performance
+- **Manual tests** (`app/src/manualTest`): Standalone tests for external dependencies (Kafka, PostgreSQL, etc.) - must be run explicitly
+- **Memory profiling** (`misc/memory-profiling`): Production-grade memory validation with JFR and heap dumps - see section below
 
 ## Architecture Overview
 
@@ -250,6 +267,97 @@ DEPLOY_MODE=standalone ./gradlew :app:run --args="DataCatererUI"
 ./gradlew :app:test --tests "io.github.datacatering.datacaterer.core.generator.DataGeneratorFactoryTest"
 ```
 
+### Manual Tests for External Dependencies
+
+Manual tests (`app/src/manualTest`) are standalone tests designed for verifying integrations with real external services (Kafka, PostgreSQL, etc.). These tests are NOT run as part of regular test suites.
+
+**Prerequisites**:
+- Install [insta-infra](https://github.com/data-catering/insta-infra) for automatic service management
+- Or start required services manually (Docker, local installations)
+
+**Running Manual Tests**:
+```bash
+# Run specific manual test (will auto-start Kafka via insta-infra if available)
+./gradlew :app:manualTest --tests "io.github.datacatering.datacaterer.core.manual.KafkaStreamingManualTest"
+
+# Run PostgreSQL manual test
+./gradlew :app:manualTest --tests "io.github.datacatering.datacaterer.core.manual.PostgresManualTest"
+
+# Run any unified YAML file for testing
+YAML_FILE=/path/to/my-config.yaml ./gradlew :app:manualTest --tests "io.github.datacatering.datacaterer.core.manual.YamlFileManualTest"
+
+# Run an example from misc/schema/examples
+YAML_FILE=misc/schema/examples/kafka-streaming.yaml ./gradlew :app:manualTest --tests "io.github.datacatering.datacaterer.core.manual.YamlFileManualTest"
+
+# With custom service configuration
+KAFKA_BROKERS=my-kafka:9092 ./gradlew :app:manualTest --tests "*KafkaStreamingManualTest"
+POSTGRES_URL=jdbc:postgresql://myhost:5432/mydb ./gradlew :app:manualTest --tests "*PostgresManualTest"
+```
+
+**Available Manual Tests**:
+- `KafkaStreamingManualTest`: Tests Kafka streaming with real Kafka cluster
+- `PostgresManualTest`: Tests PostgreSQL data generation with real database
+- `YamlFileManualTest`: Generic runner for any unified YAML configuration file
+
+## Memory Profiling
+
+**IMPORTANT**: Memory optimization validation uses production-grade profiling scripts in `misc/memory-profiling/`, NOT unit tests.
+
+### Why Not Use Tests?
+
+Test-based memory measurement fails due to:
+- Non-deterministic GC behavior (`System.gc()` is just a suggestion)
+- Test framework and Gradle daemon overhead
+- Inability to test real Spark jobs with HTTP streaming
+- Inconsistent results between runs (50%+ variance)
+
+### Using Memory Profiling Scripts
+
+Located in `misc/memory-profiling/`, these scripts provide accurate memory validation using Java Flight Recorder, heap dumps, and GC analysis.
+
+**Quick Start**:
+```bash
+cd misc/memory-profiling
+
+# Quick validation (10K records)
+./scripts/run-memory-profile.sh
+
+# Bounded buffer validation (250K records)
+./scripts/run-memory-profile.sh scenarios/bounded-buffer-test.yaml 512m 2g
+
+# Stress test with OOM detection (2M records)
+./scripts/run-memory-profile.sh scenarios/stress-test-http.yaml 1g 2g --oom-dump
+
+# Full regression testing
+./scripts/run-all-scenarios.sh 1g 2g
+```
+
+**Available Scenarios**:
+- `baseline-http.yaml` - Quick smoke test (10K records)
+- `bounded-buffer-test.yaml` - Validates bounded buffer optimization (250K records)
+- `high-throughput-http.yaml` - High throughput validation (500K records)
+- `large-batch-http.yaml` - Large batch processing (500K records)
+- `sustained-load-http.yaml` - Long-running load test (1M records)
+- `stress-test-http.yaml` - Stress test with OOM detection (2M records)
+
+**Profiling Options**:
+```bash
+# With Java Flight Recorder
+./scripts/run-memory-profile.sh scenarios/stress-test-http.yaml 1g 2g --flight-recorder
+
+# With heap dump on OOM
+./scripts/run-memory-profile.sh scenarios/stress-test-http.yaml 1g 2g --oom-dump
+
+# Custom HTTP port
+./scripts/run-memory-profile.sh scenarios/baseline-http.yaml 512m 2g --port 9090
+```
+
+**Results**:
+- Memory usage reports in `misc/memory-profiling/results/`
+
+**Documentation**:
+- [misc/memory-profiling/README.md](misc/memory-profiling/README.md) - Comprehensive guide
+
 ## Key Dependencies
 
 - **Scala**: 2.12.x
 
@@ -49,6 +49,19 @@ Press Enter to run the default example. Check results at `docker/sample/report/i
 
 ### YAML
 
+#### New Unified Format (v1.0+)
+
+```shell
+git clone git@github.com:data-catering/data-caterer.git
+cd data-caterer/example
+export YAML_FILE=misc/schema/examples/minimal.yaml
+./gradlew :app:run
+```
+
+Check the [unified YAML examples](misc/schema/examples/) for more configurations.
+
+#### Legacy Format (Still Supported)
+
 ```shell
 git clone git@github.com:data-catering/data-caterer.git
 cd data-caterer/example
@@ -58,10 +71,12 @@ cd data-caterer/example
 It will run the [`csv.yaml`](example/docker/data/custom/plan/csv.yaml) plan file and the [`csv_transaction_file`](example/docker/data/custom/task/file/csv/csv_transaction_file.yaml) task file.
 Check results at `docker/data/custom/report/index.html`.
 
+**📦 Migrating from Legacy to Unified Format?** See [Migration Guide](docs/migrations/yaml-unified-format/MIGRATION.md) for the automated migration tool.
+
 ### UI
 
 ```shell
-docker run -d -p 9898:9898 -e DEPLOY_MODE=standalone --name datacaterer datacatering/data-caterer:0.18.0
+docker run -d -p 9898:9898 -e DEPLOY_MODE=standalone --name datacaterer datacatering/data-caterer:0.19.0
 ```
 
 Open [http://localhost:9898](http://localhost:9898).
 
@@ -0,0 +1,139 @@
+package io.github.datacatering.datacaterer.api
+
+/**
+ * Type-safe builder for conditional value generation.
+ * Allows referencing other fields and building CASE WHEN expressions.
+ *
+ * @example {{{
+ *   field.name("discount").conditionalValue(
+ *     when("total").greaterThan(1000) -> 100,
+ *     when("total").greaterThan(500) -> 50,
+ *     when("total").greaterThan(100) -> 10
+ *   )(elseValue = 0)
+ * }}}
+ */
+case class ConditionalBuilder(fieldName: String) {
+
+  /**
+   * Creates a condition where field is greater than a value.
+   *
+   * @param value the value to compare against
+   * @return ConditionalBranch for chaining with then value
+   */
+  def greaterThan(value: Any): ConditionalBranch =
+    ConditionalBranch(s"$fieldName > ${formatValue(value)}")
+
+  /**
+   * Creates a condition where field is less than a value.
+   *
+   * @param value the value to compare against
+   * @return ConditionalBranch for chaining with then value
+   */
+  def lessThan(value: Any): ConditionalBranch =
+    ConditionalBranch(s"$fieldName < ${formatValue(value)}")
+
+  /**
+   * Creates a condition where field equals a value.
+   *
+   * @param value the value to compare against
+   * @return ConditionalBranch for chaining with then value
+   */
+  def equalTo(value: Any): ConditionalBranch =
+    ConditionalBranch(s"$fieldName = ${formatValue(value)}")
+
+  /**
+   * Creates a condition where field is between two values (inclusive).
+   *
+   * @param min the minimum value
+   * @param max the maximum value
+   * @return ConditionalBranch for chaining with then value
+   */
+  def between(min: Any, max: Any): ConditionalBranch =
+    ConditionalBranch(s"$fieldName BETWEEN ${formatValue(min)} AND ${formatValue(max)}")
+
+  /**
+   * Creates a condition where field is in a set of values.
+   *
+   * @param values the values to check against
+   * @return ConditionalBranch for chaining with then value
+   */
+  def in(values: Any*): ConditionalBranch = {
+    val valuesList = values.map(formatValue).mkString(", ")
+    ConditionalBranch(s"$fieldName IN ($valuesList)")
+  }
+
+  /**
+   * Creates a condition where field is greater than or equal to a value.
+   *
+   * @param value the value to compare against
+   * @return ConditionalBranch for chaining with then value
+   */
+  def greaterThanOrEqual(value: Any): ConditionalBranch =
+    ConditionalBranch(s"$fieldName >= ${formatValue(value)}")
+
+  /**
+   * Creates a condition where field is less than or equal to a value.
+   *
+   * @param value the value to compare against
+   * @return ConditionalBranch for chaining with then value
+   */
+  def lessThanOrEqual(value: Any): ConditionalBranch =
+    ConditionalBranch(s"$fieldName <= ${formatValue(value)}")
+
+  /**
+   * Creates a condition where field is not equal to a value.
+   *
+   * @param value the value to compare against
+   * @return ConditionalBranch for chaining with then value
+   */
+  def notEqualTo(value: Any): ConditionalBranch =
+    ConditionalBranch(s"$fieldName != ${formatValue(value)}")
+
+  private def formatValue(value: Any): String = value match {
+    case s: String => s"'$s'"
+    case v => v.toString
+  }
+}
+
+/**
+ * Represents a conditional branch (the condition part of WHEN ... THEN).
+ */
+case class ConditionalBranch(condition: String) {
+
+  /**
+   * Completes the conditional with a constant value.
+   * Usage: when("field").greaterThan(100) -> 50
+   *
+   * @param thenValue the value to return when condition is true
+   * @return ConditionalCase for use in conditionalValue()
+   */
+  def ->(thenValue: Any): ConditionalCase = {
+    val formattedValue = thenValue match {
+      case s: String => s"'$s'"
+      case v => v.toString
+    }
+    ConditionalCase(condition, formattedValue)
+  }
+
+  /**
+   * Alternative syntax: value ->: condition
+   * Completes the conditional with a constant value.
+   *
+   * @param thenValue the value to return when condition is true
+   * @return ConditionalCase for use in conditionalValue()
+   */
+  def ->:(thenValue: Any): ConditionalCase = ->(thenValue)
+}
+
+/**
+ * A complete conditional case (condition + value).
+ * Represents one WHEN ... THEN clause in a SQL CASE expression.
+ */
+case class ConditionalCase(condition: String, thenValue: String) {
+  /**
+   * Converts this case to SQL WHEN ... THEN syntax.
+   *
+   * @return SQL string for this case
+   */
+  def toSql: String = s"WHEN $condition THEN $thenValue"
+}