|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Project Overview |
| 6 | + |
| 7 | +Data Caterer is a test data management tool built with Scala and Apache Spark that provides automated data generation, validation, and cleanup capabilities. It supports multiple data sources including databases, files, messaging systems, and HTTP APIs. |
| 8 | + |
| 9 | +## Build System & Common Commands |
| 10 | + |
| 11 | +The project uses Gradle with Kotlin DSL and follows a multi-module structure: |
| 12 | +- **Root module**: Configuration and project orchestration |
| 13 | +- **api**: Builder patterns and models for programmatic usage |
| 14 | +- **app**: Core execution engine, Spark integration, and UI server |
| 15 | +- **example**: Sample implementations and Docker configurations |
| 16 | + |
| 17 | +### Essential Commands |
| 18 | + |
| 19 | +```bash |
| 20 | +# Build the entire project |
| 21 | +./gradlew build |
| 22 | + |
| 23 | +# Build individual modules |
| 24 | +./gradlew :app:build |
| 25 | +./gradlew :api:build |
| 26 | + |
| 27 | +# Run tests (use exact class names, NOT wildcards) |
| 28 | +./gradlew :app:test --tests "io.github.datacatering.datacaterer.core.ui.plan.PlanRepositoryTest" --info |
| 29 | +./gradlew :api:test |
| 30 | + |
| 31 | +# Generate test coverage with Scoverage |
| 32 | +./gradlew reportScoverage |
| 33 | + |
| 34 | +# Create fat/shadow JAR for distribution |
| 35 | +./gradlew :app:shadowJar |
| 36 | + |
| 37 | +# Run specific configurations from IDE |
| 38 | +./gradlew :app:run --args="DataCatererUI" |
| 39 | +``` |
| 40 | + |
| 41 | +### Important Test Running Notes |
| 42 | + |
| 43 | +ScalaTest with JUnit Platform has limitations with Gradle's `--tests` filtering: |
| 44 | +- ✅ Use exact class names: `--tests "io.github.datacatering.datacaterer.core.ui.plan.PlanRepositoryTest"` |
| 45 | +- ❌ Do NOT use wildcards: `--tests "*PlanRunTest*"` (runs ALL tests instead of filtering) |
| 46 | + |
| 47 | +## Architecture Overview |
| 48 | + |
| 49 | +### Core Domain Concepts |
| 50 | + |
| 51 | +- **Plans**: High-level configuration defining data operations to perform |
| 52 | +- **Tasks**: Individual data sources (databases, files, messaging systems, HTTP) |
| 53 | +- **Steps**: Sub-operations within tasks (tables, topics, file paths) |
| 54 | +- **Fields**: Individual data field configurations with generation rules |
| 55 | +- **Validations**: Data quality checks and assertions |
| 56 | + |
| 57 | +### Module Structure |
| 58 | + |
| 59 | +``` |
| 60 | +api/ # Builder API and models |
| 61 | +├── model/ # Core data models and types |
| 62 | +├── connection/ # Data source connection builders |
| 63 | +└── validation/ # Validation builders |
| 64 | +
|
| 65 | +app/ # Core application |
| 66 | +├── core/ |
| 67 | +│ ├── generator/ # Data generation engine |
| 68 | +│ ├── validator/ # Data validation engine |
| 69 | +│ ├── sink/ # Data output processors |
| 70 | +│ ├── metadata/ # Metadata discovery and integration |
| 71 | +│ ├── ui/ # Web UI server components |
| 72 | +│ └── util/ # Utilities and helpers |
| 73 | +└── main/resources/ # Configuration files and UI assets |
| 74 | +``` |
| 75 | + |
| 76 | +### Key Architectural Patterns |
| 77 | + |
| 78 | +**Builder Pattern**: All configuration uses immutable builders with method chaining |
| 79 | +```scala |
| 80 | +postgres("customer_postgres", "jdbc:postgresql://localhost:5432/customer") |
| 81 | + .table("accounts") |
| 82 | + .fields(field.name("account_id").regex("ACC[0-9]{10}").unique(true)) |
| 83 | +``` |
| 84 | + |
| 85 | +**Case Class Data Models**: Immutable data structures with Jackson JSON serialization |
| 86 | +```scala |
| 87 | +@JsonIgnoreProperties(ignoreUnknown = true) |
| 88 | +case class DataSource( |
| 89 | + name: String, |
| 90 | + `type`: String, |
| 91 | + options: Map[String, String] = Map(), |
| 92 | + enabled: Boolean = true |
| 93 | +) |
| 94 | +``` |
| 95 | + |
| 96 | +**Spark Integration**: Uses Apache Spark for distributed data processing and Spark SQL for data operations |
| 97 | + |
| 98 | +## Development Patterns |
| 99 | + |
| 100 | +### Code Style Requirements |
| 101 | + |
| 102 | +- Use `com.softwaremill.quicklens.ModifyPimp` for immutable updates in builders |
| 103 | +- Always provide parameterless constructors: `def this() = this(DefaultValue())` |
| 104 | +- Use `@JsonIgnoreProperties(ignoreUnknown = true)` for JSON serialization compatibility |
| 105 | +- Use `Option[T]` instead of `null` for optional values |
| 106 | +- Follow package structure under `io.github.datacatering.datacaterer` |
| 107 | + |
| 108 | +### Builder Implementation Pattern |
| 109 | + |
| 110 | +```scala |
| 111 | +case class TaskBuilder(task: Task = Task()) { |
| 112 | + def this() = this(Task()) |
| 113 | + |
| 114 | + def name(name: String): TaskBuilder = |
| 115 | + this.modify(_.task.name).setTo(name) |
| 116 | + |
| 117 | + def option(option: (String, String)): TaskBuilder = |
| 118 | + this.modify(_.task.options)(_ ++ Map(option)) |
| 119 | +} |
| 120 | +``` |
| 121 | + |
| 122 | +### Environment Configuration |
| 123 | + |
| 124 | +Runtime behavior is controlled via environment variables: |
| 125 | +- `ENABLE_GENERATE_DATA`: Enable/disable data generation |
| 126 | +- `ENABLE_DELETE_GENERATED_RECORDS`: Enable cleanup mode |
| 127 | +- `PLAN_FILE_PATH`: Path to YAML plan configuration |
| 128 | +- `TASK_FOLDER_PATH`: Directory containing task definitions |
| 129 | +- `APPLICATION_CONFIG_PATH`: Custom application configuration |
| 130 | + |
| 131 | +### Data Source Support |
| 132 | + |
| 133 | +The system supports: |
| 134 | +- **Databases**: Postgres, MySQL, Cassandra, BigQuery |
| 135 | +- **Files**: CSV, JSON, Parquet, Delta Lake, Iceberg, ORC |
| 136 | +- **Messaging**: Kafka, RabbitMQ, Solace |
| 137 | +- **HTTP**: REST APIs with OpenAPI/Swagger integration |
| 138 | +- **Metadata Sources**: Great Expectations, JSON Schema, Data Contract CLI, OpenMetadata, Marquez |
| 139 | + |
| 140 | +## UI and API Integration |
| 141 | + |
| 142 | +The application includes a web UI server that provides: |
| 143 | +- Connection management and testing |
| 144 | +- Interactive plan creation |
| 145 | +- Execution history tracking |
| 146 | +- Real-time results viewing |
| 147 | + |
| 148 | +The UI is implemented as a separate module with React frontend and Scala backend using HTTP4S. |
| 149 | + |
| 150 | +## Testing Strategy |
| 151 | + |
| 152 | +- Use ScalaTest for unit testing |
| 153 | +- Test both API builders and core application logic |
| 154 | +- Mock external dependencies (databases, file systems) |
| 155 | +- Use exact class names for test filtering, not wildcards |
| 156 | +- Leverage the example module for integration testing |
| 157 | + |
| 158 | +## Key Dependencies |
| 159 | + |
| 160 | +- **Scala**: 2.12.x |
| 161 | +- **Apache Spark**: 3.5.x |
| 162 | +- **Jackson**: JSON serialization |
| 163 | +- **Quicklens**: Immutable data updates |
| 164 | +- **ScalaTest**: Testing framework |
| 165 | +- **HTTP4S**: Web server framework |
| 166 | +- **Logback/Log4j**: Logging |
0 commit comments