Skip to content

feat(generator): add Avro schema generation with first-class DSL support #120

Description

@LMLiam

Scope: generation

Executive Summary

Add first-class Avro schema generation to Microsmith, including an intuitive Kotlin DSL, deterministic .avsc emission, Avro-specific validation, and fixture/documentation coverage that makes Avro a supported schema-generation target alongside the existing schema formats.

Problem Statement

Microsmith already supports schema-driven generation workflows, but it does not currently support Avro. That blocks users working with Kafka, schema-registry, analytics, data-platform, and event-contract ecosystems from using Microsmith as their primary schema authoring surface. Without native Avro support, they have to maintain Avro schemas manually or introduce a separate transformation step, which weakens Microsmith's value as a multi-format schema-generation tool.

Objectives

  • Add first-class Avro schema generation as a supported Microsmith target.
  • Provide an ergonomic, authoring-first Kotlin DSL for Avro under the existing Microsmith scripting model.
  • Keep the DSL intuitive and Kotlin-idiomatic rather than mirroring raw Avro JSON structure.
  • Emit valid, deterministic .avsc schema files suitable for downstream tooling and code review.
  • Validate Avro-specific constraints before writing invalid output.
  • Document the Avro authoring and generation contract end-to-end.

Non-Objectives

  • Avro RPC/protocol generation (.avpr) in the first release.
  • Avro IDL (.avdl) parsing or emission in the first release.
  • Language-specific source-code generation from Avro schemas in this issue.
  • Schema-registry publication, remote compatibility checks, or serializer integrations in this issue.
  • Designing the DSL as a thin Kotlin wrapper around Avro JSON.

Functional Requirements

  • Add a new Avro schema target under the Microsmith schema DSL.
  • Support the Avro named types required for a practical first release:
    • record
    • enum
    • fixed
  • Support the field/container/reference constructs required for a practical first release:
    • primitive types
    • named-type references
    • arrays
    • maps
    • unions
    • nullable fields via union modeling
  • Support practical Avro metadata and compatibility surfaces:
    • namespace
    • doc
    • aliases
    • default values
    • field order where applicable
    • logical types where supported by the chosen internal model
  • Emit .avsc files as the canonical output for the first implementation.
  • Default generated Avro output should land in a repository-level avro/ directory unless explicitly configured otherwise.
  • File naming must be deterministic and based on Avro named-type identity.
  • Explicit user-configured output paths must continue to override defaults.
  • Generation must preserve stable ordering for fields, enum symbols, union branches where policy permits, and file emission order.
  • The generator must reject or clearly diagnose unsupported or invalid Avro constructs instead of silently emitting broken schemas.
  • The generator must support cross-type references and namespace-qualified resolution within a generation run.
  • The generator must support multiple namespaces within a single Avro authoring surface.
  • The generator must support multi-file output for repositories that model multiple Avro named types.
  • The generator must define and document how shared named types are emitted and referenced across files.

DSL Requirements

  • Add a dedicated Avro surface under schemas { avro { ... } }.
  • The DSL must be authoring-first and ergonomic, not a thin wrapper around Avro JSON.
  • The DSL should follow the same ergonomic standard as the existing Microsmith DSLs, preferring intuitive helpers over low-level constructor-style APIs.
  • The first release should optimize for readability and explicit intent over one-to-one fidelity with Avro JSON layout.

Namespace Authoring

  • Namespace blocks should use an idiomatic Kotlin DSL form as the primary syntax, for example:
    • "com.example.common" { ... }
    • "com.example.events" { ... }
  • A named namespace("...") { ... } helper may exist as a secondary/helper API if implementation convenience requires it, but the primary documented DSL should prefer the string-invoke form.

Named Types And Aliases

  • Named declarations should support variadic aliases directly in the declaration signature where practical, for example:
    • record("Address", "PostalAddress") { ... }
    • enum("CountryCode", "IsoCountryCode", "LegacyCountryCode") { ... }
    • fixed("Decimal128", "MoneyBytes") { ... }
  • Block-level aliases(...) helpers may still exist where they improve composability, especially for field aliases, but common named-type aliases should not require a separate nested call.
  • Named types should be declared top-level within avro { ... } or namespace blocks in the first release rather than allowing arbitrarily nested named-type declarations inside records.

Fields, Types, And Containers

  • The DSL should support ergonomic, type-first field helpers where practical, for example:
    • string("name")
    • int("quantity") { default(1) }
    • string("line2") { nullable(); default(null) }
    • ref("country", "CountryCode")
    • array("tags", string)
    • map("attributes", string)
  • For Avro arrays and maps, the preferred first-release form is the concise field-first shape:
    • array("tags", string)
    • map("attributes", string)
  • The issue should not require extra nested container-type blocks for normal array/map declarations unless a later design review finds a strong need for that complexity.
  • The DSL should prefer modifier-style nullability such as nullable() inside a field block rather than exploding the API into nullableString(...), nullableInt(...), and similar variants.
  • nullable() must be documented as syntactic sugar for a deterministic Avro union shape, not as a separate schema concept.
  • Where a dedicated field helper does not make sense, the DSL may still accept an explicit schema expression, but that should not be the primary ergonomic path.

References, Unions, And Defaults

  • The DSL must support both namespace-local references and fully-qualified cross-namespace references.
  • The DSL must support references to previously declared named types.
  • The issue must explicitly define whether forward references within the same Avro block are supported; if they are not, they must fail with clear diagnostics rather than behaving ambiguously.
  • The DSL must define a deterministic reference-resolution contract. The preferred first-release rule is:
    • unqualified references resolve within the current namespace
    • fully-qualified references resolve globally
    • ambiguous or unresolved references fail validation
  • The DSL must make nullable fields and unions explicit enough to avoid ambiguous or invalid Avro generation.
  • The DSL must define and document the emitted branch ordering policy for unions, especially nullable unions.
  • The DSL must define and document how defaults are validated against the final emitted Avro schema shape, especially for union-backed fields.
  • The DSL must reject nested unions and duplicate effective union branches unless there is a strong, documented reason to support them.
  • The DSL must allow defaults to be authored in a typed, readable way where practical, rather than forcing end-users to hand-construct JSON literals.
  • The DSL should provide an ergonomic built-in empty marker for empty array/map defaults rather than requiring Kotlin implementation details such as emptyList<Any>() or emptyMap<String, String>() in normal authoring.

Enum Ergonomics

  • Enum bodies should support concise, idiomatic value declaration styles, with both of the following considered valid first-class APIs:
    • value("GB")
    • +"GB"

Primitive And Logical Types

  • The DSL should prefer symbol-style built-in schema values for primitive and fixed logical shapes, for example:
    • string
    • boolean
    • int
    • long
    • bytes
    • uuid
    • date
    • timeMillis and/or timeMicros
    • timestampMillis and/or timestampMicros
  • Parameterized shapes may still use builder-style helpers where a singleton symbol would be insufficient, for example decimal(...).
  • The DSL should still provide a generic logical-type escape hatch for uncommon or future logical types, but that escape hatch should not be the primary ergonomic path.
  • The DSL should prefer symbol-style built-in values over *Type() factory calls for fixed primitive/logical shapes.
  • Reified generic sugar may be provided where it improves readability, for example logical<Uuid>("eventId"), but those generic markers should resolve to Microsmith-defined Avro schema markers rather than host-language platform types.
  • doc() should remain supported because Avro doc is emitted schema metadata, not just author-side commentary in the .microsmith.kts file.

Documentation Examples

  • The DSL contract must be documented with at least one non-trivial example.
  • README and fixtures must include at least one non-trivial Avro example covering the core first-release contract, including multiple namespaces and cross-namespace references.

Example DSL

microsmith {
    schemas {
        avro {
            "com.example.common" {
                enum("CountryCode", "IsoCountryCode") {
                    doc("ISO 3166-1 alpha-2 country code")
                    +"GB"
                    +"US"
                    +"DE"
                    value("FR")
                }

                fixed("Decimal128", "MoneyBytes") {
                    doc("128-bit fixed-width decimal backing store")
                    size(16)
                    decimal(precision = 19, scale = 4)
                }

                record("Address", "PostalAddress") {
                    doc("Reusable postal address")

                    string("line1")
                    string("line2") {
                        nullable()
                        default(null)
                        order(FieldOrder.IGNORE)
                    }
                    string("city")
                    string("region") {
                        nullable()
                        default(null)
                    }
                    string("postalCode") {
                        aliases("postcode")
                    }
                    ref("country", "CountryCode")
                }

                record("Money") {
                    doc("Currency amount expressed as fixed-point decimal")

                    string("currency")
                    ref("amount", "Decimal128")
                }
            }

            "com.example.identity" {
                record("RegisteredCustomer") {
                    doc("Known customer with a persistent account")

                    logical("customerId", uuid)
                    string("email")
                    string("loyaltyTier") {
                        nullable()
                        default(null)
                    }
                }

                record("GuestCustomer", "AnonymousCustomer") {
                    doc("Checkout identity for a guest user")

                    string("email")
                    boolean("marketingOptIn") {
                        default(false)
                    }
                }
            }

            "com.example.orders" {
                enum("OrderStatus") {
                    doc("Lifecycle state for an order")
                    +"PENDING"
                    +"CONFIRMED"
                    +"CANCELLED"
                    +"FULFILLED"
                }

                record("LineItem", "PurchaseLine", "OrderLine") {
                    doc("A single purchasable item on an order")

                    string("sku")
                    int("quantity") {
                        default(1)
                        order(FieldOrder.ASCENDING)
                    }
                    ref("unitPrice", "com.example.common.Money")
                    array("tags", string) {
                        default(empty)
                    }
                    map("attributes", string) {
                        default(empty)
                    }
                }

                record("OrderPlaced", "OrderCreated", "OrderSubmitted") {
                    doc("Canonical order-created event for downstream consumers")

                    logical("eventId", uuid)
                    string("orderId") {
                        aliases("externalOrderId")
                    }
                    ref("status", "OrderStatus") {
                        default("PENDING")
                    }
                    union(
                        "customer",
                        ref("com.example.identity.RegisteredCustomer"),
                        ref("com.example.identity.GuestCustomer"),
                    )
                    array("lineItems", ref("LineItem")) {
                        default(empty)
                    }
                    ref("billingAddress", "com.example.common.Address")
                    ref("shippingAddress", "com.example.common.Address") {
                        nullable()
                        default(null)
                    }
                    ref("total", "com.example.common.Money")
                    logical("requestedShipDate", date) {
                        nullable()
                        default(null)
                    }
                    logical("placedAt", timestampMillis) {
                        order(FieldOrder.DESCENDING)
                    }
                    logical("warehouseCutoffLocal", logicalType("local-timestamp-micros", long)) {
                        nullable()
                        default(null)
                    }
                    map("metadata", string) {
                        default(empty)
                        order(FieldOrder.IGNORE)
                    }
                }
            }
        }
    }
}

Generator Semantics And Output Contract

  • The implementation must choose and document the canonical Avro output shape for named types.
  • The preferred first-release contract is one .avsc file per top-level named type under avro/.
  • If shared named types are emitted separately, generated references must remain valid and unambiguous by namespace and name.
  • Output must be deterministic across repeated runs when the DSL input is unchanged.
  • Output must be review-friendly, including stable JSON field ordering and formatting.
  • Output directories and filenames must not depend on nondeterministic iteration order.

Validation Requirements

  • Validate Avro name and namespace rules.
  • Validate enum symbol uniqueness.
  • Validate fixed size constraints.
  • Validate union legality according to supported policy.
  • Validate default values against declared field schema.
  • Validate duplicate type names within the same effective namespace.
  • Validate reference resolution for named types.
  • Validate logical-type compatibility with the underlying primitive type.
  • Validate namespace-local and fully-qualified cross-namespace references.
  • Emit actionable diagnostics that point back to the user-authored script contract.

Non-Functional Requirements

  • Emitted schemas must be deterministic and reproducible.
  • Generation must be fast enough for normal repository authoring loops.
  • The implementation must be maintainable and align with current Microsmith schema-module patterns.
  • Formatting of generated Avro JSON must be stable and human-reviewable.

Security Considerations

  • Output path handling must not permit writes outside the intended repository root.
  • Validation diagnostics must avoid leaking unrelated filesystem or environment state.
  • The generator must fail safely on invalid inputs rather than emitting misleading artifacts.

Operational Readiness

  • Add README coverage for Avro support, file types, output contract, and DSL examples.
  • Add representative fixtures for Avro generation.
  • Add troubleshooting guidance for the most common Avro authoring mistakes.
  • Ensure users understand that first-release scope is .avsc generation, not .avdl or .avpr.

Backward Compatibility And Migration

  • This is a new capability and should not change existing generation behavior for other schema targets.
  • Any new default output directory for Avro must remain isolated to the Avro target.
  • Future Avro protocol or IDL support must be additive and not break .avsc generation contracts.

Observability And Metrics

  • Fixture pass rate for Avro generation scenarios.
  • Validation failure coverage for common invalid-schema cases.
  • Determinism checks across repeated generation runs.
  • Documentation issue rate for Avro onboarding after release.

Risks And Mitigations

  • Avro union/default rules are easy to get wrong: mitigate with strict validation and explicit DSL modeling.
  • Namespace/reference semantics may create subtle bugs: mitigate with dedicated cross-reference fixtures.
  • Output-shape churn could frustrate adopters: define the file contract early and document it clearly.
  • Scope creep into Avro protocol/IDL/codegen could delay delivery: keep first release limited to .avsc generation.
  • DSL ergonomics can drift toward JSON-shaped boilerplate: mitigate by explicitly preferring intuitive helper-based authoring over low-level field constructor APIs.

Acceptance Criteria

  • Microsmith supports schemas { avro { ... } } as a documented schema-generation target.
  • Users can declare records, enums, fixed types, arrays, maps, unions, nullability, defaults, docs, aliases, namespaces, cross-namespace references, and logical types through an ergonomic DSL.
  • The Avro DSL supports intuitive helper-style field declarations where practical rather than forcing raw Avro-JSON-shaped authoring.
  • The Avro DSL supports string-invoke namespace blocks as the primary documented namespace syntax.
  • The Avro DSL supports variadic aliases on named declarations.
  • The Avro DSL supports modifier-style nullability such as nullable() inside field blocks.
  • Enum values can be declared ergonomically via value("...") and +"...".
  • The Avro DSL prefers symbol-style built-in types over *Type() helpers for fixed primitive/logical shapes.
  • The Avro DSL uses ref(...) as the primary named-type reference form.
  • The Avro DSL uses array("name", type) and map("name", type) as the preferred first-release collection-field syntax.
  • The Avro DSL supports an empty marker for collection defaults instead of requiring raw Kotlin collection literals in normal authoring.
  • The Avro DSL exposes typed helpers for common logical types and a generic fallback for uncommon logical types.
  • The Avro DSL has a documented contract for reference resolution, union ordering, nullability, and default validation.
  • Microsmith emits valid, deterministic .avsc output for representative multi-type and multi-namespace fixtures.
  • Default Avro output lands in ./avro unless explicitly configured otherwise.
  • Invalid Avro constructs are rejected with actionable diagnostics.
  • README and fixtures include at least one non-trivial Avro example that demonstrates multiple namespaces, cross-namespace references, ergonomic field/type helpers, and the empty default marker for collection fields where relevant.
  • Automated tests cover successful generation, invalid-schema diagnostics, determinism, and reference resolution.

Test Strategy

  • Unit tests for Avro model validation and JSON emission.
  • Unit tests for typed logical-type helpers and the generic logical-type fallback.
  • Unit tests for DSL helper ergonomics and semantic normalization, including nullability, union handling, and empty-default normalization for arrays/maps.
  • Integration tests for representative Avro fixtures with cross-type references.
  • Integration tests for representative Avro fixtures with multiple namespaces and fully-qualified references.
  • Golden-file tests for deterministic .avsc output.
  • Negative-path tests for invalid defaults, invalid unions, duplicate names, bad namespaces, and unresolved references.
  • Parser-validation tests against a real Avro implementation to confirm emitted schemas are consumable.
  • Documentation-snippet verification for the published Avro DSL example.

Dependencies

Definition Of Done

  • Avro is implemented as a documented, validated, fixture-covered Microsmith generation target with a first-class ergonomic DSL and deterministic .avsc output contract.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:docsDocumentation and adoption guidesarea:generatorGeneration pipeline and file outputarea:qualityTesting and CI quality gatesarea:scriptingKotlin scripting host and script executionenhancementNew feature or requestpriority:p1High-priority roadmap item

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions