Skip to content

SQL API: Add schema-aware array-rows response + queryRef/schemaRef caching (Kalam-link friendly) #50

@jamals86

Description

@jamals86

Problem

Today /v1/api/sql returns result rows as JSON objects plus a columns list, and the “typed” mode wraps each value. This makes it hard for Kalam-link to:

  • reliably understand the returned schema (types/nullability/order)
  • decode efficiently (objects are heavier than arrays)
  • cache decoding metadata safely across requests (schema drift vs query hash)

Relevant code:

  • Request model: sql_request.rs
  • Response model: sql_response.rs
  • Handler: sql_handler.rs
  • ScalarValue→JSON conversions: arrow_json_conversion.rs

Goals

  1. Make the REST SQL response self-describing (Arrow schema), so clients know exact output types/order.
  2. Make the payload efficient (rows as arrays, stable column order).
  3. Enable safe client+server caching:
    • Server caches “prepared query template” (queryRef)
    • Client caches schema decoding (schemaRef → schema)
  4. Keep ScalarValue→JSON conversion logic centralized (no duplicate conversion paths).
  5. Return a schema with datatype of: KalamDataType

Non-goals (for this issue)

  • Changing WebSocket payload formats (can follow later, but this issue targets REST /v1/api/sql).
  • Introducing a full GraphQL layer or persisted-query registry UI.

Proposed API changes

New request fields (backward-compatible)

Extend the existing request model with optional fields:

  • queryRef?: { version: number, sha256: string }
  • sql?: string (already exists; allow either sql OR queryRef)
  • params?: JsonValue[] (already exists)
  • includeSchema?: boolean (default false when client sends knownSchemaRefs, else true on first query)
  • knownSchemaRefs?: string[] (client tells server which schema hashes it already has)
  • resultFormat?: "array_rows" | "object_rows" (default object_rows for backward compat; Kalam-link uses array_rows)

Notes:

  • Keep serialization_mode as-is (simple vs typed value encoding), but do not change row shape by mode.
  • Multi-statement + params stays rejected (current behavior), unless expanded later.

New response fields (for SELECT-like results)

For each QueryResult (or a new QueryResultV2):

  • queryRef?: { version: number, sha256: string } (returned when request used sql)
  • schemaRef?: { version: number, sha256: string } (always returned for row results)
  • schema?: { fields: [...] , metadata?: ... } (returned only when includeSchema=true OR client doesn’t have schemaRef)
  • rows: array of rows
    • when resultFormat="array_rows": rows: JsonValue[][]
    • when resultFormat="object_rows": keep current rows: object[]
  • Deprecate columns when schema is present (or keep temporarily but treat as redundant)

Optional endpoint (recommended for robustness)

  • GET /v1/api/schema/{schemaSha256}
    • Returns { schemaRef, schema }
    • Lets clients recover schema without re-running the query.

Hashing / caching semantics (critical)

schemaRef (correctness)

  • schemaRef.sha256 = sha256(canonical_arrow_schema_json)
  • Canonicalization requirements:
    • stable field order
    • include name, data_type, nullable (and ideally metadata)
  • Must be computed from the output schema of the query (post-projection), not just table schema.

queryRef (server perf)

Used to let the client send only {queryRef, params} on subsequent calls.

  • queryRef.sha256 should be derived from at least:
    • normalized SQL template (no params embedded)
    • relevant session settings that can affect planning/schema
    • auth/tenant scope key (so we never reuse a plan across incompatible permission contexts)

A safe starting scope key could include: user role + namespace_id + any “current user” context.

Also define server cache behavior:

  • LRU + TTL (avoid unbounded memory) use Moka
  • Why: async-safe, TTL + size-based eviction, lock-free reads
  • Cache entries invalidated on schemaRef change, or simply allow schemaRef mismatch to trigger schema resend.

Concrete examples (end-to-end)

Step 1: First call (client only knows SQL)

Request:

POST /v1/api/sql
{
  "sql": "select * from chat.messages where conversation_id = $1",
  "params": ["conv_123"],
  "serialization_mode": "simple",
  "resultFormat": "array_rows",
  "includeSchema": true
}

Response:

200 OK
{
  "status": "success",
  "results": [
    {
      "queryRef": { "version": 1, "sha256": "q_7a1b...e21" },
      "schemaRef": { "version": 1, "sha256": "s_55c9...91a" },
      "schema": {
        "fields": [
          {"name":"msg_id","data_type":"Int64","nullable":false},
          {"name":"conversation_id","data_type":"Utf8","nullable":false},
          {"name":"from","data_type":"Utf8","nullable":false},
          {"name":"timestamp","data_type":"Timestamp(Microsecond, None)","nullable":false},
          {"name":"content","data_type":"Utf8","nullable":true},
          {"name":"metadata","data_type":"Utf8","nullable":true}
        ]
      },
      "rows": [
        ["123456789", "conv_123", "alice", 1699000000000000, "Hello", null]
      ],
      "row_count": 1
    }
  ],
  "took": 15.0
}

Step 2: Subsequent call (client sends queryRef + params only)

Request:

POST /v1/api/sql
{
  "queryRef": { "version": 1, "sha256": "q_7a1b...e21" },
  "params": ["conv_999"],
  "serialization_mode": "simple",
  "resultFormat": "array_rows",
  "knownSchemaRefs": ["s_55c9...91a"]
}

Response (schema omitted, but schemaRef included):

200 OK
{
  "status": "success",
  "results": [
    {
      "schemaRef": { "version": 1, "sha256": "s_55c9...91a" },
      "rows": [
        ["123456790", "conv_999", "bob", 1699000000001234, "Hi", null]
      ],
      "row_count": 1
    }
  ],
  "took": 3.2
}

Step 3: Schema drift (ALTER TABLE / view change / SELECT * expansion)

Client sends same request as Step 2; server detects schemaRef changed and returns schema again:

200 OK
{
  "status": "success",
  "results": [
    {
      "schemaRef": { "version": 1, "sha256": "s_a912...44f" },
      "schema": {
        "fields": [
          {"name":"msg_id","data_type":"Int64","nullable":false},
          "...",
          {"name":"edited_at","data_type":"Timestamp(Microsecond, None)","nullable":true}
        ]
      },
      "rows": [
        ["123456789", "conv_123", "alice", 1699000000000000, "Hello", null, null]
      ],
      "row_count": 1
    }
  ],
  "took": 3.4
}

Important considerations / pitfalls

  • Do not key server queryRef cache only by SQL text; it must be scoped to avoid cross-user leakage and schema mismatches.
  • schemaRef should be based on output schema, not physical table schema.
  • Current masking logic in the handler should be reviewed for array row format (masking by column name still possible, but row shape changes).
  • Keep ScalarValue→JSON conversion centralized (extend conversion utilities rather than adding one-off conversions).

Implementation sketch (high level)

  1. Add new response types/fields in sql_response.rs
    • Support rows as either array-rows or object-rows (feature-flagged by resultFormat)
    • Add schemaRef, optional schema, optional queryRef
  2. Extend request model in sql_request.rs
  3. Add schema serialization helper + sha256 hashing
    • Canonical Arrow schema JSON + sha256
  4. Extend conversion in arrow_json_conversion.rs
    • Add RecordBatch → array-rows conversion that still calls the single ScalarValue→JSON function
  5. Update handler logic in sql_handler.rs
    • If request has queryRef, resolve cached SQL/plan; else compute and return queryRef
    • Always attach schemaRef; include schema only if requested/needed
  6. Add tests
    • schemaRef stable for same schema
    • schema omitted when knownSchemaRefs contains it
    • schema included when schema changes

Acceptance criteria

  • REST SQL endpoint supports resultFormat="array_rows" with stable ordering matching Arrow schema fields.
  • Response includes schemaRef for row-returning statements.
  • Response includes schema only when needed (first call / cache miss / schema drift).
  • Client can send queryRef + params without sending SQL on subsequent calls.
  • No duplicate ScalarValue→JSON conversion logic introduced.
  • Backward compatibility preserved (existing clients can keep using object rows).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions