Problem
Today /v1/api/sql returns result rows as JSON objects plus a columns list, and the “typed” mode wraps each value. This makes it hard for Kalam-link to:
- reliably understand the returned schema (types/nullability/order)
- decode efficiently (objects are heavier than arrays)
- cache decoding metadata safely across requests (schema drift vs query hash)
Relevant code:
- Request model: sql_request.rs
- Response model: sql_response.rs
- Handler: sql_handler.rs
- ScalarValue→JSON conversions: arrow_json_conversion.rs
Goals
- Make the REST SQL response self-describing (Arrow schema), so clients know exact output types/order.
- Make the payload efficient (rows as arrays, stable column order).
- Enable safe client+server caching:
- Server caches “prepared query template” (queryRef)
- Client caches schema decoding (schemaRef → schema)
- Keep ScalarValue→JSON conversion logic centralized (no duplicate conversion paths).
- Return a schema with datatype of: KalamDataType
Non-goals (for this issue)
- Changing WebSocket payload formats (can follow later, but this issue targets REST
/v1/api/sql).
- Introducing a full GraphQL layer or persisted-query registry UI.
Proposed API changes
New request fields (backward-compatible)
Extend the existing request model with optional fields:
queryRef?: { version: number, sha256: string }
sql?: string (already exists; allow either sql OR queryRef)
params?: JsonValue[] (already exists)
includeSchema?: boolean (default false when client sends knownSchemaRefs, else true on first query)
knownSchemaRefs?: string[] (client tells server which schema hashes it already has)
resultFormat?: "array_rows" | "object_rows" (default object_rows for backward compat; Kalam-link uses array_rows)
Notes:
- Keep
serialization_mode as-is (simple vs typed value encoding), but do not change row shape by mode.
- Multi-statement + params stays rejected (current behavior), unless expanded later.
New response fields (for SELECT-like results)
For each QueryResult (or a new QueryResultV2):
queryRef?: { version: number, sha256: string } (returned when request used sql)
schemaRef?: { version: number, sha256: string } (always returned for row results)
schema?: { fields: [...] , metadata?: ... } (returned only when includeSchema=true OR client doesn’t have schemaRef)
rows: array of rows
- when
resultFormat="array_rows": rows: JsonValue[][]
- when
resultFormat="object_rows": keep current rows: object[]
- Deprecate
columns when schema is present (or keep temporarily but treat as redundant)
Optional endpoint (recommended for robustness)
GET /v1/api/schema/{schemaSha256}
- Returns
{ schemaRef, schema }
- Lets clients recover schema without re-running the query.
Hashing / caching semantics (critical)
schemaRef (correctness)
schemaRef.sha256 = sha256(canonical_arrow_schema_json)
- Canonicalization requirements:
- stable field order
- include name, data_type, nullable (and ideally metadata)
- Must be computed from the output schema of the query (post-projection), not just table schema.
queryRef (server perf)
Used to let the client send only {queryRef, params} on subsequent calls.
queryRef.sha256 should be derived from at least:
- normalized SQL template (no params embedded)
- relevant session settings that can affect planning/schema
- auth/tenant scope key (so we never reuse a plan across incompatible permission contexts)
A safe starting scope key could include: user role + namespace_id + any “current user” context.
Also define server cache behavior:
- LRU + TTL (avoid unbounded memory) use Moka
- Why: async-safe, TTL + size-based eviction, lock-free reads
- Cache entries invalidated on schemaRef change, or simply allow schemaRef mismatch to trigger schema resend.
Concrete examples (end-to-end)
Step 1: First call (client only knows SQL)
Request:
POST /v1/api/sql
{
"sql": "select * from chat.messages where conversation_id = $1",
"params": ["conv_123"],
"serialization_mode": "simple",
"resultFormat": "array_rows",
"includeSchema": true
}
Response:
200 OK
{
"status": "success",
"results": [
{
"queryRef": { "version": 1, "sha256": "q_7a1b...e21" },
"schemaRef": { "version": 1, "sha256": "s_55c9...91a" },
"schema": {
"fields": [
{"name":"msg_id","data_type":"Int64","nullable":false},
{"name":"conversation_id","data_type":"Utf8","nullable":false},
{"name":"from","data_type":"Utf8","nullable":false},
{"name":"timestamp","data_type":"Timestamp(Microsecond, None)","nullable":false},
{"name":"content","data_type":"Utf8","nullable":true},
{"name":"metadata","data_type":"Utf8","nullable":true}
]
},
"rows": [
["123456789", "conv_123", "alice", 1699000000000000, "Hello", null]
],
"row_count": 1
}
],
"took": 15.0
}
Step 2: Subsequent call (client sends queryRef + params only)
Request:
POST /v1/api/sql
{
"queryRef": { "version": 1, "sha256": "q_7a1b...e21" },
"params": ["conv_999"],
"serialization_mode": "simple",
"resultFormat": "array_rows",
"knownSchemaRefs": ["s_55c9...91a"]
}
Response (schema omitted, but schemaRef included):
200 OK
{
"status": "success",
"results": [
{
"schemaRef": { "version": 1, "sha256": "s_55c9...91a" },
"rows": [
["123456790", "conv_999", "bob", 1699000000001234, "Hi", null]
],
"row_count": 1
}
],
"took": 3.2
}
Step 3: Schema drift (ALTER TABLE / view change / SELECT * expansion)
Client sends same request as Step 2; server detects schemaRef changed and returns schema again:
200 OK
{
"status": "success",
"results": [
{
"schemaRef": { "version": 1, "sha256": "s_a912...44f" },
"schema": {
"fields": [
{"name":"msg_id","data_type":"Int64","nullable":false},
"...",
{"name":"edited_at","data_type":"Timestamp(Microsecond, None)","nullable":true}
]
},
"rows": [
["123456789", "conv_123", "alice", 1699000000000000, "Hello", null, null]
],
"row_count": 1
}
],
"took": 3.4
}
Important considerations / pitfalls
- Do not key server queryRef cache only by SQL text; it must be scoped to avoid cross-user leakage and schema mismatches.
- schemaRef should be based on output schema, not physical table schema.
- Current masking logic in the handler should be reviewed for array row format (masking by column name still possible, but row shape changes).
- Keep ScalarValue→JSON conversion centralized (extend conversion utilities rather than adding one-off conversions).
Implementation sketch (high level)
- Add new response types/fields in sql_response.rs
- Support
rows as either array-rows or object-rows (feature-flagged by resultFormat)
- Add
schemaRef, optional schema, optional queryRef
- Extend request model in sql_request.rs
- Add schema serialization helper + sha256 hashing
- Canonical Arrow schema JSON + sha256
- Extend conversion in arrow_json_conversion.rs
- Add RecordBatch → array-rows conversion that still calls the single ScalarValue→JSON function
- Update handler logic in sql_handler.rs
- If request has
queryRef, resolve cached SQL/plan; else compute and return queryRef
- Always attach
schemaRef; include schema only if requested/needed
- Add tests
- schemaRef stable for same schema
- schema omitted when
knownSchemaRefs contains it
- schema included when schema changes
Acceptance criteria
- REST SQL endpoint supports
resultFormat="array_rows" with stable ordering matching Arrow schema fields.
- Response includes
schemaRef for row-returning statements.
- Response includes
schema only when needed (first call / cache miss / schema drift).
- Client can send
queryRef + params without sending SQL on subsequent calls.
- No duplicate ScalarValue→JSON conversion logic introduced.
- Backward compatibility preserved (existing clients can keep using object rows).
Problem
Today
/v1/api/sqlreturns result rows as JSON objects plus acolumnslist, and the “typed” mode wraps each value. This makes it hard for Kalam-link to:Relevant code:
Goals
Non-goals (for this issue)
/v1/api/sql).Proposed API changes
New request fields (backward-compatible)
Extend the existing request model with optional fields:
queryRef?: { version: number, sha256: string }sql?: string(already exists; allow eithersqlORqueryRef)params?: JsonValue[](already exists)includeSchema?: boolean(defaultfalsewhen client sendsknownSchemaRefs, elsetrueon first query)knownSchemaRefs?: string[](client tells server which schema hashes it already has)resultFormat?: "array_rows" | "object_rows"(defaultobject_rowsfor backward compat; Kalam-link usesarray_rows)Notes:
serialization_modeas-is (simple vs typed value encoding), but do not change row shape by mode.New response fields (for SELECT-like results)
For each
QueryResult(or a newQueryResultV2):queryRef?: { version: number, sha256: string }(returned when request usedsql)schemaRef?: { version: number, sha256: string }(always returned for row results)schema?: { fields: [...] , metadata?: ... }(returned only whenincludeSchema=trueOR client doesn’t have schemaRef)rows: array of rowsresultFormat="array_rows":rows: JsonValue[][]resultFormat="object_rows": keep currentrows: object[]columnswhen schema is present (or keep temporarily but treat as redundant)Optional endpoint (recommended for robustness)
GET /v1/api/schema/{schemaSha256}{ schemaRef, schema }Hashing / caching semantics (critical)
schemaRef (correctness)
schemaRef.sha256 = sha256(canonical_arrow_schema_json)queryRef (server perf)
Used to let the client send only
{queryRef, params}on subsequent calls.queryRef.sha256should be derived from at least:A safe starting scope key could include: user role + namespace_id + any “current user” context.
Also define server cache behavior:
Concrete examples (end-to-end)
Step 1: First call (client only knows SQL)
Request:
Response:
Step 2: Subsequent call (client sends queryRef + params only)
Request:
Response (schema omitted, but schemaRef included):
Step 3: Schema drift (ALTER TABLE / view change / SELECT * expansion)
Client sends same request as Step 2; server detects schemaRef changed and returns schema again:
Important considerations / pitfalls
Implementation sketch (high level)
rowsas either array-rows or object-rows (feature-flagged byresultFormat)schemaRef, optionalschema, optionalqueryRefqueryRef, resolve cached SQL/plan; else compute and returnqueryRefschemaRef; includeschemaonly if requested/neededknownSchemaRefscontains itAcceptance criteria
resultFormat="array_rows"with stable ordering matching Arrow schema fields.schemaReffor row-returning statements.schemaonly when needed (first call / cache miss / schema drift).queryRef + paramswithout sending SQL on subsequent calls.