Skip to content

Latest commit

 

History

History
1541 lines (1232 loc) · 53.5 KB

File metadata and controls

1541 lines (1232 loc) · 53.5 KB

Velaria Python Ecosystem

This document is the entrypoint for Velaria's supported Python ecosystem layer.

Python is a supported ingress, interop, and packaging surface. It is not the execution core. Core semantics still come from the native kernel and the runtime contract in docs/runtime-contract.md.

The Python layer must assume the core kernel is column-first.

Practical implications:

  • Python interop should preserve columnar data as deep as possible
  • rows are a compatibility/export surface, not the preferred internal execution form
  • Arrow import/export work should reduce copy and rematerialization rather than normalizing row-first behavior
  • performance-sensitive changes should prefer native kernel improvements and lower-copy boundaries over Python-side workarounds

Scope

Supported

The supported Python ecosystem includes:

  • the velaria/ package and Session API
  • Arrow ingestion and Arrow output
  • uv-based local workflow
  • native extension build
  • wheel / native wheel packaging
  • the supported CLI entrypoint velaria_cli.py
  • Excel ingestion via read_excel(...)
  • Bitable adapters and stream source integration
  • custom source / custom sink adapters
  • vector search and vector explain APIs
  • offline embedding pipeline helpers for versioned vector assets
  • offline keyword-index build helpers and reusable BM25 keyword-search assets
  • public-data finance helpers for A-share / U.S. stock history and quote ingestion
  • Velaria-owned Agent CLI/TUI backed by Codex App Server or Claude Agent SDK runtime adapters
  • default Agent CLI via velaria_cli.py, with headless agent --print and agent --stream-json modes

Examples

Examples and helper assets include:

  • examples/demo_batch_sql_arrow.py
  • examples/demo_stream_sql.py
  • examples/demo_bitable_group_by_owner.py
  • examples/demo_vector_search.py
  • benchmarks/bench_arrow_ingestion.py
  • examples/demo_embedding_pipeline.py
  • examples/finance_public_data_smoke.py
  • benchmarks/bench_embedding_pipeline.py
  • local ecosystem scripts and skills

Experimental

The Python experimental area is currently reserved under experimental/.

Anything placed there is explicitly outside the supported ecosystem surface until it is promoted into velaria/, velaria_cli.py, or a supported adapter module.

Not In Scope

Python does not define:

  • execution hot-path semantics
  • a separate progress schema
  • a separate checkpoint contract
  • a separate vector scoring implementation for supported APIs
  • Python UDFs in the hot path
  • a row-first fallback policy for the native kernel

API Surface

Main Session API:

  • Session.probe(...)
  • Session.read(...)
  • Session.read_csv(...)
  • Session.read_line_file(...)
  • Session.read_json(...)
  • Session.sql(...)
  • Session.create_dataframe_from_arrow(...)
  • Session.create_stream_from_arrow(...)
  • Session.create_temp_view(...)
  • Session.read_stream_csv_dir(...)
  • Session.stream_sql(...)
  • Session.explain_stream_sql(...)
  • Session.start_stream_sql(...)
  • Session.vector_search(...)
  • Session.explain_vector_search(...)
  • build_embedding_rows(...)
  • materialize_embeddings(...)
  • load_embedding_dataframe(...)
  • embed_query_text(...)
  • SentenceTransformerEmbeddingProvider(...)
  • build_keyword_index(...)
  • search_keyword_index(...)

Additional ecosystem helpers:

  • read_excel(...)
  • CustomArrowStreamSource
  • CustomArrowStreamSink
  • create_stream_from_custom_source(...)
  • consume_arrow_batches_with_custom_sink(...)
  • finance_pack.fetch_history(...)
  • finance_pack.fetch_quotes(...)
  • finance_pack.fetch_fundamentals(...)
  • finance_pack.build_research_prompt(...)

Mapping rule:

  • Python names may be ecosystem-friendly
  • behavior must map back to the same native kernel contract exposed by C++
  • Python wrappers should not force row materialization earlier than required by the user-facing boundary

Finance Agentic Pack

The finance pack is a Python ecosystem helper for agentic monitor workflows. It does not add financial semantics to the native kernel.

Install the optional public-data provider dependency:

uv sync --project python --extra finance

Start with the product readiness check and source guide:

uv run --project python --extra finance python python/velaria_cli.py finance doctor

uv run --project python --extra finance python python/velaria_cli.py finance sources

finance sources is generated from the provider registry used by the fetch commands. It is the authoritative runtime list for provider capabilities, supported markets, command support, freshness metadata, and recommended history/quote/news paths.

Run the one-command A-share analysis workflow. This fetches a public quote, stores it as a Velaria observation, runs a monitor, and prints a readable research report with source evidence:

uv run --project python --extra finance python python/velaria_cli.py finance analyze \
  --market cn \
  --symbol 000001

Use JSON output for agent automation:

uv run --project python --extra finance python python/velaria_cli.py finance analyze \
  --market cn \
  --symbol 000001 \
  --format json

Run the complete CLI chain: fetch historical OHLCV, persist a history artifact, subscribe to live quote ticks, run a monitor, and emit analysis plus service integration metadata:

uv run --project python --extra finance python python/velaria_cli.py finance pipeline \
  --market cn \
  --symbol 000001 \
  --start-date 20250101 \
  --end-date 20250131 \
  --iterations 1 \
  --interval-sec 0

Use JSON output for the complete chain:

uv run --project python --extra finance python python/velaria_cli.py finance pipeline \
  --market cn \
  --symbol 000001 \
  --start-date 20250101 \
  --end-date 20250131 \
  --iterations 1 \
  --interval-sec 0 \
  --format json

Run the same complete chain for a U.S. stock. The default path uses Yahoo for historical OHLCV and Tencent for quote ticks; Tencent U.S. quote rows are reported as freshness=delayed by the provider contract.

uv run --project python --extra finance python python/velaria_cli.py finance pipeline \
  --market us \
  --symbol AAPL \
  --start-date 20260501 \
  --end-date 20260518 \
  --iterations 1 \
  --interval-sec 0 \
  --format json

Rank a candidate pool with quote polling, historical momentum, public news RSS, and transparent sentiment evidence. This command emits research candidates, not trading advice:

uv run --project python --extra finance python python/velaria_cli.py finance rank-candidates \
  --market us \
  --symbols AAPL,MSFT,NVDA \
  --start-date 20260501 \
  --end-date 20260518 \
  --top 3 \
  --news-limit 5 \
  --iterations 1 \
  --format json

Use continuous JSONL mode for a running monitor-style loop:

uv run --project python --extra finance python python/velaria_cli.py finance rank-candidates \
  --market us \
  --symbols AAPL,MSFT,NVDA \
  --start-date 20260501 \
  --end-date 20260518 \
  --top 3 \
  --news-limit 5 \
  --iterations 0 \
  --interval-sec 30 \
  --jsonl

Use native stream mode when the ranking loop should push normalized candidate events through Velaria's native realtime stream source/sink APIs. Add --ingest-raw when quote, history, news, derived feature metrics, and candidate rows should all be persisted as Velaria external_event sources for later inspection. Native stream sink output is also persisted as a durable stream history source named finance_<market>_rank_candidates_native_stream_signals:

uv run --project python --extra finance python python/velaria_cli.py finance rank-candidates \
  --market us \
  --symbols AAPL,MSFT,NVDA \
  --start-date 20260501 \
  --end-date 20260518 \
  --top 3 \
  --news-limit 5 \
  --native-stream \
  --ingest-raw \
  --entry-score-threshold 8 \
  --entry-return-threshold 5 \
  --exit-score-threshold 0 \
  --exit-quote-pct-threshold -3 \
  --signal-policy-preset balanced \
  --iterations 0 \
  --interval-sec 300 \
  --format json

Native stream signal flags are computed by a signal policy before they enter the generic stream SQL predicate WHERE entry_signal >= 1 OR exit_signal >= 1. Use --signal-policy-preset balanced|momentum|defensive for built-in policies, or pass --signal-policy JSON to make the condition tree explicit:

uv run --project python --extra finance python python/velaria_cli.py finance rank-candidates \
  --market us \
  --symbols AAPL,MSFT,NVDA \
  --start-date 20260501 \
  --end-date 20260518 \
  --native-stream \
  --ingest-raw \
  --signal-policy '{"entry":{"all":[{"field":"momentum_state","op":"!=","value":"bearish"},{"field":"score","op":">=","value":0}]},"exit":{"any":[{"field":"news_sentiment_label","op":"=","value":"negative"},{"field":"quote_pct_change","op":"<=","value":-2}]}}' \
  --iterations 1 \
  --interval-sec 0 \
  --format json

Query stored stream output later with:

uv run --project python --extra finance python python/velaria_cli.py finance stream-history \
  --market us \
  --source-id finance_us_rank_candidates_native_stream_signals \
  --limit 50 \
  --format json

Use intelligence as the productized entrypoint when the goal is the full finance loop: public data ingestion, Velaria native realtime stream signals, durable event storage, agent-readable AI briefs, and replay from persisted rows. It reuses the same watch-session runtime instead of creating a separate finance engine:

uv run --project python --extra finance python python/velaria_cli.py finance intelligence start \
  --intelligence-id us_intel_20260519 \
  --session-id us_watch_20260519 \
  --market us \
  --symbols AAPL,MSFT,NVDA \
  --market-symbols SPY,QQQ,DIA \
  --start-date 20260501 \
  --end-date 20260518 \
  --top 3 \
  --news-limit 5 \
  --iterations 0 \
  --interval-sec 300 \
  --format json

uv run --project python --extra finance python python/velaria_cli.py finance intelligence review \
  --session-id us_watch_20260519 \
  --format json

uv run --project python --extra finance python python/velaria_cli.py finance intelligence replay \
  --session-id us_watch_20260519 \
  --format json

uv run --project python --extra finance python python/velaria_cli.py finance intelligence report \
  --session-id us_watch_20260519 \
  --format json

uv run --project python --extra finance python python/velaria_cli.py finance intelligence index \
  --session-id us_watch_20260519 \
  --format json

uv run --project python --extra finance python python/velaria_cli.py finance intelligence search \
  --session-id us_watch_20260519 \
  --query "NVDA momentum risk news fundamentals" \
  --top-k 5 \
  --format json

uv run --project python --extra finance python python/velaria_cli.py finance intelligence jobs \
  --session-id us_watch_20260519 \
  --format json

uv run --project python --extra finance python python/velaria_cli.py finance intelligence status \
  --session-id us_watch_20260519 \
  --format json

uv run --project python --extra finance python python/velaria_cli.py finance intelligence stop \
  --session-id us_watch_20260519 \
  --format json

uv run --project python --extra finance python python/velaria_cli.py finance intelligence resume \
  --session-id us_watch_20260519 \
  --format json

uv run --project python --extra finance python python/velaria_cli.py finance intelligence evaluate \
  --session-id us_watch_20260519 \
  --format json

uv run --project python --extra finance python python/velaria_cli.py finance intelligence eval-report \
  --session-id us_watch_20260519 \
  --format json

intelligence start writes finance_intelligence_sessions and finance_intelligence_ai_notes, while all quote/history/news/feature/market/ fundamental and native stream rows remain under the watch-session feed sources. replay reads those persisted realtime rows back as historical evidence and writes finance_intelligence_replays. search hybrid-searches the persisted watch-session evidence with BM25 keyword retrieval, structured finance signals, recency, and reciprocal rank fusion, then writes finance_intelligence_searches. By default search uses --index-mode auto, which reuses a persisted evidence index when its metadata fingerprint matches the current session rows, and rebuilds it when stale. Use intelligence index to prebuild that reusable index under $VELARIA_HOME/finance/evidence_indexes/. Finance intelligence does not use hash embeddings in the product path; retrieval.semantic.status is disabled until a real production embedding provider is explicitly configured. report writes finance_intelligence_reports with a final scorecard, supervisor checks, provider quality diagnostics, and a replayable research summary. jobs, status, stop, and resume expose the durable job surface backed by finance_intelligence_jobs and finance_watch_session_runs; use them when an agent needs to inspect or control a long-running finance intelligence session without scraping logs. evaluate and eval-report read only persisted rows and persist finance_intelligence_evaluations with signal, provider, retrieval, and runtime quality metrics. The ai_plane.agent_prompt is designed for velaria_cli_run and does not fabricate model output.

Use watch-session when you need the lower-level durable market watch and diagnostic surface. It combines candidate ranking, native stream signal generation, raw quote/history/news storage, market context snapshots, and fundamental provider snapshots under one durable session_id. If a public provider cannot supply a requested feed, the row is persisted as a structured unavailable event instead of being mocked:

uv run --project python --extra finance python python/velaria_cli.py finance watch-session start \
  --session-id us_watch_20260519 \
  --market us \
  --symbols AAPL,MSFT,NVDA \
  --market-symbols SPY,QQQ,DIA \
  --start-date 20260501 \
  --end-date 20260518 \
  --top 3 \
  --news-limit 5 \
  --entry-score-threshold 8 \
  --entry-return-threshold 5 \
  --exit-score-threshold 0 \
  --exit-quote-pct-threshold -3 \
  --signal-policy-preset balanced \
  --iterations 0 \
  --interval-sec 300 \
  --format json

Inspect the durable session and produce the closing review from the same data:

uv run --project python --extra finance python python/velaria_cli.py finance watch-session list --format json

uv run --project python --extra finance python python/velaria_cli.py finance watch-session events \
  --session-id us_watch_20260519 \
  --feed all \
  --limit 200 \
  --format json

uv run --project python --extra finance python python/velaria_cli.py finance watch-session summarize \
  --session-id us_watch_20260519 \
  --format json

For long-running market watches, run the same session asynchronously. The background process still uses Velaria native realtime stream SQL for signal selection and writes all feeds to the same AgenticStore; the foreground CLI returns a pid, log_path, and model-readable follow-up commands:

uv run --project python --extra finance python python/velaria_cli.py finance watch-session start \
  --session-id us_watch_20260519 \
  --market us \
  --symbols AAPL,MSFT,NVDA \
  --market-symbols SPY,QQQ,DIA \
  --start-date 20260501 \
  --end-date 20260518 \
  --top 3 \
  --news-limit 5 \
  --iterations 0 \
  --interval-sec 300 \
  --async-run \
  --format json

uv run --project python --extra finance python python/velaria_cli.py finance watch-session status \
  --session-id us_watch_20260519 \
  --format json

uv run --project python --extra finance python python/velaria_cli.py finance watch-session logs \
  --session-id us_watch_20260519 \
  --limit 20 \
  --format json

uv run --project python --extra finance python python/velaria_cli.py finance watch-session review \
  --session-id us_watch_20260519 \
  --log-limit 20 \
  --format json

uv run --project python --extra finance python python/velaria_cli.py finance watch-session supervise \
  --session-id us_watch_20260519 \
  --interval-sec 60 \
  --log-limit 20 \
  --format json

uv run --project python --extra finance python python/velaria_cli.py finance watch-session stop \
  --session-id us_watch_20260519 \
  --format json

review reads the async runtime row, process state, log tail, persisted feed counts, latest signals, and provider-unavailable evidence, then appends a structured row to finance_watch_session_reviews. supervise runs that same review loop continuously inside the CLI (--iterations 0) or for a bounded number of cycles. The review output includes next_actions and an agent_prompt that is designed to be passed back through velaria_cli_run for continuous observation and adjustment.

Use agentic stream monitor mode when the ranking loop should create Velaria execution_mode=stream monitors and emit FocusEvents from the persisted ranking event stream. --until-time runs inside the CLI until the RFC3339 deadline; no external driver script is required:

uv run --project python --extra finance python python/velaria_cli.py finance rank-candidates \
  --market us \
  --symbols AAPL,MSFT,NVDA \
  --start-date 20260501 \
  --end-date 20260518 \
  --top 3 \
  --news-limit 5 \
  --stream-monitor \
  --ingest-raw \
  --entry-score-threshold 8 \
  --entry-return-threshold 5 \
  --exit-score-threshold 0 \
  --exit-quote-pct-threshold -3 \
  --until-time 2026-05-18T16:00:00-04:00 \
  --interval-sec 300 \
  --format json

Fetch public news rows directly when you need to inspect the news provider and sentiment evidence. News/filing rows include source_category, source_type, source_score, and source_score_reason so ranking, replay, and agents can distinguish broad news aggregators from finance news and regulatory filings:

uv run --project python --extra finance python python/velaria_cli.py finance fetch-news \
  --provider google-news \
  --market us \
  --symbol AAPL \
  --limit 5

uv run --project python --extra finance python python/velaria_cli.py finance fetch-news \
  --provider yahoo-finance-news \
  --market us \
  --symbol AAPL \
  --limit 5

export VELARIA_SEC_USER_AGENT="VelariaFinance/1.0 ops@example.com"

uv run --project python --extra finance python python/velaria_cli.py finance fetch-news \
  --provider sec-filings \
  --market us \
  --symbol AAPL \
  --limit 5

Fetch public U.S. fundamentals evidence through SEC Company Facts. If a symbol, market, or upstream endpoint cannot provide the data, the command returns a structured unavailable row instead of mock values. For production SEC access, set a descriptive application/contact User-Agent first:

export VELARIA_SEC_USER_AGENT="VelariaFinance/1.0 ops@example.com"

uv run --project python --extra finance python python/velaria_cli.py finance fetch-fundamentals \
  --provider sec-companyfacts \
  --market us \
  --symbols AAPL,MSFT,NVDA

Fetch A-share historical data through Yahoo chart JSON or AkShare and write a Parquet dataset:

uv run --project python --extra finance python python/velaria_cli.py finance fetch-history \
  --provider yahoo \
  --market cn \
  --symbol 000001 \
  --start-date 20250101 \
  --end-date 20250131 \
  --output /tmp/velaria-cn-history.parquet

Fetch public quote rows and ingest them as a Velaria external_event source:

uv run --project python --extra finance python python/velaria_cli.py finance ingest-quotes \
  --provider tencent \
  --market cn \
  --symbols 000001,600519 \
  --source-id finance_cn_quotes

Watch one public quote symbol, append each observation to an external_event source, run a monitor, and return FocusEvent plus analysis context:

uv run --project python --extra finance python python/velaria_cli.py finance watch \
  --market cn \
  --symbol 000001 \
  --interval-sec 30 \
  --iterations 0 \
  --jsonl

The same finance commands are available to velaria_cli.py -i through the registered agent tool velaria_cli_run. In agent mode, pass only the Velaria subcommand, for example finance doctor, finance sources, finance analyze --market cn --symbol 000001 --format json, finance pipeline --market cn --symbol 000001 --start-date 20250101 --end-date 20250131 --iterations 1 --format json, finance watch-session start --market us --symbols AAPL,MSFT,NVDA --start-date 20260501 --end-date 20260518 --iterations 0 --async-run --format json, finance watch-session status --session-id us_watch_20260519 --format json, finance watch-session review --session-id us_watch_20260519 --format json, finance watch-session supervise --session-id us_watch_20260519 --interval-sec 60 --format json, finance intelligence index --session-id us_watch_20260519 --format json, finance intelligence search --session-id us_watch_20260519 --query "NVDA momentum risk news fundamentals" --format json, finance intelligence report --session-id us_watch_20260519 --format json, finance intelligence jobs --session-id us_watch_20260519 --format json, finance intelligence status --session-id us_watch_20260519 --format json, finance intelligence stop --session-id us_watch_20260519 --format json, finance intelligence resume --session-id us_watch_20260519 --format json, finance intelligence evaluate --session-id us_watch_20260519 --format json, finance intelligence eval-report --session-id us_watch_20260519 --format json, finance fetch-fundamentals --provider sec-companyfacts --market us --symbols AAPL,MSFT,NVDA, or finance watch --market cn --symbol 000001 --interval-sec 30 --iterations 0 --jsonl; do not include uv, python, or python/velaria_cli.py in the tool arguments.

finance pipeline and finance rank-candidates do not require a finance-specific service route. They write to the same Velaria AgenticStore used by the local service. When finance rank-candidates --stream-monitor is used, the command also creates entry/exit execution_mode=stream monitors and executes them for each candidate ranking tick. If velaria_service is started with the same VELARIA_HOME, the generic external-events, monitors, and focus-events service routes can inspect the source, monitor, ranking observations, and events created by the CLI.

Run the public-data smoke against real AkShare endpoints:

uv run --project python --extra finance python python/examples/finance_public_data_smoke.py

When AkShare / Eastmoney is blocked by a local proxy or upstream network policy, verify public quote ingestion through Tencent's lightweight quote endpoint:

uv run --project python --extra finance python python/examples/finance_public_data_smoke.py --quotes-only

The standardized rows include provider evidence fields:

  • provider
  • source_url
  • fetched_at
  • freshness
  • delay_sec
  • license_note

U.S. quote freshness is reported from the provider path and may be delayed or unknown. The finance pack records that metadata instead of treating every quote as exchange-grade realtime data.

For research workflows, use finance_pack.build_research_prompt(...) to turn FocusEvent objects and related datasets into a prompt for the interactive Velaria Agent. The prompt requires live source links and states that output is research assistance, not investment advice.

File reader mapping:

  • Session.probe(path) returns inferred source kind, schema, normalized options, the final selected format, scored candidates, confidence, and warnings
  • Session.read(path, ...) is the preferred batch file front door and probes the source automatically
  • Session.read_csv(...) remains the explicit CSV override
  • Session.read_line_file(path, mappings=[...], ...) maps to the native line/regex source connector
  • Session.read_json(path, columns=[...], ...) maps to the native JSON lines / JSON array source connector
  • all four file readers share the same source materialization knobs: materialization, materialization_dir, and materialization_format
  • all four file readers also support cache_in_memory=True to retain the projected source snapshot inside the current session for repeated same-session queries; it is a reuse-oriented tradeoff, not a pure free cache hint, because it can bypass source pushdown on the first query
  • velaria-cli file-sql defaults to --input-type auto and registers batch sources through CREATE TABLE ... OPTIONS(path: '...')
  • versioned embedding datasets written as Parquet / Arrow should be loaded through pyarrow plus Session.create_dataframe_from_arrow(...) or load_embedding_dataframe(...), not Session.read(...)
  • reusable keyword indexes are directory artifacts built from Arrow / Parquet or batch file inputs and should be queried through the service / helper APIs, not treated as plain table files

Regex line usage:

  • regex is an explicit line-reader mode; it is not auto-probed
  • mappings source indexes follow regex capture-group numbering:
    • 0 is the full match
    • 1..n are capture groups
  • unmatched lines are skipped

Regex example:

regex_df = session.read_line_file(
    "events.log",
    mappings=[("uid", 1), ("action", 2), ("latency", 3), ("ok", 4), ("note", 5)],
    mode="regex",
    regex_pattern=r'^uid=(\d+) action="([^"]+)" latency=(\d+) ok=(true|false) note=(.+)$',
)

Minimal examples:

import velaria

session = velaria.Session()

csv_df = session.read_csv("input.csv")

probe = session.probe("events.jsonl")
auto_df = session.read("events.jsonl")

# probe includes:
# kind / final_format / score / confidence / schema / candidates / warnings

json_df = session.read_json(
    "events.jsonl",
    columns=["user_id", "action", "latency"],
    format="json_lines",
)

json_array_df = session.read_json(
    "events.json",
    columns=["event", "cost"],
    format="json_array",
)

nested_json_df = session.read_json(
    "events_nested.json",
    columns=["a", "b"],
    format="json_array",
)

# JSON reader notes:
# - top-level rows must still be JSON objects
# - nested object fields are preserved as raw JSON strings
# - numeric JSON arrays are still parsed as vector values when read directly as a field

JSON source examples:

{"user_id":1,"action":"open","latency":12.5}
{"user_id":2,"action":"close","latency":9.0}
[
  {"event":"open","cost":1.5},
  {"event":"close","cost":2}
]
[
  {"a":1,"b":{"b1":1}},
  {"a":2,"b":{"b1":2,"b2":["x",3,null]}}
]

Nested-object result shape with columns=["a", "b"]:

[1, "{\"b1\":1}"]
[2, "{\"b1\":2,\"b2\":[\"x\",3,null]}"]

Current JSON limits:

  • json_lines and json_array are supported
  • each top-level row must be a JSON object
  • columns=[...] is required for explicit JSON reads
  • nested object values are returned as JSON text, not flattened columns
  • a top-level scalar array such as ["a", "b"] is not supported as a table source
  • nested arrays inside an object string are preserved inside that JSON text
  • direct field values that are numeric JSON arrays still map to vector values

Embedding pipeline example:

from velaria import (
    DEFAULT_LOCAL_EMBEDDING_MODEL,
    DEFAULT_EMBEDDING_WARMUP_TEXT,
    HashEmbeddingProvider,
    Session,
    SentenceTransformerEmbeddingProvider,
    build_mixed_text_embedding_rows,
    download_embedding_model,
    materialize_mixed_text_embeddings,
    run_mixed_text_hybrid_search,
)

records = [
    {
        "doc_id": "doc-1",
        "title": "Alpha",
        "summary": "Payment page timeout",
        "tags": ["billing", "checkout"],
        "bucket": 1,
        "source_updated_at": 1,
    },
    {
        "doc_id": "doc-2",
        "title": "Beta",
        "summary": "Refund delay in worker queue",
        "tags": ["refund", "queue"],
        "bucket": 2,
        "source_updated_at": 2,
    },
]
provider = HashEmbeddingProvider(dimension=8)
session = Session()
materialize_mixed_text_embeddings(
    records,
    provider=provider,
    model="hash-demo",
    template_version="text-v1",
    text_fields=("title", "summary", "tags"),
    output_path="docs_embeddings.parquet",
)

result = run_mixed_text_hybrid_search(
    session,
    "docs_embeddings.parquet",
    provider=provider,
    model="hash-demo",
    query_text="payment page hangs during checkout",
    where_sql="bucket = 1 AND doc_id = 'doc-1'",
    top_k=2,
    metric="cosine",
)

For a local semantic baseline with all-MiniLM-L6-v2, install the optional provider dependency first:

uv sync --project python --extra embedding

Then swap the provider:

provider = SentenceTransformerEmbeddingProvider(
    model_name=DEFAULT_LOCAL_EMBEDDING_MODEL,
)

If you want to avoid remote Hub resolution on every machine, put the model files in a local directory and point the provider at that directory. Supported lookup order for the default MiniLM model is:

  1. VELARIA_EMBEDDING_MODEL_DIR
  2. python/models/all-MiniLM-L6-v2
  3. fallback to the Hugging Face model id

Example:

export VELARIA_EMBEDDING_MODEL_DIR=/absolute/path/to/all-MiniLM-L6-v2
export VELARIA_EMBEDDING_CACHE_DIR=/absolute/path/to/hf-cache

Then SentenceTransformerEmbeddingProvider(model_name=DEFAULT_LOCAL_EMBEDDING_MODEL) will load from the local directory instead of the Hub.

You can also explicitly pre-download and warm up the model before serving queries:

from velaria import (
    DEFAULT_LOCAL_EMBEDDING_MODEL,
    SentenceTransformerEmbeddingProvider,
    download_embedding_model,
)

local_dir = download_embedding_model(DEFAULT_LOCAL_EMBEDDING_MODEL)
provider = SentenceTransformerEmbeddingProvider(model_name=DEFAULT_LOCAL_EMBEDDING_MODEL)
provider.warmup(
    download_if_missing=False,
    warmup_text="warmup embedding text",
)

Recommended startup flow:

  1. download_embedding_model(...) during environment/bootstrap time
  2. provider.warmup(...) once during process start
  3. run batch embedding or online query embedding after the model is already resident

CLI examples:

uv run --project python python python/velaria_cli.py file-sql \
  --csv /tmp/input.csv \
  --input-type csv \
  --query "SELECT * FROM input_table LIMIT 5"

uv run --project python python python/velaria_cli.py file-sql \
  --input-path /tmp/input.jsonl \
  --input-type auto \
  --query "SELECT * FROM input_table LIMIT 5"

uv run --project python python python/velaria_cli.py file-sql \
  --input-path /tmp/events.log \
  --input-type line \
  --line-mode regex \
  --regex-pattern '^uid=(\\d+) action=\"([^\"]+)\" latency=(\\d+) ok=(true|false) note=(.+)$' \
  --mappings 'uid:1,action:2,latency:3,ok:4,note:5' \
  --query "SELECT * FROM input_table LIMIT 5"

Mixed-text embedding pipeline through the CLI:

uv run --project python python python/velaria_cli.py embedding-build \
  --input-path /tmp/docs.csv \
  --input-type csv \
  --text-columns title,summary,tags \
  --provider minilm \
  --output-path /tmp/docs_embeddings.parquet

uv run --project python python python/velaria_cli.py embedding-query \
  --dataset-path /tmp/docs_embeddings.parquet \
  --provider minilm \
  --query-text "payment page hangs during checkout" \
  --where-sql "bucket = 1 AND region = 'apac'" \
  --top-k 5

# direct query from the raw file without a prebuilt embedding dataset
uv run --project python python python/velaria_cli.py embedding-query \
  --input-path /tmp/docs.csv \
  --input-type csv \
  --text-columns title,summary,tags \
  --provider minilm \
  --query-text "payment page hangs during checkout" \
  --where-sql "bucket = 1 AND region = 'apac'" \
  --top-k 5

Reusable keyword-index build and BM25 keyword search through the service:

curl -sS http://127.0.0.1:37491/api/v1/runs/keyword-index-build \
  -H 'Content-Type: application/json' \
  -d '{
    "input_path": "/tmp/docs.csv",
    "input_type": "csv",
    "text_columns": ["title", "body"],
    "analyzer": "jieba"
  }'

curl -sS http://127.0.0.1:37491/api/v1/runs/keyword-search \
  -H 'Content-Type: application/json' \
  -d '{
    "index_path": "/tmp/keyword_index",
    "query_text": "payment timeout",
    "where_sql": "bucket = 1",
    "top_k": 10
  }'

Agent Runtime (Optional)

Codex and Claude SDK adapter dependencies are declared by the Python package. Configure the Claude adapter only when you want --runtime claude.

The agent runtime provides:

  • Velaria-owned Agent CLI/TUI via velaria_cli.py or velaria_cli.py agent with Codex/Claude used only as runtime adapters
  • Headless Agent turns via velaria_cli.py agent --print ... and JSONL event streaming via velaria_cli.py agent --stream-json ...
  • Thread persistence under agentRuntimeWorkspace
  • On-demand exposure of the Velaria usage skill as an MCP resource
  • Velaria local functions exposed through the runtime bridge / MCP server: velaria_read, velaria_schema, velaria_sql, velaria_explain, velaria_dataset_download, velaria_dataset_import, velaria_dataset_normalize, velaria_dataset_process, velaria_cli_run, velaria_artifact_preview, velaria_sql_capabilities, velaria_sql_function_search, velaria_sql_query_patterns
  • On-demand SQL reference resource velaria://sql/catalog for SQL v1 capabilities, scalar functions, and reusable query patterns
  • Natural language to SQL generation via velaria_cli.py ai generate-sql
  • Legacy compatibility commands under velaria_cli.py ai ...; new interactive work should use the default Agent entry or velaria_cli.py agent

Both runtimes use the same ~/.velaria/config.json and agent* config keys.

Codex runtime (default):

{
  "agentRuntime": "codex",
  "agentAuthMode": "local",
  "agentProvider": "openai",
  "agentReasoningEffort": "none",
  "agentRuntimeWorkspace": "~/.velaria/ai-runtime",
  "agentCodexNetworkAccess": true
}
{
  "agentRuntime": "claude",
  "agentAuthMode": "local",
  "agentProvider": "anthropic",
  "agentModel": "claude-sonnet-4-20250514",
  "agentReasoningEffort": "none",
  "agentRuntimeWorkspace": "~/.velaria/ai-runtime",
  "agentNetworkAccess": true
}

Defaults: Codex reuses the local Codex config model and falls back to gpt-5.4-mini; set agentCodexModel only when Velaria should override the local Codex model. Claude uses claude-sonnet-4-20250514. Both default agentReasoningEffort to none. agentRuntimeWorkspace is the runtime working directory used to save and resume agent threads; if omitted, Velaria creates a project-scoped directory under ~/.velaria/ai-runtime/. agentAuthMode: "local" reuses the local login; use agentAuthMode: "api_key" with agentApiKey and agentBaseUrl for explicit credentials. Use agentRuntimePath / agentCodexRuntimePath only when overriding the local Codex runtime bridge. Use agentClaudeRuntimePath for the Claude SDK adapter. Network access via agentCodexNetworkAccess (Codex) or agentNetworkAccess (Claude), both default true. The runtime inherits standard proxy environment variables such as http_proxy, https_proxy, and all_proxy.

Velaria Agent keeps the underlying runtime generic. Velaria-specific usage guidance and SQL function details are exposed on demand through MCP resources and local functions, not by embedding the full skill or SQL function catalog in the default prompt. Use velaria_sql_capabilities, velaria_sql_function_search, velaria_sql_query_patterns, or the velaria://sql/catalog MCP resource when an agent needs SQL details.

Agent CLI examples:

uv run --project python python python/velaria_cli.py

uv run --project python python python/velaria_cli.py agent --runtime claude

uv run --project python python python/velaria_cli.py agent --model gpt-5.4

uv run --project python python python/velaria_cli.py agent --print \
  "读取 data/sales.csv,按 region 汇总 amount,并保存 run"

uv run --project python python python/velaria_cli.py agent --stream-json \
  "summarize recent workspace runs"

uv run --project python python python/velaria_cli.py ai generate-sql \
  --prompt "top 5 by score" --schema "name,score,region"

Inside the Agent TUI, use Ctrl+M to open the model picker for the current runtime. Use /model <model-name> to switch to an arbitrary provider model name without restarting the CLI. Model switches start a fresh Agent session and are blocked while a turn is running or queued.

Current SQL mapping carried by Python:

  • Session.sql(...) maps to core SQL v1 batch semantics:
    • CREATE TABLE, CREATE SOURCE TABLE, CREATE SINK TABLE
    • INSERT INTO ... VALUES
    • INSERT INTO ... SELECT
    • SELECT with projection/alias, WHERE, GROUP BY columns/scalar expressions, ORDER BY, LIMIT, UNION / UNION ALL, and the current minimal JOIN
    • batch WHERE supports single predicates, column-to-column predicates, plus AND / OR expressions
    • batch KEYWORD SEARCH(title, body) QUERY '...' TOP_K ... on single-table non-aggregate queries
    • batch HYBRID SEARCH ... QUERY ... on single-table non-aggregate queries
    • current Python service can combine reusable keyword-index recall with vector rerank by passing both index_path and dataset_path to hybrid-search
  • batch builtins currently exposed through the same core path:
    • LOWER, UPPER, TRIM, LTRIM, RTRIM
    • LENGTH, LEN, CHAR_LENGTH, CHARACTER_LENGTH, REVERSE
    • CONCAT, CONCAT_WS, LEFT, RIGHT, SUBSTR / SUBSTRING, POSITION, REPLACE, CAST
    • ABS, CEIL, FLOOR, ROUND, YEAR, MONTH, DAY, ISO_YEAR, ISO_WEEK, WEEK, YEARWEEK, NOW, TODAY, CURRENT_TIMESTAMP, currentTimestamp, UNIX_TIMESTAMP
    • supported scalar functions can be nested in projection expressions
  • Session.stream_sql(...), Session.explain_stream_sql(...), and Session.start_stream_sql(...) share the same stream SQL front-door checks:
    • source must be a source table / stream source
    • sink target must be a sink table
    • only the current stream-stable projection, filter, window, stateful aggregate, and bounded-source ORDER BY shapes are accepted
  • current SQL v1 keeps ORDER BY scoped to columns present in the SELECT output
  • unsupported SQL shapes are expected to surface as explicit parse / semantic / unsupported SQL v1 / table-kind errors from the core path, not Python-only behavior

Desktop / service import behavior:

  • local file import preview still only inspects schema + preview rows
  • when saving a dataset from the desktop app, embedding build and keyword-index build can both be configured and will run asynchronously in the background
  • bitable import can build:
    • a reusable embedding dataset from selected text columns
    • a reusable keyword index from selected keyword columns
    • both in parallel within the same background import run
  • packaged sidecar builds copy jieba dictionaries from the resolved Bazel cppjieba dependency at build time; the source repo does not need to carry those dictionary files

Repository Layout

Stable Python layout in this repo:

  • supported library:
    • python/velaria/
  • supported CLI tool:
    • python/velaria_cli.py
  • examples:
    • python/examples/
  • benchmarks:
    • python/benchmarks/
  • reserved experimental area:
    • python/experimental/
  • regression tests:
    • python/tests/

Toolchain and Environment

Repository Python commands use uv.

Recommended local baseline:

  • CPython 3.12 or 3.13
  • uv
  • local CPython headers (Python.h)

Bazel Python detection currently probes local CPython interpreters in the 3.9 to 3.13 range. If auto-discovery fails, set:

export VELARIA_PYTHON_BIN=/path/to/python3.13

That interpreter must expose Python.h; otherwise Bazel cannot build the native extension.

Development Workflow

Bootstrap:

bazel build //:velaria_pyext
bazel run //python:sync_native_extension
uv sync --project python --python python3.13

If you run python/velaria_cli.py or other source-checkout Python entrypoints directly, keep python/velaria/_velaria.so in sync with:

bazel run //python:sync_native_extension

Run demos:

uv run --project python python python/examples/demo_batch_sql_arrow.py
uv run --project python python python/examples/demo_stream_sql.py
uv run --project python python python/examples/demo_vector_search.py
uv run --project python python python/examples/demo_embedding_pipeline.py

Recommended regression entrypoint:

./scripts/run_python_ecosystem_regression.sh

Repository benchmark fixture generation:

  • the stage benchmark can generate synthetic data at runtime
  • if you want a locally realistic benchmark input, generate an anonymized CSV from a private raw export with:
uv run --project python python scripts/generate_stage_benchmark_fixture.py \
  --input /path/to/raw_rows_100k.csv \
  --output python/benchmarks/data/stage_input_100k_anonymized.csv
  • keep that generated CSV local and untracked; it is ignored by .gitignore

That script covers:

  • native extension build
  • wheel and native wheel build
  • Bazel Python regression targets
  • demo smoke
  • CLI smoke

Benchmark regression entrypoint:

./scripts/run_python_stage_benchmark.sh

Embedding pipeline benchmark:

uv run --project python python python/benchmarks/bench_embedding_pipeline.py

For the local MiniLM provider:

uv sync --project python --extra embedding
uv run --project python python python/benchmarks/bench_embedding_pipeline.py \
  --provider minilm \
  --model sentence-transformers/all-MiniLM-L6-v2

The benchmark reports both:

  • batch embedding/materialization throughput
  • online query embedding latency
  • online hybrid search latency on the resulting embedding dataset

Core file-input benchmark entrypoint:

bazel run //:file_source_benchmark -- 200000 3

That benchmark currently reports:

  • CSV hardcode / explicit / auto-probed paths
  • CSV scan-only / full-columnar / full-row-materialize / projected / filter-pushdown / aggregate-pushdown sub-cases
  • line split hardcode / explicit / auto-probed paths and direct filter-pushdown / aggregate-pushdown cases
  • line regex parse and grouped-aggregate paths
  • JSON lines hardcode / explicit / auto-probed paths and direct filter-pushdown / aggregate-pushdown cases
  • SQL CREATE TABLE ... OPTIONS(path: '...') registration costs plus CSV / line / JSON predicate-pushdown comparisons

Current pushdown lowering also classifies source work into:

  • ConjunctiveFilterOnly
  • SingleKeyCount
  • SingleKeyNumericAggregate
  • Generic

Representative clean-main vs current snapshot on 200000 / 3:

  • read_line_regex_explicit_group_sum: 5679936 us -> 641735 us
  • sql_csv_predicate_and_group_count: 133011 us -> 109146 us
  • sql_csv_predicate_or_group_count: 307313 us -> 171556 us
  • sql_csv_predicate_mixed_group_count: 462000 us -> 275583 us
  • sql_line_predicate_or_group_count: 314852 us -> 174627 us
  • sql_json_predicate_or_group_count: 604404 us -> 420423 us

For Linux perf sampling on the native CSV path:

perf record --call-graph=dwarf bazel-bin/file_source_benchmark -- 200000 3
perf report

By default that script generates benchmark input at runtime. To use a local anonymized CSV instead, set VELARIA_STAGE_BENCH_CSV=/path/to/file.csv. The default scenario is groupby_count_max.

Benchmark scenario controls:

  • set VELARIA_STAGE_BENCH_SCENARIO=groupby_count_max for the caller_psm / count / max(latency) path
  • set VELARIA_STAGE_BENCH_SCENARIO=filter_lower_limit for the LOWER(method) + filter + LIMIT path
  • set VELARIA_STAGE_BENCH_QUERY="..." only when you intentionally want a custom Velaria query
  • pass --cache-in-memory to python/benchmarks/bench_stage_paths.py when you want the reuse path to retain projected source columns in the current session
  • when VELARIA_STAGE_BENCH_QUERY does not match the selected scenario query, also set VELARIA_STAGE_BENCH_SKIP_HARDCODE=1; otherwise the benchmark rejects the run

hardcode is only reported when it is semantically aligned with the selected scenario. The benchmark now enforces row-count parity between the hardcode baseline and Velaria result before it prints ratios.

Interpretation guardrails for the stage benchmark:

  • Session.read_csv(...) and Session.sql(...) are setup/planning calls in this harness; they do not represent file scan or query execution time by themselves
  • DataFrame.to_arrow() is the first materialization point in this harness, so its stage includes execution plus Arrow export
  • to_pylist() only measures the Python-side conversion from the already materialized Arrow table
  • the hardcode baseline in python/benchmarks/bench_stage_paths.py is a scenario-specific Python stdlib baseline built with csv.DictReader and manual logic; it is useful for relative comparison inside this harness, but it is not a native C/C++ kernel upper bound
  • packaged CLI startup is a separate measurement surface from the Python API stage benchmark

Packaging

Build targets:

  • native extension:
    • //:velaria_pyext
  • sync built native extension into the source checkout:
    • //python:sync_native_extension
  • pure-Python wheel wrapper:
    • //python:velaria_whl
  • native wheel:
    • //python:velaria_native_whl
  • Python CLI:
    • //python:velaria_cli

Single-file CLI packaging:

./scripts/build_py_cli_executable.sh
./dist/velaria-cli file-sql \
  --csv /path/to/input.csv \
  --query "SELECT * FROM input_table LIMIT 5"

That single-file CLI is packaged with PyInstaller --onefile, so cold-start measurements include Python/bootstrap overhead in addition to engine work.

The CLI is part of the ecosystem layer. For supported paths, it should delegate to the same native session contract as Python and C++.

Repo-visible CLI entrypoints are:

  • source checkout:
    • uv run --project python python python/velaria_cli.py ...
  • installed wheel or local package install:
    • velaria-cli ...
    • velaria_cli ...
  • packaged binary:
    • ./dist/velaria-cli ...

The global commands are expected only after installing the wheel or package into your environment.

Every top-level command and subcommand supports --help. Running velaria-cli without a subcommand starts the Velaria Agent TUI in a TTY. Use velaria-cli agent --print "..." or velaria-cli agent --stream-json "..." for script-friendly Agent turns.

Workspace + Artifacts

The CLI also supports a local workspace layout for tracked runs and artifact indexing.

Default paths:

  • runs: ~/.velaria/runs/<run_id>/
  • index: ~/.velaria/index/artifacts.sqlite

You can override the root with:

export VELARIA_HOME=/tmp/velaria-home

Tracked run commands:

uv run --project python python python/velaria_cli.py run start -- file-sql \
  --run-name "cn_slow_query_24h_2026-04-03" \
  --description "score filter result for demo input" \
  --tag cn \
  --tag "slow-query,demo" \
  --csv /path/to/input.csv \
  --query "SELECT * FROM input_table LIMIT 5"

./dist/velaria-cli run start -- file-sql \
  --run-name "cn_slow_query_24h_2026-04-03" \
  --description "score filter result for demo input" \
  --tag cn \
  --tag "slow-query,demo" \
  --csv /path/to/input.csv \
  --query "SELECT * FROM input_table LIMIT 5"

uv run --project python python python/velaria_cli.py run list --tag cn --query "slow query" --limit 20
uv run --project python python python/velaria_cli.py run result --run-id <run_id>
uv run --project python python python/velaria_cli.py run diff --run-id <run_id> --other-run-id <other_run_id>
uv run --project python python python/velaria_cli.py run show --run-id <run_id>
uv run --project python python python/velaria_cli.py run status --run-id <run_id>
uv run --project python python python/velaria_cli.py artifacts list --run-id <run_id>
uv run --project python python python/velaria_cli.py artifacts preview --artifact-id <artifact_id>
uv run --project python python python/velaria_cli.py run cleanup --keep-last 10

The tracked workspace contract is:

  • stdout returns JSON only
  • logs go to stdout.log / stderr.log
  • run.json can carry run_name, description, and tags for human-readable context and filtering
  • run list returns summary-friendly fields such as artifact_count and duration_ms
  • failures return structured JSON with error_type, phase, optional run_id, and details
  • stream progress appends native snapshotJson() output to progress.jsonl
  • stream explain keeps the native logical / physical / strategy structure
  • large results stay in files under artifacts/; SQLite stores only index rows and small previews
  • deleting run directories requires the explicit --delete-files switch

End-to-end examples:

CSV SQL to parquet plus preview:

uv run --project python python python/velaria_cli.py run start -- file-sql \
  --run-name "high_score_rows" \
  --description "high score rows for local inspection" \
  --tag local-demo \
  --tag scores \
  --csv /path/to/input.csv \
  --query "SELECT name, score FROM input_table WHERE score > 10"

uv run --project python python python/velaria_cli.py run list --tag scores --query "high score"
uv run --project python python python/velaria_cli.py run result --run-id <run_id>
uv run --project python python python/velaria_cli.py run diff --run-id <run_id> --other-run-id <other_run_id>
uv run --project python python python/velaria_cli.py artifacts list --run-id <run_id>
uv run --project python python python/velaria_cli.py artifacts preview --artifact-id <artifact_id>

Stream SQL once plus status:

uv run --project python python python/velaria_cli.py run start -- stream-sql-once \
  --source-csv-dir /path/to/source_dir \
  --sink-schema "key STRING, value_sum INT" \
  --query "INSERT INTO output_sink SELECT key, SUM(value) AS value_sum FROM input_stream GROUP BY key"

uv run --project python python python/velaria_cli.py run status --run-id <run_id>

For this action, the query still follows the core stream SQL boundary:

  • --query must be INSERT INTO <sink> SELECT ...
  • the source side must resolve to the stream source table created by the command
  • the sink side must resolve to the sink table created from --sink-schema
  • explain output remains logical / physical / strategy, and progress stays native snapshotJson()

Vector search plus explain artifact:

uv run --project python python python/velaria_cli.py run start -- vector-search \
  --csv /path/to/vectors.csv \
  --vector-column embedding \
  --query-vector "0.1,0.2,0.3" \
  --top-k 5

uv run --project python python python/velaria_cli.py artifacts list --run-id <run_id>

Hybrid search through the CLI keeps the same command and adds optional filter / threshold controls:

uv run --project python python python/velaria_cli.py vector-search \
  --csv /path/to/vectors.csv \
  --vector-column embedding \
  --query-vector "0.1,0.2,0.3" \
  --metric cosine \
  --top-k 5 \
  --where-column bucket \
  --where-op = \
  --where-value 1 \
  --score-threshold 0.02

Batch SQL also supports a minimal hybrid search clause through file-sql:

uv run --project python python python/velaria_cli.py file-sql \
  --csv /path/to/vectors.csv \
  --query "SELECT id, bucket, vector_score FROM input_table WHERE bucket = 1 HYBRID SEARCH embedding QUERY '[0.1 0.2 0.3]' METRIC cosine TOP_K 5 SCORE_THRESHOLD 0.02"

Python ecosystem source groups:

  • supported:
    • //python:velaria_python_supported_sources
  • examples and benchmarks:
    • //python:velaria_python_example_sources
  • experimental placeholder:
    • //python:velaria_python_experimental_sources

Arrow Contract

Arrow is the preferred interop form for high-volume results.

Guidance:

  • prefer Arrow/native columnar paths over to_rows() when benchmarking or integrating large results
  • treat to_rows() as a convenience/debugging surface
  • Session.sql(...) returns a lazy batch DataFrame handle
  • to_arrow() / to_rows() trigger materialization; the first materialization stage includes execution plus conversion to the requested result form
  • if you need pandas, use session.sql(...).to_arrow().to_pandas(); there is no direct DataFrame.to_pandas() helper in the current API

Supported Arrow ingestion inputs:

  • pyarrow.Table
  • pyarrow.RecordBatch
  • pyarrow.RecordBatchReader
  • objects implementing __arrow_c_stream__
  • Python sequences of Arrow batches

Vector-preferred Arrow shape:

  • FixedSizeList<float32>

Preferred local CSV vector text shape:

  • [1 2 3]
  • [1,2,3]

Current vector search scope:

  • local exact scan only
  • metrics: cosine, dot, l2
  • batch SQL supports a minimal HYBRID SEARCH ... QUERY ... clause
  • CLI vector-search supports optional --where-column/--where-op/--where-value and --score-threshold
  • no ANN / distributed execution / standalone vector DB behavior

Excel, Bitable, and Custom Streams

Excel

read_excel(...) reads .xlsx through:

  1. pandas.read_excel
  2. pyarrow.Table conversion
  3. Session.create_dataframe_from_arrow(...)

Example:

from velaria import Session, read_excel

session = Session()
df = read_excel(session, "/path/to/file.xlsx", sheet_name="Sheet1")
session.create_temp_view("staff", df)
print(session.sql("SELECT * FROM staff LIMIT 5").to_rows())

Bitable and Custom Streams

Supported ecosystem integrations include:

  • Bitable-backed stream source flows
  • custom Arrow stream sources
  • custom Arrow stream sinks

These are supported as ecosystem integrations, not as alternate execution cores.

Regression Matrix

Python ecosystem regression targets:

  • //python:streaming_v05_test
  • //python:arrow_stream_ingestion_test
  • //python:vector_search_test
  • //python:read_excel_test
  • //python:custom_stream_source_test
  • //python:bitable_stream_source_test
  • //python:bitable_group_by_owner_integration_test

Python-layer grouped suite:

  • //python:velaria_python_supported_regression

Root-level grouped suite:

  • //:python_ecosystem_regression

Relation to Core

Python may:

  • wrap
  • package
  • automate
  • project ecosystem-friendly names

Python may not:

  • redefine progress/checkpoint/explain semantics
  • become the source of truth for runtime decisions
  • introduce a second vector-search implementation for supported interfaces
  • pull the native kernel back toward a row-first design for ecosystem convenience

For core boundaries, see docs/core-boundary.md. For stable runtime semantics, see docs/runtime-contract.md.