This document is the entrypoint for Velaria's supported Python ecosystem layer.
Python is a supported ingress, interop, and packaging surface. It is not the execution core. Core semantics still come from the native kernel and the runtime contract in docs/runtime-contract.md.
The Python layer must assume the core kernel is column-first.
Practical implications:
- Python interop should preserve columnar data as deep as possible
rowsare a compatibility/export surface, not the preferred internal execution form- Arrow import/export work should reduce copy and rematerialization rather than normalizing row-first behavior
- performance-sensitive changes should prefer native kernel improvements and lower-copy boundaries over Python-side workarounds
The supported Python ecosystem includes:
- the
velaria/package andSessionAPI - Arrow ingestion and Arrow output
uv-based local workflow- native extension build
- wheel / native wheel packaging
- the supported CLI entrypoint
velaria_cli.py - Excel ingestion via
read_excel(...) - Bitable adapters and stream source integration
- custom source / custom sink adapters
- vector search and vector explain APIs
- offline embedding pipeline helpers for versioned vector assets
- offline keyword-index build helpers and reusable BM25 keyword-search assets
- public-data finance helpers for A-share / U.S. stock history and quote ingestion
- Velaria-owned Agent CLI/TUI backed by Codex App Server or Claude Agent SDK runtime adapters
- default Agent CLI via
velaria_cli.py, with headlessagent --printandagent --stream-jsonmodes
Examples and helper assets include:
examples/demo_batch_sql_arrow.pyexamples/demo_stream_sql.pyexamples/demo_bitable_group_by_owner.pyexamples/demo_vector_search.pybenchmarks/bench_arrow_ingestion.pyexamples/demo_embedding_pipeline.pyexamples/finance_public_data_smoke.pybenchmarks/bench_embedding_pipeline.py- local ecosystem scripts and skills
The Python experimental area is currently reserved under experimental/.
Anything placed there is explicitly outside the supported ecosystem surface until it is promoted into velaria/, velaria_cli.py, or a supported adapter module.
Python does not define:
- execution hot-path semantics
- a separate progress schema
- a separate checkpoint contract
- a separate vector scoring implementation for supported APIs
- Python UDFs in the hot path
- a row-first fallback policy for the native kernel
Main Session API:
Session.probe(...)Session.read(...)Session.read_csv(...)Session.read_line_file(...)Session.read_json(...)Session.sql(...)Session.create_dataframe_from_arrow(...)Session.create_stream_from_arrow(...)Session.create_temp_view(...)Session.read_stream_csv_dir(...)Session.stream_sql(...)Session.explain_stream_sql(...)Session.start_stream_sql(...)Session.vector_search(...)Session.explain_vector_search(...)build_embedding_rows(...)materialize_embeddings(...)load_embedding_dataframe(...)embed_query_text(...)SentenceTransformerEmbeddingProvider(...)build_keyword_index(...)search_keyword_index(...)
Additional ecosystem helpers:
read_excel(...)CustomArrowStreamSourceCustomArrowStreamSinkcreate_stream_from_custom_source(...)consume_arrow_batches_with_custom_sink(...)finance_pack.fetch_history(...)finance_pack.fetch_quotes(...)finance_pack.fetch_fundamentals(...)finance_pack.build_research_prompt(...)
Mapping rule:
- Python names may be ecosystem-friendly
- behavior must map back to the same native kernel contract exposed by C++
- Python wrappers should not force row materialization earlier than required by the user-facing boundary
The finance pack is a Python ecosystem helper for agentic monitor workflows. It does not add financial semantics to the native kernel.
Install the optional public-data provider dependency:
uv sync --project python --extra financeStart with the product readiness check and source guide:
uv run --project python --extra finance python python/velaria_cli.py finance doctor
uv run --project python --extra finance python python/velaria_cli.py finance sourcesfinance sources is generated from the provider registry used by the fetch
commands. It is the authoritative runtime list for provider capabilities,
supported markets, command support, freshness metadata, and recommended
history/quote/news paths.
Run the one-command A-share analysis workflow. This fetches a public quote, stores it as a Velaria observation, runs a monitor, and prints a readable research report with source evidence:
uv run --project python --extra finance python python/velaria_cli.py finance analyze \
--market cn \
--symbol 000001Use JSON output for agent automation:
uv run --project python --extra finance python python/velaria_cli.py finance analyze \
--market cn \
--symbol 000001 \
--format jsonRun the complete CLI chain: fetch historical OHLCV, persist a history artifact, subscribe to live quote ticks, run a monitor, and emit analysis plus service integration metadata:
uv run --project python --extra finance python python/velaria_cli.py finance pipeline \
--market cn \
--symbol 000001 \
--start-date 20250101 \
--end-date 20250131 \
--iterations 1 \
--interval-sec 0Use JSON output for the complete chain:
uv run --project python --extra finance python python/velaria_cli.py finance pipeline \
--market cn \
--symbol 000001 \
--start-date 20250101 \
--end-date 20250131 \
--iterations 1 \
--interval-sec 0 \
--format jsonRun the same complete chain for a U.S. stock. The default path uses Yahoo for
historical OHLCV and Tencent for quote ticks; Tencent U.S. quote rows are
reported as freshness=delayed by the provider contract.
uv run --project python --extra finance python python/velaria_cli.py finance pipeline \
--market us \
--symbol AAPL \
--start-date 20260501 \
--end-date 20260518 \
--iterations 1 \
--interval-sec 0 \
--format jsonRank a candidate pool with quote polling, historical momentum, public news RSS, and transparent sentiment evidence. This command emits research candidates, not trading advice:
uv run --project python --extra finance python python/velaria_cli.py finance rank-candidates \
--market us \
--symbols AAPL,MSFT,NVDA \
--start-date 20260501 \
--end-date 20260518 \
--top 3 \
--news-limit 5 \
--iterations 1 \
--format jsonUse continuous JSONL mode for a running monitor-style loop:
uv run --project python --extra finance python python/velaria_cli.py finance rank-candidates \
--market us \
--symbols AAPL,MSFT,NVDA \
--start-date 20260501 \
--end-date 20260518 \
--top 3 \
--news-limit 5 \
--iterations 0 \
--interval-sec 30 \
--jsonlUse native stream mode when the ranking loop should push normalized candidate
events through Velaria's native realtime stream source/sink APIs. Add
--ingest-raw when quote, history, news, derived feature metrics, and candidate
rows should all be persisted as Velaria external_event sources for later
inspection. Native stream sink output is also persisted as a durable stream
history source named
finance_<market>_rank_candidates_native_stream_signals:
uv run --project python --extra finance python python/velaria_cli.py finance rank-candidates \
--market us \
--symbols AAPL,MSFT,NVDA \
--start-date 20260501 \
--end-date 20260518 \
--top 3 \
--news-limit 5 \
--native-stream \
--ingest-raw \
--entry-score-threshold 8 \
--entry-return-threshold 5 \
--exit-score-threshold 0 \
--exit-quote-pct-threshold -3 \
--signal-policy-preset balanced \
--iterations 0 \
--interval-sec 300 \
--format jsonNative stream signal flags are computed by a signal policy before they enter
the generic stream SQL predicate WHERE entry_signal >= 1 OR exit_signal >= 1.
Use --signal-policy-preset balanced|momentum|defensive for built-in policies,
or pass --signal-policy JSON to make the condition tree explicit:
uv run --project python --extra finance python python/velaria_cli.py finance rank-candidates \
--market us \
--symbols AAPL,MSFT,NVDA \
--start-date 20260501 \
--end-date 20260518 \
--native-stream \
--ingest-raw \
--signal-policy '{"entry":{"all":[{"field":"momentum_state","op":"!=","value":"bearish"},{"field":"score","op":">=","value":0}]},"exit":{"any":[{"field":"news_sentiment_label","op":"=","value":"negative"},{"field":"quote_pct_change","op":"<=","value":-2}]}}' \
--iterations 1 \
--interval-sec 0 \
--format jsonQuery stored stream output later with:
uv run --project python --extra finance python python/velaria_cli.py finance stream-history \
--market us \
--source-id finance_us_rank_candidates_native_stream_signals \
--limit 50 \
--format jsonUse intelligence as the productized entrypoint when the goal is the full
finance loop: public data ingestion, Velaria native realtime stream signals,
durable event storage, agent-readable AI briefs, and replay from persisted
rows. It reuses the same watch-session runtime instead of creating a separate
finance engine:
uv run --project python --extra finance python python/velaria_cli.py finance intelligence start \
--intelligence-id us_intel_20260519 \
--session-id us_watch_20260519 \
--market us \
--symbols AAPL,MSFT,NVDA \
--market-symbols SPY,QQQ,DIA \
--start-date 20260501 \
--end-date 20260518 \
--top 3 \
--news-limit 5 \
--iterations 0 \
--interval-sec 300 \
--format json
uv run --project python --extra finance python python/velaria_cli.py finance intelligence review \
--session-id us_watch_20260519 \
--format json
uv run --project python --extra finance python python/velaria_cli.py finance intelligence replay \
--session-id us_watch_20260519 \
--format json
uv run --project python --extra finance python python/velaria_cli.py finance intelligence report \
--session-id us_watch_20260519 \
--format json
uv run --project python --extra finance python python/velaria_cli.py finance intelligence index \
--session-id us_watch_20260519 \
--format json
uv run --project python --extra finance python python/velaria_cli.py finance intelligence search \
--session-id us_watch_20260519 \
--query "NVDA momentum risk news fundamentals" \
--top-k 5 \
--format json
uv run --project python --extra finance python python/velaria_cli.py finance intelligence jobs \
--session-id us_watch_20260519 \
--format json
uv run --project python --extra finance python python/velaria_cli.py finance intelligence status \
--session-id us_watch_20260519 \
--format json
uv run --project python --extra finance python python/velaria_cli.py finance intelligence stop \
--session-id us_watch_20260519 \
--format json
uv run --project python --extra finance python python/velaria_cli.py finance intelligence resume \
--session-id us_watch_20260519 \
--format json
uv run --project python --extra finance python python/velaria_cli.py finance intelligence evaluate \
--session-id us_watch_20260519 \
--format json
uv run --project python --extra finance python python/velaria_cli.py finance intelligence eval-report \
--session-id us_watch_20260519 \
--format jsonintelligence start writes finance_intelligence_sessions and
finance_intelligence_ai_notes, while all quote/history/news/feature/market/
fundamental and native stream rows remain under the watch-session feed sources.
replay reads those persisted realtime rows back as historical evidence and
writes finance_intelligence_replays. search hybrid-searches the persisted
watch-session evidence with BM25 keyword retrieval, structured finance signals,
recency, and reciprocal rank fusion,
then writes finance_intelligence_searches. By default search uses
--index-mode auto, which reuses a persisted evidence index when its metadata
fingerprint matches the current session rows, and rebuilds it when stale.
Use intelligence index to prebuild that reusable index under
$VELARIA_HOME/finance/evidence_indexes/. Finance intelligence does not use
hash embeddings in the product path; retrieval.semantic.status is disabled
until a real production embedding provider is explicitly configured. report
writes finance_intelligence_reports with a final scorecard, supervisor
checks, provider quality diagnostics, and a replayable research summary.
jobs, status, stop, and resume expose the durable job surface backed by
finance_intelligence_jobs and finance_watch_session_runs; use them when an
agent needs to inspect or control a long-running finance intelligence session
without scraping logs. evaluate and eval-report read only persisted rows and
persist finance_intelligence_evaluations with signal, provider, retrieval,
and runtime quality metrics. The
ai_plane.agent_prompt is designed for velaria_cli_run and does not
fabricate model output.
Use watch-session when you need the lower-level durable market watch and
diagnostic surface. It combines candidate ranking, native stream signal
generation, raw quote/history/news storage, market context snapshots, and
fundamental provider snapshots under one durable session_id. If a public
provider cannot supply a requested feed, the row is persisted as a structured
unavailable event instead of being mocked:
uv run --project python --extra finance python python/velaria_cli.py finance watch-session start \
--session-id us_watch_20260519 \
--market us \
--symbols AAPL,MSFT,NVDA \
--market-symbols SPY,QQQ,DIA \
--start-date 20260501 \
--end-date 20260518 \
--top 3 \
--news-limit 5 \
--entry-score-threshold 8 \
--entry-return-threshold 5 \
--exit-score-threshold 0 \
--exit-quote-pct-threshold -3 \
--signal-policy-preset balanced \
--iterations 0 \
--interval-sec 300 \
--format jsonInspect the durable session and produce the closing review from the same data:
uv run --project python --extra finance python python/velaria_cli.py finance watch-session list --format json
uv run --project python --extra finance python python/velaria_cli.py finance watch-session events \
--session-id us_watch_20260519 \
--feed all \
--limit 200 \
--format json
uv run --project python --extra finance python python/velaria_cli.py finance watch-session summarize \
--session-id us_watch_20260519 \
--format jsonFor long-running market watches, run the same session asynchronously. The
background process still uses Velaria native realtime stream SQL for signal
selection and writes all feeds to the same AgenticStore; the foreground CLI
returns a pid, log_path, and model-readable follow-up commands:
uv run --project python --extra finance python python/velaria_cli.py finance watch-session start \
--session-id us_watch_20260519 \
--market us \
--symbols AAPL,MSFT,NVDA \
--market-symbols SPY,QQQ,DIA \
--start-date 20260501 \
--end-date 20260518 \
--top 3 \
--news-limit 5 \
--iterations 0 \
--interval-sec 300 \
--async-run \
--format json
uv run --project python --extra finance python python/velaria_cli.py finance watch-session status \
--session-id us_watch_20260519 \
--format json
uv run --project python --extra finance python python/velaria_cli.py finance watch-session logs \
--session-id us_watch_20260519 \
--limit 20 \
--format json
uv run --project python --extra finance python python/velaria_cli.py finance watch-session review \
--session-id us_watch_20260519 \
--log-limit 20 \
--format json
uv run --project python --extra finance python python/velaria_cli.py finance watch-session supervise \
--session-id us_watch_20260519 \
--interval-sec 60 \
--log-limit 20 \
--format json
uv run --project python --extra finance python python/velaria_cli.py finance watch-session stop \
--session-id us_watch_20260519 \
--format jsonreview reads the async runtime row, process state, log tail, persisted feed
counts, latest signals, and provider-unavailable evidence, then appends a
structured row to finance_watch_session_reviews. supervise runs that same
review loop continuously inside the CLI (--iterations 0) or for a bounded
number of cycles. The review output includes next_actions and an
agent_prompt that is designed to be passed back through velaria_cli_run for
continuous observation and adjustment.
Use agentic stream monitor mode when the ranking loop should create Velaria
execution_mode=stream monitors and emit FocusEvents from the persisted
ranking event stream. --until-time runs inside the CLI until the RFC3339
deadline; no external driver script is required:
uv run --project python --extra finance python python/velaria_cli.py finance rank-candidates \
--market us \
--symbols AAPL,MSFT,NVDA \
--start-date 20260501 \
--end-date 20260518 \
--top 3 \
--news-limit 5 \
--stream-monitor \
--ingest-raw \
--entry-score-threshold 8 \
--entry-return-threshold 5 \
--exit-score-threshold 0 \
--exit-quote-pct-threshold -3 \
--until-time 2026-05-18T16:00:00-04:00 \
--interval-sec 300 \
--format jsonFetch public news rows directly when you need to inspect the news provider and
sentiment evidence. News/filing rows include source_category, source_type,
source_score, and source_score_reason so ranking, replay, and agents can
distinguish broad news aggregators from finance news and regulatory filings:
uv run --project python --extra finance python python/velaria_cli.py finance fetch-news \
--provider google-news \
--market us \
--symbol AAPL \
--limit 5
uv run --project python --extra finance python python/velaria_cli.py finance fetch-news \
--provider yahoo-finance-news \
--market us \
--symbol AAPL \
--limit 5
export VELARIA_SEC_USER_AGENT="VelariaFinance/1.0 ops@example.com"
uv run --project python --extra finance python python/velaria_cli.py finance fetch-news \
--provider sec-filings \
--market us \
--symbol AAPL \
--limit 5Fetch public U.S. fundamentals evidence through SEC Company Facts. If a symbol, market, or upstream endpoint cannot provide the data, the command returns a structured unavailable row instead of mock values. For production SEC access, set a descriptive application/contact User-Agent first:
export VELARIA_SEC_USER_AGENT="VelariaFinance/1.0 ops@example.com"
uv run --project python --extra finance python python/velaria_cli.py finance fetch-fundamentals \
--provider sec-companyfacts \
--market us \
--symbols AAPL,MSFT,NVDAFetch A-share historical data through Yahoo chart JSON or AkShare and write a Parquet dataset:
uv run --project python --extra finance python python/velaria_cli.py finance fetch-history \
--provider yahoo \
--market cn \
--symbol 000001 \
--start-date 20250101 \
--end-date 20250131 \
--output /tmp/velaria-cn-history.parquetFetch public quote rows and ingest them as a Velaria external_event source:
uv run --project python --extra finance python python/velaria_cli.py finance ingest-quotes \
--provider tencent \
--market cn \
--symbols 000001,600519 \
--source-id finance_cn_quotesWatch one public quote symbol, append each observation to an external_event
source, run a monitor, and return FocusEvent plus analysis context:
uv run --project python --extra finance python python/velaria_cli.py finance watch \
--market cn \
--symbol 000001 \
--interval-sec 30 \
--iterations 0 \
--jsonlThe same finance commands are available to velaria_cli.py -i through the
registered agent tool velaria_cli_run. In agent mode, pass only the Velaria
subcommand, for example finance doctor, finance sources, finance analyze --market cn --symbol 000001 --format json, finance pipeline --market cn --symbol 000001 --start-date 20250101 --end-date 20250131 --iterations 1 --format json, finance watch-session start --market us --symbols AAPL,MSFT,NVDA --start-date 20260501 --end-date 20260518 --iterations 0 --async-run --format json, finance watch-session status --session-id us_watch_20260519 --format json, finance watch-session review --session-id us_watch_20260519 --format json, finance watch-session supervise --session-id us_watch_20260519 --interval-sec 60 --format json, finance intelligence index --session-id us_watch_20260519 --format json, finance intelligence search --session-id us_watch_20260519 --query "NVDA momentum risk news fundamentals" --format json, finance intelligence report --session-id us_watch_20260519 --format json, finance intelligence jobs --session-id us_watch_20260519 --format json, finance intelligence status --session-id us_watch_20260519 --format json, finance intelligence stop --session-id us_watch_20260519 --format json, finance intelligence resume --session-id us_watch_20260519 --format json, finance intelligence evaluate --session-id us_watch_20260519 --format json, finance intelligence eval-report --session-id us_watch_20260519 --format json, finance fetch-fundamentals --provider sec-companyfacts --market us --symbols AAPL,MSFT,NVDA, or finance watch --market cn --symbol 000001 --interval-sec 30 --iterations 0 --jsonl; do not include uv, python, or
python/velaria_cli.py in the tool arguments.
finance pipeline and finance rank-candidates do not require a
finance-specific service route. They write to the same Velaria AgenticStore
used by the local service. When finance rank-candidates --stream-monitor is
used, the command also creates entry/exit execution_mode=stream monitors and
executes them for each candidate ranking tick. If
velaria_service is started with the same VELARIA_HOME, the generic
external-events, monitors, and focus-events service routes can inspect
the source, monitor, ranking observations, and events created by the CLI.
Run the public-data smoke against real AkShare endpoints:
uv run --project python --extra finance python python/examples/finance_public_data_smoke.pyWhen AkShare / Eastmoney is blocked by a local proxy or upstream network policy, verify public quote ingestion through Tencent's lightweight quote endpoint:
uv run --project python --extra finance python python/examples/finance_public_data_smoke.py --quotes-onlyThe standardized rows include provider evidence fields:
providersource_urlfetched_atfreshnessdelay_seclicense_note
U.S. quote freshness is reported from the provider path and may be delayed or unknown. The finance pack records that metadata instead of treating every quote as exchange-grade realtime data.
For research workflows, use finance_pack.build_research_prompt(...) to turn
FocusEvent objects and related datasets into a prompt for the interactive
Velaria Agent. The prompt requires live source links and states that output is
research assistance, not investment advice.
File reader mapping:
Session.probe(path)returns inferred source kind, schema, normalized options, the final selected format, scored candidates, confidence, and warningsSession.read(path, ...)is the preferred batch file front door and probes the source automaticallySession.read_csv(...)remains the explicit CSV overrideSession.read_line_file(path, mappings=[...], ...)maps to the native line/regex source connectorSession.read_json(path, columns=[...], ...)maps to the native JSON lines / JSON array source connector- all four file readers share the same source materialization knobs:
materialization,materialization_dir, andmaterialization_format - all four file readers also support
cache_in_memory=Trueto retain the projected source snapshot inside the current session for repeated same-session queries; it is a reuse-oriented tradeoff, not a pure free cache hint, because it can bypass source pushdown on the first query velaria-cli file-sqldefaults to--input-type autoand registers batch sources throughCREATE TABLE ... OPTIONS(path: '...')- versioned embedding datasets written as Parquet / Arrow should be loaded through
pyarrowplusSession.create_dataframe_from_arrow(...)orload_embedding_dataframe(...), notSession.read(...) - reusable keyword indexes are directory artifacts built from Arrow / Parquet or batch file inputs and should be queried through the service / helper APIs, not treated as plain table files
Regex line usage:
- regex is an explicit line-reader mode; it is not auto-probed
mappingssource indexes follow regex capture-group numbering:0is the full match1..nare capture groups
- unmatched lines are skipped
Regex example:
regex_df = session.read_line_file(
"events.log",
mappings=[("uid", 1), ("action", 2), ("latency", 3), ("ok", 4), ("note", 5)],
mode="regex",
regex_pattern=r'^uid=(\d+) action="([^"]+)" latency=(\d+) ok=(true|false) note=(.+)$',
)Minimal examples:
import velaria
session = velaria.Session()
csv_df = session.read_csv("input.csv")
probe = session.probe("events.jsonl")
auto_df = session.read("events.jsonl")
# probe includes:
# kind / final_format / score / confidence / schema / candidates / warnings
json_df = session.read_json(
"events.jsonl",
columns=["user_id", "action", "latency"],
format="json_lines",
)
json_array_df = session.read_json(
"events.json",
columns=["event", "cost"],
format="json_array",
)
nested_json_df = session.read_json(
"events_nested.json",
columns=["a", "b"],
format="json_array",
)
# JSON reader notes:
# - top-level rows must still be JSON objects
# - nested object fields are preserved as raw JSON strings
# - numeric JSON arrays are still parsed as vector values when read directly as a fieldJSON source examples:
{"user_id":1,"action":"open","latency":12.5}
{"user_id":2,"action":"close","latency":9.0}[
{"event":"open","cost":1.5},
{"event":"close","cost":2}
][
{"a":1,"b":{"b1":1}},
{"a":2,"b":{"b1":2,"b2":["x",3,null]}}
]Nested-object result shape with columns=["a", "b"]:
[1, "{\"b1\":1}"]
[2, "{\"b1\":2,\"b2\":[\"x\",3,null]}"]
Current JSON limits:
json_linesandjson_arrayare supported- each top-level row must be a JSON object
columns=[...]is required for explicit JSON reads- nested object values are returned as JSON text, not flattened columns
- a top-level scalar array such as
["a", "b"]is not supported as a table source - nested arrays inside an object string are preserved inside that JSON text
- direct field values that are numeric JSON arrays still map to vector values
Embedding pipeline example:
from velaria import (
DEFAULT_LOCAL_EMBEDDING_MODEL,
DEFAULT_EMBEDDING_WARMUP_TEXT,
HashEmbeddingProvider,
Session,
SentenceTransformerEmbeddingProvider,
build_mixed_text_embedding_rows,
download_embedding_model,
materialize_mixed_text_embeddings,
run_mixed_text_hybrid_search,
)
records = [
{
"doc_id": "doc-1",
"title": "Alpha",
"summary": "Payment page timeout",
"tags": ["billing", "checkout"],
"bucket": 1,
"source_updated_at": 1,
},
{
"doc_id": "doc-2",
"title": "Beta",
"summary": "Refund delay in worker queue",
"tags": ["refund", "queue"],
"bucket": 2,
"source_updated_at": 2,
},
]
provider = HashEmbeddingProvider(dimension=8)
session = Session()
materialize_mixed_text_embeddings(
records,
provider=provider,
model="hash-demo",
template_version="text-v1",
text_fields=("title", "summary", "tags"),
output_path="docs_embeddings.parquet",
)
result = run_mixed_text_hybrid_search(
session,
"docs_embeddings.parquet",
provider=provider,
model="hash-demo",
query_text="payment page hangs during checkout",
where_sql="bucket = 1 AND doc_id = 'doc-1'",
top_k=2,
metric="cosine",
)For a local semantic baseline with all-MiniLM-L6-v2, install the optional provider dependency first:
uv sync --project python --extra embeddingThen swap the provider:
provider = SentenceTransformerEmbeddingProvider(
model_name=DEFAULT_LOCAL_EMBEDDING_MODEL,
)If you want to avoid remote Hub resolution on every machine, put the model files in a local directory and point the provider at that directory. Supported lookup order for the default MiniLM model is:
VELARIA_EMBEDDING_MODEL_DIRpython/models/all-MiniLM-L6-v2- fallback to the Hugging Face model id
Example:
export VELARIA_EMBEDDING_MODEL_DIR=/absolute/path/to/all-MiniLM-L6-v2
export VELARIA_EMBEDDING_CACHE_DIR=/absolute/path/to/hf-cacheThen SentenceTransformerEmbeddingProvider(model_name=DEFAULT_LOCAL_EMBEDDING_MODEL) will load from the local directory instead of the Hub.
You can also explicitly pre-download and warm up the model before serving queries:
from velaria import (
DEFAULT_LOCAL_EMBEDDING_MODEL,
SentenceTransformerEmbeddingProvider,
download_embedding_model,
)
local_dir = download_embedding_model(DEFAULT_LOCAL_EMBEDDING_MODEL)
provider = SentenceTransformerEmbeddingProvider(model_name=DEFAULT_LOCAL_EMBEDDING_MODEL)
provider.warmup(
download_if_missing=False,
warmup_text="warmup embedding text",
)Recommended startup flow:
download_embedding_model(...)during environment/bootstrap timeprovider.warmup(...)once during process start- run batch embedding or online query embedding after the model is already resident
CLI examples:
uv run --project python python python/velaria_cli.py file-sql \
--csv /tmp/input.csv \
--input-type csv \
--query "SELECT * FROM input_table LIMIT 5"
uv run --project python python python/velaria_cli.py file-sql \
--input-path /tmp/input.jsonl \
--input-type auto \
--query "SELECT * FROM input_table LIMIT 5"
uv run --project python python python/velaria_cli.py file-sql \
--input-path /tmp/events.log \
--input-type line \
--line-mode regex \
--regex-pattern '^uid=(\\d+) action=\"([^\"]+)\" latency=(\\d+) ok=(true|false) note=(.+)$' \
--mappings 'uid:1,action:2,latency:3,ok:4,note:5' \
--query "SELECT * FROM input_table LIMIT 5"Mixed-text embedding pipeline through the CLI:
uv run --project python python python/velaria_cli.py embedding-build \
--input-path /tmp/docs.csv \
--input-type csv \
--text-columns title,summary,tags \
--provider minilm \
--output-path /tmp/docs_embeddings.parquet
uv run --project python python python/velaria_cli.py embedding-query \
--dataset-path /tmp/docs_embeddings.parquet \
--provider minilm \
--query-text "payment page hangs during checkout" \
--where-sql "bucket = 1 AND region = 'apac'" \
--top-k 5
# direct query from the raw file without a prebuilt embedding dataset
uv run --project python python python/velaria_cli.py embedding-query \
--input-path /tmp/docs.csv \
--input-type csv \
--text-columns title,summary,tags \
--provider minilm \
--query-text "payment page hangs during checkout" \
--where-sql "bucket = 1 AND region = 'apac'" \
--top-k 5Reusable keyword-index build and BM25 keyword search through the service:
curl -sS http://127.0.0.1:37491/api/v1/runs/keyword-index-build \
-H 'Content-Type: application/json' \
-d '{
"input_path": "/tmp/docs.csv",
"input_type": "csv",
"text_columns": ["title", "body"],
"analyzer": "jieba"
}'
curl -sS http://127.0.0.1:37491/api/v1/runs/keyword-search \
-H 'Content-Type: application/json' \
-d '{
"index_path": "/tmp/keyword_index",
"query_text": "payment timeout",
"where_sql": "bucket = 1",
"top_k": 10
}'Codex and Claude SDK adapter dependencies are declared by the Python package.
Configure the Claude adapter only when you want --runtime claude.
The agent runtime provides:
- Velaria-owned Agent CLI/TUI via
velaria_cli.pyorvelaria_cli.py agentwith Codex/Claude used only as runtime adapters - Headless Agent turns via
velaria_cli.py agent --print ...and JSONL event streaming viavelaria_cli.py agent --stream-json ... - Thread persistence under
agentRuntimeWorkspace - On-demand exposure of the Velaria usage skill as an MCP resource
- Velaria local functions exposed through the runtime bridge / MCP server:
velaria_read,velaria_schema,velaria_sql,velaria_explain,velaria_dataset_download,velaria_dataset_import,velaria_dataset_normalize,velaria_dataset_process,velaria_cli_run,velaria_artifact_preview,velaria_sql_capabilities,velaria_sql_function_search,velaria_sql_query_patterns - On-demand SQL reference resource
velaria://sql/catalogfor SQL v1 capabilities, scalar functions, and reusable query patterns - Natural language to SQL generation via
velaria_cli.py ai generate-sql - Legacy compatibility commands under
velaria_cli.py ai ...; new interactive work should use the default Agent entry orvelaria_cli.py agent
Both runtimes use the same ~/.velaria/config.json and agent* config keys.
Codex runtime (default):
{
"agentRuntime": "codex",
"agentAuthMode": "local",
"agentProvider": "openai",
"agentReasoningEffort": "none",
"agentRuntimeWorkspace": "~/.velaria/ai-runtime",
"agentCodexNetworkAccess": true
}{
"agentRuntime": "claude",
"agentAuthMode": "local",
"agentProvider": "anthropic",
"agentModel": "claude-sonnet-4-20250514",
"agentReasoningEffort": "none",
"agentRuntimeWorkspace": "~/.velaria/ai-runtime",
"agentNetworkAccess": true
}Defaults: Codex reuses the local Codex config model and falls back to
gpt-5.4-mini; set agentCodexModel only when Velaria should override the
local Codex model. Claude uses claude-sonnet-4-20250514.
Both default agentReasoningEffort to none. agentRuntimeWorkspace is the
runtime working directory used to save and resume agent threads; if omitted,
Velaria creates a project-scoped directory under ~/.velaria/ai-runtime/.
agentAuthMode: "local" reuses the local login; use agentAuthMode: "api_key"
with agentApiKey and agentBaseUrl for explicit credentials.
Use agentRuntimePath / agentCodexRuntimePath only when overriding the local
Codex runtime bridge. Use agentClaudeRuntimePath for the Claude SDK adapter.
Network access via agentCodexNetworkAccess (Codex) or agentNetworkAccess
(Claude), both default true.
The runtime inherits standard proxy environment variables such as http_proxy,
https_proxy, and all_proxy.
Velaria Agent keeps the underlying runtime generic. Velaria-specific usage
guidance and SQL function details are exposed on demand through MCP resources
and local functions, not by embedding the full skill or SQL function catalog in
the default prompt. Use velaria_sql_capabilities,
velaria_sql_function_search, velaria_sql_query_patterns, or the
velaria://sql/catalog MCP resource when an agent needs SQL details.
Agent CLI examples:
uv run --project python python python/velaria_cli.py
uv run --project python python python/velaria_cli.py agent --runtime claude
uv run --project python python python/velaria_cli.py agent --model gpt-5.4
uv run --project python python python/velaria_cli.py agent --print \
"读取 data/sales.csv,按 region 汇总 amount,并保存 run"
uv run --project python python python/velaria_cli.py agent --stream-json \
"summarize recent workspace runs"
uv run --project python python python/velaria_cli.py ai generate-sql \
--prompt "top 5 by score" --schema "name,score,region"Inside the Agent TUI, use Ctrl+M to open the model picker for the current
runtime. Use /model <model-name> to switch to an arbitrary provider model
name without restarting the CLI. Model switches start a fresh Agent session and
are blocked while a turn is running or queued.
Current SQL mapping carried by Python:
Session.sql(...)maps to core SQL v1 batch semantics:CREATE TABLE,CREATE SOURCE TABLE,CREATE SINK TABLEINSERT INTO ... VALUESINSERT INTO ... SELECTSELECTwith projection/alias,WHERE,GROUP BYcolumns/scalar expressions,ORDER BY,LIMIT,UNION/UNION ALL, and the current minimalJOIN- batch
WHEREsupports single predicates, column-to-column predicates, plusAND/ORexpressions - batch
KEYWORD SEARCH(title, body) QUERY '...' TOP_K ...on single-table non-aggregate queries - batch
HYBRID SEARCH ... QUERY ...on single-table non-aggregate queries - current Python service can combine reusable keyword-index recall with vector rerank by passing both
index_pathanddataset_pathtohybrid-search
- batch builtins currently exposed through the same core path:
LOWER,UPPER,TRIM,LTRIM,RTRIMLENGTH,LEN,CHAR_LENGTH,CHARACTER_LENGTH,REVERSECONCAT,CONCAT_WS,LEFT,RIGHT,SUBSTR/SUBSTRING,POSITION,REPLACE,CASTABS,CEIL,FLOOR,ROUND,YEAR,MONTH,DAY,ISO_YEAR,ISO_WEEK,WEEK,YEARWEEK,NOW,TODAY,CURRENT_TIMESTAMP,currentTimestamp,UNIX_TIMESTAMP- supported scalar functions can be nested in projection expressions
Session.stream_sql(...),Session.explain_stream_sql(...), andSession.start_stream_sql(...)share the same stream SQL front-door checks:- source must be a source table / stream source
- sink target must be a sink table
- only the current stream-stable projection, filter, window, stateful aggregate, and bounded-source
ORDER BYshapes are accepted
- current SQL v1 keeps
ORDER BYscoped to columns present in theSELECToutput - unsupported SQL shapes are expected to surface as explicit parse / semantic / unsupported SQL v1 / table-kind errors from the core path, not Python-only behavior
Desktop / service import behavior:
- local file import preview still only inspects schema + preview rows
- when saving a dataset from the desktop app, embedding build and keyword-index build can both be configured and will run asynchronously in the background
- bitable import can build:
- a reusable embedding dataset from selected text columns
- a reusable keyword index from selected keyword columns
- both in parallel within the same background import run
- packaged sidecar builds copy jieba dictionaries from the resolved Bazel
cppjiebadependency at build time; the source repo does not need to carry those dictionary files
Stable Python layout in this repo:
- supported library:
python/velaria/
- supported CLI tool:
python/velaria_cli.py
- examples:
python/examples/
- benchmarks:
python/benchmarks/
- reserved experimental area:
python/experimental/
- regression tests:
python/tests/
Repository Python commands use uv.
Recommended local baseline:
- CPython
3.12or3.13 uv- local CPython headers (
Python.h)
Bazel Python detection currently probes local CPython interpreters in the 3.9 to 3.13 range. If auto-discovery fails, set:
export VELARIA_PYTHON_BIN=/path/to/python3.13That interpreter must expose Python.h; otherwise Bazel cannot build the native extension.
Bootstrap:
bazel build //:velaria_pyext
bazel run //python:sync_native_extension
uv sync --project python --python python3.13If you run python/velaria_cli.py or other source-checkout Python entrypoints directly,
keep python/velaria/_velaria.so in sync with:
bazel run //python:sync_native_extensionRun demos:
uv run --project python python python/examples/demo_batch_sql_arrow.py
uv run --project python python python/examples/demo_stream_sql.py
uv run --project python python python/examples/demo_vector_search.py
uv run --project python python python/examples/demo_embedding_pipeline.pyRecommended regression entrypoint:
./scripts/run_python_ecosystem_regression.shRepository benchmark fixture generation:
- the stage benchmark can generate synthetic data at runtime
- if you want a locally realistic benchmark input, generate an anonymized CSV from a private raw export with:
uv run --project python python scripts/generate_stage_benchmark_fixture.py \
--input /path/to/raw_rows_100k.csv \
--output python/benchmarks/data/stage_input_100k_anonymized.csv- keep that generated CSV local and untracked; it is ignored by
.gitignore
That script covers:
- native extension build
- wheel and native wheel build
- Bazel Python regression targets
- demo smoke
- CLI smoke
Benchmark regression entrypoint:
./scripts/run_python_stage_benchmark.shEmbedding pipeline benchmark:
uv run --project python python python/benchmarks/bench_embedding_pipeline.pyFor the local MiniLM provider:
uv sync --project python --extra embedding
uv run --project python python python/benchmarks/bench_embedding_pipeline.py \
--provider minilm \
--model sentence-transformers/all-MiniLM-L6-v2The benchmark reports both:
- batch embedding/materialization throughput
- online query embedding latency
- online hybrid search latency on the resulting embedding dataset
Core file-input benchmark entrypoint:
bazel run //:file_source_benchmark -- 200000 3That benchmark currently reports:
- CSV hardcode / explicit / auto-probed paths
- CSV scan-only / full-columnar / full-row-materialize / projected / filter-pushdown / aggregate-pushdown sub-cases
- line split hardcode / explicit / auto-probed paths and direct filter-pushdown / aggregate-pushdown cases
- line regex parse and grouped-aggregate paths
- JSON lines hardcode / explicit / auto-probed paths and direct filter-pushdown / aggregate-pushdown cases
- SQL
CREATE TABLE ... OPTIONS(path: '...')registration costs plus CSV / line / JSON predicate-pushdown comparisons
Current pushdown lowering also classifies source work into:
ConjunctiveFilterOnlySingleKeyCountSingleKeyNumericAggregateGeneric
Representative clean-main vs current snapshot on 200000 / 3:
read_line_regex_explicit_group_sum:5679936 us -> 641735 ussql_csv_predicate_and_group_count:133011 us -> 109146 ussql_csv_predicate_or_group_count:307313 us -> 171556 ussql_csv_predicate_mixed_group_count:462000 us -> 275583 ussql_line_predicate_or_group_count:314852 us -> 174627 ussql_json_predicate_or_group_count:604404 us -> 420423 us
For Linux perf sampling on the native CSV path:
perf record --call-graph=dwarf bazel-bin/file_source_benchmark -- 200000 3
perf reportBy default that script generates benchmark input at runtime.
To use a local anonymized CSV instead, set VELARIA_STAGE_BENCH_CSV=/path/to/file.csv.
The default scenario is groupby_count_max.
Benchmark scenario controls:
- set
VELARIA_STAGE_BENCH_SCENARIO=groupby_count_maxfor thecaller_psm / count / max(latency)path - set
VELARIA_STAGE_BENCH_SCENARIO=filter_lower_limitfor theLOWER(method) + filter + LIMITpath - set
VELARIA_STAGE_BENCH_QUERY="..."only when you intentionally want a custom Velaria query - pass
--cache-in-memorytopython/benchmarks/bench_stage_paths.pywhen you want the reuse path to retain projected source columns in the current session - when
VELARIA_STAGE_BENCH_QUERYdoes not match the selected scenario query, also setVELARIA_STAGE_BENCH_SKIP_HARDCODE=1; otherwise the benchmark rejects the run
hardcode is only reported when it is semantically aligned with the selected scenario.
The benchmark now enforces row-count parity between the hardcode baseline and Velaria result
before it prints ratios.
Interpretation guardrails for the stage benchmark:
Session.read_csv(...)andSession.sql(...)are setup/planning calls in this harness; they do not represent file scan or query execution time by themselvesDataFrame.to_arrow()is the first materialization point in this harness, so its stage includes execution plus Arrow exportto_pylist()only measures the Python-side conversion from the already materialized Arrow table- the
hardcodebaseline inpython/benchmarks/bench_stage_paths.pyis a scenario-specific Python stdlib baseline built withcsv.DictReaderand manual logic; it is useful for relative comparison inside this harness, but it is not a native C/C++ kernel upper bound - packaged CLI startup is a separate measurement surface from the Python API stage benchmark
Build targets:
- native extension:
//:velaria_pyext
- sync built native extension into the source checkout:
//python:sync_native_extension
- pure-Python wheel wrapper:
//python:velaria_whl
- native wheel:
//python:velaria_native_whl
- Python CLI:
//python:velaria_cli
Single-file CLI packaging:
./scripts/build_py_cli_executable.sh
./dist/velaria-cli file-sql \
--csv /path/to/input.csv \
--query "SELECT * FROM input_table LIMIT 5"That single-file CLI is packaged with PyInstaller --onefile, so cold-start measurements include
Python/bootstrap overhead in addition to engine work.
The CLI is part of the ecosystem layer. For supported paths, it should delegate to the same native session contract as Python and C++.
Repo-visible CLI entrypoints are:
- source checkout:
uv run --project python python python/velaria_cli.py ...
- installed wheel or local package install:
velaria-cli ...velaria_cli ...
- packaged binary:
./dist/velaria-cli ...
The global commands are expected only after installing the wheel or package into your environment.
Every top-level command and subcommand supports --help.
Running velaria-cli without a subcommand starts the Velaria Agent TUI in a
TTY. Use velaria-cli agent --print "..." or
velaria-cli agent --stream-json "..." for script-friendly Agent turns.
The CLI also supports a local workspace layout for tracked runs and artifact indexing.
Default paths:
- runs:
~/.velaria/runs/<run_id>/ - index:
~/.velaria/index/artifacts.sqlite
You can override the root with:
export VELARIA_HOME=/tmp/velaria-homeTracked run commands:
uv run --project python python python/velaria_cli.py run start -- file-sql \
--run-name "cn_slow_query_24h_2026-04-03" \
--description "score filter result for demo input" \
--tag cn \
--tag "slow-query,demo" \
--csv /path/to/input.csv \
--query "SELECT * FROM input_table LIMIT 5"
./dist/velaria-cli run start -- file-sql \
--run-name "cn_slow_query_24h_2026-04-03" \
--description "score filter result for demo input" \
--tag cn \
--tag "slow-query,demo" \
--csv /path/to/input.csv \
--query "SELECT * FROM input_table LIMIT 5"
uv run --project python python python/velaria_cli.py run list --tag cn --query "slow query" --limit 20
uv run --project python python python/velaria_cli.py run result --run-id <run_id>
uv run --project python python python/velaria_cli.py run diff --run-id <run_id> --other-run-id <other_run_id>
uv run --project python python python/velaria_cli.py run show --run-id <run_id>
uv run --project python python python/velaria_cli.py run status --run-id <run_id>
uv run --project python python python/velaria_cli.py artifacts list --run-id <run_id>
uv run --project python python python/velaria_cli.py artifacts preview --artifact-id <artifact_id>
uv run --project python python python/velaria_cli.py run cleanup --keep-last 10The tracked workspace contract is:
- stdout returns JSON only
- logs go to
stdout.log/stderr.log run.jsoncan carryrun_name,description, andtagsfor human-readable context and filteringrun listreturns summary-friendly fields such asartifact_countandduration_ms- failures return structured JSON with
error_type,phase, optionalrun_id, anddetails - stream progress appends native
snapshotJson()output toprogress.jsonl - stream explain keeps the native
logical/physical/strategystructure - large results stay in files under
artifacts/; SQLite stores only index rows and small previews - deleting run directories requires the explicit
--delete-filesswitch
End-to-end examples:
CSV SQL to parquet plus preview:
uv run --project python python python/velaria_cli.py run start -- file-sql \
--run-name "high_score_rows" \
--description "high score rows for local inspection" \
--tag local-demo \
--tag scores \
--csv /path/to/input.csv \
--query "SELECT name, score FROM input_table WHERE score > 10"
uv run --project python python python/velaria_cli.py run list --tag scores --query "high score"
uv run --project python python python/velaria_cli.py run result --run-id <run_id>
uv run --project python python python/velaria_cli.py run diff --run-id <run_id> --other-run-id <other_run_id>
uv run --project python python python/velaria_cli.py artifacts list --run-id <run_id>
uv run --project python python python/velaria_cli.py artifacts preview --artifact-id <artifact_id>Stream SQL once plus status:
uv run --project python python python/velaria_cli.py run start -- stream-sql-once \
--source-csv-dir /path/to/source_dir \
--sink-schema "key STRING, value_sum INT" \
--query "INSERT INTO output_sink SELECT key, SUM(value) AS value_sum FROM input_stream GROUP BY key"
uv run --project python python python/velaria_cli.py run status --run-id <run_id>For this action, the query still follows the core stream SQL boundary:
--querymust beINSERT INTO <sink> SELECT ...- the source side must resolve to the stream source table created by the command
- the sink side must resolve to the sink table created from
--sink-schema - explain output remains
logical / physical / strategy, and progress stays nativesnapshotJson()
Vector search plus explain artifact:
uv run --project python python python/velaria_cli.py run start -- vector-search \
--csv /path/to/vectors.csv \
--vector-column embedding \
--query-vector "0.1,0.2,0.3" \
--top-k 5
uv run --project python python python/velaria_cli.py artifacts list --run-id <run_id>Hybrid search through the CLI keeps the same command and adds optional filter / threshold controls:
uv run --project python python python/velaria_cli.py vector-search \
--csv /path/to/vectors.csv \
--vector-column embedding \
--query-vector "0.1,0.2,0.3" \
--metric cosine \
--top-k 5 \
--where-column bucket \
--where-op = \
--where-value 1 \
--score-threshold 0.02Batch SQL also supports a minimal hybrid search clause through file-sql:
uv run --project python python python/velaria_cli.py file-sql \
--csv /path/to/vectors.csv \
--query "SELECT id, bucket, vector_score FROM input_table WHERE bucket = 1 HYBRID SEARCH embedding QUERY '[0.1 0.2 0.3]' METRIC cosine TOP_K 5 SCORE_THRESHOLD 0.02"Python ecosystem source groups:
- supported:
//python:velaria_python_supported_sources
- examples and benchmarks:
//python:velaria_python_example_sources
- experimental placeholder:
//python:velaria_python_experimental_sources
Arrow is the preferred interop form for high-volume results.
Guidance:
- prefer Arrow/native columnar paths over
to_rows()when benchmarking or integrating large results - treat
to_rows()as a convenience/debugging surface Session.sql(...)returns a lazy batchDataFramehandleto_arrow()/to_rows()trigger materialization; the first materialization stage includes execution plus conversion to the requested result form- if you need pandas, use
session.sql(...).to_arrow().to_pandas(); there is no directDataFrame.to_pandas()helper in the current API
Supported Arrow ingestion inputs:
pyarrow.Tablepyarrow.RecordBatchpyarrow.RecordBatchReader- objects implementing
__arrow_c_stream__ - Python sequences of Arrow batches
Vector-preferred Arrow shape:
FixedSizeList<float32>
Preferred local CSV vector text shape:
[1 2 3][1,2,3]
Current vector search scope:
- local exact scan only
- metrics:
cosine,dot,l2 - batch SQL supports a minimal
HYBRID SEARCH ... QUERY ...clause - CLI
vector-searchsupports optional--where-column/--where-op/--where-valueand--score-threshold - no ANN / distributed execution / standalone vector DB behavior
read_excel(...) reads .xlsx through:
pandas.read_excelpyarrow.TableconversionSession.create_dataframe_from_arrow(...)
Example:
from velaria import Session, read_excel
session = Session()
df = read_excel(session, "/path/to/file.xlsx", sheet_name="Sheet1")
session.create_temp_view("staff", df)
print(session.sql("SELECT * FROM staff LIMIT 5").to_rows())Supported ecosystem integrations include:
- Bitable-backed stream source flows
- custom Arrow stream sources
- custom Arrow stream sinks
These are supported as ecosystem integrations, not as alternate execution cores.
Python ecosystem regression targets:
//python:streaming_v05_test//python:arrow_stream_ingestion_test//python:vector_search_test//python:read_excel_test//python:custom_stream_source_test//python:bitable_stream_source_test//python:bitable_group_by_owner_integration_test
Python-layer grouped suite:
//python:velaria_python_supported_regression
Root-level grouped suite:
//:python_ecosystem_regression
Python may:
- wrap
- package
- automate
- project ecosystem-friendly names
Python may not:
- redefine progress/checkpoint/explain semantics
- become the source of truth for runtime decisions
- introduce a second vector-search implementation for supported interfaces
- pull the native kernel back toward a row-first design for ecosystem convenience
For core boundaries, see docs/core-boundary.md. For stable runtime semantics, see docs/runtime-contract.md.