Skip to content

Latest commit

 

History

History
371 lines (267 loc) · 19 KB

File metadata and controls

371 lines (267 loc) · 19 KB

VLA on SoCs: Deployment Notes (Landscape & Practical)

Ho T. Hung, An T. Le

Hanoi, Jan 2026

Scope: Vision-Language-Action (VLA) deployment (inference-first) for robotics / embodied agents on edge SoCs (phones/tablets, laptops-as-edge, SBCs, Jetson-class modules).
We cover: (1) model shapes (end-to-end VLA vs split VLM+policy), (2) optimization (quantization, pruning, distillation, cache/scheduling tricks), and (3) runtime lanes (Core ML/ANE, QNN, TensorRT, RKNN, LiteRT/ORT/OpenVINO) + profiling/packaging.


0) TL;DR

  • Best “quick win” today: run a sub‑1B VLA end-to-end (e.g., SmolVLA‑450M) or run a VLM/LLM + smaller action head; avoid “full-size” VLAs unless you have a Jetson AGX‑class device.
  • Main bottleneck: vision tokens + autoregressive decoding (and/or multi-step diffusion).
    Your biggest speedups usually come from:
    • Lower image resolution / fewer frames,
    • Pruning/caching visual tokens,
    • Quantizing weights (4–8 bit),
    • Action chunking + async control (decouple policy rate from control rate).
  • Deployment reality: You’ll typically mix runtimes:
    • Vision encoder on GPU/NPU (TensorRT / Core ML / QNN / RKNN),
    • LLM/VLM on GPU (Metal/CUDA) or NPU when supported,
    • Action head often stays on GPU/CPU unless it’s standard ops.

Living lists for efficient VLA


1) Reference VLA architectures

A) Autoregressive VLM -> discrete action tokens

Examples: OpenVLA, SmolVLA, many “token-based” VLAs

  • Encode image(s) -> visual tokens
  • Condition on instruction -> decode action tokens (often 7–DoF or chunked action sequences)
  • Great for generalization, but can be decode-bound and vision-token heavy.

B) VLM + diffusion / flow-matching action head

  • VLM provides task context; diffusion/flow head generates continuous actions.
  • Often more sample efficient / smooth (improving with recent ideas such as Real-time Action Chunking), but multi-step inference can be too slow unless distilled (see Sec. 6).

C) Split pipeline (often the most SoC-friendly)

  • On-device perception (small VLM/ViT) + tiny policy/controller
  • Optional: off-device LLM “planner” (cloud or nearby GPU)
  • Sacrifices some end-to-end purity but is usually easier to ship.

D) Dual-system VLA (slow reasoning + fast control)

This is the “split pipeline” idea but formalized: a slow System 2 reasons, a fast System 1 executes (often at higher Hz), sometimes with partial parameter sharing.

Open implementations / references (usable code):


2) SoC landscape (pragmatic lanes)

Rule of thumb: pick the lane that matches your target device first, then pick a model + optimization strategy that the lane supports.

Lane (typical devices) Best-supported acceleration stack What usually fits Main gotchas
Apple Silicon (M‑series Macs, iPhone A‑series) Core ML (ANE/GPU) via coremltools, + MLX / MLC‑LLM (Metal) 1–7B LLMs on GPU; smaller VLM/VLA on-device Core ML op coverage & model structure constraints; ANE best with int8 activations
Qualcomm Snapdragon / QCS / RBx QNN via Qualcomm AI Hub, ORT QNN EP, ExecuTorch QNN, LiteRT‑LM Small–mid LLMs/VLMs when op coverage is good; “static-ish” graphs Compiler + op support constraints; Android packaging; memory bandwidth
NVIDIA Jetson Orin TensorRT (+ TensorRT‑LLM branch), NanoLLM, CUDA kernels Larger VLM/VLA possible (AGX Orin best) Power/thermals; container compatibility; TensorRT op gaps
Rockchip RK35xx (RK3588, RK3576…) RKNN/RKLLM toolchain Small CV models; some LLMs via RKLLM Toolchain/version churn; limited op support; quantization constraints
Other edge NPUs (MediaTek / Samsung / Intel / AMD / NXP…) Usually LiteRT / NNAPI / OpenVINO / Vulkan / vendor SDK Smaller models; mixed success for LLM/VLM Ecosystem fragmentation; driver/EP maturity

If you’re new: start with Apple (fast iteration) or Jetson (best “it just runs” for robotics). Qualcomm is excellent when you align with the QNN-supported subgraph.


3) Lane-specific toolchains

Apple (Core ML + Metal)

Caveat: MLC/MLX primarily use Metal GPU, not the ANE; ANE acceleration generally comes through Core ML.


Qualcomm (QNN ecosystem)

Caveat: QNN works best when your model can be lowered to a supported static subgraph (ops + shapes). Plan for fallbacks (GPU/CPU) for unsupported pieces.


NVIDIA Jetson Orin (CUDA + TensorRT)

Caveat: TensorRT-LLM support on Jetson exists but is more “branch / JetPack-coupled” than desktop; expect friction and pin versions.


Rockchip (RKNN / RKLLM)

Caveat: Expect frequent version changes and stricter conversion constraints than CUDA/Metal stacks.


3.1 Common model artifacts

When people say “deploy the VLA”, it often means producing multiple artifacts:

  • PyTorch checkpoint (.pt/.pth) - research + baseline correctness
  • ONNX (.onnx) - common interchange, especially for ORT / QNN EP
  • Core ML (.mlpackage / .mlmodel) - iOS/macOS app bundle, ANE/GPU targeting
  • QNN / QAIRT context binaries - Snapdragon NPU-deployable compiled assets
  • TensorRT engines (.plan) - Jetson GPU-optimized engines (build on-device or in matching containers)
  • RKNN / RKLLM formats - Rockchip-converted assets for NPU/runtime
  • GGUF / GGML-family - CPU/GPU inference stacks like llama.cpp (useful for LLM-only subsystems)
  • 1-bit quantized (BitNet) - extreme CPU-only quantization for LLM backbones; requires QAT retraining but delivers massive speedups + energy savings (Microsoft BitNet b1.58 ecosystem)

Practical implication: plan for multiple conversion + calibration passes, and budget time for “op coverage” debugging.


3.2 Profiling & debugging

You’ll want both latency and power/thermals:

  • Apple: Xcode Instruments (Time Profiler), Metal System Trace, Core ML model profiling tools
  • Android/Qualcomm: Perfetto/Systrace, Snapdragon Profiler (where available), QNN profiling logs, ORT profiling
  • Jetson: tegrastats, jtop, Nsight Systems/Compute, TensorRT verbose logs
  • General: measure p50/p95 latency and jitter; avoid “one warm run” benchmarks

4) “Can I run a VLA on-device?” (reality check)

Large baseline: OpenVLA (7B-class)

Takeaway: feasible on AGX Orin; usually too heavy for phones/SBCs unless you heavily compromise (aggressive quantization + token reduction + low FPS).


5) Small-scale VLAs worth looking at (SoC-friendly)

For a “single-device demo” on consumer hardware, these are more realistic than 7B+ VLAs.

5.1 SmolVLA‑450M (Hugging Face / LeRobot) - practical baseline

Deployment hint: treat it like 3 deployable chunks: {vision encoder, language core, action head}. You may accelerate them with different runtimes.


5.2 VLA-Adapter (tiny-scale VLA paradigm)


5.3 NORA (open VLA baseline with released code)


5.4 Evo-1 (lightweight VLA)


5.5 EdgeVLA (explicit “edge VLA” experiments)

  • Code: https://github.com/kscalelabs/evla
  • Why it’s useful for edge: explores training VLAs on small language models (e.g., Qwen2-class) and non‑autoregressive objectives.

5.6 FLOWER (rectified-flow action expert; efficient multi-step policy)


5.7 NinA (normalizing-flow action expert; FLOWER variant)


5.8 NanoVLA (routing + decoupling for edge) - paper-only


5.9 TinyVLA (tiny-scale policy learning)


5.10 RoboMamba (state-space / Mamba-style efficiency)


5.11 MoLe‑VLA (dynamic layer skipping)

5.12 SimVLA (minimal design)

6) Optimization playbook (SoC-focused)

Step 0 - Measure end-to-end (not just tokens/sec)

Track control-loop metrics:

  • Policy latency (p50/p95), jitter
  • Action update rate achieved on robot
  • Closed-loop success rate (simulation + real)
  • Power/thermals (sustained)

Step 1 - Make the model “edge-shaped”

Low-risk levers:

  • Reduce image resolution (e.g., 224² -> 160²) if success rate holds
  • Reduce frame rate and use action chunking
  • Prefer single image or a small temporal window (avoid long video token streams)
  • Enforce static shapes where your compiler needs it (QNN / TensorRT / Core ML)

Step 2 - Quantize (usually the biggest memory win)

  • LLM/VLM: int8 / int4 weights where kernels exist (platform-dependent)
  • Vision encoder: often int8-friendly; keep normalization consistent
  • Action head: fp16/bf16 often fine; quantize only if it’s a bottleneck

Apple note: Core ML tools explicitly supports 4-bit and 8-bit weight quantization (and optional 8-bit activations).
Qualcomm note: QNN compilation/EP often expects quantized graphs to reach NPU speedups.

VLA-specific quantization (recommended first step):

CPU-only edge note: for CPU-only edge deployment (no GPU/NPU), consider 1-bit quantization (BitNet b1.58) if you can retrain: massive speedups (2.37x–6.17x on x86, 1.37x–5.07x on ARM) + energy savings (72–82% on x86, 55–70% on ARM). See [microsoft/BitNet][bitnet-repo] and OptimizingModels note.

Step 3 - Cut visual tokens (often the biggest latency win)

Common patterns:

  • Token pruning (static or per-layer)
  • Token caching across frames (robotics has high temporal redundancy)
  • Late fusion / decoupling so you can reuse parts

Actionable starting points (with code):

Efficient token importance estimation (kernel library):

  • Flash-ColReduce: Triton kernels for column-wise attention reductions (sum/mean/max) with O(N) memory instead of O(N²); identifies which tokens matter most without materializing full attention; used in visual token pruning (e.g., SparseVILA). Code: https://github.com/z-lab/flash-colreduce

If you want a longer (fast-changing) list, start from:

Step 4 - Make action tokens efficient (quality + speed)

If action discretization is hurting dexterity or sequence length:

Step 5 - Reduce decoding overhead

VLM-heavy VLA note (if you have a large language decoder):

Step 6 - If you have diffusion policies, distill them

Multi-step diffusion is often the blocker for real-time on SoCs. Two complementary approaches:

Model-side (distillation):

Serving-side (inference optimization):

Step 7 - If you just need a faster expert, distill the VLA


7) Minimal deployment checklist (refer this before you “start optimizing”)

  1. Choose runtime target per submodule (vision / language / action head)
  2. Confirm supported ops & shapes (QNN/Core ML/TensorRT)
  3. Choose quantization scheme that has kernels on your device
  4. Validate in sim (LIBERO / RLBench / MimicGen / Isaac Lab) + replay logs
  5. Only then chase “fancy” pruning / parallel decoding papers

8) “Good defaults” for an on-device robotics loop

  • Run the policy at 2–10 Hz (or slower), but run the controller at 50–200 Hz.
  • Use action chunking so one policy inference yields multiple fine-grained control steps.
  • Use async inference (don’t block camera capture/control threads).

Pseudo-skeleton:

# policy_rate << control_rate
while robot.is_running():
    obs = get_observation()  # camera + proprio
    if time_to_run_policy():
        chunk = policy(obs, instruction)   # e.g., predicts T future actions
    act = chunk.pop(0) if chunk else fallback_action()
    robot.step(act)  # high-rate control

9) Where this note connects in the repo

  • For broader “SoC ML stacks” (training + inference, runtimes, compilers): see ML Training & Inference on SoCs.
  • For model compression theory/practice (quant/prune/distill): see the Optimizing Models notes.