Ho T. Hung, An T. Le
Hanoi, Jan 2026
Scope: Vision-Language-Action (VLA) deployment (inference-first) for robotics / embodied agents on edge SoCs (phones/tablets, laptops-as-edge, SBCs, Jetson-class modules).
We cover: (1) model shapes (end-to-end VLA vs split VLM+policy), (2) optimization (quantization, pruning, distillation, cache/scheduling tricks), and (3) runtime lanes (Core ML/ANE, QNN, TensorRT, RKNN, LiteRT/ORT/OpenVINO) + profiling/packaging.
- Best “quick win” today: run a sub‑1B VLA end-to-end (e.g., SmolVLA‑450M) or run a VLM/LLM + smaller action head; avoid “full-size” VLAs unless you have a Jetson AGX‑class device.
- Main bottleneck: vision tokens + autoregressive decoding (and/or multi-step diffusion).
Your biggest speedups usually come from:- Lower image resolution / fewer frames,
- Pruning/caching visual tokens,
- Quantizing weights (4–8 bit),
- Action chunking + async control (decouple policy rate from control rate).
- Deployment reality: You’ll typically mix runtimes:
- Vision encoder on GPU/NPU (TensorRT / Core ML / QNN / RKNN),
- LLM/VLM on GPU (Metal/CUDA) or NPU when supported,
- Action head often stays on GPU/CPU unless it’s standard ops.
- Awesome-VLA (includes an “Efficient-VLA” section): https://github.com/KwanWaiPang/Awesome-VLA
- Awesome Efficient VLA (taxonomy + code links): https://github.com/guanweifan/awesome-efficient-vla
- Survey (quick landscape): A Survey on Efficient Vision-Language-Action Models - https://arxiv.org/abs/2510.24795
Examples: OpenVLA, SmolVLA, many “token-based” VLAs
- Encode image(s) -> visual tokens
- Condition on instruction -> decode action tokens (often 7–DoF or chunked action sequences)
- Great for generalization, but can be decode-bound and vision-token heavy.
- VLM provides task context; diffusion/flow head generates continuous actions.
- Often more sample efficient / smooth (improving with recent ideas such as Real-time Action Chunking), but multi-step inference can be too slow unless distilled (see Sec. 6).
- On-device perception (small VLM/ViT) + tiny policy/controller
- Optional: off-device LLM “planner” (cloud or nearby GPU)
- Sacrifices some end-to-end purity but is usually easier to ship.
This is the “split pipeline” idea but formalized: a slow System 2 reasons, a fast System 1 executes (often at higher Hz), sometimes with partial parameter sharing.
Open implementations / references (usable code):
- Fast-in-Slow (FiS-VLA): https://github.com/CHEN-H01/Fast-in-Slow (project: https://fast-in-slow.github.io/)
- OpenHelix: https://github.com/OpenHelix-Team/OpenHelix (project: https://openhelix-robot.github.io/)
- Hume: https://github.com/hume-vla/hume (project: https://hume-vla.github.io/)
Rule of thumb: pick the lane that matches your target device first, then pick a model + optimization strategy that the lane supports.
| Lane (typical devices) | Best-supported acceleration stack | What usually fits | Main gotchas |
|---|---|---|---|
| Apple Silicon (M‑series Macs, iPhone A‑series) | Core ML (ANE/GPU) via coremltools, + MLX / MLC‑LLM (Metal) |
1–7B LLMs on GPU; smaller VLM/VLA on-device | Core ML op coverage & model structure constraints; ANE best with int8 activations |
| Qualcomm Snapdragon / QCS / RBx | QNN via Qualcomm AI Hub, ORT QNN EP, ExecuTorch QNN, LiteRT‑LM | Small–mid LLMs/VLMs when op coverage is good; “static-ish” graphs | Compiler + op support constraints; Android packaging; memory bandwidth |
| NVIDIA Jetson Orin | TensorRT (+ TensorRT‑LLM branch), NanoLLM, CUDA kernels | Larger VLM/VLA possible (AGX Orin best) | Power/thermals; container compatibility; TensorRT op gaps |
| Rockchip RK35xx (RK3588, RK3576…) | RKNN/RKLLM toolchain | Small CV models; some LLMs via RKLLM | Toolchain/version churn; limited op support; quantization constraints |
| Other edge NPUs (MediaTek / Samsung / Intel / AMD / NXP…) | Usually LiteRT / NNAPI / OpenVINO / Vulkan / vendor SDK | Smaller models; mixed success for LLM/VLM | Ecosystem fragmentation; driver/EP maturity |
If you’re new: start with Apple (fast iteration) or Jetson (best “it just runs” for robotics). Qualcomm is excellent when you align with the QNN-supported subgraph.
- Core ML Tools (quantization / pruning / palettization):
- Quantization overview (supports 8-bit and 4-bit weights, optional 8-bit activations): https://apple.github.io/coremltools/docs-guides/source/opt-quantization-overview.html
- Quantization algorithms (RTN, GPTQ, QAT): https://apple.github.io/coremltools/docs-guides/source/opt-quantization-algos.html
- MLX (Metal-first research stack): https://github.com/ml-explore/mlx
- MLX-LM (practical LLM inference): https://github.com/ml-explore/mlx-lm
- MLC-LLM (Metal runtime for LLMs/VLMs): https://llm.mlc.ai/docs/deploy/ios.html
- vLLM on Apple Silicon (community): https://github.com/vllm-project/vllm-metal
Caveat: MLC/MLX primarily use Metal GPU, not the ANE; ANE acceleration generally comes through Core ML.
- Qualcomm AI Hub (models + compilation): https://aihub.qualcomm.com/get-started
- Example on-device model pages:
- Llama 3.2 3B Instruct: https://aihub.qualcomm.com/models/llama_v3_2_3b_instruct
- Qwen2.5 7B Instruct: https://aihub.qualcomm.com/models/qwen2_5_7b_instruct
- Example on-device model pages:
- ONNX Runtime QNN Execution Provider docs:
- QNN EP overview: https://docs.qualcomm.com/bundle/publicresource/topics/80-62010-1/ort-qnn-ep.html
- ORT “build model assets for Snapdragon NPU”: https://onnxruntime.ai/docs/genai/howto/build-models-for-snapdragon.html
- ExecuTorch + QNN backend:
- Build/run Qualcomm backend: https://docs.pytorch.org/executorch/0.4/build-run-qualcomm-ai-engine-direct-backend.html
- Llama 3 3B tutorial (Android + QNN): https://docs.pytorch.org/executorch/1.0/llm/build-run-llama3-qualcomm-ai-engine-direct-backend.html
- LiteRT-LM (Google AI Edge, cross-platform LLM runtime):
Caveat: QNN works best when your model can be lowered to a supported static subgraph (ops + shapes). Plan for fallbacks (GPU/CPU) for unsupported pieces.
- Jetson AI Lab 2.0 (curated tutorials): https://www.jetson-ai-lab.com/
- OpenVLA guide (archived / deprecated but still useful): https://www.jetson-ai-lab.com/archive/openvla.html
- OpenVLA project page: https://openvla.github.io/
- Jetson Platform Services (VLM inference service): https://docs.nvidia.com/jetson/jps/inference-services/vlm.html
- Isaac ROS DNN inference (TensorRT/Triton nodes): https://github.com/NVIDIA-ISAAC-ROS/isaac_ros_dnn_inference
Caveat: TensorRT-LLM support on Jetson exists but is more “branch / JetPack-coupled” than desktop; expect friction and pin versions.
- RKNN-LLM stack (Rockchip): https://github.com/airockchip/rknn-llm
- RKNN-Toolkit2 (conversion/quant/inference): https://github.com/rockchip-linux/rknn-toolkit2
- RKNN model zoo: https://github.com/airockchip/rknn_model_zoo
Caveat: Expect frequent version changes and stricter conversion constraints than CUDA/Metal stacks.
When people say “deploy the VLA”, it often means producing multiple artifacts:
- PyTorch checkpoint (
.pt/.pth) - research + baseline correctness - ONNX (
.onnx) - common interchange, especially for ORT / QNN EP - Core ML (
.mlpackage/.mlmodel) - iOS/macOS app bundle, ANE/GPU targeting - QNN / QAIRT context binaries - Snapdragon NPU-deployable compiled assets
- TensorRT engines (
.plan) - Jetson GPU-optimized engines (build on-device or in matching containers) - RKNN / RKLLM formats - Rockchip-converted assets for NPU/runtime
- GGUF / GGML-family - CPU/GPU inference stacks like llama.cpp (useful for LLM-only subsystems)
- 1-bit quantized (BitNet) - extreme CPU-only quantization for LLM backbones; requires QAT retraining but delivers massive speedups + energy savings (Microsoft BitNet b1.58 ecosystem)
Practical implication: plan for multiple conversion + calibration passes, and budget time for “op coverage” debugging.
You’ll want both latency and power/thermals:
- Apple: Xcode Instruments (Time Profiler), Metal System Trace, Core ML model profiling tools
- Android/Qualcomm: Perfetto/Systrace, Snapdragon Profiler (where available), QNN profiling logs, ORT profiling
- Jetson:
tegrastats,jtop, Nsight Systems/Compute, TensorRT verbose logs - General: measure p50/p95 latency and jitter; avoid “one warm run” benchmarks
- OpenVLA is an open-source 7B VLA pretrained on Open X-Embodiment episodes: https://openvla.github.io/
- On Jetson AGX Orin-class devices, the Jetson AI Lab archive reports INT4/FP8/FP16 runs with a NanoLLM pipeline and publishes example latency/FPS/accuracy numbers: https://www.jetson-ai-lab.com/archive/openvla.html
Takeaway: feasible on AGX Orin; usually too heavy for phones/SBCs unless you heavily compromise (aggressive quantization + token reduction + low FPS).
For a “single-device demo” on consumer hardware, these are more realistic than 7B+ VLAs.
- Size: ~450M params
- Backbone: SmolVLM2 (vision encoder) + SmolLM2 (language decoder), trained as a robotics policy
- Action head: flow-matching transformer (for action prediction)
- Why it’s good for SoCs: small enough that quantization + token reduction often gets you into usable latency.
- Resources:
Deployment hint: treat it like 3 deployable chunks: {vision encoder, language core, action head}. You may accelerate them with different runtimes.
- Project / paper hub: https://vla-adapter.github.io/
- Code: https://github.com/OpenHelix-Team/VLA-Adapter
- Why it’s useful for edge: focuses on tiny-scale designs and practical training/finetuning recipes.
- Code: https://github.com/declare-lab/nora
- Why it’s useful for edge: a practical open VLA codebase/checkpoints you can inspect + benchmark.
- Code: https://github.com/MINT-SJTU/Evo-1
- Why it’s useful for edge: designed for efficiency while preserving instruction/vision alignment.
- Code: https://github.com/kscalelabs/evla
- Why it’s useful for edge: explores training VLAs on small language models (e.g., Qwen2-class) and non‑autoregressive objectives.
- Code (CALVIN / LIBERO): https://github.com/intuitive-robots/flower_vla_calvin
- Code (pretraining on OXE): https://github.com/intuitive-robots/flower_vla_pret
- Why it’s useful for edge: flow-style action heads can be distilled / step-reduced more naturally than long autoregressive decoding.
- Code: https://github.com/dunnolab/NinA
- Why it’s useful for edge: swaps diffusion-style sampling for normalizing flows to reduce inference cost.
- Paper: https://arxiv.org/abs/2510.25122
- Useful ideas (even without code): late fusion (decouple V/L), routing, and chunking to reduce edge cost.
- Paper: https://arxiv.org/abs/2406.04339
- Project: https://robomamba.github.io/
Track control-loop metrics:
- Policy latency (p50/p95), jitter
- Action update rate achieved on robot
- Closed-loop success rate (simulation + real)
- Power/thermals (sustained)
Low-risk levers:
- Reduce image resolution (e.g., 224² -> 160²) if success rate holds
- Reduce frame rate and use action chunking
- Prefer single image or a small temporal window (avoid long video token streams)
- Enforce static shapes where your compiler needs it (QNN / TensorRT / Core ML)
- LLM/VLM: int8 / int4 weights where kernels exist (platform-dependent)
- Vision encoder: often int8-friendly; keep normalization consistent
- Action head: fp16/bf16 often fine; quantize only if it’s a bottleneck
Apple note: Core ML tools explicitly supports 4-bit and 8-bit weight quantization (and optional 8-bit activations).
Qualcomm note: QNN compilation/EP often expects quantized graphs to reach NPU speedups.
VLA-specific quantization (recommended first step):
- Scale-Calibrated Post-Training Quantization for VLA: QuantVLA - https://arxiv.org/abs/2602.20309
CPU-only edge note: for CPU-only edge deployment (no GPU/NPU), consider 1-bit quantization (BitNet b1.58) if you can retrain: massive speedups (2.37x–6.17x on x86, 1.37x–5.07x on ARM) + energy savings (72–82% on x86, 55–70% on ARM). See [microsoft/BitNet][bitnet-repo] and OptimizingModels note.
- Combined quant + token pruning (training-free): SQAP‑VLA - https://arxiv.org/abs/2509.09090 (code: https://github.com/ecdine/SQAP-VLA)
Common patterns:
- Token pruning (static or per-layer)
- Token caching across frames (robotics has high temporal redundancy)
- Late fusion / decoupling so you can reuse parts
Actionable starting points (with code):
- Token caching across frames: VLA‑Cache - https://arxiv.org/abs/2502.02175 (code: https://github.com/siyuhsu/vla-cache)
- Quantization-aware pruning + token pruning: SQAP‑VLA - https://arxiv.org/abs/2509.09090 (code: https://github.com/ecdine/SQAP-VLA)
Efficient token importance estimation (kernel library):
- Flash-ColReduce: Triton kernels for column-wise attention reductions (sum/mean/max) with O(N) memory instead of O(N²); identifies which tokens matter most without materializing full attention; used in visual token pruning (e.g., SparseVILA). Code: https://github.com/z-lab/flash-colreduce
If you want a longer (fast-changing) list, start from:
- https://github.com/KwanWaiPang/Awesome-VLA (see “Efficient‑VLA”)
- https://github.com/guanweifan/awesome-efficient-vla
If action discretization is hurting dexterity or sequence length:
- FAST / FAST+ action tokenization: https://arxiv.org/abs/2501.09747
- Vector-quantized action tokenizers: VQ‑VLA - https://arxiv.org/abs/2507.01016 (code: https://github.com/xiaoxiao0406/VQ-VLA)
- Parallel decoding for action chunking: PD‑VLA - https://arxiv.org/abs/2503.02310 (paper-only; use as an idea bucket)
- Early-exit decoding + consistency distillation: CEED‑VLA - https://arxiv.org/abs/2506.13725 Code: https://github.com/OpenHelix-Team/CEED-VLA Project: https://irpn-eai.github.io/CEED-VLA/
VLM-heavy VLA note (if you have a large language decoder):
- Block diffusion parallel drafting : DFlash Code: https://github.com/z-lab/dflash
Multi-step diffusion is often the blocker for real-time on SoCs. Two complementary approaches:
Model-side (distillation):
- One-step diffusion distillation: OneDP - https://arxiv.org/abs/2410.21257 (project: https://research.nvidia.com/labs/dir/onedp/)
- Fast Generation from Diffusion Models: NVIDIA FastGen Code: https://github.com/NVlabs/FastGen
Serving-side (inference optimization):
- PyTorch-native and Flexible Inference Engine with Hybrid Cache Acceleration and Parallelism for DiTs: Cache-DiT Code: https://github.com/vipshop/cache-dit
- From Instantaneous to Average Velocity for Accelerating Flow Matching Inference: MeanCache - https://arxiv.org/pdf/2601.19961 Code: https://github.com/UnicomAI/MeanCache
- Refined Policy Distillation (RL refinement to compact expert): https://arxiv.org/abs/2503.05833
- Choose runtime target per submodule (vision / language / action head)
- Confirm supported ops & shapes (QNN/Core ML/TensorRT)
- Choose quantization scheme that has kernels on your device
- Validate in sim (LIBERO / RLBench / MimicGen / Isaac Lab) + replay logs
- Only then chase “fancy” pruning / parallel decoding papers
- Run the policy at 2–10 Hz (or slower), but run the controller at 50–200 Hz.
- Use action chunking so one policy inference yields multiple fine-grained control steps.
- Use async inference (don’t block camera capture/control threads).
Pseudo-skeleton:
# policy_rate << control_rate
while robot.is_running():
obs = get_observation() # camera + proprio
if time_to_run_policy():
chunk = policy(obs, instruction) # e.g., predicts T future actions
act = chunk.pop(0) if chunk else fallback_action()
robot.step(act) # high-rate control- For broader “SoC ML stacks” (training + inference, runtimes, compilers): see ML Training & Inference on SoCs.
- For model compression theory/practice (quant/prune/distill): see the Optimizing Models notes.