- fix: replace AutoModelForVision2Seq with AutoModelForImageTextToText for transformers 5.x
AutoModelForVision2Seq was removed in transformers 5.x (shipped on AWS DL AMI). Use AutoModelForImageTextToText as the primary import with a fallback to AutoModelForVision2Seq for older transformers versions.
Files updated: - openadapt_ml/training/grpo/trainer.py - openadapt_ml/cloud/modal_cloud.py - docs/grpo_trl_rewrite_draft.py (comment only)
Note: openadapt_ml/training/trl_trainer.py already had the correct
try/except pattern and was not modified.
Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com
- fix: use keyword args for Qwen VL processor to avoid positional conflict
Qwen2_5_VLProcessor.call() expects text= and images= as keyword args. Passing text as positional arg conflicts with images kwarg: TypeError: got multiple values for argument 'images'
Co-authored-by: Claude Opus 4.6 (1M context) noreply@anthropic.com
- Trigger release
(
7f8833c)
- Add first scored trace (Notepad Hello World, score 0.5)
(
ba44eaa)
6 steps, 91s, GPT-5.4-mini planner+grounder, lightweight mode. VLM judge passed milestone 2 (Hello World typed, confidence 1.00). Milestone 1 (process check) timed out during /execute_windows eval.
Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com
- fix: make heavy ML dependencies optional for lightweight installs
Move torch, torchvision, bitsandbytes, peft, and transformers from required dependencies to [project.optional-dependencies.training]. Wrap all top-level imports of these packages in try/except ImportError so the package can be imported without them installed.
This unblocks lightweight consumers (e.g. Wright worker installing openadapt-evals) that don't need local model training/inference. Users who need training can install with: pip install openadapt-ml[training]
Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com
- style: fix ruff formatting in qwen_vl.py
Co-authored-by: Claude Opus 4.6 (1M context) noreply@anthropic.com
The validation script called env.reset(task_id=...) but the actual API is env.reset(config=ResetConfig(task_id=...)). This caused Phase 2 to fail with TypeError.
Co-authored-by: Claude Opus 4.6 (1M context) noreply@anthropic.com
- Trigger release
(
c9da079)
Literature-backed design for task-specific LoRA adapters with runtime routing. Covers architecture, training pipeline, data collection (including correction flywheel as training data source), update economics, and validation plan. Positioned as one experiment track within the broader OpenAdapt experimentation framework.
Co-authored-by: Claude Opus 4.6 (1M context) noreply@anthropic.com
- feat: add evaluate_url, lora_checkpoint, validation script, and CLI for GRPO training
- Add evaluate_url field to GRPOConfig for separate evaluate endpoint - Add lora_checkpoint field to resume GRPO from existing SFT LoRA adapter - Pass evaluate_url through rollout collector to WAALiveConfig - Load existing LoRA via PeftModel.from_pretrained() when lora_checkpoint set - Update verl_backend.py error message with actionable instructions - Add 5-phase validation script (connectivity → rollout → inference → train → multi-step) - Add CLI entry point (scripts/run_grpo.py) for running GRPO without writing Python
Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com
- style: fix ruff formatting in config and validation script
Co-authored-by: Claude Opus 4.6 (1M context) noreply@anthropic.com
vLLM 0.11.0 pins torch==2.8.0. The GPU E2E validation (openadapt-evals PR #87) confirmed the full ML stack works with PyTorch 2.8.0+cu128. The previous >=2.9.1 constraint prevented installing openadapt-ml alongside vLLM in the same environment.
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
- feat: add dual training backend support (standalone + verl-agent)
Add backend field to GRPOConfig ("standalone" or "verl") to support switching between training
backends:
- standalone: existing trainer.py (single-GPU, episode-level rewards) - verl: verl-agent/VAGEN integration (multi-GPU, GiGPO per-step credit)
New verl_backend.py provides build_vagen_config() to map GRPOConfig to VAGEN-compatible config, and train_with_verl() as the integration point (placeholder until full end-to-end is wired up).
No existing function signatures or behavior modified.
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
- style: format verl_backend.py with ruff
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
- docs: add experimental roadmap and evidence context to vision
- Add 2x2 experimental matrix (retrieval × fine-tuning) to Core Thesis - Add evidence context to benchmark table: note it's an internal synthetic benchmark (~3 UI elements) that validates the pipeline, not real-world performance. Link to openadapt-evals for ongoing WAA/OSWorld evaluation.
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
- fix: use 46.7% consistently in 2x2 matrix
Was showing 33-47% range which conflated preliminary (n=3) and full (n=45) results. The validated number is 46.7%.
- feat: add GRPO training module for online RL
Add openadapt_ml/training/grpo/ package with: - GRPOConfig for training hyperparameters - GRPORolloutCollector connecting to openadapt-evals RLEnvironment - GRPOTrainer implementing custom GRPO loop for multimodal VLMs - Binary reward function and group-relative advantage computation - Chain-of-thought warm-up pipeline for SFT pre-training - 20 unit tests passing without GPU
- fix: address review findings in GRPO module
- Replace copy.deepcopy(model) with LoRA state dict snapshot (prevents OOM) - Mark _compute_rollout_loss as scaffold with dummy forward pass for grad flow - Fix collect_rollout call to match RLEnvironment API (task_id in signature) - Add model.eval()/model.train() toggling around rollout/training phases - Remove unused gradient_accumulation_steps config field - Use actual screen_size from RLEnvironment instead of hardcoded 1920x1200 - Clamp CLICK coordinates to [0.0, 1.0] to prevent invalid pixel values - Validate task_ids non-empty at start of train() - Export CoT warmup functions from package init - Add BenchmarkAction fallback when openadapt-evals not installed - Add 9 new tests: action parser (8) + empty task_ids validation (1) - All 29 tests passing
- feat: implement GRPO loss computation and fix cot_warmup dependency
Implement the core _compute_rollout_loss method that was previously a NotImplementedError scaffold. The implementation:
- Reconstructs VLM prompts from rollout observations - Formats actions back to DSL text via new _format_action_as_text helper - Computes log-probabilities of action tokens under current policy - Computes reference policy log-probs via PEFT disable_adapter() with fallback to manual LoRA weight swapping - Returns GRPO loss: -advantage * log_prob + kl_coef * KL penalty
Also adds get_api_adapter() factory function to api_adapter.py, fixing the broken import in cot_warmup.py's generate_cot_annotations().
Additional review fixes from prior session: - Initialize _is_unsloth and _ref_lora_state in init
- Remove dead else branch for task_id selection - Fix total_loss device placement - LoRA-only fallback save in checkpoint - TYPE regex accepts single quotes - Coordinate clamping in _parse_vlm_output_to_action
40 tests passing (10 new: 8 format_action + 1 roundtrip + 1 api_adapter).
- refactor: deduplicate GRPO prompts via shared _build_agent_messages
Extract prompt construction into _build_agent_messages() which imports SYSTEM_PROMPT from next_action.py (the SFT training prompt). This ensures the GRPO agent uses the same prompt distribution the model was warm-started on, and guarantees _make_agent_fn and _compute_rollout_loss use identical prompts (critical for correct log-prob computation).
- fix(grpo): address critical review findings in GRPO loss computation
- C-01: Store raw model output on action._grpo_raw_text for accurate loss - C-02: Separate tokenization of prompt/action with concatenation to fix BPE boundary alignment - I-01: Prefer LoRA weight swapping over disable_adapter() for reference policy (captures initial LoRA state after SFT warm-start) - I-03: Per-step gradient accumulation via immediate backward() to prevent OOM from building computation graph over all rollout steps - I-04: Fix unescape order in TYPE parser (backslash before quotes) - M-03: Pass model_name through get_api_adapter to ApiVLMAdapter - M-07: Case-insensitive CLICK/TYPE regex in _parse_vlm_output_to_action - L-01: Extract DEFAULT_SCREEN_SIZE constant, replace all hardcoded values
- fix(grpo): fix instruction propagation, screen size, weight swap safety
- CR-01: Task instruction was never populated during GRPO rollouts. WAALiveAdapter._get_observation() does not populate raw_observation, so the agent prompt said "Goal: " with nothing after it. Fix: store instruction on Rollout dataclass (populated from env._current_task in collector), use it in both agent_fn and _compute_rollout_loss. - IM-01: Change DEFAULT_SCREEN_SIZE from 1920x1200 to 1920x1080 for consistency with baselines module and standard VM configurations. Add screen_size field to GRPOConfig so it is configurable. - IM-02: Add try/finally around LoRA weight swap in _compute_ref_log_probs. Without this, an exception during the reference forward pass permanently corrupts the model state.
- fix(grpo): remove unused torch import in _setup_model
The import torch at line 121 was flagged by ruff (F401) as unused. The surrounding code only calls .detach().clone() on tensor objects, which does not require the torch module directly.
- style(grpo): apply ruff formatting to GRPO module files
Run ruff format on cot_warmup.py, rollout_collector.py, and trainer.py to satisfy the CI ruff formatter check.
- refactor(grpo): replace custom trainer with minimal TRL bridge
Replace 809-line custom GRPO trainer with ~280 lines that: - Use standard HuggingFace AutoModelForVision2Seq + AutoProcessor + PEFT LoraConfig instead of Unsloth monkey-patching - Implement standalone GRPO loss in ~15 lines of PyTorch (clipped surrogate) instead of custom policy gradient + KL penalty - Use beta=0.0 (no KL penalty, no reference model) per DAPO/Open- Reasoner-Zero literature, eliminating weight-swap complexity - Keep per-step backward to avoid OOM on long trajectories - Use standard model.save_pretrained() for checkpointing - Document WHY standalone GRPO math vs TRL GRPOTrainer (VLM multi-turn image pixel_values not stored in token IDs) and WHEN to switch
Preserves all public API: GRPOTrainer, _parse_vlm_output_to_action, _format_action_as_text, _build_agent_messages, DEFAULT_SCREEN_SIZE. All 50 tests pass (44 existing + 6 new for grpo_loss and trainer internals).
- feat(grpo): add E2E tests with artifact generation and architecture docs
- tests/test_grpo_e2e.py: 5 E2E tests (training loop, rollout collection, loss convergence, weight diff, mathematical properties) using tiny mock VLM. Produces 65+ artifacts (JSON traces, PNGs, checkpoints, summaries). - scripts/grpo_e2e_report.py: CLI report generator for test artifacts (text + optional HTML output). - docs/grpo_e2e_test_design.md: design rationale for E2E test approach - docs/grpo_architecture_analysis.md: analysis of custom vs TRL-based GRPO - docs/grpo_trl_rewrite_draft.py: TRL v0.29.0 integration research - docs/strategic_analysis_evals_ml_synergy.md: business/economics analysis
- fix(grpo): address self-review findings (BUG-01, CLEAN-01 through -05)
- Rename grpo_loss to policy_gradient_loss with honest docstring: single-epoch on-policy means
ratio=1.0, clipping never fires, this is REINFORCE with group-relative advantages. Keep grpo_loss
as backwards-compatible alias. - Add public aliases: parse_vlm_output_to_action,
format_action_as_text (drop underscore prefix for public API) - Export policy_gradient_loss and
public functions from init.py - Remove unused config fields: kl_coef (was 0.01 but never used
with beta=0), max_seq_length (never referenced) - Fix model_name default:
Qwen/Qwen2.5-VL-7B-Instruct (not unsloth variant) - Fix trivial test assertion: grad_norm > 0 (was
= 0, always true) - Update loss tests to verify gradient direction, not just loss sign - Add test_public_api_exports for new public names
56 tests pass (51 unit + 5 E2E).
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
PR titles become squash merge commit messages. Without the fix:/feat: prefix, python-semantic-release skips the release. Document this requirement prominently in CLAUDE.md.
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
- docs: add mandatory branch/PR rule to CLAUDE.md
Adds explicit instruction that all changes must go through feature branches and pull requests. enforce_admins has been enabled on GitHub to prevent admin bypass of branch protection.
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
- fix(modal): remove unused os import
Fixes ruff F401 lint error on modal_cloud.py.
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
-
modal: Fix inference container image and multi-modal message handling (
6aef712) -
Pin transformers==4.57.3 (matches local, has Qwen3-VL support) - Add torchvision dependency (required by AutoVideoProcessor) - Add fallback: AutoModelForVision2Seq -> Qwen2_5_VLForConditionalGeneration - Add fallback: AutoProcessor -> Qwen2_5_VLProcessor - Reconstruct multi-modal messages with {"type": "image"} placeholders for proper vision token generation in apply_chat_template - Rename container_idle_timeout -> scaledown_window (Modal API update)
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
-
modal: Add inference serving with call_inference API (
f45c524) -
Add _build_inference_app() for Modal GPU inference with PEFT adapter - Add upload_adapter_to_volume() for uploading adapters to Modal volume - Add call_inference() as the primary API for remote inference - Add 'serve' CLI command for interactive model serving - Container caches model in memory across calls (container_idle_timeout=600) - Support --no-adapter for zero-shot base model serving
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
-
modal: Apply fixes from first successful Modal training run (
3b38b77) -
Add
serialized=Trueto @app.function for non-global-scope support - Auto-create volume before upload, add--forcefor overwrites - Fix variable scoping (vol = training_volume) inside remote function - Addopenadapt-ml[training]to container image dependencies - Use--jsonlflag in train subprocess for correct data path - Addmodalto project dependencies - Update test to verify create+put two-call pattern
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
- cloud: Add Vast.ai and Modal GPU providers
(
dd0aad2)
Vast.ai (~$0.17/hr A10): SSH+rsync marketplace model with full CLI (list, launch, terminate, train) matching lambda_labs.py pattern. Includes GPU search, --gpu-wait retry, auto-convert --demo-dir flow.
Modal ($30/mo free, $1.10/hr A10G): Python-native cloud with zero-ops training via decorated functions and Modal Volumes for data transfer. CLI: train, status, download, list-volumes.
Both support the same --demo-dir end-to-end pipeline as Lambda Labs.
53 new tests (34 Vast.ai + 19 Modal), all passing.
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
- train: Add end-to-end pipeline automation with --demo-dir flag
(
d883e39)
Add prepare_bundle() and generate_screenshot_mapping() to convert_demos.py for single-call demo conversion. Extend both train.py and lambda_labs.py train commands with --demo-dir, --captures-dir, --mapping flags so the full pipeline (mapping → conversion → bundle → upload → train) runs as one command. Add --gpu-wait for Lambda GPU availability retry loop.
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
- Sft training pipeline with demo conversion, Lambda Labs integration, and data persistence
(#29,
e56c9e4)
- feat(training): add demo conversion pipeline for ms-swift SFT format
Convert annotated demo JSON files to JSONL training data compatible with ms-swift for Qwen3-VL fine-tuning. Handles coordinate conversion from [0,1] to [0,1000] range, generates blocks from observation and intent fields, and accumulates action history across steps.
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
- feat(training): add screenshot linking via mapping file
Support --mapping flag for pre-computed screenshot mapping JSON that maps task_id -> {step_index -> screenshot_path}. This correctly handles the coalesced step-to-raw-capture mapping (where step indices skip due to merging). Also adds --captures-dir with DB-based fallback and DOUBLE_CLICK/RIGHT_CLICK parsing.
- fix(training): align SFT format with inference prompt and add validation
- Add system role message to training conversations matching qwen3vl_agent.SYSTEM_PROMPT - Add
"Output exactly one action" / thinking instruction to user message matching _build_prompt() output
- Add coordinate range validation warning for values outside [0, 1] - Add input schema validation for required demo/step fields - Remove broken _resolve_screenshots_from_db() and _resolve_screenshots_direct() fallbacks that produced silently wrong mappings for coalesced demos
- Remove --screenshot-dir CLI arg (unreliable for coalesced demos) - Keep --mapping (recommended) and capture API as screenshot resolution
- feat(training): add JSONL training pipeline with bundle support
Align convert_demos output with internal SFT format (images + messages), add train_from_jsonl() loader, --jsonl flag to train.py, --bundle flag to convert_demos and Lambda Labs train command. Enables training on annotated demo data without Episode objects.
- fix(training): add TRL callback, 4-bit quantization, early stopping
- Add OpenAdaptCallback for training_log.json output + early stop on loss - Fix _load_standard_model to use BitsAndBytesConfig for 4-bit quantization - Use AutoModelForImageTextToText (supports Qwen3-VL) instead of Qwen2VL class - Switch demo config to 2B model for fast iteration on A10 - Hide Azure ML Jobs panel when cloud_provider is not azure - Fix Lambda setup: remove uv.sources before uv sync on remote
- fix(cloud): use git archive for code sync, fix callback MRO
Replace rsync with git archive HEAD | ssh tar in sync_local_code() to send only committed tracked
files (~10MB vs ~1.8GB with binary artifacts).
Fix callback class MRO: _OpenAdaptCallback must precede TrainerCallback so our on_log/on_train_begin override the no-op base implementations.
- refactor(training): consolidate duplicated SFTTrainer setup
Extract _run_sft_training() shared by train_with_trl() and train_from_jsonl(), eliminating ~80 lines of duplicated SFTConfig, SFTTrainer instantiation, and training loop code.
- feat(training): add plateau-based early stopping
Add early_stop_min_delta and early_stop_plateau_patience to stop training when loss stops improving by at least min_delta for N consecutive steps. Works alongside the existing absolute threshold.
- docs: add GPU hosting options and training pipeline gap analysis
GPU hosting options covers 24 platforms ranked by value for open-source projects needing free/credited GPU compute for VLM fine-tuning.
- docs: add Qwen3-VL-2B training results analysis
Detailed analysis of first fine-tuning run: 27.24 → 9.77 loss (64% reduction) over 50 steps on 20 annotated WAA demo samples. Includes per-epoch breakdown, compute efficiency metrics, and recommendations for future training runs.
- fix: remove absolute paths from repo, fix LoRA task_type, add tests
- Remove screenshot_mapping.json (has absolute local paths), add to .gitignore, add screenshot_mapping.example.json instead - Fix LoRA task_type: always use CAUSAL_LM (Qwen-VL is decoder-only, not encoder-decoder like T5/BART that needs SEQ_2_SEQ_LM) - Add 57 tests for convert_demos (action parsing, coordinate conversion, step conversion, validation, bundle creation) and training callback (log writing, threshold early stopping, plateau detection)
- fix: address self-review issues (config wiring, security, tests)
- Wire lr_scheduler_type, weight_decay, max_grad_norm, target_modules from YAML config through to SFTConfig (were silently ignored) - Fix command injection in lambda_labs.py via shlex.quote() - Fix callback writing loss=0 on non-loss log events (track _last_loss) - Fix WAIT() mapping to wait() instead of finished() in convert_demos - Fix CI: add --no-sources to uv sync for uv.sources compatibility - Add test for non-loss log event callback behavior - Update SYSTEM_PROMPT comment (remove stale cross-reference)
- ci: fix uv.sources with UV_NO_SOURCES env var, skip integration tests
UV_NO_SOURCES=1 covers uv sync, uv run ruff, and uv run pytest. Integration tests require openadapt_evals which is not a dependency.
-
style: format annotate.py with ruff
-
feat: auto-generate training plots, persist data with checkpoint
- Add plot_training.py: generates loss curve, LR schedule, and combined plots from training_log.json using matplotlib - Copy training_log.json + plots into checkpoint directory after training so artifacts are self-contained and never lost - Add periodic rsync of training_log.json during Lambda training (every 5 min) so data survives instance interruption - Replace ASCII loss curve in training results doc with real PNG plots - Add reconstructed training_log.json from Qwen3-VL-2B demo run
- test: add tests for plot generation and checkpoint co-location
11 tests covering: - Loss plot generation (with/without LR data) - Output directory creation and defaults - Empty data handling - Epoch boundary rendering - Real training log validation - Checkpoint co-location of log + plots
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
Replace branch = "main" (v7/v8 key) with [tool.semantic_release.branches.main] table (v9 key).
The old key is silently ignored by v9, causing releases to never trigger on the main branch.
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
- demo-prompt: Add VLM-annotated traces for 3 recorded demos
(
6d2e0a9)
Ran annotation pipeline (GPT-4o) on all 3 recorded captures: - 37e10fc4 (notifications): 5 steps — turn off system notifications - 0c9dda13 (archive): 9 steps — create Archive folder, move .docx files - 366de66e (notepad): 6 steps — open Notepad, create draft.txt
These grounded traces replace fabricated hand-written demos for demo-conditioned evaluation.
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
- deps: Bump openadapt-capture to >=0.3.0, add uv.sources
(
c120b0e)
The new recording format uses recording.db (not capture.db). Local editable source ensures lockfile resolves correctly.
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
- Add demo GIFs back to README
(
f725872)
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
- Rewrite CLAUDE.md — remove migration guide, match pure ML scope
(
392bd0e)
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
- Rewrite README for professional open-source style
(
e26032b)
Replace 1100-line README containing stale VM/pool references with clean 220-line README reflecting what the package actually contains post-migration. Use test.yml badge instead of release.yml for accurate build status.
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
- feat: remove evaluation infrastructure (moved to openadapt-evals)
All evaluation infrastructure (~13,000 lines) has been migrated to openadapt-evals (PR #29). This PR removes the now-redundant code from openadapt-ml, making it a pure ML package.
Deleted files: - benchmarks/cli.py (8,503 lines - VM/pool CLI) - benchmarks/azure_vm.py (AzureVMManager) - benchmarks/pool.py (PoolManager) - benchmarks/vm_monitor.py, azure_ops_tracker.py, resource_tracker.py - benchmarks/azure.py, viewer.py, pool_viewer.py, trace_export.py - benchmarks/waa_deploy/ (Docker agent deployment) - tests/test_quota_auto_detection.py, test_demo_persistence.py - tests/benchmarks/test_api_agent.py, test_waa.py
Updated: - benchmarks/init.py: Only exports ML agents (PolicyAgent, etc.) - pyproject.toml: Removed azure-ai-ml, azureml-core, azure-mgmt-* - CLAUDE.md: Removed CLI/VM/pool docs, added migration guide
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
- fix: update stale references to migrated benchmark modules
Update all remaining references to deleted benchmark modules across source code, scripts, and tests:
- cloud/local.py: azure_ops_tracker, session_tracker, CLI subprocess calls - scripts/: p0/p1 validation scripts, screenshot generators, quota checker - training/benchmark_viewer.py: HTML template CLI references - experiments/waa_demo/runner.py: docstring and print references - deprecated/waa_deploy/init.py: import path
All now point to openadapt_evals equivalents.
- docs: update README references to migrated CLI
All VM/pool CLI commands moved from openadapt_ml.benchmarks.cli to openadapt-evals (oa-vm). Update all README references.
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
- demo-prompt: Add VLM annotation pipeline for recorded demos
(
d1c1bc7)
Converts raw recordings (coordinates + screenshots) into structured text traces matching the hand-written demo format. Uses VLM to annotate each step with screen observation, intent, semantic action, and result.
Pipeline: capture → episode → coalesce → annotate (VLM) → validate → format
Key components: - Step coalescing (500 raw actions → 5-30 meaningful steps) - Click marker rendering on screenshots for VLM - Before+after frame pairs for grounded result descriptions - Sequential context (previous step annotation feeds into next) - Compact formatting matching hand-written demo shape - Runner integration with annotated > hand-written priority
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
- feat: remove evaluation infrastructure (moved to openadapt-evals)
All evaluation infrastructure (~13,000 lines) has been migrated to openadapt-evals (PR #29). This PR removes the now-redundant code from openadapt-ml, making it a pure ML package.
Deleted files: - benchmarks/cli.py (8,503 lines - VM/pool CLI) - benchmarks/azure_vm.py (AzureVMManager) - benchmarks/pool.py (PoolManager) - benchmarks/vm_monitor.py, azure_ops_tracker.py, resource_tracker.py - benchmarks/azure.py, viewer.py, pool_viewer.py, trace_export.py - benchmarks/waa_deploy/ (Docker agent deployment) - tests/test_quota_auto_detection.py, test_demo_persistence.py - tests/benchmarks/test_api_agent.py, test_waa.py
Updated: - benchmarks/init.py: Only exports ML agents (PolicyAgent, etc.) - pyproject.toml: Removed azure-ai-ml, azureml-core, azure-mgmt-* - CLAUDE.md: Removed CLI/VM/pool docs, added migration guide
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
- fix: update stale references to migrated benchmark modules
Update all remaining references to deleted benchmark modules across source code, scripts, and tests:
- cloud/local.py: azure_ops_tracker, session_tracker, CLI subprocess calls - scripts/: p0/p1 validation scripts, screenshot generators, quota checker - training/benchmark_viewer.py: HTML template CLI references - experiments/waa_demo/runner.py: docstring and print references - deprecated/waa_deploy/init.py: import path
All now point to openadapt_evals equivalents.
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
WAA's run.py expects test config as {domain: [task_ids...]} dict, but --task wrote a bare JSON array [task_id] causing TypeError when run.py indexes by domain string key.
Now looks up the task's domain from test_all.json inside the container and writes the correct {domain: [task_id]} format.
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
The vendor/WindowsAgentArena submodule pointed to unpushed local commits (a956c5b) that don't exist upstream, breaking git-based pip installs.
- Remove submodule entirely (not a runtime dependency) - Embed the 9-line compute-instance-startup.sh as a constant in cli.py - Update path references in Azure ML commands to not depend on vendor/
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
GITHUB_TOKEN cannot push version-bump commits to branches with PR protection. Use org-level ADMIN_TOKEN instead, with skip-check to prevent infinite loops on release commits.
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
- refactor(benchmarks): extract library API from CLI for programmatic usage
Extract core VM management and pool lifecycle logic from cli.py into importable modules (azure_vm.py, pool.py) with clean Python APIs.
- Add AzureVMManager class with Azure SDK primary path + az CLI fallback - Add PoolManager class for pool create/wait/run/cleanup lifecycle - Add configurable resource_group via Settings, env var, or --resource-group flag - Support DefaultAzureCredential for enterprise SSO/service principals - CLI handlers become thin wrappers delegating to library classes - Add agent_factory parameter stub on PoolManager.run() for pluggable agents
All 327 tests pass, CLI surface unchanged.
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
- style: fix pre-existing ruff lint errors in pool_viewer and resource_tracker
Remove unused import json and unused variable worker_re in pool_viewer.py, and unused import
Optional in resource_tracker.py.
-
style: run ruff formatter on benchmarks modules
-
fix(azure_vm): add SDK path for set_auto_shutdown via generic resource API
Auto-shutdown schedules are Microsoft.DevTestLab/schedules resources. Use azure-mgmt-resource (already a dependency) to create them via the generic resource client, with az CLI fallback if SDK fails.
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
- cli: Run ruff formatter
(
714268c)
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- feat(benchmarks): add HTML viewer for WAA pool benchmark results
Add pool_viewer.py module and CLI command for generating interactive HTML viewers from WAA parallel benchmark runs.
Features: - Parse waa-pool-*.log files to extract task results - Summary stats (total tasks, success rate, avg time per task) - Per-worker breakdown showing tasks per worker - Task list with pass/fail status and step counts - Domain breakdown with per-domain success rates - Interactive filters for domain and status
Usage: uv run python -m openadapt_ml.benchmarks.cli view-pool uv run python -m openadapt_ml.benchmarks.cli view-pool --run-name pool_run_20260204 uv run python -m openadapt_ml.benchmarks.cli view-pool --no-open
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- docs(claude): document VM auto-shutdown and orphan prevention
Add documentation for the auto-shutdown feature: - Explain auto-shutdown policy (default 4 hours) - Document --auto-shutdown-hours flag for pool-create and create - Document -y flag for pool-cleanup (skip confirmation) - Document test VM cleanup via try/finally
- docs(readme): update CLI commands to use pool-* workflow
Update documentation to reflect the current working CLI: - Replace outdated vm monitor with
pool-status/pool-vnc/pool-logs - Update single VM workflow to use pool-create --workers 1 -
Add analyze_pool_logs.py script for parsing benchmark results
- fix(cli): prevent orphaned test VMs during pool-create
Remove --no-wait flag from test VM creation so the VM fully exists before we attempt to delete it. Previously, the test VM would still be provisioning when delete was called, causing delete to fail silently and leave orphaned VMs consuming quota.
- fix(cli): use waa-auto image in pool-wait, wait for apt lock
Critical fixes for end-to-end pool workflow:
-
Use waa-auto:latest in pool-wait (not windowsarena/winarena) - pool-create builds waa-auto with modern dockurr/windows v5.14 - pool-wait was incorrectly using vanilla windowsarena/winarena (v0.00) - v0.00 doesn't support VERSION=11e auto-download - This caused "ISO file not found" errors
-
Wait for apt lock before Docker install - Fresh Azure VMs run unattended-upgrades - apt-get install failed with "unable to locate package" - Added wait loop for /var/lib/apt/lists/lock
- fix(pool): match working waa command parameters exactly
- Use vanilla windowsarena/winarena:latest with --entrypoint /bin/bash - Add --prepare-image false --start-client false flags (skips ISO download) - Use 172.30.0.2 for probe and emulator_ip (matching working waa command)
The pool-wait command was broken because it used waa-auto:latest without the proper entrypoint and flags. The working 'waa' command (line 5404-5454) uses these exact parameters successfully.
Co-authored-by: Claude Opus 4.5 noreply@anthropic.com
-
cli: Resolve ruff linter errors (
6084161) -
Replace bare
except:withexcept Exception:- Remove unused f-string prefixes - Remove unused variable assignments - Remove unused imports
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- docs(readme): add parallel WAA evaluation section, fix build badge
- Fix broken build badge (publish.yml → release.yml) - Add prominent "Parallel WAA Benchmark Evaluation" section near top - Add detailed "WAA Benchmark Workflow" section (#14) with: - Single VM and parallel pool workflows - VNC access instructions - Architecture diagram - Cost estimates - Update section numbering (Limitations → 15, Roadmap → 16)
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- fix(readme): address self-review feedback
- Fix anchor placement (move before heading for proper navigation) - Correct pool-delete → pool-cleanup (actual command name) - Add pool-status example for getting worker IPs - Add "prices vary by region" caveat
Co-authored-by: Claude Opus 4.5 noreply@anthropic.com
-
cli: Improve pool-create reliability and error handling (
6ead5ff) -
Properly clean up test VM and associated resources during quota check - Use sudo for docker pull (usermod not effective in same session) - Add pool-cleanup command for orphaned resources - Show full error messages in pool creation failures
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
-
pool: Use WAA native task distribution with --worker_id/--num_workers (
69082e4) -
Fixed task distribution: WAA ignores --start_idx/--num_tasks, use native --worker_id and --num_workers parameters instead - Worker 0 gets tasks 0, N, 2N... Worker 1 gets tasks 1, N+1, 2N+1... - Use vanilla windowsarena/winarena image with correct IP (20.20.20.21) - Add container reuse check (skip restart if already running) - Pass API key via env var instead of config file - Fix QMP port exposure (7200) for QEMU control - Store Windows disk on /mnt for 300GB temp storage (D8ds_v5)
Tested: 2-worker pool running 4 tasks in parallel successfully
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- waa: Use D4ds_v4 VM size for quota compatibility
(
8dfa40d)
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- waa: Use D8ds_v5 VM size for Azure ML workers
(
e9dc820)
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- Add Azure ML log streaming and cost tracking guides
(
728f274)
Document the new CLI commands for: - Live log streaming from Azure ML jobs - Cost tracking for compute instances - Teardown procedures
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- cli: Add Azure ML log streaming, cost tracking, and teardown
(
401e36d)
Add comprehensive Azure ML management commands: - azure-ml-stream: Stream logs from running jobs using Python SDK with account key auth (works around DefaultAzureCredential permission issues) - azure-ml-cost: Track compute instance uptime and estimated costs - azure-ml-teardown: Cancel jobs and delete compute instances
Also improves: - azure-ml-quota: Shows both ML Dedicated quota (what Azure ML actually uses) and regular VM quota - Better error handling and logging throughout
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- cli: Add Azure ML status, VNC, and monitor commands
(
055ecc3)
New commands for end-to-end Azure ML automation: - azure-ml-status: Show jobs and compute instances
- azure-ml-vnc: Set up VNC tunnel to compute instance - azure-ml-monitor: Monitor jobs with auto VNC setup
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- cli: Add azure-ml-quota command for quota management
(
5b34170)
Semi-automated quota increase workflow: - Checks current quota for WAA-compatible VM families - Shows which families have sufficient quota - Opens Azure Portal quota page with instructions - Guides user through the request process
Usage: uv run python -m openadapt_ml.benchmarks.cli azure-ml-quota
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- cli: Add multi-VM pool commands for parallel WAA evaluation
(
d988e56)
Add pool-create, pool-wait, and pool-run commands for running WAA benchmarks across multiple VMs in parallel:
-
pool-create --workers N: Create N VMs with Docker and WAA image - Parallel VM creation using ThreadPoolExecutor - Auto-selects available region and VM size - Configures Docker with /mnt storage - Registers pool for tracking
-
pool-wait: Wait for WAA to be ready on all workers - Starts WAA containers on each worker - Polls /probe endpoint until ready - Configurable timeout
-
pool-run --tasks N: Distribute tasks across pool - Round-robin task distribution - Parallel execution on all workers - Progress tracking in registry
This enables ~5x faster benchmark completion with 5 workers, or full 154-task evaluation in ~10min with 10+ workers.
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- waa: Update submodule with SDK v2 migration
(
241ddf8)
Updates WindowsAgentArena submodule to include Azure ML SDK v2 migration that enables job submission from macOS ARM64.
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- ci: Remove build_command from semantic-release config
(
6bd7ded)
The python-semantic-release action runs in a Docker container where uv is not available. Let the workflow handle building instead.
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- Add auto-release workflow
(
e6d067b)
Automatically bumps version and creates tags on PR merge: - feat: minor version bump - fix/perf: patch version bump - docs/style/refactor/test/chore/ci/build: patch version bump
Triggers publish.yml which deploys to PyPI.
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- Switch to python-semantic-release for automated versioning
(
404f26f)
Replaces manual commit parsing with python-semantic-release: - Automatic version bumping based on conventional commits - feat: -> minor, fix:/perf: -> patch - Creates GitHub releases automatically
- Publishes to PyPI on release
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
-
Move warnings.warn() after imports to fix E402 in viewer files - Remove unused imports (Any, base64, os, Service) to fix F401 - Remove f-string without placeholders to fix F541 - Apply ruff formatting to 5 files
Files changed (7): - benchmarks/viewer.py - E402 fix - benchmarks/waa_deploy/api_agent.py - F401 + format - benchmarks/azure_ops_tracker.py - format only - benchmarks/vm_monitor.py - format only - cloud/local.py - format only - scripts/capture_screenshots.py - F401, F541 + format - training/viewer.py - E402 fix
Co-authored-by: Claude Opus 4.5 noreply@anthropic.com
- fix(training): support VL models in standard transformers fallback
Auto-detect vision-language models (Qwen2-VL, Qwen2.5-VL) and use the appropriate model class instead of always using AutoModelForCausalLM.
Detection criteria: - "VL" in model name (case-insensitive) - "vision" in model name - vision_config attribute in model config
Model class selection: - VL models: Qwen2VLForConditionalGeneration (with AutoModelForVision2Seq fallback) - Text-only models: AutoModelForCausalLM
Also sets task_type to SEQ_2_SEQ_LM for VL models in LoRA config.
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
-
test(training): simplify VL tests to avoid model downloads
-
fix(training): improve VL model support - catch RuntimeError, disable assistant_only_loss
- Add RuntimeError and TypeError to exception handling in _load_standard_model() to catch errors when loading Qwen2.5-VL with Qwen2VLForConditionalGeneration - Disable assistant_only_loss in standard TRL config as it's not supported for VL models yet
Co-authored-by: Claude Opus 4.5 noreply@anthropic.com
- Bump version to 0.2.1
(
7a1c054)
Includes VL model support fix (PR #18): - Auto-detect VL models and use correct model class - Handle Qwen2VLForConditionalGeneration properly - Set assistant_only_loss=False for VL models
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
Delete unused files: - training/viewer_migration_example.py (72 lines) - only self-referential - scripts/fix_acr_auth.py (212 lines) - one-time fix now baked into setup_azure.py - docs/azure_acr_authentication.md - docs for removed script
Update CLAUDE.md to remove references to deleted fix script.
Verified safe to delete: - None of these files are imported by cli.py - fix_acr_auth.py functionality is now in setup_azure.py (steps 10-12)
Co-authored-by: Claude Opus 4.5 noreply@anthropic.com
-
Update gitignore and module exports (
4ab39ea) -
Add patterns for training output, synthetic data, experiment results - Add .jsonl, benchmark_live.json, external/, demos/ to gitignore - Export new runtime and schema types in module init.py
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- Add architecture decisions, analysis, and design documentation
(
4c6859f)
Key documents: - ARCHITECTURE_DECISIONS.md: Technical direction and decision records - analysis_jan2026.md: Comprehensive analysis and strategic options - enterprise/: SAC, Design Roadmap, Coords vs Marks ablation research
Design docs: - safety_gate_design.md: Safety gate architecture - perception_integration.md: Grounding integration design - representation_shootout_design.md: Coords vs Marks experiment design - viewer_consolidation_design.md, viewer_redesign_proposal.md
Experiment results: - waa_benchmark_results_jan2026.md: WAA benchmark analysis - grpo_training_report.md: GRPO training experiments - trl_unsloth_integration_analysis.md: Training integration analysis
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
-
Add ecosystem planning documents (
f3afda5) -
github_org_update_plan.md: GitHub org profile update strategy - desktop_app_plan.md: Desktop app distribution (pywebview + PyInstaller) - openadapt_integration_plan.md: Core openadapt integration roadmap
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- Add GitHub organization profile content recommendations
(
afaa1dc)
Add comprehensive recommendations for updating the OpenAdaptAI GitHub organization profile including:
- Organization bio (160 char max) - Organization README content for .github/profile/README.md - Pinned repositories recommendation (6 repos) - Repository descriptions for each package in the modular ecosystem
Focuses on the new modular architecture with openadapt as the unified entry point, highlighting openadapt-ml, openadapt-capture, openadapt-evals, openadapt-viewer, openadapt-grounding, and openadapt-retrieval packages.
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- Add Qwen3-VL embedding research and design documentation
(
30e85c5)
Add comprehensive documentation for Qwen3-VL vision-language embedding: - qwen3_vl_embedding_research.md: Literature review of VLM embedding extraction methods, including early exit strategies, hidden state extraction, and multimodal representation learning - qwen3_vl_embedding_design.md: Technical design document for extracting and using Qwen3-VL embeddings for GUI element retrieval and similarity-based action prediction
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- Add viewer architecture survey and comparison
(
21cc0fe)
Survey of viewer technologies and frameworks for training/benchmark visualization, comparing options like Gradio, Streamlit, Panel, and custom HTML solutions for the unified viewer architecture.
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- Add website redesign plan
(
6710e77)
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
-
Pivot desktop app to uv-first distribution and propose meta-package architecture (
98e4b2c) -
desktop_app_plan.md: Switch from PyInstaller to uv-based installation - Tier 1: Single command install via uv tool - Tier 2: Optional uv bundled installer (~15MB) - Tier 3: PyInstaller full bundle (deferred) - Reduces annual cost from $500-700 to $0
-
new_openadapt_architecture.md: Propose Option B+ thin CLI wrapper - Create unified 'openadapt' meta-package - Re-export common items from sub-packages - Unified CLI (openadapt capture/train/eval) - Phase-based implementation over 2 weeks
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- Update openadapt-web repository reference to new name
(
f4176c7)
Update repository link from OpenAdapt.web to openadapt-web following the rename to match the lowercase-hyphen naming convention.
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- Add safety gate, perception integration, and representation experiments
(
5778323)
New modules: - runtime/safety_gate.py: Deterministic safety gate for action validation - perception/integration.py: Bridge between openadapt-grounding and openadapt-ml - experiments/representation_shootout/: Coords vs Marks ablation framework - benchmarks/trace_export.py: Export benchmark traces to various formats
Tests: - Reorganize tests from root to tests/ directory - Add integration tests in tests/integration/ - Add test_gemini_grounding_imports.py for grounding module
Scripts: - p1_episode_success_ab_test.py: A/B test for demo-conditioned episode success
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- Add unified baseline adapters for VLM comparison
(
4921030)
Implements a provider abstraction layer and unified baseline system for comparing Claude, GPT, and Gemini across multiple evaluation tracks.
New modules: - openadapt_ml/models/providers/ - API provider implementations - base.py: BaseAPIProvider ABC - anthropic.py: Claude support - openai.py: GPT support - google.py: Gemini support
- openadapt_ml/baselines/ - Unified baseline system - config.py: TrackConfig, BaselineConfig, MODELS registry - prompts.py: Track-specific prompt templates - parser.py: Response parsing with JSON and regex fallback - adapter.py: UnifiedBaselineAdapter main class - cli.py: CLI commands (run, compare, list-models)
Tracks supported: - Track A: Direct coordinate prediction - Track B: ReAct-style reasoning with coordinates - Track C: Set-of-Mark element selection
Usage: uv run python -m openadapt_ml.baselines.cli list-models uv run python -m openadapt_ml.baselines.cli run --model claude-opus-4.5 --track A --image screenshot.png --goal "Click submit"
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- experiments: Add representation shootout and SOM evaluation results
(
e6aca89)
Add experiment results and artifacts: - representation_shootout results comparing embedding extraction methods - qwen_login 2b_dev_fixed plots showing base vs fine-tuned comparison - registration_som_eval.json evaluation metrics for SOM-based action prediction
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
-
waa: Refactor CLI and fix Python 3.9 compatibility (#14,
e55b610) -
Refactor CLI from 6800 to ~1300 lines with flat command structure - Add analyze command to parse and summarize benchmark results - Add --num-tasks flag to limit number of tasks to run - Fix Python 3.9 compatibility by copying Python from vanilla WAA image (fixes transformers 4.46.2 compatibility with GroundingDINO) - Add coverage and analysis artifacts to .gitignore
Co-authored-by: Claude Opus 4.5 noreply@anthropic.com
- docs: add verified repo consolidation plan
- Two-package architecture: openadapt-evals (foundation) + openadapt-ml (ML) - Verified audit
findings: 10 dead files confirmed, 3 previously marked dead but used - CLI namespacing: oa evals
, oa ml - Dependency direction: openadapt-ml depends on openadapt-evals (not circular)
- Agents with ML deps (PolicyAgent, BaselineAgent) move to openadapt-ml - adapters/waa/ subdirectory pattern for benchmark organization
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- feat: add openadapt-evals as optional dependency
Add [benchmarks] optional dependency for benchmark evaluation: - pip install openadapt-ml[benchmarks]
This is part of the repo consolidation to establish: - openadapt-evals: Foundation for benchmarks + infrastructure - openadapt-ml: ML training (depends on evals for benchmarks)
- docs(cli): clarify serve vs dashboard command naming
- oa ml serve: serve trained models for inference - oa ml dashboard: training dashboard for monitoring
This distinguishes the two use cases clearly: - serve = model inference endpoint - dashboard = training progress UI
- refactor(benchmarks): consolidate to re-export from openadapt-evals
Migrate benchmark infrastructure to two-package architecture: - openadapt-evals: Foundation package with all adapters, agents, runner - openadapt-ml: ML-specific agents that wrap openadapt-ml internals
Changes: - Convert base.py, waa.py, waa_live.py, runner.py, data_collection.py, live_tracker.py to deprecation stubs that re-export from openadapt-evals - Keep only ML-specific agents in agent.py: PolicyAgent, APIBenchmarkAgent, UnifiedBaselineAgent - Update init.py to import from openadapt-evals with deprecation warning - Update tests to import from correct locations - Remove test_waa_live.py (tests belong in openadapt-evals)
Net: -3540 lines of duplicate code removed
- refactor(benchmarks): delete deprecation stubs, import from openadapt-evals
Remove deprecation stubs since there are no external users. Tests now import directly from openadapt-evals (canonical location).
Deleted: - base.py, waa.py, waa_live.py, runner.py, data_collection.py, live_tracker.py
Kept: - agent.py (ML-specific agents: PolicyAgent, APIBenchmarkAgent, UnifiedBaselineAgent) - init.py (simplified to only export ML-specific agents)
- docs(readme): add WAA benchmark results section with placeholders
Add section 15 for Windows Agent Arena benchmark results with clearly marked placeholders. Results will be filled in when full evaluation completes. Warning banner indicates PR should not merge until placeholders are replaced.
Sections added: - 15.1 Benchmark Overview - 15.2 Baseline Reproduction (paper vs our run) - 15.3 Model Comparison (GPT-4o, Claude, Qwen variants) - 15.4 Domain Breakdown
- docs(readme): move WAA benchmark results to openadapt-evals
WAA benchmark results belong in openadapt-evals (the benchmark infrastructure package) rather than openadapt-ml (the training package).
See: OpenAdaptAI/openadapt-evals#22
- feat(cli): add VNC auto-launch and --fast VM option
- Add setup_vnc_tunnel_and_browser() helper for automatic VNC access - Add VM_SIZE_FAST constants with D8 series sizes - Add VM_SIZE_FAST_FALLBACKS for automatic region/size retry - Add --fast flag to create command for faster installations - Add --fast flag to start command for more QEMU resources (6 cores, 16GB) - Opens browser automatically after container starts
- docs: add WAA speedup options documentation
- Document --fast VM flag usage - Explain parallelization options - Detail golden image approach for future optimization
- docs(readme): add benchmark execution logs section
- Add section 13.5 with log viewing commands - Add benchmark run commands with examples - Renumber screenshot capture tool section to 13.6
- docs(readme): clarify --run flag for benchmark execution logs
- Add logs --run command for viewing task progress - Add logs --run -f for live streaming - Add logs --run --tail N for last N lines
- docs(readme): add example output for logs commands
- Add example output for
logs(container status) - Add example output forlogs --run -f(benchmark execution)
- feat(cli): add --progress flag for benchmark ETA
- Add _show_benchmark_progress() function - Parse run logs for completed task count - Calculate elapsed time and estimated remaining - Show progress percentage
Example usage: uv run python -m openadapt_ml.benchmarks.cli logs --progress
- docs(research): add cua.ai vs openadapt-ml WAA comparison
Comprehensive analysis of Cua (YC X25) computer-use agent platform: - Architecture comparison (composite agents, sandbox-first) - Benchmark framework differences (cua-bench vs openadapt-evals)
- Training data generation (trajectory replotting) - Recommendations: adopt patterns, not full migration
Key findings: - Cua's parallelization uses multiple sandboxes (like our multi-VM plan) - Composite agent pattern could reduce API costs - HTML capture enables training data diversity
- feat(cli): add parallelization support with --worker-id and --num-workers
WAA natively supports parallel execution by distributing tasks across workers.
Usage: # Run on single VM (default) run --num-tasks 154
VM2: run --num-tasks 154 --worker-id 1 --num-workers 3
VM3: run --num-tasks 154 --worker-id 2 --num-workers 3
Tasks auto-distribute: worker 0 gets tasks 0-51, worker 1 gets 52-103, etc.
- docs(research): add market positioning and strategic differentiation
Expand cua_waa_comparison.md with: - Success rate gap analysis (38.1% vs 19.5%) - Market positioning comparison (TAM, buyers, value props) - Where sandbox approach fails (Citrix, licensed SW, compliance) - Shell applications convergence opportunities - Bottom line: Windows enterprise automation is hard, validates OpenAdapt approach
- docs(waa): add parallelization and scalable benchmark design docs
-
Add WAA_PARALLELIZATION_DESIGN.md documenting: - Official WAA approach (Azure ML Compute) - Our dedicated VM approach (dev/debug) - When to use each approach
-
Add WAA_UNATTENDED_SCALABLE.md documenting: - Goal: unattended, scalable, programmatic WAA - Synthesized approach using official run_azure.py - Implementation plan and cost estimates
-
Update Dockerfile comments to clarify: - API agents (api-claude, api-openai) run externally - openadapt-evals CLI connects via SSH tunnel - No internal run.py patching needed
-
style: fix ruff formatting
-
fix(imports): update internal code to import from openadapt-evals
Replace imports from deleted benchmark files with direct imports from openadapt-evals:
- azure.py: BenchmarkResult, BenchmarkTask, WAAAdapter - waa_demo/runner.py: BenchmarkAction, WAAMockAdapter, etc.
This completes the migration to the two-package architecture where openadapt-evals is the canonical source for benchmark infrastructure.
- fix(imports): add missing EvaluationConfig import
- Update azure.py to import BenchmarkAgent from openadapt_evals - Add EvaluationConfig to runner.py imports
Fixes CI failure: F821 Undefined name EvaluationConfig
- fix(deps): require openadapt-evals>=0.1.1
v0.1.0 uses task ID format "browser_1" but tests expect "mock_browser_001" which was added in v0.1.1.
Co-authored-by: Claude Opus 4.5 noreply@anthropic.com
- benchmarks: Migrate to openadapt-evals package
(
e6f63c7)
BREAKING CHANGE: Benchmark code moved to openadapt-evals.
- Update CLAUDE.md with migration guide - Add deprecation warning to benchmarks/init.py - Old imports still work but emit DeprecationWarning
Migration: # OLD (deprecated) from openadapt_ml.benchmarks import WAAMockAdapter
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- Add unit tests for providers and baselines modules
(
a56ec04)
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- Resolve test failures and SSE dashboard state conflicts
(
041912d)
Test fixes: - test_action_parsing.py: Handle 4-value return from predict_action_from_sample() - test_api_adapter.py: Fix mock patch locations (openai.OpenAI, anthropic.Anthropic) - trainer.py: Change logger.save() to logger._save_log() - policy.py: Allow negative coords in CLICK regex for clamping tests
SSE dashboard fixes: - Add phase: "ready" to Azure VM Host tasks to prevent Starting+completed conflict - Improve frontend phase inference from status when phase is missing - Add debug console logging for SSE troubleshooting
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- cli: Use localhost for VNC URLs via SSH tunnel
(
947816d)
Probe output now correctly shows localhost:8006 instead of public IP which is not accessible without SSH tunnel.
-
waa: Add full Python dependencies for benchmark client (
0544199) -
Add build-essential, ffmpeg, and X11 libs for package compilation - Install core packages: gymnasium, fabric, transformers, torch (CPU) - Install ML packages: opencv, easyocr, matplotlib, accelerate - Create python -> python3 symlink for compatibility - Separate pip installs into layers for better caching
-
waa: Add missing pydrive and other client dependencies (
238ff91) -
waa: Add remaining WAA client dependencies (openpyxl, docx, etc.) (
678df44) -
waa: Copy OEM files to Samba share at container startup (
04d5f94)
Add /copy-oem.sh startup script that copies OEM files from /oem to /tmp/smb (Samba share) at container startup. This fixes Windows not finding setup scripts because smb.conf is generated at runtime.
Also update experiment doc to remove timeline estimates and add WAA baseline as in-progress.
- waa: Copy Python env from official image to avoid 3.13 compat issues
(
4a27a22)
- Bump version to 0.2.0 for PyPI release
(
6aedda3)
Features in this release: - TRL + Unsloth training integration (2x faster, 50% less VRAM) - Standardized on uv for package management - Enhanced VM CLI and WAA deployment - Comprehensive documentation updates
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
-
Remove old waa/Dockerfile (moved to waa_deploy/) (
ad78e78) -
Standardize on uv for package management (
eb3aecd) -
Replace all
pip installwithuv addin docs - Update cloud GPU training to usecurl ... | shfor uv install - Update CLAUDE.md with enhanced VM operations guidance - Consistentuv syncfor local development
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
-
Update CLAUDE.md and minor fixes (
84507df) -
Updated CLAUDE.md with new features and documentation - trainer.py: minor improvements - eval_policy.py: updated for new schema - uv.lock: dependency updates
-
Update uv.lock (
63797af)
Explores options for data format interoperability: - Option A: Native Episode output from openadapt-capture - Option B: Conversion layer in openadapt-ml (recommended) - Option C: Shared schema package - Option D: Dual output
Recommends Option B with clear guidelines for the conversion layer. Includes text demo format specification for WAA experiment.
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
-
Add comprehensive capture migration guide with a11y integration (
779ba58) -
Add design documents for SSE, retrieval, and parallelization (
464861c) -
SSE architecture and integration guides - Demo retrieval design and experiments - WAA parallelization and live adapter plans - Chrome extension design for capture - Benchmark viewer UX improvements
-
Add openadapt-capture to openadapt-ml migration plan (
447bd7e) -
Add schema consolidation plan (
4d7c18a)
Detailed migration plan for consolidating from two schema modules to one: - DELETE: openadapt_ml/schemas/ (dataclass-based, legacy) - KEEP: openadapt_ml/schema/ (Pydantic-based, canonical)
Includes: - Dependency analysis (22 files affected) - Field mapping between old and new - 7-phase migration strategy - Testing strategy - Rollback plan - Timeline estimate (~8-10 hours)
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
-
Add staged export hierarchy to enterprise guide (
f093304) -
Episode JSON as canonical, Parquet/WebDataset as projections - Expand data loss table for flat formats - Mark exporters as Planned with design doc links - Add multi-step evaluation caveat
-
Add WAA demo recording guide for Windows captures (
6d8bdb1)
Step-by-step instructions for recording the 3 complex demos: - Task #4: Fill blank cells (LibreOffice Calc) - Task #5: Create chart (LibreOffice Calc) - Task #9: Archive folder (File Explorer)
Includes setup, recording steps, export, and transfer instructions.
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
-
Consolidate WAA CLI workflow documentation (
dbe00d0) -
Add Quick Start: Single VM section to azure_waa_setup.md - Replace manual SSH steps in CLAUDE.md with CLI commands - Document custom waa-auto Docker image that fixes OEM folder issue - Add vm probe, vm reset-windows, and other useful commands
-
Strengthen enterprise integration guide positioning (
c093f93)
Add decision boundary, requirements, retrofitting cost sections. Add data portability note addressing vendor lock-in concern. Add optional metadata extension pattern. Add typical integration workflow. Add open schema independence statement.
-
Update README for TRL training and PyPI installation (
1c8e899) -
Add Installation section with PyPI instructions (uv add openadapt-ml) - Update training section to reflect TRL + Unsloth integration - Update repository structure with trl_trainer.py reference - Add PyPI badge - Fix section numbering throughout - Update test descriptions for TRL trainer
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
-
Validate demo-conditioning at n=45 (46.7% → 100%) (
bce5a11) -
Zero-shot: 46.7% (21/45), Demo: 100% (45/45), Control: 57.8% - Improvement: +53.3 percentage points across 15 macOS categories - Add Parquet export design doc (derived analytics format) - Update enterprise guide with validated result
-
experiment: Strengthen demo-conditioning doc for expert review (
b56faa8) -
Add interpretation note framing result as "trajectory-conditioned disambiguation" not general task-solving - Highlight length-matched control in executive summary - Frame shared first action as intentional controlled variable - Add "Positioning Relative to Fine-Tuning" section connecting to prompting-first methodology - Expand limitations with actionable specifics (WAA running, SoM conventions, episode success vs first-action)
-
schema: Comprehensive documentation for Episode schema (
0963b75) -
Add detailed Quick Start with code examples - Document all 24 action types with categories - Explain pixel vs normalized coordinate systems - Add validation and format conversion examples - Document extension points (raw, metadata fields) - Add docs/schema/README.md as standalone reference
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- Add demo-conditioned prompting experiment and retrieval module
(
f80921e)
Validated that demo-conditioning improves first-action accuracy from 33% to 100% (n=3, preliminary signal). Key findings: - Benefit is semantic, not token-length (length-matched control: 67%) - Demos generalize across task variations (toggle polarity, parameters) - Zero-shot has systematic spatial bias that demos correct
Added retrieval module (TF-IDF + domain bonus) for automatic demo selection. Added demo-conditioned training mode to train_from_json.py. Added enterprise integration guide for workflow data export.
Statistical note: n=3 is insufficient for significance. Validation at n≥30 on expanded task set in progress.
- Add Episode JSON schema and polish benchmark viewer
(
d9e669e)
Episode Schema (openadapt_ml/schema/): - Pydantic models for Episode, Step, Action, Observation - Schema version 1.0.0 with semver evolution policy - WAA format converter (from_waa_trajectory, to_waa_trajectory) - JSON Schema export for documentation/tooling - 20 action types (click, type, key, hotkey, scroll, drag, etc.)
Benchmark Viewer Improvements: - Fix SSE memory leak (clearAllIntervals on reconnect) - Add ThreadedTCPServer for concurrent request handling - Polish UI with color-coded status, loading spinners, error banners - Add refresh buttons to all panels with feedback - Prominent VNC button with copy-to-clipboard IP
CLI Enhancements: - Add --auto-shutdown flag to deallocate VM after benchmark - Add --timeout flag for Azure ML job auto-cancellation - Add vm cleanup-stale command for finding stale jobs/VMs - Add refresh button support in Azure Jobs API
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- Add WAA demo-conditioned experiment with 7 manual demos
(
4f9799a)
Implements hybrid demo approach for WAA benchmark: - 7 manual demos for simple tasks (settings, toggles, linear flows) - 3 placeholders for complex tasks needing recorded demos
Tasks covered: 1. Do Not Track (Edge) - manual 2. Bookmark to bar (Edge) - manual 3. Font size (Edge) - manual 4. Fill blank cells (Calc) - needs recording 5. Create chart (Calc) - needs recording 6. Center align (Writer) - manual 7. Notifications (Settings) - manual 8. Night Light schedule (Settings) - manual 9. Archive folder (Explorer) - needs recording 10. Details view (Explorer) - manual
Includes runner CLI for listing tasks and viewing demos.
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- Enhanced VM CLI and WAA deployment
(
cd95c43)
CLI improvements: - Add vm deallocate, start, exec, fix-oem, docker-prune, stop-build actions - SSH keepalive settings (60s interval) to prevent timeouts - Docker startup check after VM restart - Better probe checking (curl from inside container)
WAA deployment: - Move Dockerfile to waa_deploy/ with api_agent.py - Add api-claude and api-openai agent support - P0 demo persistence validation script
Demo persistence validated: - scripts/p0_validate_demo_persistence.py confirms demo included at all steps - ApiAgent properly passes demo through multi-step episodes
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
-
Trl + Unsloth training integration (
c006f2b) -
Add trl_trainer.py with SFTTrainer + Unsloth optimizations - Update train_from_json.py to use TRL trainer (2x faster, 50% less VRAM) - Remove legacy custom training loop from trainer.py - Add [training] optional dependencies (trl, datasets) - Support --use-unsloth / --no-unsloth flags
Training command: uv run python examples/train_from_json.py --data episodes/ --output results/
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
-
benchmark: Add Run Benchmark UI panel and fix VNC/Docker issues (
a86269e) -
Add Run Benchmark panel to benchmark viewer (model, tasks, agent, domain selection) - Add POST /api/benchmark/start endpoint to launch benchmarks from UI - Add --domain and --task-ids CLI flags for filtered benchmark runs - Fix VNC link to use localhost:8006 (SSH tunnel) instead of direct VM IP - Fix Docker build to use --no-cache --pull to prevent stale dockurr/windows layers - Add docs/waa_network_architecture.md explaining the localhost-based network topology - Add docs/benchmark_run_ui_design.md with UI design specification
The Docker cache issue caused dockurr/windows v0.00 scripts (no auto-download) to be used instead of v5.14 (with auto-download). Fixed by adding --no-cache --pull.
- benchmarks: Waa CLI improvements, result analysis, and viewer enhancements
(
c8d1ce0)
WAA CLI: - Add analyze command for programmatic WAA result analysis - Remote analysis via SSH
(--vm-ip --remote) - Local directory analysis (--results-dir) - Per-domain success rates, JSON
export - Fix invalid model name: gpt-5.2 → gpt-4o - Add --skip-build tip for faster reruns - Add
vm_monitor.py for VM status tracking - Add live_tracker.py for real-time benchmark progress
Documentation: - Add docs/waa_setup.md - WAA setup guide - Add docs/GEMINI_GROUNDING_QUICKSTART.md - Add docs/background_task_visibility.md - Add implementation summaries
Viewer: - Add benchmark_viewer.py for WAA result visualization - Enhance local.py serve command - Integrate benchmarks into unified viewer
- export: Add Parquet exporter and toolbox positioning
(
bb11d62)
Add first-class Parquet export support: - to_parquet() / from_parquet() for Episode serialization - CLI: python -m openadapt_ml.export parquet --input
--output - Optional summary table generation - pyarrow as optional dependencyUpdate enterprise integration guide: - Add "What is openadapt-ml?" toolbox section - Frame as composable utilities, not monolithic framework - Update Parquet section with real implementation examples - Scope canonical claim to within openadapt-ml
- retrieval: Add demo retrieval system and WAA live adapter
(
b62b6f1)
Demo Retrieval: - embeddings.py: improved embedding generation with caching - demo_retriever.py: semantic search for relevant demonstrations - Support for goal-based and screenshot-based retrieval
Benchmark Viewer: - viewer.py: standalone HTML viewer for benchmark results - waa_live.py: live evaluation adapter for WAA benchmarks - Integrated with dashboard for real-time monitoring
-
schema: Add converters between internal and external formats (
3124992) -
from_internal_episode(): Convert schemas.sessions.Episode to schema.Episode - to_internal_episode(): Convert schema.Episode back to internal dict format - Document field mapping in README
This enables bidirectional conversion between: - Internal training format (schemas.sessions): id, goal, t, image_path, x/y - External interop format (schema.episode): episode_id, instruction, step_index, etc.
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
-
schema: Add select_monitor and normalized coordinates (
da9fc16) -
Add action types: select_monitor, window_focus, window_resize, window_move - Add monitor_id field for select_monitor action - Add window_title field for window_focus action - Add normalized_coordinates (0.0-1.0) as alternative to pixel coords - Add normalized_start/end for resolution-independent drag actions
This enables cu-episode-v1 alignment without loss of information.
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
-
waa: Auto-build Docker image and fix tunnel detection (
f71e556) -
CLI run-waa now automatically builds waa-auto image if missing - Added --rebuild flag to force image rebuild - Dockerfile: fixed IP patching, added playwright for web automation - ssh_tunnel.py: fixed tunnel status to check actual port state instead of just internal tracking, correctly reports external tunnels
Closes #XX
-
Consolidate schema to single Pydantic-based Episode module (
c05ad6b) -
Migrate from dual schema modules (schemas/ dataclass-based) to single canonical Pydantic schema (schema/episode.py) - Delete old openadapt_ml/schemas/ directory (sessions.py, validation.py) - Update all imports across 27 files to use openadapt_ml.schema
Schema field mappings (old -> new): - Episode.id -> Episode.episode_id - Episode.goal -> Episode.instruction (required), Episode.goal (optional) - Step.t -> Step.step_index (int) + Step.timestamp (float) - Step.thought -> Step.reasoning - Observation.image_path -> Observation.screenshot_path - Action.x, Action.y -> Action.normalized_coordinates (tuple) - Action.type (str) -> Action.type (ActionType enum) - Action.element_index -> Action.element.element_id (via UIElement)
Added fields to Observation: app_name, url for benchmark compatibility
Converters in schema/converters.py handle legacy format conversion.
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
- Add comprehensive tests for WAA demo experiment module
(
ae5930b)
28 tests covering: - Task definitions (10 tasks, domains, difficulties) - Demo content (7 complete, 3 placeholder) - Integration (task/demo consistency, retrieval) - Format validation (DEMONSTRATION header, Goal line)
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
-
Add tests for demo retrieval and WAA live adapter (
d9b6eab) -
test_demo_retrieval.py: comprehensive tests for demo retriever - test_waa_live.py: tests for live WAA adapter - Updated test_retrieval.py for new embedding features - demo_retrieval_example.py: usage examples
-
Update tests for TRL trainer refactor (
0694d8d)
-
Consistent 'Viewer' label in nav tabs (
4a6f627) -
dashboard: Improve viewer/dashboard consistency and CLI commands (
27a9857) -
dashboard: Remove duplicate job ID and make header full-width (
5c3c22b) -
dashboard: Use subprocess for simpler http server startup (
4e74142) -
plots: Add model size labels to hardened benchmark plots (
4c828f2) -
serve: Remove stale refresh args, stop button now works (
b7206bc) -
stub: Auto-copy real screenshot for evaluation samples (
53c9f50) -
viewer: Extract predictions from window.comparisonData and fix elapsed time loading (
475c38d) -
viewer: Sync audio speed with playback, add visual feedback to overlay toggles (
e58cc22)
-
gitignore: Ignore synthetic and ephemeral training artifacts (
ff3a6e3) -
plots: Track hardened v2 experiment plots and scope ignore to top-level (
1511e6d) -
readme: Point synthetic plots at hardened v2 experiment artifacts (
6f35079)
-
Add benchmark viewer integration design and update TODOs (
f21c828) -
Add early termination controls as high priority TODO (
6ed8042) -
Document need for auto-termination, dashboard stop button, checkpoint download - Fix shared header in unified viewer template (trainer.py) - Remove 'Dashboards:' label from compare.py nav
-
Add GUI-Actor integration plan for coordinate-free grounding (
a67abc3) -
Add training feedback UX critical path analysis (
a87b9bb) -
Add unified compute architecture design and PyPI TODO (
e2abfe3) -
readme: Add 2b training log snippet and clarify qwen3 masking roadmap (
f02ebfd) -
roadmap: Mark Priority 5a complete and update plotting achievements (
2da03f2) -
viewer: Add timeline visualizer and eval integration design (
d3e93ed)
-
Initial commit of openadapt-ml pipeline (synthetic login, qwen adapters, eval + training) (
ec92d6b) -
V0.1.0 release with benchmark integration, grounding module, and cloud training (
b29d558) -
benchmark: Add qwen login orchestrator and refine docs (
efcce00) -
cloud: Add Lambda Labs training, benchmarks, and training visualization (
2063aea) -
config: Add pydantic-settings configuration and API benchmarks (
2e7bfd1) -
dashboard: Add early termination controls and /api/stop endpoint (
7c27f47) -
dashboard: Enhance evaluation samples with model thinking display (
a0e2b09) -
dashboard: Show model thinking by default, add legend (
fd26952) -
docs: Add dashboard screenshots and fix viewer predictions (
8ea00d0) -
lambda: Add early termination controls with auto-stop, checkpoint download, and dashboard stop button (
4936fe0) -
lambda: Auto-symlink capture screenshots and rewrite paths (
1bef82c) -
local: Add local training CLI for CUDA/Apple Silicon (
1efb175) -
plots: Add legend to comprehensive comparison and streamline README (
b710d49) -
plots: Add legend to qwen_vs_apis comparison plot (
eee3f25) -
plots: Update individual plots with consistent color coding and legend (
e934e05) -
qwen-login: Harden benchmark and add plots, GIF, and output docs (
1b29003) -
synthetic-login: Harden jitter, prompts, and Qwen3-VL 2B/8B results (
9b27555) -
training: Add job-scoped directories and HTTP server for dashboards (
cb69e6a)
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
-
training: Add stub adapter for rapid UI testing without GPU (
5f236aa) -
viewer: Add benchmark tab with WAA integration WIP state (
a707acb) -
viewer: Add parseModelOutput for SoM parsing and truncation (
1217f64) -
viewer: Add screenshots to README and smart auto-scroll (
e147f09) -
viewer: Add SoM action parsing for model predictions (
6a89164) -
viewer: Add transcript/audio sync, copy-all button, and extract shared UI (
6beff41) -
viewer: Extract viewer module with evaluation gallery and badges (
911faac)
-
Rename --eval-on-training-data to --overfit (
3cb9dc1) -
viewer: Consolidate to standalone HTML generation (
bbda9c9)
- local: Add tests for local CLI with early stopping
(
e371687)