You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: align GRPO prompt format with SFT training format (#61)
The GRPO rollout prompt was missing the "Thought:" line and action
history that the SFT training uses. Models fine-tuned via SFT output
"Thought: ...\nAction: CLICK(...)" but the GRPO prompt didn't
prompt for this format, causing verbose free-form output that
couldn't be parsed → reward 0.0 → zero gradients.
Changes:
- Add "Thought:" and "Action:" prompt lines matching SFT format
- Add action_history parameter for step context
- Parser extracts action from "Action: ..." line before regex matching
- Parser handles JSON format {"action_type": "click", "coordinate": [x,y]}
- Debug logging of raw VLM output for zero-reward diagnosis
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0 commit comments