Use prime eval run with a small sample:
prime eval run my-environment -m openai/gpt-4.1-mini -n 5The -s flag prints sample outputs so you can see what's happening.
If using prime eval run: Results are saved automatically. Browse them interactively with:
prime eval tuiThe TUI opens a single run browser (environment -> model -> run). Press Enter on a run to open rollout details, b to go back, tab to cycle panes, e and x to expand or collapse history, pageup and pagedown to scroll history, and c for Copy Mode.
If using the Python API (env.generate() / env.evaluate()):
vf.print_prompt_completions_sample(outputs, n=3)Set the VF_LOG_LEVEL environment variable:
VF_LOG_LEVEL=DEBUG prime eval run my-environment -m openai/gpt-4.1-mini -n 5- SingleTurnEnv: One prompt, one response (Q&A, classification)
- MultiTurnEnv: Custom back-and-forth interaction (games, simulations)
- ToolEnv: Model calls Python functions (search, calculator)
- StatefulToolEnv: Tools that need per-rollout state (sandbox IDs, sessions)
Unlimited turns. The rollout continues until a stop condition is triggered (e.g., model stops calling tools, or a custom condition you define).
Use the @vf.stop decorator on a method that returns True to end the rollout:
@vf.stop
async def task_completed(self, state: State) -> bool:
return "DONE" in state["completion"][-1]["content"]In ToolEnv, customize error handling:
env = ToolEnv(
tools=[my_tool],
error_formatter=lambda e: f"Error: {type(e).__name__}: {e}",
stop_errors=[CriticalError], # These errors end the rollout
)Non-critical errors are returned to the model as tool responses so it can retry.
Reward functions receive any of these via **kwargs:
completion- the model's responseanswer- ground truth from datasetprompt- the input promptstate- full rollout stateparser- the rubric's parser (if set)task- task identifierinfo- metadata dict from dataset
Just include the ones you need in your function signature.
Group reward functions receive plural arguments (completions, answers, states) and return a list of floats. They're detected automatically by parameter names:
def relative_reward(completions: list, answers: list, **kwargs) -> list[float]:
# Score all completions for an example together
scores = [compute_score(c, a) for c, a in zip(completions, answers)]
# Normalize relative to group
max_score = max(scores) if scores else 1.0
return [s / max_score for s in scores]Point the client to your local server:
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
outputs = await env.evaluate(client, model="your-model-name", ...)