Pr1/nlu interfacing minimal mazegen#8
Conversation
Add nlu_pipeline (env, runner, agents, observation, examples) with sample mazes and smoke test scripts. Include mazegen models/generators/solver and render_dataset for PNG rendering and solver checks. Omit automatic dataset generation (orchestrator, validator, mechanisms, generate_dataset) and bulk generated_mazes for a follow-up PR. Ignore smoke_tests/results and terminal_output.txt. Co-authored-by: Cursor <cursoragent@cursor.com>
Results directory pattern is now smoke_{maze}_bfs (was smoke_{maze}_smart_manual).
Co-authored-by: Cursor <cursoragent@cursor.com>
Drop other sample maze JSON/PNGs from the PR. Update run_llm and run_local_llm examples to use V01_empty_room.json. Co-authored-by: Cursor <cursoragent@cursor.com>
seanrivera
left a comment
There was a problem hiding this comment.
Generally this needs to be moved to integrate better with the rest of the system. The build artifacts need to be .gitignored, and the rest needs to be rebased to work with the full pipeline.
There was a problem hiding this comment.
This file path may need some cleanup to work with the rest of the repo
There was a problem hiding this comment.
Please try not to commit generated artifacts to the repo.
There was a problem hiding this comment.
Can you move these into the same directory as the other jsons.
There was a problem hiding this comment.
This generates its own maze schema, instead of using the existing one. I understand the need for this for testing, but it shouldn't be merged in.
There was a problem hiding this comment.
Similar to the maze generator above, this conflicts with the other BFS solves we have. It should probably not be added to the codebase
There was a problem hiding this comment.
Can you refactor this the support the EvaluationHarness Object?
There was a problem hiding this comment.
You can remove this and just use the GridRunners or other versions of the Abstract grid backend
PR review guidelines — NLU benchmark (v2 interfacing)
Use this as a checklist when reviewing the PR that adds
src/v2/nlu_pipeline/plus the minimalautomatic_maze_generationpieces needed for smoke tests (solver, layout generators,render_dataset).What this PR is for
ExperimentRunner,ExperimentConfig, observation building, prompting strategies, querying loop, parser, smoke scripts, and two sample mazes (V01_empty_room,V04_single_key) with previews.orchestrator,validator,mechanisms,generate_dataset, bulkgenerated_mazes/). Those land in later PRs.Suggested review order
config.py— Read the docstring onExperimentConfig. It defines the experimental axes (prompting,observation,context_window,querying). Confirm the combinations you care about are meaningful and documented.runner.py— How the runner wires prompt strategy +ObservationBuilder+QueryingModeinto a single episode loop; what gets appended to system vs user; when the model is called.observation.pyandrenderer.py— How text vs PNG observations are built with history; lazy load ofrender_dataset.pyfor maze PNGs.prompt_strategies.py/querying.py— Whether instructions match the parser and env action set.smoke_tests/— Scripts are the fastest way to sanity-check behavior without a real LLM API (see below).How to smoke test locally
BFS
Claude
Different Experiment Modes
Expect:
success=True/ “All checks passed” (or equivalent), and outputs undersmoke_tests/results/(gitignored).After
smoke_prompting_observation_querying.py: The following reloads detailed_logs.json and re-checks the same structural invariants as the smoke (query counts vs stored queries, observation mode vs message shape, prompting markers, context-window text, etc.), printing any ISSUE lines plus a per-run summary.The outputs for each of these smoke tests will be stored inside a result folder. It has detailed logs at each query step as well as png rendering of the maze after each agent action.
Clarification: “image-only” vs text layout in the pipeline
Some reviewers prefer an interface that accepts screenshots only (no natural-language rendering of the maze anywhere).
In this code, that mode already exists as a configuration choice, not a separate subsystem:
ExperimentConfig.observation == "screenshot_only"— no initial NL maze block in the system prompt for that axis; user turns emphasize live PNG (and action-only history whencontext_windowislast3). Seeobservation.pyandconfig.py.text_only/image_textadd NL layout and/or per-step text via the sameObservationBuilderand runner; the branching is localized to prompting / observation / querying config handling.The author is not splitting out a second “screenshot-only pipeline” in this PR: the shared runner and config axes keep one code path, avoid duplicating the episode loop, and keep smoke tests and future experiments (ablations across observation modes) cheap.
Validation on current 214 benchmark mazes:
Copy (not committed here) all the mazes inside
nlu_benchmark/benchmark_mazes, then run:Reviewer checklist (short)
ExperimentConfig.smoke_tests/results/outputs.