Skip to content

Pr1/nlu interfacing minimal mazegen#8

Open
arushi-jain-27 wants to merge 14 commits into
mainfrom
pr1/nlu-interfacing-minimal-mazegen
Open

Pr1/nlu interfacing minimal mazegen#8
arushi-jain-27 wants to merge 14 commits into
mainfrom
pr1/nlu-interfacing-minimal-mazegen

Conversation

@arushi-jain-27
Copy link
Copy Markdown
Collaborator

@arushi-jain-27 arushi-jain-27 commented May 2, 2026

PR review guidelines — NLU benchmark (v2 interfacing)

Use this as a checklist when reviewing the PR that adds src/v2/nlu_pipeline/ plus the minimal automatic_maze_generation pieces needed for smoke tests (solver, layout generators, render_dataset).

What this PR is for

  • Primary goal: a configurable experiment runner that turns a maze JSON into model-ready (LLM or VLM) input: system prompt (via prompting strategy - not needed for initial experiments), per-step user content (text and/or image and observation history), and querying semantics (step-by-step vs subgoal vs full trajectory).
  • In scope here: ExperimentRunner, ExperimentConfig, observation building, prompting strategies, querying loop, parser, smoke scripts, and two sample mazes (V01_empty_room, V04_single_key) with previews.
  • Out of scope for this PR (follow-up): full automatic maze dataset generation (orchestrator, validator, mechanisms, generate_dataset, bulk generated_mazes/). Those land in later PRs.

Suggested review order

  1. config.py — Read the docstring on ExperimentConfig. It defines the experimental axes (prompting, observation, context_window, querying). Confirm the combinations you care about are meaningful and documented.
  2. runner.py — How the runner wires prompt strategy + ObservationBuilder + QueryingMode into a single episode loop; what gets appended to system vs user; when the model is called.
  3. observation.py and renderer.py — How text vs PNG observations are built with history; lazy load of render_dataset.py for maze PNGs.
  4. prompt_strategies.py / querying.py — Whether instructions match the parser and env action set.
  5. smoke_tests/ — Scripts are the fastest way to sanity-check behavior without a real LLM API (see below).

How to smoke test locally

BFS

python src\v2\nlu_pipeline\nlu_benchmark\smoke_tests\smoke_bfs.py --maze V04_single_key.json

Claude

python src\v2\nlu_pipeline\nlu_benchmark\smoke_tests\smoke_llm.py -v --log-level DEBUG

Different Experiment Modes

python src\v2\nlu_pipeline\nlu_benchmark\smoke_tests\smoke_prompting_observation_querying.py --maze V01_empty_room.json

Expect: success=True / “All checks passed” (or equivalent), and outputs under smoke_tests/results/ (gitignored).

After smoke_prompting_observation_querying.py: The following reloads detailed_logs.json and re-checks the same structural invariants as the smoke (query counts vs stored queries, observation mode vs message shape, prompting markers, context-window text, etc.), printing any ISSUE lines plus a per-run summary.

python src\v2\nlu_pipeline\nlu_benchmark\smoke_tests\analyze_smoke_runner_logs.py --maze V01_empty_room.json

The outputs for each of these smoke tests will be stored inside a result folder. It has detailed logs at each query step as well as png rendering of the maze after each agent action.

Clarification: “image-only” vs text layout in the pipeline

Some reviewers prefer an interface that accepts screenshots only (no natural-language rendering of the maze anywhere).

In this code, that mode already exists as a configuration choice, not a separate subsystem:

  • ExperimentConfig.observation == "screenshot_only" — no initial NL maze block in the system prompt for that axis; user turns emphasize live PNG (and action-only history when context_window is last3). See observation.py and config.py.
  • text_only / image_text add NL layout and/or per-step text via the same ObservationBuilder and runner; the branching is localized to prompting / observation / querying config handling.

The author is not splitting out a second “screenshot-only pipeline” in this PR: the shared runner and config axes keep one code path, avoid duplicating the episode loop, and keep smoke tests and future experiments (ablations across observation modes) cheap.

Validation on current 214 benchmark mazes:

Copy (not committed here) all the mazes inside nlu_benchmark/benchmark_mazes, then run:

python3 src/v2/nlu_pipeline/nlu_benchmark/smoke_tests/smoke_benchmark_mazes.py

Reviewer checklist (short)

  • Config axes and defaults are intentional and documented.
  • Runner correctly distinguishes system vs user content and respects ExperimentConfig.
  • Smoke tests pass on a clean checkout (paths, imports, matplotlib).
  • No accidental commit of secrets, large generated datasets, or smoke_tests/results/ outputs.
  • Validation on current mazes runs as expected
  • Follow-up work (full maze generation, bulk JSON/PNGs) is clearly scoped outside this PR if omitted.

arushi-jain-27 and others added 14 commits May 1, 2026 20:13
Add nlu_pipeline (env, runner, agents, observation, examples) with sample
mazes and smoke test scripts. Include mazegen models/generators/solver and
render_dataset for PNG rendering and solver checks.

Omit automatic dataset generation (orchestrator, validator, mechanisms,
generate_dataset) and bulk generated_mazes for a follow-up PR.

Ignore smoke_tests/results and terminal_output.txt.

Co-authored-by: Cursor <cursoragent@cursor.com>
Results directory pattern is now smoke_{maze}_bfs (was smoke_{maze}_smart_manual).

Co-authored-by: Cursor <cursoragent@cursor.com>
Drop other sample maze JSON/PNGs from the PR. Update run_llm and run_local_llm examples to use V01_empty_room.json.

Co-authored-by: Cursor <cursoragent@cursor.com>
@seanrivera seanrivera self-assigned this May 9, 2026
Copy link
Copy Markdown
Member

@seanrivera seanrivera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally this needs to be moved to integrate better with the rest of the system. The build artifacts need to be .gitignored, and the rest needs to be rebased to work with the full pipeline.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file path may need some cleanup to work with the rest of the repo

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try not to commit generated artifacts to the repo.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This as well.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move these into the same directory as the other jsons.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generates its own maze schema, instead of using the existing one. I understand the need for this for testing, but it shouldn't be merged in.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the maze generator above, this conflicts with the other BFS solves we have. It should probably not be added to the codebase

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you refactor this the support the EvaluationHarness Object?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this and just use the GridRunners or other versions of the Abstract grid backend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants