Pr1/nlu interfacing minimal mazegen by arushi-jain-27 · Pull Request #8 · ManifoldRG/MultiNet-v2.0

arushi-jain-27 · 2026-05-02T00:42:06Z

PR review guidelines — NLU benchmark (v2 interfacing)

Use this as a checklist when reviewing the PR that adds src/v2/nlu_pipeline/ plus the minimal automatic_maze_generation pieces needed for smoke tests (solver, layout generators, render_dataset).

What this PR is for

Primary goal: a configurable experiment runner that turns a maze JSON into model-ready (LLM or VLM) input: system prompt (via prompting strategy - not needed for initial experiments), per-step user content (text and/or image and observation history), and querying semantics (step-by-step vs subgoal vs full trajectory).
In scope here: ExperimentRunner, ExperimentConfig, observation building, prompting strategies, querying loop, parser, smoke scripts, and two sample mazes (V01_empty_room, V04_single_key) with previews.
Out of scope for this PR (follow-up): full automatic maze dataset generation (orchestrator, validator, mechanisms, generate_dataset, bulk generated_mazes/). Those land in later PRs.

Suggested review order

config.py — Read the docstring on ExperimentConfig. It defines the experimental axes (prompting, observation, context_window, querying). Confirm the combinations you care about are meaningful and documented.
runner.py — How the runner wires prompt strategy + ObservationBuilder + QueryingMode into a single episode loop; what gets appended to system vs user; when the model is called.
observation.py and renderer.py — How text vs PNG observations are built with history; lazy load of render_dataset.py for maze PNGs.
prompt_strategies.py / querying.py — Whether instructions match the parser and env action set.
smoke_tests/ — Scripts are the fastest way to sanity-check behavior without a real LLM API (see below).

How to smoke test locally

BFS

python src\v2\nlu_pipeline\nlu_benchmark\smoke_tests\smoke_bfs.py --maze V04_single_key.json

Claude

python src\v2\nlu_pipeline\nlu_benchmark\smoke_tests\smoke_llm.py -v --log-level DEBUG

Different Experiment Modes

python src\v2\nlu_pipeline\nlu_benchmark\smoke_tests\smoke_prompting_observation_querying.py --maze V01_empty_room.json

Expect: success=True / “All checks passed” (or equivalent), and outputs under smoke_tests/results/ (gitignored).

After smoke_prompting_observation_querying.py: The following reloads detailed_logs.json and re-checks the same structural invariants as the smoke (query counts vs stored queries, observation mode vs message shape, prompting markers, context-window text, etc.), printing any ISSUE lines plus a per-run summary.

python src\v2\nlu_pipeline\nlu_benchmark\smoke_tests\analyze_smoke_runner_logs.py --maze V01_empty_room.json

The outputs for each of these smoke tests will be stored inside a result folder. It has detailed logs at each query step as well as png rendering of the maze after each agent action.

Clarification: “image-only” vs text layout in the pipeline

Some reviewers prefer an interface that accepts screenshots only (no natural-language rendering of the maze anywhere).

In this code, that mode already exists as a configuration choice, not a separate subsystem:

ExperimentConfig.observation == "screenshot_only" — no initial NL maze block in the system prompt for that axis; user turns emphasize live PNG (and action-only history when context_window is last3). See observation.py and config.py.
text_only / image_text add NL layout and/or per-step text via the same ObservationBuilder and runner; the branching is localized to prompting / observation / querying config handling.

The author is not splitting out a second “screenshot-only pipeline” in this PR: the shared runner and config axes keep one code path, avoid duplicating the episode loop, and keep smoke tests and future experiments (ablations across observation modes) cheap.

Validation on current 214 benchmark mazes:

Copy (not committed here) all the mazes inside nlu_benchmark/benchmark_mazes, then run:

python3 src/v2/nlu_pipeline/nlu_benchmark/smoke_tests/smoke_benchmark_mazes.py

Reviewer checklist (short)

Config axes and defaults are intentional and documented.
Runner correctly distinguishes system vs user content and respects ExperimentConfig.
Smoke tests pass on a clean checkout (paths, imports, matplotlib).
No accidental commit of secrets, large generated datasets, or smoke_tests/results/ outputs.
Validation on current mazes runs as expected
Follow-up work (full maze generation, bulk JSON/PNGs) is clearly scoped outside this PR if omitted.

Add nlu_pipeline (env, runner, agents, observation, examples) with sample mazes and smoke test scripts. Include mazegen models/generators/solver and render_dataset for PNG rendering and solver checks. Omit automatic dataset generation (orchestrator, validator, mechanisms, generate_dataset) and bulk generated_mazes for a follow-up PR. Ignore smoke_tests/results and terminal_output.txt. Co-authored-by: Cursor <cursoragent@cursor.com>

Results directory pattern is now smoke_{maze}_bfs (was smoke_{maze}_smart_manual). Co-authored-by: Cursor <cursoragent@cursor.com>

Drop other sample maze JSON/PNGs from the PR. Update run_llm and run_local_llm examples to use V01_empty_room.json. Co-authored-by: Cursor <cursoragent@cursor.com>

seanrivera

Generally this needs to be moved to integrate better with the rest of the system. The build artifacts need to be .gitignored, and the rest needs to be rebased to work with the full pipeline.

seanrivera · 2026-05-09T16:01:22Z

This file path may need some cleanup to work with the rest of the repo

seanrivera · 2026-05-09T16:03:51Z

Please try not to commit generated artifacts to the repo.

seanrivera · 2026-05-09T16:04:08Z

This as well.

seanrivera · 2026-05-09T16:05:07Z

Can you move these into the same directory as the other jsons.

seanrivera · 2026-05-13T10:01:21Z

This generates its own maze schema, instead of using the existing one. I understand the need for this for testing, but it shouldn't be merged in.

seanrivera · 2026-05-13T10:59:12Z

Similar to the maze generator above, this conflicts with the other BFS solves we have. It should probably not be added to the codebase

seanrivera · 2026-05-13T14:41:07Z

Can you refactor this the support the EvaluationHarness Object?

seanrivera · 2026-05-13T15:52:01Z

You can remove this and just use the GridRunners or other versions of the Abstract grid backend

arushi-jain-27 and others added 14 commits May 1, 2026 20:13

refactor(smoke): rename smoke_smart_manual.py to smoke_bfs.py

8591a0e

Results directory pattern is now smoke_{maze}_bfs (was smoke_{maze}_smart_manual). Co-authored-by: Cursor <cursoragent@cursor.com>

chore(nlu): keep only V01 and V04 sample mazes

cc677f6

Drop other sample maze JSON/PNGs from the PR. Update run_llm and run_local_llm examples to use V01_empty_room.json. Co-authored-by: Cursor <cursoragent@cursor.com>

Fix path in smoke tests

00b29a9

Update code to run claude model

1f022c1

Validate benchmark mazes

b07ffd6

Store paths in benchmark solver

f3f3f61

Add failing mazes png

d4cec07

Smoke test claude

ec69dea

Add claude smoke

2dc4ac7

Shrink dimensions before running smoke tests on sample jsons

6808203

Fix image context

a3f6854

Swap X and Y axis in rendering

1363a93

Fix rendering

be38a58

seanrivera self-assigned this May 9, 2026

seanrivera requested changes May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pr1/nlu interfacing minimal mazegen#8

Pr1/nlu interfacing minimal mazegen#8
arushi-jain-27 wants to merge 14 commits into
mainfrom
pr1/nlu-interfacing-minimal-mazegen

arushi-jain-27 commented May 2, 2026 •

edited

Loading

Uh oh!

seanrivera left a comment

Uh oh!

seanrivera May 9, 2026

Uh oh!

seanrivera May 9, 2026

Uh oh!

seanrivera May 9, 2026

Uh oh!

seanrivera May 9, 2026

Uh oh!

seanrivera May 13, 2026

Uh oh!

seanrivera May 13, 2026

Uh oh!

seanrivera May 13, 2026

Uh oh!

seanrivera May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arushi-jain-27 commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR review guidelines — NLU benchmark (v2 interfacing)

What this PR is for

Suggested review order

How to smoke test locally

BFS

Claude

Different Experiment Modes

Clarification: “image-only” vs text layout in the pipeline

Validation on current 214 benchmark mazes:

Reviewer checklist (short)

Uh oh!

seanrivera left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

arushi-jain-27 commented May 2, 2026 •

edited

Loading