Skip to content

Maze Generation and NL Interface Setup#1

Open
arushi-jain-27 wants to merge 1 commit into
mainfrom
maze_gen_and_interface
Open

Maze Generation and NL Interface Setup#1
arushi-jain-27 wants to merge 1 commit into
mainfrom
maze_gen_and_interface

Conversation

@arushi-jain-27
Copy link
Copy Markdown
Collaborator

@arushi-jain-27 arushi-jain-27 commented Apr 25, 2026

NLU Interface

Purpose: nlu_benchmark evaluates models on language-guided navigation in a grid world loaded from the same maze JSON (maze + mechanisms) used by automatic maze generation. One run = one episode: reset the env, talk to the model, parse its actions, step the simulator, repeat until success, timeout, or repeated failure.

Simulation: GridWorldEnv implements TURN_LEFT / TURN_RIGHT / MOVE_FORWARD / PICKUP / TOGGLE / DONE, with walls, color-matched keys/doors, switches that control gates, and clear StepEvent outcomes (MOVED, BLOCKED, PICKUP, TOGGLED, DONE, WRONG_DONE, etc.). That logic is the ground truth; the model only sees what the prompt (and optional image) exposes.

Experimental knobs (ExperimentConfig):

  • prompting: minimal vs standard (mechanism list) vs verbose (rules + richer per-turn hints).
  • observation: text-only, text + live PNG each turn, or screenshot-first (minimal text, action-only history).
  • context_window: current turn only vs last 3 steps in the user message.
  • querying: step-by-step (one actionable token per model call), subgoal (batched SUB_GOAL + ACTIONS, re-query when the queue empties or after failures), or full trajectory (one long ACTIONS list per episode).

Prompting pipeline: The runner (ExperimentRunner / build_runner) assembles system instructions (task, valid actions, output format) and, for text/image+text modes, a fixed initial maze description once per episode. Each user turn adds the current situation (position, facing, inventory, live mechanism state), last step result, optional history, and optional base64 PNG renders. PNGs use the same drawing path as automatic_maze_generation/render_dataset.py so benchmark images match dataset style.

Model interface: The “agent” is any messages -> reply text function; included helpers cover random, Hugging Face Router, and local Transformers. Replies are parsed via FINAL_OUTPUT: and/or SUB_GOAL / ACTIONS depending on querying mode. The run returns success, steps used, final state, and a step transcript plus the serialized config for reproducibility.

Smoke testing: Sample mazes are exercised with a BFS shortest-path solver over the full mechanism state, converted into actions, and stepped through GridWorldEnv (e.g. nlu_benchmark/smoke_tests/smoke_smart_manual.py) to sanity-check loading, rendering, and env semantics without an LLM.

Maze generation

Generation is spec-driven: a MazeGenSpec picks a backbone topology (winding corridor, multi-route, side vault, sequential chain, dense maze), a logic chain (none, key–door, switch–gate, ordered multi-step chains, etc.), and optional distractors (wrong keys, dead ends, distractor chains). The pipeline is roughly sample spec → layout generation → mechanism placement → validation (orchestrator, generators, mechanisms, validator). Validation checks things like solvability, avoiding unintended shortcuts, and respecting prerequisite order before blockers.

The artifact is a JSON payload: maze (grid dimensions, wall segments, start, goal), mechanisms (keys, doors, switches, gates, plus reserved slots for other types), and validation (validity, reasons, solver cost/path and interaction traces where applicable). mazegen/generate_dataset.py turns validated layouts into these files for dataset builds. render_dataset.py reads the same JSON and draws PNG layouts with matplotlib (including a row/col → display-space conversion helper).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant