Limited-Horizon Bellman Learning (LHBL)

This is the code for the AAAI 2026 published paper "Beyond Single-Step Updates: Reinforcement Learning of Heuristics with Limited-Horizon Search" LHBL.

This currently contains the code for using LHBL and DeepCubeA to solve the Rubik's cube, 15-puzzle, 24-puzzle, 35-puzzle, 48-puzzle, and Lights Out.

This is a version of the Original DeepCubeA repository.

Setup

For required python packages, please see requirements.txt OR cube.yml. You should be able to install these packages with pip or conda.

Python version used: 3.7.2

Training and A* Search

Training uses Approximate Value Iteration (AVI) to learn a cost-to-go (heuristic) function. The learned heuristic is then used with A* search to solve puzzle instances.

There are two training methods available:

SSBL (Single-step Bellman Learning): Single-step Bellman backup updates
LHBL (Limited-Horizon Bellman Learning): Multi-step updates using graph-based Bellman backups within a search horizon, which can improve heuristic quality

There are pre-trained models in the saved_models/ directory as well as output.txt files to let you know what output to expect.

There are pre-computed results of A* search in the results/ directory.

Commands to train LHBL to solve the 15-puzzle

Train cost-to-go function - LHBL(50) (Limited-Horizon Bellman Learning with horizon=50)

python ctg_approx/avi.py --env puzzle15 --states_per_update 50000000 --batch_size 10000 --nnet_name puzzle15 --max_itrs 1000000 --loss_thresh 0.1 --back_max 500 --num_update_procs 30 --max_update_steps 50 --update_method ASTAR --limited_horizon_update

Train cost-to-go function - SSBL

python ctg_approx/avi.py --env puzzle15 --states_per_update 50000000 --batch_size 10000 --nnet_name puzzle15 --max_itrs 1000000 --loss_thresh 0.1 --back_max 500 --num_update_procs 30

The key additional flags for LHBL are:

--limited_horizon_update: Enable limited-horizon Bellman backup
--max_update_steps: Number of search steps to take (search horizon)
--update_method ASTAR: Use A* search during updates (alternative: GBFS)

Solve with A* search, use --verbose for more information

python search_methods/astar.py --states data/puzzle15/test/data_0.pkl --model_dir saved_models/puzzle15/current/ --env puzzle15 --weight 0.8 --batch_size 20000 --results_dir results/puzzle15/ --language cpp --nnet_batch_size 10000

Improving Results

During approximate value iteration (AVI), one can get better results by increasing the batch size (--batch_size) and number of states per update (--states_per_update). Decreasing the threshold before the target network is updated (--loss_thresh) can also help.

One can also add additional states to training set by doing greedy best-first search (GBFS) during the update stage and adding the states encountered during GBFS to the states used for approximate value iteration (--max_update_steps). Setting --max_update_steps to 1 is the same as doing approximate value iteration.

During A* search, increasing the weight on the path cost (--weight, range should be [0,1]) and the batch size (--batch_size) generally improves results.

These improvements often come at the expense of time.

Reproducing Results

To reproduce the paper results, use the provided train.sh and test.sh scripts. It has exmaple on how to run SSBL and LHBL(50), using different horizons is possible and explained in train.sh.

Training

bash train.sh

The script contains training commands for all environments (Rubik's Cube, 35-Puzzle, Lights Out) with both SSBL and LHBL methods. Edit the script to uncomment the specific experiment you want to run.

Testing

bash test.sh

The script runs A* search using trained models on test instances for all environments. Adjust the variables at the top of the script (start_idx, end_idx, solution_time_cap) as needed.

Parallelism

Training and solving can be easily parallelized across multiple CPUs and GPUs.

When training with ctg_approx/avi.py, set the number of workers for doing approximate value iteration with --num_update_procs During the update process, the target DNN is spawned on each available GPU and they work in parallel during the udpate step.

The number of GPUs used can be controlled by setting the CUDA_VISIBLE_DEVICES environment variable.

i.e. export CUDA_VISIBLE_DEVICES="0,1,2,3"

Memory

When obtaining training data with approximate value iteration and solving using A* search, the batch size of the data given to the DNN can be controlled with --update_nnet_batch_size for the avi.py file and --nnet_batch_size for the astar.py file. Reduce this value if your GPUs are running out of memory during approximate value iteration or during A* search.

Compiling C++ for A* Search

cd cpp/

make

If you are not able to get the C++ version working on your computer, you can change the --language switch for search_methods/astar.py from --language cpp to --language python. Note that the C++ version is generally faster.

Citation

To cite this project, please use

@misc{hadar2025singlestepupdatesreinforcementlearning,
      title={Beyond Single-Step Updates: Reinforcement Learning of Heuristics with Limited-Horizon Search}, 
      author={Gal Hadar, Forest Agostinelli, Shahaf S. Shperberg},
      year={2025},
      eprint={2511.10264},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2511.10264}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Limited-Horizon Bellman Learning (LHBL)

Setup

Training and A* Search

Commands to train LHBL to solve the 15-puzzle

Train cost-to-go function - LHBL(50) (Limited-Horizon Bellman Learning with horizon=50)

Train cost-to-go function - SSBL

Solve with A* search, use --verbose for more information

Improving Results

Reproducing Results

Training

Testing

Parallelism

Memory

Compiling C++ for A* Search

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
cpp		cpp
ctg_approx		ctg_approx
data		data
environments		environments
results		results
saved_models		saved_models
scripts		scripts
search_methods		search_methods
tests		tests
updaters		updaters
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
cube.yml		cube.yml
requirements.txt		requirements.txt
test.sh		test.sh
train.sh		train.sh

Folders and files

Latest commit

History

Repository files navigation

Limited-Horizon Bellman Learning (LHBL)

Setup

Training and A* Search

Commands to train LHBL to solve the 15-puzzle

Train cost-to-go function - LHBL(50) (Limited-Horizon Bellman Learning with horizon=50)

Train cost-to-go function - SSBL

Solve with A* search, use --verbose for more information

Improving Results

Reproducing Results

Training

Testing

Parallelism

Memory

Compiling C++ for A* Search

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages