LearningToOptimize · andrewrosemberg · Jun 22, 2026 · Jun 19, 2026 · Jun 19, 2026 · Jun 20, 2026
diff --git a/.gitignore b/.gitignore
@@ -51,4 +51,7 @@ examples/**/.CondaPkg/*
 *.err
 *.tsv
 *.pdf
-plan.md
+plan/
+*_cuts.json
+settings.json
+*.sh
diff --git a/Project.toml b/Project.toml
@@ -16,6 +16,7 @@ Logging = "56ddb016-857b-54e1-b83d-db4d58db5568"
 MathOptInterface = "b8f27783-ece8-5eb3-8dc8-9495eed66fee"
 ParametricOptInterface = "0ce4ce61-57bf-432b-a095-efac525d185e"
 Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
+Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
 Zygote = "e88e6eb3-aa80-5325-afca-941959d7151f"
 
 [compat]
@@ -35,6 +36,7 @@ MadNLP = "0.8, 0.9, 0.10"
 MadNLPGPU = "0.7, 0.8, 0.9, 0.10"
 MathOptInterface = "1.48.0"
 ParametricOptInterface = "0.14.1, 0.15, 0.16"
+Statistics = "1.10, 1.11"
 Zygote = "0.6.77, 0.7"
 julia = "1.10, 1.11, 1.12"
 

diff --git a/README.md b/README.md
@@ -27,7 +27,7 @@ DecisionRules.jl implements this workflow in three flavors:
 
 ```julia
 using Pkg
-Pkg.add(url="https://github.com/LearningToOptimize/DecisionRules.jl.git")
+Pkg.add("DecisionRules")
 ```
 
 ## What you need to provide
@@ -202,6 +202,18 @@ Each evaluation reports (a) the rollout objective **excluding** the target-slack
 
 Per-sample debugging hooks can be attached with `SampleLog(on_sample=(s, models, log) -> ...)`; the training loop calls the hook after each sample's solve with the live JuMP model(s). The previous `record_loss=(iter, model, loss, tag) -> ...` keyword keeps working as a deprecated adapter.
 
+## GPU acceleration with DecisionRulesExa.jl
+
+For large-scale problems where the inner NLP solve is the bottleneck (e.g., AC-OPF with hundreds of buses), [DecisionRulesExa.jl](https://github.com/LearningToOptimize/DecisionRulesExa.jl) provides a GPU-accelerated backend that replaces JuMP with [ExaModels.jl](https://github.com/exanauts/ExaModels.jl) and solves with [MadNLP.jl](https://github.com/MadNLP/MadNLP.jl) + CUDSS on GPU.
+
+DecisionRulesExa.jl implements the same TS-DDR algorithm (deterministic-equivalent mode) with the same envelope-theorem gradient computation but formulates the NLP in ExaModels' SIMD-compatible modeling layer. This enables:
+
+- **GPU-native interior-point solves** via MadNLP + CUDSS
+- **Parallel GPU solves** for multiple training samples per gradient step
+- **Runtime parameter updates** via `ExaModels.set_parameter!` (no model reconstruction)
+
+See the [GPU Acceleration](https://LearningToOptimize.github.io/DecisionRules.jl/dev/gpu_acceleration/) page in the documentation for a tutorial on getting started with DecisionRulesExa.jl.
+
 ## Examples and tests
 
 Examples live in `examples/`. Run tests with:

diff --git a/docs/Project.toml b/docs/Project.toml
@@ -3,9 +3,12 @@ DecisionRules = "47937410-f832-486f-8300-12c95b225dfc"
 DiffOpt = "930fe3bc-9c6b-11ea-2d94-6184641e85e7"
 Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
 Flux = "587475ba-b771-5e3f-ad9e-33799f191a9c"
+Functors = "d9f16b24-f501-4c13-a1f2-28368ffc5196"
+HiGHS = "87dc4568-4c63-4d18-b0c0-bb2238e4078b"
 Ipopt = "b6b21f68-93f8-5de0-b562-5493be1d77c9"
 JuMP = "4076af6c-e467-56ae-b986-b466b2749572"
 Literate = "98b081ad-f1c9-55d3-8b20-4c87d4299306"
+MathOptInterface = "b8f27783-ece8-5eb3-8dc8-9495eed66fee"
 Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
 Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
 

diff --git a/docs/make.jl b/docs/make.jl
@@ -24,6 +24,9 @@ makedocs(;
     pages=[
         "Home" => "index.md",
         "Algorithm" => "algorithm.md",
+        "Gradient Fallback" => "gradient_fallback.md",
+        "Uncertainty Sampling" => "sampling.md",
+        "GPU Acceleration" => "gpu_acceleration.md",
         "Examples" => [
             "Hydropower Scheduling" => "examples/hydro.md",
             "Rocket Control" => "examples/rocket.md",

diff --git a/docs/src/algorithm.md b/docs/src/algorithm.md
@@ -108,6 +108,86 @@ for k = 1, ..., ⌈T/W⌉:
 **Pros**: balances coupling (within windows) with tractability; parallelizable windows.
 **Cons**: continuity gaps between windows require penalty tuning.
 
+## Mixed gradient: score-function (REINFORCE) correction
+
+For problems with integer variables or non-smooth subproblems, the dual
+gradient can be biased — it is local to a fixed integer assignment and cannot
+see the effect of discrete switches (e.g., opening a setup variable).
+
+DecisionRules provides a **score-function (REINFORCE)** correction that mixes
+the dual gradient with a model-free policy gradient estimated from stage-wise
+rollouts under perturbed targets.
+
+### How the score-function estimator works
+
+1. **Perturb**: add Gaussian noise to the policy targets:
+   ``\tilde{x}_t = \hat{x}_t(\theta) + \delta_t``, where
+   ``\delta_t \sim \mathcal{N}(0, \sigma^2 I)``.
+
+2. **Rollout**: solve the stage-wise subproblems with the perturbed targets to
+   obtain realized costs ``R_m`` for ``m = 1, \ldots, M`` rollouts. These
+   rollouts solve the models exactly as built (MIPs stay MIPs), so the costs
+   reflect true integer-feasible decisions.
+
+3. **Advantage**: center the costs ``A_m = R_m - \bar{R}`` (mean baseline
+   reduces variance without changing the expected gradient).
+
+4. **Surrogate loss**: the differentiable scalar whose gradient recovers the
+   REINFORCE estimate:
+
+```math
+L_{\text{sf}}(\theta)
+\;=\;
+\frac{1}{M} \sum_{m=1}^{M}
+  A_m
+  \sum_{t=1}^{T}
+  \left\langle
+    \frac{\delta_{m,t}}{\sigma^2},\;
+    \hat{x}_{t+1}(\theta)
+  \right\rangle.
+```
+
+This is the standard score-function estimator for Gaussian perturbations.
+The key identity is
+``\nabla_\theta \log p(\delta_t \mid \theta) = \delta_t / \sigma^2``
+for a Gaussian centered at ``\hat{x}_t(\theta)``.
+
+### Mixed gradient
+
+The final training gradient combines both signals:
+
+```math
+\nabla L
+\;=\;
+\alpha\, \nabla L_{\text{dual}}
++ (1 - \alpha)\, \nabla L_{\text{sf}},
+```
+
+where ``\alpha \in [0, 1]`` is the `dual_weight`.
+
+There are two separate solve paths in the mixed-gradient training loop:
+
+- **Dual path**: controlled by `integer_strategy`, which determines how local
+  dual information is read from the deterministic equivalent
+  (e.g., [`FixedDiscreteIntegerStrategy`](@ref) solves the MIP, fixes integers,
+  re-solves the LP, and reads LP duals).
+- **Score-function path**: controlled by [`ScoreFunctionConfig`](@ref), which
+  owns separate rollout subproblems. These are solved exactly as built, and
+  their realized costs define the Monte Carlo score-function term.
+
+### Scheduled ramp-in
+
+A [`ScoreFunctionSchedule`](@ref) can ramp ``\alpha`` from 1 (pure dual) to
+its final value over a warmup period.  Let ``k`` be the current iteration and
+``\rho_k = \operatorname{clip}((k - k_0) / r,\, 0,\, 1)``.  The effective
+score-function weight is ``\rho_k (1 - \alpha)``.
+
+This lets the DE dual gradient establish a good initial policy before
+introducing the higher-variance REINFORCE signal.
+
+See the [Stochastic Lot-Sizing with Fixed Ordering Costs](@ref) example for a
+complete worked example with integer variables and mixed gradients.
+
 ## Penalty annealing
 
 The target penalty ``\lambda`` is critical: too small and the optimizer ignores

diff --git a/docs/src/api.md b/docs/src/api.md
@@ -16,4 +16,5 @@ Private = false
 ```@autodocs
 Modules = [DecisionRules]
 Public = false
+Filter = t -> t != DecisionRules
 ```
diff --git a/docs/src/assets/hydro_generation_comparison.png b/docs/src/assets/hydro_generation_comparison.png
diff --git a/docs/src/assets/hydro_volume_comparison.png b/docs/src/assets/hydro_volume_comparison.png
diff --git a/docs/src/assets/inventory_integer_results.png b/docs/src/assets/inventory_integer_results.png
diff --git a/docs/src/assets/inventory_relaxed_results.png b/docs/src/assets/inventory_relaxed_results.png
-Original file line number
+Diff line change
@@ Expand Up / @@ -51,4 +51,7 @@ examples/**/.CondaPkg/* @@
     *.err
     *.tsv
     *.pdf
-    plan.md
+    plan/
+    *_cuts.json
+    settings.json
+    *.sh