Skip to content

Policy decay to child value#2394

Draft
Menkib64 wants to merge 29 commits intoLeelaChessZero:masterfrom
Menkib64:policy-decay-based-on-value
Draft

Policy decay to child value#2394
Menkib64 wants to merge 29 commits intoLeelaChessZero:masterfrom
Menkib64:policy-decay-based-on-value

Conversation

@Menkib64
Copy link
Copy Markdown
Contributor

@Menkib64 Menkib64 commented Mar 1, 2026

This is an attempt to decay prior policy to value defined policy. Default values are random guess which might not be completely stupid. These would have to be tuned to work well.

This is an attempt to decay prior policy to value defined policy.
Default values are random guess which might not be completely stupid.
These would have to be tuned to work well.
@Menkib64 Menkib64 force-pushed the policy-decay-based-on-value branch from 2c66bb5 to 34f3b52 Compare March 1, 2026 20:29
@Menkib64 Menkib64 marked this pull request as ready for review March 4, 2026 14:55
Copilot AI review requested due to automatic review settings March 4, 2026 14:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a “policy decay” mechanism that blends the prior policy with a value-derived policy as node visits increase, configurable via new search parameters. It applies to both classic and dag_classic search paths and also surfaces the decayed policy in verbose stats.

Changes:

  • Add PolicyDecay(...) and apply it during child selection (classic + dag_classic).
  • Add new UCI/CLI options for decay configuration (temperature, visit horizon, value share).
  • Enhance dag_classic verbose stats to display decayed policy (PD) and use it in U/S reporting.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
src/search/dag_classic/search.cc Adds PolicyDecay and uses decayed policy in verbose stats and U/S calculations.
src/search/dag_classic/node.h Adds EdgeAndNode::GetIndex() helper for mapping edges to array indices.
src/search/classic/search.cc Adds PolicyDecay and applies it to current_pol during node picking.
src/search/classic/params.h Exposes new decay/temperature getters on search params.
src/search/classic/params.cc Registers new options and stores decay parameters.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/search/classic/search.cc Outdated
Comment thread src/search/classic/search.cc Outdated
Comment thread src/search/classic/params.cc Outdated
Comment thread src/search/classic/params.cc Outdated
Comment on lines +678 to +680
kPolicyValueTemperature(1.0f / options.Get<float>(kPolicyValueTemperatureId)),
kPolicyDecayVisits(options.Get<int>(kPolicyDecayVisitsId)),
kPolicyDecayValueShare(options.Get<float>(kPolicyDecayValueShareId) / 100.0f),
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kPolicyValueTemperature is initialized as 1.0f / PolicyValueTemperature, but the getter is named GetPolicyValueTemperature() and the option help refers to "Tau". This inversion is easy to misread later; consider renaming the cached field/getter to indicate it's an inverse temperature, or store the raw tau and perform the inversion at the call site.

Copilot uses AI. Check for mistakes.
Comment thread src/search/dag_classic/search.cc Outdated
Comment thread src/search/classic/search.cc Outdated
Comment thread src/search/classic/params.cc Outdated
Comment on lines +733 to +735
unsigned GetIndex(const Node* parent) const {
return edge_ - parent->GetLowNode()->GetEdges();
}
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edge_ - parent->GetLowNode()->GetEdges() is a ptrdiff_t; returning it as unsigned can hide negative values and may trigger sign-conversion warnings. Consider returning size_t/uint16_t and static_casting from the pointer difference after an assertion that the pointers are in the same array.

Copilot uses AI. Check for mistakes.
Comment thread src/search/dag_classic/search.cc Outdated
Comment thread src/search/dag_classic/search.cc Outdated
@Menkib64
Copy link
Copy Markdown
Contributor Author

Menkib64 commented Mar 4, 2026

Proposed defaults are based on a little local exploration. They reach about -10 elo against SF in ultra bullet. The best configurations without decay reach -5 elo. My search has already found a few configurations up to +18 elo when decay is enabled. I think there is need to update the default tune. The update should be a separate pull request when there is more data about good configurations. My tuning model haven't yet learned parameter interactions which are fairly complicated.

@zz4032
Copy link
Copy Markdown
Contributor

zz4032 commented Mar 22, 2026

Gauntlet: SF vs. ba4dbdf (outdated by now) and its master base 702d4b8, both with parameters tuned:

RANK  NAME                      :  ELO  ERROR        LOS(%)  PAIRS
==================================================================
1     lc0_master_b38ec9e        :  7.5  (-4.5/+5.0)    89.0   2500
2     lc0_PR2394_ba4dbdf        :  3.3  (-5.1/+4.7)    90.3   2500
3     stockfish_master_702d4b8  :  0.0  (-0.0/+0.0)     0.0   5000

Network: 791556, MinibatchSize=384 (2x GPU), Backend=roundrobin
lc0_master_b38ec9e:
CPuct=2.027, FpuValue=0.413, PolicyTemperature=1.295
NPM (median): 103505
lc0_PR2394_ba4dbdf:
CPuct=1.523, FpuValue=0.416, PolicyTemperature=1.397, PolicyValueTemperature=1.114, PolicyDecayValueShare=18.4, PolicyDecayVisits=58300
NPM (median): 93297 (-9.9% to master base)

@Menkib64
Copy link
Copy Markdown
Contributor Author

Network: 791556, MinibatchSize=384 (2x GPU), Backend=roundrobin lc0_master_b38ec9e: CPuct=2.027, FpuValue=0.413, PolicyTemperature=1.295
NPM (median): 103505
lc0_PR2394_ba4dbdf: CPuct=1.523, FpuValue=0.416, PolicyTemperature=1.397, PolicyValueTemperature=1.114, PolicyDecayValueShare=18.4, PolicyDecayVisits=58300
NPM (median): 93297 (-9.9% to master base)

Your tune wants to reduce search wide more than my tune for the older version. I can see that there is similar performance drop when my tune tests similar configurations. I'm thinking that depth first prefetching might be an important feature to improve search without losing performance.

I have made a few new iterations after the initial version. The current version seem to produce more consistent policy shapes for different type of positions. I'm still in progress to discover the best configuration.

@Menkib64 Menkib64 marked this pull request as draft April 11, 2026 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants