Policy decay to child value#2394
Conversation
This is an attempt to decay prior policy to value defined policy. Default values are random guess which might not be completely stupid. These would have to be tuned to work well.
2c66bb5 to
34f3b52
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces a “policy decay” mechanism that blends the prior policy with a value-derived policy as node visits increase, configurable via new search parameters. It applies to both classic and dag_classic search paths and also surfaces the decayed policy in verbose stats.
Changes:
- Add
PolicyDecay(...)and apply it during child selection (classic + dag_classic). - Add new UCI/CLI options for decay configuration (temperature, visit horizon, value share).
- Enhance dag_classic verbose stats to display decayed policy (PD) and use it in U/S reporting.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| src/search/dag_classic/search.cc | Adds PolicyDecay and uses decayed policy in verbose stats and U/S calculations. |
| src/search/dag_classic/node.h | Adds EdgeAndNode::GetIndex() helper for mapping edges to array indices. |
| src/search/classic/search.cc | Adds PolicyDecay and applies it to current_pol during node picking. |
| src/search/classic/params.h | Exposes new decay/temperature getters on search params. |
| src/search/classic/params.cc | Registers new options and stores decay parameters. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| kPolicyValueTemperature(1.0f / options.Get<float>(kPolicyValueTemperatureId)), | ||
| kPolicyDecayVisits(options.Get<int>(kPolicyDecayVisitsId)), | ||
| kPolicyDecayValueShare(options.Get<float>(kPolicyDecayValueShareId) / 100.0f), |
There was a problem hiding this comment.
kPolicyValueTemperature is initialized as 1.0f / PolicyValueTemperature, but the getter is named GetPolicyValueTemperature() and the option help refers to "Tau". This inversion is easy to misread later; consider renaming the cached field/getter to indicate it's an inverse temperature, or store the raw tau and perform the inversion at the call site.
| unsigned GetIndex(const Node* parent) const { | ||
| return edge_ - parent->GetLowNode()->GetEdges(); | ||
| } |
There was a problem hiding this comment.
edge_ - parent->GetLowNode()->GetEdges() is a ptrdiff_t; returning it as unsigned can hide negative values and may trigger sign-conversion warnings. Consider returning size_t/uint16_t and static_casting from the pointer difference after an assertion that the pointers are in the same array.
|
Proposed defaults are based on a little local exploration. They reach about -10 elo against SF in ultra bullet. The best configurations without decay reach -5 elo. My search has already found a few configurations up to +18 elo when decay is enabled. I think there is need to update the default tune. The update should be a separate pull request when there is more data about good configurations. My tuning model haven't yet learned parameter interactions which are fairly complicated. |
Using parent nodes to decay policy can suppress low policy values completely after the first visit evaluates them badly. This is likely not a good enough exploration. I suspect it causes problems using 100% decay in earlier test. Using child visits aims to avoid supressing a move only after it has had a fair chance to prove the early evaluation wrong.
|
Gauntlet: SF vs. Network: 791556, MinibatchSize=384 (2x GPU), Backend=roundrobin |
Your tune wants to reduce search wide more than my tune for the older version. I can see that there is similar performance drop when my tune tests similar configurations. I'm thinking that depth first prefetching might be an important feature to improve search without losing performance. I have made a few new iterations after the initial version. The current version seem to produce more consistent policy shapes for different type of positions. I'm still in progress to discover the best configuration. |
This is an attempt to decay prior policy to value defined policy. Default values are random guess which might not be completely stupid. These would have to be tuned to work well.