[Deepin-Kernel-SIG] [linux 6.18.y] [Upstream] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope#1742
Conversation
[Upstream commit 5920d046f7ae3bf9cf51b9d915c1fff13d299d84] On systems where many CPUs share one LLC, unbound workqueues using WQ_AFFN_CACHE collapse to a single worker pool, causing heavy spinlock contention on pool->lock. For example, Chuck Lever measured 39% of cycles lost to native_queued_spin_lock_slowpath on a 12-core shared-L3 NFS-over-RDMA system. The existing affinity hierarchy (cpu, smt, cache, numa, system) offers no intermediate option between per-LLC and per-SMT-core granularity. Add WQ_AFFN_CACHE_SHARD, which subdivides each LLC into groups of at most wq_cache_shard_size cores (default 8, tunable via boot parameter). Shards are always split on core (SMT group) boundaries so that Hyper-Threading siblings are never placed in different pods. Cores are distributed across shards as evenly as possible -- for example, 36 cores in a single LLC with max shard size 8 produces 5 shards of 8+7+7+7+7 cores. The implementation follows the same comparator pattern as other affinity scopes: precompute_cache_shard_ids() pre-fills the cpu_shard_id[] array from the already-initialized WQ_AFFN_CACHE and WQ_AFFN_SMT topology, and cpus_share_cache_shard() is passed to init_pod_type(). Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread), show cache_shard delivers ~5x the throughput and ~6.5x lower p50 latency compared to cache scope on this 72-core single-LLC system. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org> [WangYuli: Fix conflicts] Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
Reviewer's GuideAdds a new WQ_AFFN_CACHE_SHARD affinity scope for unbound workqueues by introducing CPU-to-cache-shard mapping and topology initialization helpers so that large shared-LLC systems can distribute workqueue pools across synthetic cache shards instead of a single LLC pod. Flow diagram for cache shard initialization and lookup (WQ_AFFN_CACHE_SHARD)flowchart LR
A[workqueue_init_topology] --> B[init_pod_type WQ_AFFN_CACHE cpus_share_cache]
A --> C[init_pod_type WQ_AFFN_SMT cpus_share_smt]
A --> D[precompute_cache_shard_ids]
subgraph precompute_cache_shard_ids
D --> E[validate wq_cache_shard_size]
E --> F[for each LLC pod]
F --> G[llc_count_cores]
G --> H[llc_populate_cpu_shard_id]
H --> I[llc_calc_shard_layout]
I --> J[assign cpu_shard_id for each CPU]
end
A --> K[init_pod_type WQ_AFFN_CACHE_SHARD cpus_share_cache_shard]
subgraph cpus_share_cache_shard
L[cpu0,cpu1] --> M[cpus_share_cache]
M -->|false| N[return false]
M -->|true| O[compare cpu_shard_id]
O -->|equal| P[return true]
O -->|not equal| N
end
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- In llc_populate_cpu_shard_id(), it may be safer to assert that shard_id never exceeds layout.nr_shards (e.g., WARN_ON_ONCE(shard_id >= layout.nr_shards) inside the loop) so that any future topology or layout changes that would overrun the last shard are caught early rather than only checking the final shard_id value.
- For the cache_shard_size handling in precompute_cache_shard_ids(), consider clamping the parameter (e.g., via a param_ops or helper) instead of mutating the module parameter at runtime, so that the effective valid range (>= 1) is enforced consistently before topology initialization uses it.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In llc_populate_cpu_shard_id(), it may be safer to assert that shard_id never exceeds layout.nr_shards (e.g., WARN_ON_ONCE(shard_id >= layout.nr_shards) inside the loop) so that any future topology or layout changes that would overrun the last shard are caught early rather than only checking the final shard_id value.
- For the cache_shard_size handling in precompute_cache_shard_ids(), consider clamping the parameter (e.g., via a param_ops or helper) instead of mutating the module parameter at runtime, so that the effective valid range (>= 1) is enforced consistently before topology initialization uses it.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Pull request overview
Adds a new cache_shard affinity scope for unbound workqueues to split large shared-LLC CPU groups into smaller synthetic shards and reduce worker-pool contention.
Changes:
- Adds
WQ_AFFN_CACHE_SHARDto the workqueue affinity-scope enum and name table. - Adds a
workqueue.cache_shard_sizeparameter and shard-layout helpers. - Initializes cache-shard topology after SMT/cache pod discovery.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
include/linux/workqueue.h |
Adds the public WQ_AFFN_CACHE_SHARD affinity scope. |
kernel/workqueue.c |
Implements cache-shard topology computation, naming, parameter handling, and initialization. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| [WQ_AFFN_CACHE] = "cache", | ||
| [WQ_AFFN_CACHE_SHARD] = "cache_shard", |
| * Chooses the number of shards that keeps average shard size closest to | ||
| * wq_cache_shard_size. Returns a struct describing the total number of shards, | ||
| * the base size of each, and how many are large shards. | ||
| */ | ||
| static struct llc_shard_layout __init llc_calc_shard_layout(int nr_cores) | ||
| { | ||
| struct llc_shard_layout layout; | ||
|
|
||
| /* Ensure at least one shard; pick the count closest to the target size */ | ||
| layout.nr_shards = max(1, DIV_ROUND_CLOSEST(nr_cores, wq_cache_shard_size)); |
|
|
||
| static unsigned int wq_cache_shard_size = 8; | ||
| module_param_named(cache_shard_size, wq_cache_shard_size, uint, 0444); |
| WQ_AFFN_CPU, /* one pod per CPU */ | ||
| WQ_AFFN_SMT, /* one pod poer SMT */ | ||
| WQ_AFFN_CACHE, /* one pod per LLC */ | ||
| WQ_AFFN_CACHE_SHARD, /* synthetic sub-LLC shards */ |
|
Note: CI offten fails on "Upload Kernel Artifact" step. Please check deepin25-Xeon-8352S and kp920 runners' network. |
|
workqueue: validate cpumask_first() result in llc_populate_cpu_shard_id() |
…id()
[Upstream commit 76af54648899abbd6b449c035583e47fd407078a]
On uniprocessor (UP) configs such as nios2, NR_CPUS is 1, so
cpu_shard_id[] is a single-element array (int[1]). In
llc_populate_cpu_shard_id(), cpumask_first(sibling_cpus) returns an
unsigned int that the compiler cannot prove is always 0, triggering
a -Warray-bounds warning when the result is used to index
cpu_shard_id[]:
kernel/workqueue.c:8321:55: warning: array subscript 1 is above
array bounds of 'int[1]' [-Warray-bounds]
8321 | cpu_shard_id[c] = cpu_shard_id[cpumask_first(sibling_cpus)];
| ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is a false positive: sibling_cpus can never be empty here because
'c' itself is always set in it, so cpumask_first() will always return a
valid CPU. However, the compiler cannot prove this statically, and the
warning only manifests on UP configs where the array size is 1.
Add a bounds check with WARN_ON_ONCE to silence the warning, and store
the result in a local variable to make the code clearer and avoid calling
cpumask_first() twice.
Fixes: 5920d046f7ae ("workqueue: add WQ_AFFN_CACHE_SHARD affinity scope")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604022343.GQtkF2vO-lkp@intel.com/
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Wentao Guan <guanwentao@uniontech.com>
Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
done |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: opsiff The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[Upstream commit 5920d046f7ae3bf9cf51b9d915c1fff13d299d84]
On systems where many CPUs share one LLC, unbound workqueues using WQ_AFFN_CACHE collapse to a single worker pool, causing heavy spinlock contention on pool->lock. For example, Chuck Lever measured 39% of cycles lost to native_queued_spin_lock_slowpath on a 12-core shared-L3 NFS-over-RDMA system.
The existing affinity hierarchy (cpu, smt, cache, numa, system) offers no intermediate option between per-LLC and per-SMT-core granularity.
Add WQ_AFFN_CACHE_SHARD, which subdivides each LLC into groups of at most wq_cache_shard_size cores (default 8, tunable via boot parameter). Shards are always split on core (SMT group) boundaries so that Hyper-Threading siblings are never placed in different pods. Cores are distributed across shards as evenly as possible -- for example, 36 cores in a single LLC with max shard size 8 produces 5 shards of 8+7+7+7+7 cores.
The implementation follows the same comparator pattern as other affinity scopes: precompute_cache_shard_ids() pre-fills the cpu_shard_id[] array from the already-initialized WQ_AFFN_CACHE and WQ_AFFN_SMT topology, and cpus_share_cache_shard() is passed to init_pod_type().
Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread), show cache_shard delivers ~5x the throughput and ~6.5x lower p50 latency compared to cache scope on this 72-core single-LLC system.
Suggested-by: Tejun Heo tj@kernel.org
[WangYuli: Fix conflicts]
Summary by Sourcery
Add a new cache-shard affinity scope for unbound workqueues to improve scalability on large shared-LLC systems by subdividing last-level cache pods into tunable core shards.
New Features:
Enhancements: