[Deepin-Kernel-SIG] [linux 6.18.y] [Upstream] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope by Avenger-285714 · Pull Request #1742 · deepin-community/kernel

Avenger-285714 · 2026-05-18T08:58:40Z

[Upstream commit 5920d046f7ae3bf9cf51b9d915c1fff13d299d84]

On systems where many CPUs share one LLC, unbound workqueues using WQ_AFFN_CACHE collapse to a single worker pool, causing heavy spinlock contention on pool->lock. For example, Chuck Lever measured 39% of cycles lost to native_queued_spin_lock_slowpath on a 12-core shared-L3 NFS-over-RDMA system.

The existing affinity hierarchy (cpu, smt, cache, numa, system) offers no intermediate option between per-LLC and per-SMT-core granularity.

Add WQ_AFFN_CACHE_SHARD, which subdivides each LLC into groups of at most wq_cache_shard_size cores (default 8, tunable via boot parameter). Shards are always split on core (SMT group) boundaries so that Hyper-Threading siblings are never placed in different pods. Cores are distributed across shards as evenly as possible -- for example, 36 cores in a single LLC with max shard size 8 produces 5 shards of 8+7+7+7+7 cores.

The implementation follows the same comparator pattern as other affinity scopes: precompute_cache_shard_ids() pre-fills the cpu_shard_id[] array from the already-initialized WQ_AFFN_CACHE and WQ_AFFN_SMT topology, and cpus_share_cache_shard() is passed to init_pod_type().

Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread), show cache_shard delivers ~5x the throughput and ~6.5x lower p50 latency compared to cache scope on this 72-core single-LLC system.

Suggested-by: Tejun Heo tj@kernel.org

[WangYuli: Fix conflicts]

Summary by Sourcery

Add a new cache-shard affinity scope for unbound workqueues to improve scalability on large shared-LLC systems by subdividing last-level cache pods into tunable core shards.

New Features:

Introduce WQ_AFFN_CACHE_SHARD affinity scope for workqueues to group CPUs into sub-LLC shards instead of full-LLC pods.
Add a tunable cache_shard_size module parameter to control the target number of cores per shard.

Enhancements:

Precompute per-CPU shard IDs within each LLC based on SMT topology to balance cores evenly across shards and keep SMT siblings together.
Extend workqueue topology initialization to build pod types using the new cache-shard affinity comparator.

[Upstream commit 5920d046f7ae3bf9cf51b9d915c1fff13d299d84] On systems where many CPUs share one LLC, unbound workqueues using WQ_AFFN_CACHE collapse to a single worker pool, causing heavy spinlock contention on pool->lock. For example, Chuck Lever measured 39% of cycles lost to native_queued_spin_lock_slowpath on a 12-core shared-L3 NFS-over-RDMA system. The existing affinity hierarchy (cpu, smt, cache, numa, system) offers no intermediate option between per-LLC and per-SMT-core granularity. Add WQ_AFFN_CACHE_SHARD, which subdivides each LLC into groups of at most wq_cache_shard_size cores (default 8, tunable via boot parameter). Shards are always split on core (SMT group) boundaries so that Hyper-Threading siblings are never placed in different pods. Cores are distributed across shards as evenly as possible -- for example, 36 cores in a single LLC with max shard size 8 produces 5 shards of 8+7+7+7+7 cores. The implementation follows the same comparator pattern as other affinity scopes: precompute_cache_shard_ids() pre-fills the cpu_shard_id[] array from the already-initialized WQ_AFFN_CACHE and WQ_AFFN_SMT topology, and cpus_share_cache_shard() is passed to init_pod_type(). Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread), show cache_shard delivers ~5x the throughput and ~6.5x lower p50 latency compared to cache scope on this 72-core single-LLC system. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org> [WangYuli: Fix conflicts] Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>

sourcery-ai · 2026-05-18T08:58:46Z

Reviewer's Guide

Adds a new WQ_AFFN_CACHE_SHARD affinity scope for unbound workqueues by introducing CPU-to-cache-shard mapping and topology initialization helpers so that large shared-LLC systems can distribute workqueue pools across synthetic cache shards instead of a single LLC pod.

Flow diagram for cache shard initialization and lookup (WQ_AFFN_CACHE_SHARD)

flowchart LR
    A[workqueue_init_topology] --> B[init_pod_type WQ_AFFN_CACHE cpus_share_cache]
    A --> C[init_pod_type WQ_AFFN_SMT cpus_share_smt]
    A --> D[precompute_cache_shard_ids]

    subgraph precompute_cache_shard_ids
        D --> E[validate wq_cache_shard_size]
        E --> F[for each LLC pod]
        F --> G[llc_count_cores]
        G --> H[llc_populate_cpu_shard_id]
        H --> I[llc_calc_shard_layout]
        I --> J[assign cpu_shard_id for each CPU]
    end

    A --> K[init_pod_type WQ_AFFN_CACHE_SHARD cpus_share_cache_shard]

    subgraph cpus_share_cache_shard
        L[cpu0,cpu1] --> M[cpus_share_cache]
        M -->|false| N[return false]
        M -->|true| O[compare cpu_shard_id]
        O -->|equal| P[return true]
        O -->|not equal| N
    end

File-Level Changes

Change	Details	Files
Introduce cache-shard topology data structures, helpers, and initialization for mapping CPUs to sub-LLC shards.	Add llc_shard_layout struct and cpu_shard_id[] array to describe and store per-CPU shard membership within an LLC pod. Implement helpers to count distinct cores per LLC using SMT topology, compute an even shard layout given a target shard size, and determine when a shard is full while respecting SMT core boundaries. Populate cpu_shard_id[] per LLC pod based on computed shard layout, assigning all SMT siblings to the same shard and validating shard indices via WARN_ON_ONCE. Add precompute_cache_shard_ids() to iterate LLC pods, validate cache_shard_size > 0, compute core counts, and fill shard IDs before cache-shard affinity is used. Provide cpus_share_cache_shard() comparator that relies on existing cpus_share_cache() plus matching cpu_shard_id values to define shard affinity.	`kernel/workqueue.c`
Expose and wire up the new WQ_AFFN_CACHE_SHARD affinity scope and configuration knob into the workqueue API and topology initialization.	Extend the wq_affn_names array and wq_affn_scope enum with the WQ_AFFN_CACHE_SHARD scope. Add a module parameter cache_shard_size (default 8) to control the target cores per shard at boot time. Update workqueue_init_topology() to precompute cache shard IDs after cache and SMT pods are initialized and to register the new WQ_AFFN_CACHE_SHARD pod type using cpus_share_cache_shard().	`kernel/workqueue.c` `include/linux/workqueue.h`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've left some high level feedback:

In llc_populate_cpu_shard_id(), it may be safer to assert that shard_id never exceeds layout.nr_shards (e.g., WARN_ON_ONCE(shard_id >= layout.nr_shards) inside the loop) so that any future topology or layout changes that would overrun the last shard are caught early rather than only checking the final shard_id value.
For the cache_shard_size handling in precompute_cache_shard_ids(), consider clamping the parameter (e.g., via a param_ops or helper) instead of mutating the module parameter at runtime, so that the effective valid range (>= 1) is enforced consistently before topology initialization uses it.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In llc_populate_cpu_shard_id(), it may be safer to assert that shard_id never exceeds layout.nr_shards (e.g., WARN_ON_ONCE(shard_id >= layout.nr_shards) inside the loop) so that any future topology or layout changes that would overrun the last shard are caught early rather than only checking the final shard_id value.
- For the cache_shard_size handling in precompute_cache_shard_ids(), consider clamping the parameter (e.g., via a param_ops or helper) instead of mutating the module parameter at runtime, so that the effective valid range (>= 1) is enforced consistently before topology initialization uses it.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

Copilot

Pull request overview

Adds a new cache_shard affinity scope for unbound workqueues to split large shared-LLC CPU groups into smaller synthetic shards and reduce worker-pool contention.

Changes:

Adds WQ_AFFN_CACHE_SHARD to the workqueue affinity-scope enum and name table.
Adds a workqueue.cache_shard_size parameter and shard-layout helpers.
Initializes cache-shard topology after SMT/cache pod discovery.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`include/linux/workqueue.h`	Adds the public `WQ_AFFN_CACHE_SHARD` affinity scope.
`kernel/workqueue.c`	Implements cache-shard topology computation, naming, parameter handling, and initialization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

 	[WQ_AFFN_CACHE]		= "cache",
+	[WQ_AFFN_CACHE_SHARD]	= "cache_shard",


+ * Chooses the number of shards that keeps average shard size closest to
+ * wq_cache_shard_size. Returns a struct describing the total number of shards,
+ * the base size of each, and how many are large shards.
+ */
+static struct llc_shard_layout __init llc_calc_shard_layout(int nr_cores)
+{
+	struct llc_shard_layout layout;
+
+	/* Ensure at least one shard; pick the count closest to the target size */
+	layout.nr_shards = max(1, DIV_ROUND_CLOSEST(nr_cores, wq_cache_shard_size));



+static unsigned int wq_cache_shard_size = 8;
+module_param_named(cache_shard_size, wq_cache_shard_size, uint, 0444);


 	WQ_AFFN_CPU,			/* one pod per CPU */
 	WQ_AFFN_SMT,			/* one pod poer SMT */
 	WQ_AFFN_CACHE,			/* one pod per LLC */
+	WQ_AFFN_CACHE_SHARD,		/* synthetic sub-LLC shards */


Avenger-285714 · 2026-05-18T09:29:38Z

Note: CI offten fails on "Upload Kernel Artifact" step. Please check deepin25-Xeon-8352S and kp920 runners' network.

opsiff · 2026-05-18T10:26:09Z

workqueue: validate cpumask_first() result in llc_populate_cpu_shard_id()

On uniprocessor (UP) configs such as nios2, NR_CPUS is 1, so
cpu_shard_id[] is a single-element array (int[1]). In
llc_populate_cpu_shard_id(), cpumask_first(sibling_cpus) returns an
unsigned int that the compiler cannot prove is always 0, triggering
a -Warray-bounds warning when the result is used to index
cpu_shard_id[]:

  kernel/workqueue.c:8321:55: warning: array subscript 1 is above
  array bounds of 'int[1]' [-Warray-bounds]
   8321 |  cpu_shard_id[c] = cpu_shard_id[cpumask_first(sibling_cpus)];
        |                    ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is a false positive: sibling_cpus can never be empty here because
'c' itself is always set in it, so cpumask_first() will always return a
valid CPU. However, the compiler cannot prove this statically, and the
warning only manifests on UP configs where the array size is 1.

Add a bounds check with WARN_ON_ONCE to silence the warning, and store
the result in a local variable to make the code clearer and avoid calling
cpumask_first() twice.

Fixes: 5920d046f7ae ("workqueue: add WQ_AFFN_CACHE_SHARD affinity scope")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604022343.GQtkF2vO-lkp@intel.com/
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>

…id() [Upstream commit 76af54648899abbd6b449c035583e47fd407078a] On uniprocessor (UP) configs such as nios2, NR_CPUS is 1, so cpu_shard_id[] is a single-element array (int[1]). In llc_populate_cpu_shard_id(), cpumask_first(sibling_cpus) returns an unsigned int that the compiler cannot prove is always 0, triggering a -Warray-bounds warning when the result is used to index cpu_shard_id[]: kernel/workqueue.c:8321:55: warning: array subscript 1 is above array bounds of 'int[1]' [-Warray-bounds] 8321 | cpu_shard_id[c] = cpu_shard_id[cpumask_first(sibling_cpus)]; | ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This is a false positive: sibling_cpus can never be empty here because 'c' itself is always set in it, so cpumask_first() will always return a valid CPU. However, the compiler cannot prove this statically, and the warning only manifests on UP configs where the array size is 1. Add a bounds check with WARN_ON_ONCE to silence the warning, and store the result in a local variable to make the code clearer and avoid calling cpumask_first() twice. Fixes: 5920d046f7ae ("workqueue: add WQ_AFFN_CACHE_SHARD affinity scope") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202604022343.GQtkF2vO-lkp@intel.com/ Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Wentao Guan <guanwentao@uniontech.com> Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>

Avenger-285714 · 2026-05-18T11:57:52Z

workqueue: validate cpumask_first() result in llc_populate_cpu_shard_id()

On uniprocessor (UP) configs such as nios2, NR_CPUS is 1, so
cpu_shard_id[] is a single-element array (int[1]). In
llc_populate_cpu_shard_id(), cpumask_first(sibling_cpus) returns an
unsigned int that the compiler cannot prove is always 0, triggering
a -Warray-bounds warning when the result is used to index
cpu_shard_id[]:

  kernel/workqueue.c:8321:55: warning: array subscript 1 is above
  array bounds of 'int[1]' [-Warray-bounds]
   8321 |  cpu_shard_id[c] = cpu_shard_id[cpumask_first(sibling_cpus)];
        |                    ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is a false positive: sibling_cpus can never be empty here because
'c' itself is always set in it, so cpumask_first() will always return a
valid CPU. However, the compiler cannot prove this statically, and the
warning only manifests on UP configs where the array size is 1.

Add a bounds check with WARN_ON_ONCE to silence the warning, and store
the result in a local variable to make the code clearer and avoid calling
cpumask_first() twice.

Fixes: 5920d046f7ae ("workqueue: add WQ_AFFN_CACHE_SHARD affinity scope")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604022343.GQtkF2vO-lkp@intel.com/
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>

done

deepin-ci-robot · 2026-05-18T17:13:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: opsiff

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~deepin/OWNERS~~ [opsiff]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Avenger-285714 requested review from Copilot and opsiff May 18, 2026 08:58

deepin-ci-robot requested review from huangbibo and myml May 18, 2026 08:59

Copilot started reviewing on behalf of Avenger-285714 May 18, 2026 09:00 View session

sourcery-ai Bot reviewed May 18, 2026

View reviewed changes

Copilot AI reviewed May 18, 2026

View reviewed changes

opsiff approved these changes May 18, 2026

View reviewed changes

opsiff merged commit 4540a4f into deepin-community:linux-6.18.y May 18, 2026
7 of 12 checks passed

deepin-ci-robot added the approved label May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Deepin-Kernel-SIG] [linux 6.18.y] [Upstream] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope#1742

[Deepin-Kernel-SIG] [linux 6.18.y] [Upstream] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope#1742
opsiff merged 2 commits into
deepin-community:linux-6.18.yfrom
Avenger-285714:WQ_AFFN_CACHE_SHARD-6.18

Avenger-285714 commented May 18, 2026 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot commented May 18, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Avenger-285714 commented May 18, 2026 •

edited

Loading

Uh oh!

opsiff commented May 18, 2026

Uh oh!

Avenger-285714 commented May 18, 2026

Uh oh!

Uh oh!

deepin-ci-robot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		[WQ_AFFN_CACHE] = "cache",
		[WQ_AFFN_CACHE_SHARD] = "cache_shard",


		static unsigned int wq_cache_shard_size = 8;
		module_param_named(cache_shard_size, wq_cache_shard_size, uint, 0444);

Conversation

Avenger-285714 commented May 18, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Flow diagram for cache shard initialization and lookup (WQ_AFFN_CACHE_SHARD)

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Avenger-285714 commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

opsiff commented May 18, 2026

Uh oh!

Avenger-285714 commented May 18, 2026

Uh oh!

Uh oh!

deepin-ci-robot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Avenger-285714 commented May 18, 2026 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented May 18, 2026 •

edited

Loading

Avenger-285714 commented May 18, 2026 •

edited

Loading