Skip to content

[Deepin-Kernel-SIG] [linux 6.18.y] [Upstream] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope#1742

Merged
opsiff merged 2 commits into
deepin-community:linux-6.18.yfrom
Avenger-285714:WQ_AFFN_CACHE_SHARD-6.18
May 18, 2026
Merged

[Deepin-Kernel-SIG] [linux 6.18.y] [Upstream] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope#1742
opsiff merged 2 commits into
deepin-community:linux-6.18.yfrom
Avenger-285714:WQ_AFFN_CACHE_SHARD-6.18

Conversation

@Avenger-285714
Copy link
Copy Markdown
Member

@Avenger-285714 Avenger-285714 commented May 18, 2026

[Upstream commit 5920d046f7ae3bf9cf51b9d915c1fff13d299d84]

On systems where many CPUs share one LLC, unbound workqueues using WQ_AFFN_CACHE collapse to a single worker pool, causing heavy spinlock contention on pool->lock. For example, Chuck Lever measured 39% of cycles lost to native_queued_spin_lock_slowpath on a 12-core shared-L3 NFS-over-RDMA system.

The existing affinity hierarchy (cpu, smt, cache, numa, system) offers no intermediate option between per-LLC and per-SMT-core granularity.

Add WQ_AFFN_CACHE_SHARD, which subdivides each LLC into groups of at most wq_cache_shard_size cores (default 8, tunable via boot parameter). Shards are always split on core (SMT group) boundaries so that Hyper-Threading siblings are never placed in different pods. Cores are distributed across shards as evenly as possible -- for example, 36 cores in a single LLC with max shard size 8 produces 5 shards of 8+7+7+7+7 cores.

The implementation follows the same comparator pattern as other affinity scopes: precompute_cache_shard_ids() pre-fills the cpu_shard_id[] array from the already-initialized WQ_AFFN_CACHE and WQ_AFFN_SMT topology, and cpus_share_cache_shard() is passed to init_pod_type().

Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread), show cache_shard delivers ~5x the throughput and ~6.5x lower p50 latency compared to cache scope on this 72-core single-LLC system.

Suggested-by: Tejun Heo tj@kernel.org

[WangYuli: Fix conflicts]

Summary by Sourcery

Add a new cache-shard affinity scope for unbound workqueues to improve scalability on large shared-LLC systems by subdividing last-level cache pods into tunable core shards.

New Features:

  • Introduce WQ_AFFN_CACHE_SHARD affinity scope for workqueues to group CPUs into sub-LLC shards instead of full-LLC pods.
  • Add a tunable cache_shard_size module parameter to control the target number of cores per shard.

Enhancements:

  • Precompute per-CPU shard IDs within each LLC based on SMT topology to balance cores evenly across shards and keep SMT siblings together.
  • Extend workqueue topology initialization to build pod types using the new cache-shard affinity comparator.

[Upstream commit 5920d046f7ae3bf9cf51b9d915c1fff13d299d84]

On systems where many CPUs share one LLC, unbound workqueues using
WQ_AFFN_CACHE collapse to a single worker pool, causing heavy spinlock
contention on pool->lock. For example, Chuck Lever measured 39% of
cycles lost to native_queued_spin_lock_slowpath on a 12-core shared-L3
NFS-over-RDMA system.

The existing affinity hierarchy (cpu, smt, cache, numa, system) offers
no intermediate option between per-LLC and per-SMT-core granularity.

Add WQ_AFFN_CACHE_SHARD, which subdivides each LLC into groups of at
most wq_cache_shard_size cores (default 8, tunable via boot parameter).
Shards are always split on core (SMT group) boundaries so that
Hyper-Threading siblings are never placed in different pods. Cores are
distributed across shards as evenly as possible -- for example, 36 cores
in a single LLC with max shard size 8 produces 5 shards of 8+7+7+7+7
cores.

The implementation follows the same comparator pattern as other affinity
scopes: precompute_cache_shard_ids() pre-fills the cpu_shard_id[] array
from the already-initialized WQ_AFFN_CACHE and WQ_AFFN_SMT topology,
and cpus_share_cache_shard() is passed to init_pod_type().

Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread), show
cache_shard delivers ~5x the throughput and ~6.5x lower p50 latency
compared to cache scope on this 72-core single-LLC system.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
[WangYuli: Fix conflicts]
Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
@Avenger-285714 Avenger-285714 requested review from Copilot and opsiff May 18, 2026 08:58
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented May 18, 2026

Reviewer's Guide

Adds a new WQ_AFFN_CACHE_SHARD affinity scope for unbound workqueues by introducing CPU-to-cache-shard mapping and topology initialization helpers so that large shared-LLC systems can distribute workqueue pools across synthetic cache shards instead of a single LLC pod.

Flow diagram for cache shard initialization and lookup (WQ_AFFN_CACHE_SHARD)

flowchart LR
    A[workqueue_init_topology] --> B[init_pod_type WQ_AFFN_CACHE cpus_share_cache]
    A --> C[init_pod_type WQ_AFFN_SMT cpus_share_smt]
    A --> D[precompute_cache_shard_ids]

    subgraph precompute_cache_shard_ids
        D --> E[validate wq_cache_shard_size]
        E --> F[for each LLC pod]
        F --> G[llc_count_cores]
        G --> H[llc_populate_cpu_shard_id]
        H --> I[llc_calc_shard_layout]
        I --> J[assign cpu_shard_id for each CPU]
    end

    A --> K[init_pod_type WQ_AFFN_CACHE_SHARD cpus_share_cache_shard]

    subgraph cpus_share_cache_shard
        L[cpu0,cpu1] --> M[cpus_share_cache]
        M -->|false| N[return false]
        M -->|true| O[compare cpu_shard_id]
        O -->|equal| P[return true]
        O -->|not equal| N
    end
Loading

File-Level Changes

Change Details Files
Introduce cache-shard topology data structures, helpers, and initialization for mapping CPUs to sub-LLC shards.
  • Add llc_shard_layout struct and cpu_shard_id[] array to describe and store per-CPU shard membership within an LLC pod.
  • Implement helpers to count distinct cores per LLC using SMT topology, compute an even shard layout given a target shard size, and determine when a shard is full while respecting SMT core boundaries.
  • Populate cpu_shard_id[] per LLC pod based on computed shard layout, assigning all SMT siblings to the same shard and validating shard indices via WARN_ON_ONCE.
  • Add precompute_cache_shard_ids() to iterate LLC pods, validate cache_shard_size > 0, compute core counts, and fill shard IDs before cache-shard affinity is used.
  • Provide cpus_share_cache_shard() comparator that relies on existing cpus_share_cache() plus matching cpu_shard_id values to define shard affinity.
kernel/workqueue.c
Expose and wire up the new WQ_AFFN_CACHE_SHARD affinity scope and configuration knob into the workqueue API and topology initialization.
  • Extend the wq_affn_names array and wq_affn_scope enum with the WQ_AFFN_CACHE_SHARD scope.
  • Add a module parameter cache_shard_size (default 8) to control the target cores per shard at boot time.
  • Update workqueue_init_topology() to precompute cache shard IDs after cache and SMT pods are initialized and to register the new WQ_AFFN_CACHE_SHARD pod type using cpus_share_cache_shard().
kernel/workqueue.c
include/linux/workqueue.h

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • In llc_populate_cpu_shard_id(), it may be safer to assert that shard_id never exceeds layout.nr_shards (e.g., WARN_ON_ONCE(shard_id >= layout.nr_shards) inside the loop) so that any future topology or layout changes that would overrun the last shard are caught early rather than only checking the final shard_id value.
  • For the cache_shard_size handling in precompute_cache_shard_ids(), consider clamping the parameter (e.g., via a param_ops or helper) instead of mutating the module parameter at runtime, so that the effective valid range (>= 1) is enforced consistently before topology initialization uses it.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In llc_populate_cpu_shard_id(), it may be safer to assert that shard_id never exceeds layout.nr_shards (e.g., WARN_ON_ONCE(shard_id >= layout.nr_shards) inside the loop) so that any future topology or layout changes that would overrun the last shard are caught early rather than only checking the final shard_id value.
- For the cache_shard_size handling in precompute_cache_shard_ids(), consider clamping the parameter (e.g., via a param_ops or helper) instead of mutating the module parameter at runtime, so that the effective valid range (>= 1) is enforced consistently before topology initialization uses it.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new cache_shard affinity scope for unbound workqueues to split large shared-LLC CPU groups into smaller synthetic shards and reduce worker-pool contention.

Changes:

  • Adds WQ_AFFN_CACHE_SHARD to the workqueue affinity-scope enum and name table.
  • Adds a workqueue.cache_shard_size parameter and shard-layout helpers.
  • Initializes cache-shard topology after SMT/cache pod discovery.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
include/linux/workqueue.h Adds the public WQ_AFFN_CACHE_SHARD affinity scope.
kernel/workqueue.c Implements cache-shard topology computation, naming, parameter handling, and initialization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread kernel/workqueue.c
Comment on lines 419 to +420
[WQ_AFFN_CACHE] = "cache",
[WQ_AFFN_CACHE_SHARD] = "cache_shard",
Comment thread kernel/workqueue.c
Comment on lines +8146 to +8155
* Chooses the number of shards that keeps average shard size closest to
* wq_cache_shard_size. Returns a struct describing the total number of shards,
* the base size of each, and how many are large shards.
*/
static struct llc_shard_layout __init llc_calc_shard_layout(int nr_cores)
{
struct llc_shard_layout layout;

/* Ensure at least one shard; pick the count closest to the target size */
layout.nr_shards = max(1, DIV_ROUND_CLOSEST(nr_cores, wq_cache_shard_size));
Comment thread kernel/workqueue.c
Comment on lines 442 to +444

static unsigned int wq_cache_shard_size = 8;
module_param_named(cache_shard_size, wq_cache_shard_size, uint, 0444);
Comment thread include/linux/workqueue.h
WQ_AFFN_CPU, /* one pod per CPU */
WQ_AFFN_SMT, /* one pod poer SMT */
WQ_AFFN_CACHE, /* one pod per LLC */
WQ_AFFN_CACHE_SHARD, /* synthetic sub-LLC shards */
@Avenger-285714
Copy link
Copy Markdown
Member Author

Avenger-285714 commented May 18, 2026

Note: CI offten fails on "Upload Kernel Artifact" step. Please check deepin25-Xeon-8352S and kp920 runners' network.

@opsiff
Copy link
Copy Markdown
Member

opsiff commented May 18, 2026

workqueue: validate cpumask_first() result in llc_populate_cpu_shard_id()

On uniprocessor (UP) configs such as nios2, NR_CPUS is 1, so
cpu_shard_id[] is a single-element array (int[1]). In
llc_populate_cpu_shard_id(), cpumask_first(sibling_cpus) returns an
unsigned int that the compiler cannot prove is always 0, triggering
a -Warray-bounds warning when the result is used to index
cpu_shard_id[]:

  kernel/workqueue.c:8321:55: warning: array subscript 1 is above
  array bounds of 'int[1]' [-Warray-bounds]
   8321 |  cpu_shard_id[c] = cpu_shard_id[cpumask_first(sibling_cpus)];
        |                    ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is a false positive: sibling_cpus can never be empty here because
'c' itself is always set in it, so cpumask_first() will always return a
valid CPU. However, the compiler cannot prove this statically, and the
warning only manifests on UP configs where the array size is 1.

Add a bounds check with WARN_ON_ONCE to silence the warning, and store
the result in a local variable to make the code clearer and avoid calling
cpumask_first() twice.

Fixes: 5920d046f7ae ("workqueue: add WQ_AFFN_CACHE_SHARD affinity scope")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604022343.GQtkF2vO-lkp@intel.com/
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>

…id()

[Upstream commit 76af54648899abbd6b449c035583e47fd407078a]

On uniprocessor (UP) configs such as nios2, NR_CPUS is 1, so
cpu_shard_id[] is a single-element array (int[1]). In
llc_populate_cpu_shard_id(), cpumask_first(sibling_cpus) returns an
unsigned int that the compiler cannot prove is always 0, triggering
a -Warray-bounds warning when the result is used to index
cpu_shard_id[]:

  kernel/workqueue.c:8321:55: warning: array subscript 1 is above
  array bounds of 'int[1]' [-Warray-bounds]
   8321 |  cpu_shard_id[c] = cpu_shard_id[cpumask_first(sibling_cpus)];
        |                    ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is a false positive: sibling_cpus can never be empty here because
'c' itself is always set in it, so cpumask_first() will always return a
valid CPU. However, the compiler cannot prove this statically, and the
warning only manifests on UP configs where the array size is 1.

Add a bounds check with WARN_ON_ONCE to silence the warning, and store
the result in a local variable to make the code clearer and avoid calling
cpumask_first() twice.

Fixes: 5920d046f7ae ("workqueue: add WQ_AFFN_CACHE_SHARD affinity scope")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604022343.GQtkF2vO-lkp@intel.com/
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Wentao Guan <guanwentao@uniontech.com>
Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn>
@Avenger-285714
Copy link
Copy Markdown
Member Author

workqueue: validate cpumask_first() result in llc_populate_cpu_shard_id()

On uniprocessor (UP) configs such as nios2, NR_CPUS is 1, so
cpu_shard_id[] is a single-element array (int[1]). In
llc_populate_cpu_shard_id(), cpumask_first(sibling_cpus) returns an
unsigned int that the compiler cannot prove is always 0, triggering
a -Warray-bounds warning when the result is used to index
cpu_shard_id[]:

  kernel/workqueue.c:8321:55: warning: array subscript 1 is above
  array bounds of 'int[1]' [-Warray-bounds]
   8321 |  cpu_shard_id[c] = cpu_shard_id[cpumask_first(sibling_cpus)];
        |                    ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is a false positive: sibling_cpus can never be empty here because
'c' itself is always set in it, so cpumask_first() will always return a
valid CPU. However, the compiler cannot prove this statically, and the
warning only manifests on UP configs where the array size is 1.

Add a bounds check with WARN_ON_ONCE to silence the warning, and store
the result in a local variable to make the code clearer and avoid calling
cpumask_first() twice.

Fixes: 5920d046f7ae ("workqueue: add WQ_AFFN_CACHE_SHARD affinity scope")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604022343.GQtkF2vO-lkp@intel.com/
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>

done

@opsiff opsiff merged commit 4540a4f into deepin-community:linux-6.18.y May 18, 2026
7 of 12 checks passed
@deepin-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: opsiff

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants