Skip to content

[linux-6.6.y] PCI/pciehp: Robust Hotplug Event Handling and Safe Device Management#1691

Open
leoliu-oc wants to merge 8 commits into
deepin-community:linux-6.6.yfrom
leoliu-oc:linux-6.6.y-137-pci-dpc-waiting-time
Open

[linux-6.6.y] PCI/pciehp: Robust Hotplug Event Handling and Safe Device Management#1691
leoliu-oc wants to merge 8 commits into
deepin-community:linux-6.6.yfrom
leoliu-oc:linux-6.6.y-137-pci-dpc-waiting-time

Conversation

@leoliu-oc
Copy link
Copy Markdown
Contributor

@leoliu-oc leoliu-oc commented May 12, 2026

This patch set consists of 6 patches in total.
Among them, patches 1 to 5 are upstream patches, and the 6th one is a custom patch for the target.

Upstream patches:

  1. PCI: pciehp: Detect device replacement during system sleep
  2. PCI: Make pci_destroy_dev() concurrent safe
  3. PCI: pciehp: Avoid unnecessary device replacement check
  4. PCI: Fix NULL dereference in SR-IOV VF creation error path
  5. PCI: pciehp: Ignore Presence Detect Changed caused by DPC
  6. PCI: pciehp: Ignore Link Down/Up caused by Secondary Bus Reset
  7. PCI: pciehp: Ignore belated Presence Detect Changed caused by DPC

Target patch:

  1. PCI: dpc: Increase pciehp waiting time for DPC recovery

Summary by Sourcery

Improve robustness of PCIe hotplug event handling, especially around DPC, link changes, secondary bus resets, and device replacement during sleep, while making device removal concurrency-safe and extending DPC recovery timeouts.

New Features:

  • Detect and handle replacement of hotplugged PCIe devices across system sleep using cached device identifiers and serial number.
  • Provide hotplug APIs to mark and mask spurious link changes during operations that intentionally toggle links, such as secondary bus resets or power transitions.

Bug Fixes:

  • Prevent races in pci_destroy_dev() by making device removal idempotent via a new removed flag.
  • Avoid acting on Presence Detect Changed and Link Down/Up events caused by DPC, secondary bus reset, or similar benign operations, reducing false hotplug handling.
  • Ensure drivers treat old devices as disconnected when a different device appears in the slot after resume, avoiding unsafe accesses.

Enhancements:

  • Extend DPC recovery wait timeout from 4 seconds to 16 seconds to accommodate slower completions in the field.
  • Cache device serial numbers in the hotplug controller and update them when configuring devices to support more reliable device identity tracking.

l1k and others added 2 commits May 12, 2026 20:50
Ricky reports that replacing a device in a hotplug slot during ACPI sleep
state S3 does not cause re-enumeration on resume, as one would expect.
Instead, the new device is treated as if it was the old one.

There is no bulletproof way to detect device replacement, but as a
heuristic, check whether the device identity in config space matches cached
data in struct pci_dev (Vendor ID, Device ID, Class Code, Revision ID,
Subsystem Vendor ID, Subsystem ID).  Additionally, cache and compare the
Device Serial Number (PCIe r6.2 sec 7.9.3).  If a mismatch is detected,
mark the old device disconnected (to prevent its driver from accessing the
new device) and synthesize a Presence Detect Changed event.

The device identity in config space which is compared here is the same as
the one included in the signed Subject Alternative Name per PCIe r6.1 sec
6.31.3.  Thus, the present commit prevents attacks where a valid device is
replaced with a malicious device during system sleep and the valid device's
driver obliviously accesses the malicious device.

This is about as much as can be done at the PCI layer.  Drivers may have
additional ways to identify devices (such as reading a WWID from some
register) and may trigger re-enumeration when detecting an identity change
on resume.

Link: https://lore.kernel.org/r/a1afaa12f341d146ecbea27c1743661c71683833.1716992815.git.lukas@wunner.de
Reported-by: Ricky Wu <ricky_wu@realtek.com>
Closes: https://lore.kernel.org/r/a608b5930d0a48f092f717c0e137454b@realtek.com
Tested-by: Ricky Wu <ricky_wu@realtek.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: LeoLiu-oc <leoliu-oc@zhaoxin.com>
Use an atomic flag instead of the racy check against the device's kobj
parent. We shouldn't be poking into device implementation details at this
level anyway.

Link: https://lore.kernel.org/r/20241022224851.340648-3-kbusch@meta.com
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: LeoLiu-oc <leoliu-oc@zhaoxin.com>
@deepin-ci-robot deepin-ci-robot requested review from myml and winnscode May 12, 2026 13:00
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented May 12, 2026

Reviewer's Guide

Refactors PCIe hotplug (pciehp) and PCI core hotplug handling to robustly distinguish real device/link changes from spurious ones (DPC recovery, secondary bus reset, power state transitions, firmware/FPGA updates), safely detect device replacement across system sleep, and make pci_destroy_dev() concurrency-safe while also extending DPC recovery wait time.

Sequence diagram for handling spurious PCIe link changes during maintenance operations

sequenceDiagram
    actor EndpointDriver
    participant PCIePort as pci_dev
    participant HotplugCore as pci_hotplug_core
    participant pciehp as pciehp_ist

    EndpointDriver->>HotplugCore: pci_hp_ignore_link_change(PCIePort)
    HotplugCore->>PCIePort: set_bit(PCI_LINK_CHANGING)

    pciehp->>HotplugCore: pci_hp_spurious_link_change(PCIePort)
    HotplugCore->>HotplugCore: wait_event(pci_hp_link_change_wq)
    HotplugCore->>PCIePort: test_bit(PCI_LINK_CHANGING)
    HotplugCore-->>pciehp: return true/false
    pciehp->>pciehp: pciehp_ignore_link_change(ctrl, pdev, irq, ignored_events)

    EndpointDriver->>HotplugCore: pci_hp_unignore_link_change(PCIePort)
    HotplugCore->>PCIePort: set_bit(PCI_LINK_CHANGED)
    HotplugCore->>PCIePort: clear_bit(PCI_LINK_CHANGING)
    HotplugCore->>HotplugCore: wake_up_all(pci_hp_link_change_wq)
Loading

Sequence diagram for detecting device replacement across system sleep

sequenceDiagram
    participant PM as pciehp_resume_noirq
    participant Ctrl as controller
    participant Port as pcie_port
    participant Dev as pci_dev

    PM->>Ctrl: pcie_clear_hotplug_events(ctrl)
    PM->>Ctrl: pciehp_device_replaced(ctrl)
    alt device replaced
        Ctrl->>Port: pci_get_slot(subordinate, PCI_DEVFN(0, 0))
        Ctrl->>Dev: pci_read_config_dword(...)
        Ctrl-->>PM: return true
        PM->>Port: pci_walk_bus(subordinate, pci_dev_set_disconnected)
        PM->>Ctrl: pciehp_request(ctrl, PCI_EXP_SLTSTA_PDC)
    else device unchanged
        Ctrl-->>PM: return false
    end
Loading

File-Level Changes

Change Details Files
Detect and handle device replacement in hotplug slots, especially across system sleep and after link-change storms.
  • Introduce pciehp_device_replaced() to compare current Function 0 identity (vendor/device/class/subsystem/DSN) against cached values to detect replacement.
  • Cache Device Serial Number (DSN) for Function 0 in struct controller and populate it during pcie_init() and pciehp_configure_device().
  • On resume_noirq, if a device is present and appears replaced, mark all functions on the subordinate bus as disconnected and synthesize a Presence Detect Changed (PDC) event to trigger reprobe.
  • Extend link-change handling so that if a device is replaced or link is down after an ignored event, a new hotplug event is synthesized.
drivers/pci/hotplug/pciehp_hpc.c
drivers/pci/hotplug/pciehp_core.c
drivers/pci/hotplug/pciehp.h
drivers/pci/hotplug/pciehp_pci.c
Generalize and centralize handling of spurious link and presence-detect events due to DPC, secondary bus reset, power transitions, and similar operations.
  • Replace pciehp_ignore_dpc_link_change() with generic pciehp_ignore_link_change() that can mask both DLLSC and PDC events and log a unified message.
  • In pciehp_ist(), treat both Presence Detect Changed and Data Link Layer State Changed bits as ignorable when they coincide with successful DPC recovery or core-detected spurious link changes, honoring inband presence configuration.
  • Ensure pending hotplug events are atomically cleared for the set of ignored events and that polling-mode drivers write back those status bits.
  • Use pci_hp_ignore_link_change()/pci_hp_unignore_link_change() to bracket secondary bus reset operations instead of manually twiddling slot control/status, avoiding spurious notifications.
  • Add pci_hp_spurious_link_change() helper so hotplug drivers can synchronize with and ignore link changes caused by concurrent core/endpoint operations.
drivers/pci/hotplug/pciehp_hpc.c
drivers/pci/hotplug/pci_hotplug_core.c
drivers/pci/hotplug/pciehp_core.c
include/linux/pci.h
drivers/pci/pci.h
Provide PCI core primitives to tag and synchronize with code paths that intentionally cause link changes on hotplug bridges.
  • Add waitqueue and helper functions pci_hp_ignore_link_change(), pci_hp_unignore_link_change(), and pci_hp_spurious_link_change() to mark, complete, and wait for spurious-link-causing sections, backed by new priv_flags bits PCI_LINK_CHANGING and PCI_LINK_CHANGED.
  • Export pci_hp_spurious_link_change() and declare ignore/unignore helpers for CONFIG_HOTPLUG_PCI, with no-op inline stubs when hotplug support is disabled.
  • Document semantics of these helpers so both core and endpoint/hotplug drivers can use them to coordinate around firmware updates, FPGA reconfig, secondary bus resets, and D3cold transitions.
drivers/pci/hotplug/pci_hotplug_core.c
drivers/pci/pci.h
include/linux/pci.h
Harden pci_destroy_dev() against concurrent calls and mark devices as removed via a priv_flag.
  • Introduce PCI_DEV_REMOVED priv_flag and accessor pci_dev_test_and_set_removed() to atomically test-and-set the removed state.
  • Change pci_destroy_dev() to early-return if the removed flag is already set, instead of relying on dev.kobj.parent, ensuring device_del() is only invoked once.
drivers/pci/remove.c
drivers/pci/pci.h
Increase robustness of DPC-based recovery and associated hotplug coordination by extending the wait time.
  • Extend DPC completion wait timeout in pci_dpc_recovered() from 4 seconds to 16 seconds to better accommodate slower hardware or firmware, while still clearing the PCI_DPC_RECOVERED flag when done.
drivers/pci/pcie/dpc.c

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@deepin-ci-robot
Copy link
Copy Markdown

Hi @leoliu-oc. Thanks for your PR.

I'm waiting for a deepin-community member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • In pciehp_configure_device() you call pci_get_slot(parent, PCI_DEVFN(0, 0)) and immediately dereference the result with pci_get_dsn(dev); consider guarding against a NULL return in case function 0 is absent so you don’t risk a NULL dereference.
  • Similarly, in pciehp_device_replaced() (and in pcie_init() when caching the DSN) you assume ctrl->pcie->port->subordinate / the subordinate bus and function 0 exist; it would be safer to null-check the bus (and the pci_get_slot() result) before use to avoid issues when the slot is empty or the bus hasn’t been enumerated.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `pciehp_configure_device()` you call `pci_get_slot(parent, PCI_DEVFN(0, 0))` and immediately dereference the result with `pci_get_dsn(dev)`; consider guarding against a NULL return in case function 0 is absent so you don’t risk a NULL dereference.
- Similarly, in `pciehp_device_replaced()` (and in `pcie_init()` when caching the DSN) you assume `ctrl->pcie->port->subordinate` / the `subordinate` bus and function 0 exist; it would be safer to null-check the bus (and the `pci_get_slot()` result) before use to avoid issues when the slot is empty or the bus hasn’t been enumerated.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This patch set improves PCIe native hotplug robustness in pciehp/DPC scenarios by filtering spurious hotplug events (DPC recovery, secondary bus reset, etc.), detecting device replacement across system sleep, and hardening device teardown against concurrent removal.

Changes:

  • Add a hotplug helper API to mark/await “spurious link change” sections and use it to ignore benign Link Down/Up / Presence Detect changes in pciehp.
  • Track device identity (incl. cached DSN) and detect device replacement across suspend/resume, marking the old device disconnected and synthesizing a hotplug event.
  • Make pci_destroy_dev() idempotent for concurrency safety and extend DPC recovery wait timeout to 16s.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
include/linux/pci.h Public declarations for link-change ignore/unignore helpers.
drivers/pci/remove.c Make pci_destroy_dev() removal idempotent via a new removed flag.
drivers/pci/pcie/dpc.c Increase DPC recovery wait timeout from 4s to 16s.
drivers/pci/pci.h Add priv_flags bits and internal hotplug helper prototype.
drivers/pci/hotplug/pciehp.h Add cached DSN and device replacement detection prototype.
drivers/pci/hotplug/pciehp_pci.c Cache DSN after enumeration completes.
drivers/pci/hotplug/pciehp_hpc.c Implement replacement detection; ignore spurious link/presence events; integrate ignore-link-change helpers around SBR.
drivers/pci/hotplug/pciehp_core.c On resume_noirq, detect replacement and mark old devices disconnected + synthesize PDC.
drivers/pci/hotplug/pci_hotplug_core.c Implement link-change ignore/unignore/spurious detection primitives.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +76 to +77
ctrl->dsn = pci_get_dsn(dev);
pci_dev_put(dev);
Comment on lines +580 to +628
void pci_hp_ignore_link_change(struct pci_dev *pdev)
{
set_bit(PCI_LINK_CHANGING, &pdev->priv_flags);
smp_mb__after_atomic(); /* pairs with implied barrier of wait_event() */
}

/**
* pci_hp_unignore_link_change - end code section causing spurious link changes
* @pdev: PCI hotplug bridge
*
* Mark the end of a code section causing spurious link changes on the
* Secondary Bus of @pdev. Must be paired with pci_hp_ignore_link_change().
*/
void pci_hp_unignore_link_change(struct pci_dev *pdev)
{
set_bit(PCI_LINK_CHANGED, &pdev->priv_flags);
mb(); /* ensure pci_hp_spurious_link_change() sees either bit set */
clear_bit(PCI_LINK_CHANGING, &pdev->priv_flags);
wake_up_all(&pci_hp_link_change_wq);
}

/**
* pci_hp_spurious_link_change - check for spurious link changes
* @pdev: PCI hotplug bridge
*
* Check whether a code section is executing concurrently which is causing
* spurious link changes on the Secondary Bus of @pdev. Await the end of the
* code section if so.
*
* May be called by hotplug drivers to check whether a link change is spurious
* and can be ignored.
*
* Because a genuine link change may have occurred in-between a spurious link
* change and the invocation of this function, hotplug drivers should perform
* sanity checks such as retrieving the current link state and bringing down
* the slot if the link is down.
*
* Return: %true if such a code section has been executing concurrently,
* otherwise %false. Also return %true if such a code section has not been
* executing concurrently, but at least once since the last invocation of this
* function.
*/
bool pci_hp_spurious_link_change(struct pci_dev *pdev)
{
wait_event(pci_hp_link_change_wq,
!test_bit(PCI_LINK_CHANGING, &pdev->priv_flags));

return test_and_clear_bit(PCI_LINK_CHANGED, &pdev->priv_flags);
}
@leoliu-oc leoliu-oc force-pushed the linux-6.6.y-137-pci-dpc-waiting-time branch from bc712b5 to 65eb412 Compare May 12, 2026 13:21
l1k and others added 6 commits May 14, 2026 16:57
Hot-removal of nested PCI hotplug ports suffers from a long-standing race
condition which can lead to a deadlock:  A parent hotplug port acquires
pci_lock_rescan_remove(), then waits for pciehp to unbind from a child
hotplug port.  Meanwhile that child hotplug port tries to acquire
pci_lock_rescan_remove() as well in order to remove its own children.

The deadlock only occurs if the parent acquires pci_lock_rescan_remove()
first, not if the child happens to acquire it first.

Several workarounds to avoid the issue have been proposed and discarded
over the years, e.g.:

https://lore.kernel.org/r/4c882e25194ba8282b78fe963fec8faae7cf23eb.1529173804.git.lukas@wunner.de/

A proper fix is being worked on, but needs more time as it is nontrivial
and necessarily intrusive.

Recent commit 9d573d1 ("PCI: pciehp: Detect device replacement during
system sleep") provokes more frequent occurrence of the deadlock when
removing more than one Thunderbolt device during system sleep.  The commit
sought to detect device replacement, but also triggered on device removal.
Differentiating reliably between replacement and removal is impossible
because pci_get_dsn() returns 0 both if the device was removed, as well as
if it was replaced with one lacking a Device Serial Number.

Avoid the more frequent occurrence of the deadlock by checking whether the
hotplug port itself was hot-removed.  If so, there's no sense in checking
whether its child device was replaced.

This works because the ->resume_noirq() callback is invoked in top-down
order for the entire hierarchy:  A parent hotplug port detecting device
replacement (or removal) marks all children as removed using
pci_dev_set_disconnected() and a child hotplug port can then reliably
detect being removed.

Link: https://lore.kernel.org/r/02f166e24c87d6cde4085865cce9adfdfd969688.1741674172.git.lukas@wunner.de
Fixes: 9d573d1 ("PCI: pciehp: Detect device replacement during system sleep")
Reported-by: Kenneth Crudup <kenny@panix.com>
Closes: https://lore.kernel.org/r/83d9302a-f743-43e4-9de2-2dd66d91ab5b@panix.com/
Reported-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com>
Closes: https://lore.kernel.org/r/20240926125909.2362244-1-acelan.kao@canonical.com/
Tested-by: Kenneth Crudup <kenny@panix.com>
Tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Cc: stable@vger.kernel.org # v6.11+
Signed-off-by: LeoLiu-oc <leoliu-oc@zhaoxin.com>
Clean up when virtfn setup fails to prevent NULL pointer dereference
during device removal. The kernel oops below occurred due to incorrect
error handling flow when pci_setup_device() fails.

Add pci_iov_scan_device(), which handles virtfn allocation and setup and
cleans up if pci_setup_device() fails, so pci_iov_add_virtfn() doesn't need
to call pci_stop_and_remove_bus_device().  This prevents accessing
partially initialized virtfn devices during removal.

  BUG: kernel NULL pointer dereference, address: 00000000000000d0
  RIP: 0010:device_del+0x3d/0x3d0
  Call Trace:
   pci_remove_bus_device+0x7c/0x100
   pci_iov_add_virtfn+0xfa/0x200
   sriov_enable+0x208/0x420
   mlx5_core_sriov_configure+0x6a/0x160 [mlx5_core]
   sriov_numvfs_store+0xae/0x1a0

Link: https://lore.kernel.org/r/20250310084524.599225-1-shayd@nvidia.com
Fixes: e3f30d5 ("PCI: Make pci_destroy_dev() concurrent safe")
Signed-off-by: Shay Drory <shayd@nvidia.com>
[bhelgaas: commit log, return ERR_PTR(-ENOMEM) directly]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: LeoLiu-oc <leoliu-oc@zhaoxin.com>
Commit a97396c ("PCI: pciehp: Ignore Link Down/Up caused by DPC")
amended PCIe hotplug to not bring down the slot upon Data Link Layer State
Changed events caused by Downstream Port Containment.

However Keith reports off-list that if the slot uses in-band presence
detect (i.e. Presence Detect State is derived from Data Link Layer Link
Active), DPC also causes a spurious Presence Detect Changed event.

This needs to be ignored as well.

Unfortunately there's no register indicating that in-band presence detect
is used.  PCIe r5.0 sec 7.5.3.10 introduced the In-Band PD Disable bit in
the Slot Control Register.  The PCIe hotplug driver sets this bit on
ports supporting it.  But older ports may still use in-band presence
detect.

If in-band presence detect can be disabled, Presence Detect Changed events
occurring during DPC must not be ignored because they signal device
replacement.  On all other ports, device replacement cannot be detected
reliably because the Presence Detect Changed event could be a side effect
of DPC.  On those (older) ports, perform a best-effort device replacement
check by comparing the Vendor ID, Device ID and other data in Config Space
with the values cached in struct pci_dev.  Use the existing helper
pciehp_device_replaced() to accomplish this.  It is currently #ifdef'ed to
CONFIG_PM_SLEEP in pciehp_core.c, so move it to pciehp_hpc.c where most
other functions accessing config space reside.

Reported-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/fa264ff71952915c4e35a53c89eb0cde8455a5c5.1744298239.git.lukas@wunner.de
Signed-off-by: LeoLiu-oc <leoliu-oc@zhaoxin.com>
When a Secondary Bus Reset is issued at a hotplug port, it causes a Data
Link Layer State Changed event as a side effect.  On hotplug ports using
in-band presence detect, it additionally causes a Presence Detect Changed
event.

These spurious events should not result in teardown and re-enumeration of
the device in the slot.  Hence commit 2e35afa ("PCI: pciehp: Add
reset_slot() method") masked the Presence Detect Changed Enable bit in the
Slot Control register during a Secondary Bus Reset.  Commit 06a8d89
("PCI: pciehp: Disable link notification across slot reset") additionally
masked the Data Link Layer State Changed Enable bit.

However masking those bits only disables interrupt generation (PCIe r6.2
sec 6.7.3.1).  The events are still visible in the Slot Status register
and picked up by the IRQ handler if it runs during a Secondary Bus Reset.
This can happen if the interrupt is shared or if an unmasked hotplug event
occurs, e.g. Attention Button Pressed or Power Fault Detected.

The likelihood of this happening used to be small, so it wasn't much of a
problem in practice.  That has changed with the recent introduction of
bandwidth control in v6.13-rc1 with commit 665745f ("PCI/bwctrl:
Re-add BW notification portdrv as PCIe BW controller"):

Bandwidth control shares the interrupt with PCIe hotplug.  A Secondary Bus
Reset causes a Link Bandwidth Notification, so the hotplug IRQ handler
runs, picks up the masked events and tears down the device in the slot.

As a result, Joel reports VFIO passthrough failure of a GPU, which Ilpo
root-caused to the incorrect handling of masked hotplug events.

Clearly, a more reliable way is needed to ignore spurious hotplug events.

For Downstream Port Containment, a new ignore mechanism was introduced by
commit a97396c ("PCI: pciehp: Ignore Link Down/Up caused by DPC").
It has been working reliably for the past four years.

Adapt it for Secondary Bus Resets.

Introduce two helpers to annotate code sections which cause spurious link
changes:  pci_hp_ignore_link_change() and pci_hp_unignore_link_change()
Use those helpers in lieu of masking interrupts in the Slot Control
register.

Introduce a helper to check whether such a code section is executing
concurrently and if so, await it:  pci_hp_spurious_link_change()
Invoke the helper in the hotplug IRQ thread pciehp_ist().  Re-use the
IRQ thread's existing code which ignores DPC-induced link changes unless
the link is unexpectedly down after reset recovery or the device was
replaced during the bus reset.

That code block in pciehp_ist() was previously only executed if a Data
Link Layer State Changed event has occurred.  Additionally execute it for
Presence Detect Changed events.  That's necessary for compatibility with
PCIe r1.0 hotplug ports because Data Link Layer State Changed didn't exist
before PCIe r1.1.  DPC was added with PCIe r3.1 and thus DPC-capable
hotplug ports always support Data Link Layer State Changed events.
But the same cannot be assumed for Secondary Bus Reset, which already
existed in PCIe r1.0.

Secondary Bus Reset is only one of many causes of spurious link changes.
Others include runtime suspend to D3cold, firmware updates or FPGA
reconfiguration.  The new pci_hp_{,un}ignore_link_change() helpers may be
used by all kinds of drivers to annotate such code sections, hence their
declarations are publicly visible in <linux/pci.h>.  A case in point is
the Mellanox Ethernet driver which disables a firmware reset feature if
the Ethernet card is attached to a hotplug port, see commit 3d7a3f2
("net/mlx5: Nack sync reset request when HotPlug is enabled").  Going
forward, PCIe hotplug will be able to cope gracefully with all such use
cases once the code sections are properly annotated.

The new helpers internally use two bits in struct pci_dev's priv_flags as
well as a wait_queue.  This mirrors what was done for DPC by commit
a97396c ("PCI: pciehp: Ignore Link Down/Up caused by DPC").  That may
be insufficient if spurious link changes are caused by multiple sources
simultaneously.  An example might be a Secondary Bus Reset issued by AER
during FPGA reconfiguration.  If this turns out to happen in real life,
support for it can easily be added by replacing the PCI_LINK_CHANGING flag
with an atomic_t counter incremented by pci_hp_ignore_link_change() and
decremented by pci_hp_unignore_link_change().  Instead of awaiting a zero
PCI_LINK_CHANGING flag, the pci_hp_spurious_link_change() helper would
then simply await a zero counter.

Fixes: 665745f ("PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller")
Reported-by: Joel Mathew Thomas <proxy0@tutamail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219765
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Joel Mathew Thomas <proxy0@tutamail.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/d04deaf49d634a2edf42bf3c06ed81b4ca54d17b.1744298239.git.lukas@wunner.de
Signed-off-by: LeoLiu-oc <leoliu-oc@zhaoxin.com>
Commit c3be50f ("PCI: pciehp: Ignore Presence Detect Changed caused by
DPC") sought to ignore Presence Detect Changed events occurring as a side
effect of Downstream Port Containment.

The commit awaits recovery from DPC and then clears events which occurred
in the meantime.  However if the first event seen after DPC is Data Link
Layer State Changed, only that event is cleared and not Presence Detect
Changed.  The object of the commit is thus defeated.

That's because pciehp_ist() computes the events to clear based on the local
"events" variable instead of "ctrl->pending_events".  The former contains
the events that had occurred when pciehp_ist() was entered, whereas the
latter also contains events that have accumulated while awaiting DPC
recovery.

In practice, the order of PDC and DLLSC events is arbitrary and the delay
in-between can be several milliseconds.

So change the logic to always clear PDC events, even if they come after an
initial DLLSC event.

Fixes: c3be50f ("PCI: pciehp: Ignore Presence Detect Changed caused by DPC")
Reported-by: Lương Việt Hoàng <tcm4095@gmail.com>
Reported-by: Joel Mathew Thomas <proxy0@tutamail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219765#c165
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Lương Việt Hoàng <tcm4095@gmail.com>
Tested-by: Joel Mathew Thomas <proxy0@tutamail.com>
Link: https://patch.msgid.link/d9c4286a16253af7e93eaf12e076e3ef3546367a.1750257164.git.lukas@wunner.de
Signed-off-by: LeoLiu-oc <leoliu-oc@zhaoxin.com>
zhaoxin inclusion
category: feature

--------------------

Commit a97396c ("PCI: pciehp: Ignore Link Down/Up caused by DPC")
amended PCIe hotplug to not bring down the slot upon Data Link Layer State
Changed events caused by Downstream Port Containment.

However, PCIe hotplug (pciehp) waits up to 4 seconds before assuming that
DPC recovery has failed and disabling the slot. This timeout period is
insufficient for some PCIe devices.
For example, the E810 dual-port network card driver needs to take over
10 seconds to execute its err_detected() callback.
Since this exceeds the maximum wait time allowed for DPC recovery by the
hotplug IRQ threads, a race condition occurs between the hotplug thread and
the dpc_handler() thread.

Signed-off-by: leoliu-oc <leoliu-oc@zhaoxin.com>
@leoliu-oc leoliu-oc force-pushed the linux-6.6.y-137-pci-dpc-waiting-time branch from 65eb412 to 27ae6dd Compare May 14, 2026 09:09
@deepin-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: opsiff

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants