Skip to content

F #213: Add 'set_switchdev' option to manage eswitch mode#214

Open
sk4zuzu wants to merge 2 commits into
masterfrom
f-213
Open

F #213: Add 'set_switchdev' option to manage eswitch mode#214
sk4zuzu wants to merge 2 commits into
masterfrom
f-213

Conversation

@sk4zuzu
Copy link
Copy Markdown
Collaborator

@sk4zuzu sk4zuzu commented May 24, 2026

  • Add 'set_switchdev' option
  • Manage eswitch mode in UDEV
  • Make VF;PF;FN detection safer (fix)
  • Update README.md

@sk4zuzu sk4zuzu requested review from dann1, rsmontero and tinova May 24, 2026 22:59
- Add 'set_switchdev' option
- Manage eswitch mode in UDEV
- Make VF;PF;FN detection safer (fix)
- Update README.md

Signed-off-by: Michal Opala <sk4zuzu@gmail.com>
@dann1
Copy link
Copy Markdown
Collaborator

dann1 commented May 25, 2026

Hi @sk4zuzu the mode activation is being managed by a udev rule. Testing the pre playbook with this inventory snippet

node:
  hosts:
    sm15:
      ansible_host: sm15
      pci_devices:
        - address: "0000:81:00.1"
          set_driver: omit
          set_numvfs: max
          set_switchdev: true

yields this rule

[root@sm15 ~]# cat /etc/udev/rules.d/98-eswitch.rules
# managed by one-deploy
# --- PCI
SUBSYSTEM=="pci", ACTION=="add", ENV{ID_PATH}=="pci-0000:81:01.2", \
RUN+="/usr/sbin/devlink dev eswitch set 'pci/0000:81:00.1' mode switchdev"

The issue is that the command execution is being hooked by the presence of the first virtual function of the PF being set to switchdev.

[root@sm15 ~]# ip link show pf811
6: pf811: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
    link/ether 7c:c2:55:88:48:fb brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 1     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 2     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 3     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 4     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 5     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 6     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 7     link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    altname enp129s0f1np1
[root@sm15 ~]# dpdk-devbind.py --status-dev net | grep -i 81:01.2
0000:81:01.2 'MT27710 Family [ConnectX-4 Lx Virtual Function] 1016' numa_node=0 if=enp129s0f1v0 drv=mlx5_core unused=vfio-pci
[root@sm15 ~]# ethtool -i enp129s0f1v0
driver: mlx5_core
version: 5.14.0-570.62.1.el9_6.x86_64
firmware-version: 14.32.1250 (SM_1321000001000)
expansion-rom-version:
bus-info: 0000:81:01.2
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

Setting the mode has to be done with 0 VFs, otherwise it fails

[root@sm15 ~]# cat /sys/bus/pci/devices/0000\:81\:00.1/sriov_numvfs
8
[root@sm15 ~]# devlink dev eswitch show pci/0000:81:00.1
pci/0000:81:00.1: mode legacy inline-mode link encap-mode basic
[root@sm15 ~]# devlink dev eswitch set pci/0000:81:00.1 mode switchdev
Error: mlx5_core: Failed setting eswitch to offloads.
kernel answers: Invalid argument

[root@sm15 ~]# dmesg | tail -n 5
[606792.231599] mlx5_core 0000:81:00.1: E-Switch: Disable: mode(LEGACY), nvfs(8), necvfs(0), active vports(9)
[606795.217415] mlx5_core 0000:81:00.1: mlx5_cmd_out_err:811:(pid 3267096): CREATE_FLOW_TABLE(0x930) op_mod(0x0) failed, status bad resource state(0x9), syndrome (0x98afbb), err(-22)
[606795.218741] mlx5_core 0000:81:00.1: E-Switch: Failed to create slow path FDB Table err -22
[606796.816561] mlx5_core 0000:81:00.1: mlx5_cmd_modify_header_alloc:990:(pid 2955916): too many modify header actions 1, max supported 0
[606796.816570] mlx5_core 0000:81:00.1 pf811: Failed to create tc offload table

The result is that ansible succeeds, the rule is created, but the mode remains in legacy. We need to make sure that in the current power cycle, the mode is set, as well as in the next power cycle.

This rule worked for me

[root@sm15 ~]# devlink dev eswitch show pci/0000:81:00.1
pci/0000:81:00.1: mode legacy inline-mode link encap-mode basic
[root@sm15 ~]# vim /etc/udev/rules.d/97-eswitch.rules
[root@sm15 ~]# udevadm control --reload-rules
udevadm trigger --action=add --subsystem-match=pci --sysname-match=0000:81:00.1
[root@sm15 ~]# devlink dev eswitch show pci/0000:81:00.1
pci/0000:81:00.1: mode switchdev inline-mode link encap-mode basic
[root@sm15 ~]# cat /etc/udev/rules.d/97-eswitch.rules
SUBSYSTEM=="pci", ACTION=="add", KERNELS=="0000:81:00.1", \
RUN+="/usr/sbin/devlink dev eswitch set 'pci/0000:81:00.1' mode switchdev"

@sk4zuzu
Copy link
Copy Markdown
Collaborator Author

sk4zuzu commented May 25, 2026

@dann1 This is interesting because with netdevsim I've seen otherwise, hence the implementation. I mean setting switchdev mode after VFs are created seems to be working fine + I used UDEV specifically so this survives reboot. 🤔 I will research further then thanks, I hope I will not have to implement different procedures for different cards.. 😅

@sk4zuzu
Copy link
Copy Markdown
Collaborator Author

sk4zuzu commented May 25, 2026

@dann1 So the procedure I implemented is not entirely incorrect. ☝️😌

Setting the mode has to be done with 0 VFs, otherwise it fails

This statement is imprecise, your conclusion is actually not true, as if you unbind drivers of all VFs then change the mode, and finally then bind them again:

devlink dev eswitch show pci/0000:c1:00.1
pci/0000:c1:00.1: mode legacy inline-mode link encap-mode basic
echo 2 > /sys/bus/pci/devices/0000:c1:00.1/sriov_numvfs
c1:01.2 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]
c1:01.3 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function]
echo '0000:c1:01.2' > /sys/bus/pci/drivers/mlx5_core/unbind
echo '0000:c1:01.3' > /sys/bus/pci/drivers/mlx5_core/unbind
devlink dev eswitch set 'pci/0000:c1:00.1' mode switchdev
echo '0000:c1:01.2' > /sys/bus/pci/drivers/mlx5_core/bind
echo '0000:c1:01.3' > /sys/bus/pci/drivers/mlx5_core/bind
devlink dev eswitch show pci/0000:c1:00.1
pci/0000:c1:00.1: mode switchdev inline-mode link encap-mode basic

It works.. 🤗

Of course doing this with 0 VFs is easier, though 👍 🥰

@dann1
Copy link
Copy Markdown
Collaborator

dann1 commented May 25, 2026

Whatever works best. I tried binding mlx5_core->vfio-pci->mlx5_core and it didn't work though. Activating 1 vf on each mellanox interface

[root@sm15 ~]# dpdk-devbind.py --status-dev net

Network devices using DPDK-compatible driver
============================================
0000:81:00.2 'MT27710 Family [ConnectX-4 Lx Virtual Function] 1016' numa_node=0 drv=vfio-pci unused=mlx5_core
0000:81:01.2 'MT27710 Family [ConnectX-4 Lx Virtual Function] 1016' numa_node=0 drv=vfio-pci unused=mlx5_core

Network devices using kernel driver
===================================
0000:01:00.0 'BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller 16d8' numa_node=0 if=vmnic0 drv=bnxt_en unused=vfio-pci
0000:01:00.1 'BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller 16d8' numa_node=0 if=vmnic1 drv=bnxt_en unused=vfio-pci
0000:81:00.0 'MT27710 Family [ConnectX-4 Lx] 1015' numa_node=0 if=pf810,eth0 drv=mlx5_core unused=vfio-pci
0000:81:00.1 'MT27710 Family [ConnectX-4 Lx] 1015' numa_node=0 if=pf811 drv=mlx5_core unused=vfio-pci
[root@sm15 ~]# devlink dev eswitch show pci/0000:81:00.1
pci/0000:81:00.1: mode legacy inline-mode link encap-mode basic
[root@sm15 ~]# dpdk-devbind.py -b mlx5_core 0000:81:01.2
Error: bind failed for 0000:81:01.2 - Cannot bind to driver mlx5_core: [Errno 19] No such device
Error: unbind failed for 0000:81:01.2 - Cannot open /sys/bus/pci/drivers//unbind: [Errno 13] Permission denied: '/sys/bus/pci/drivers//unbind'
[root@sm15 ~]# driverctl unset-override 0000:81:01.2
[root@sm15 ~]# driverctl unset-override 0000:81:00.2
[root@sm15 ~]# dpdk-devbind.py -b mlx5_core 0000:81:01.2
[root@sm15 ~]# dpdk-devbind.py -b mlx5_core 0000:81:00.2
[root@sm15 ~]# devlink dev eswitch show pci/0000:81:00.1
pci/0000:81:00.1: mode legacy inline-mode link encap-mode basic
[root@sm15 ~]# dpdk-devbind.py -b vfio-pci 0000:81:01.2
[root@sm15 ~]# dpdk-devbind.py -b vfio-pci 0000:81:00.2
[root@sm15 ~]# devlink dev eswitch show pci/0000:81:00.1
pci/0000:81:00.1: mode legacy inline-mode link encap-mode basic
[root@sm15 ~]# dpdk-devbind.py -b mlx5_core 0000:81:00.2
[root@sm15 ~]# dpdk-devbind.py -b mlx5_core 0000:81:01.2
[root@sm15 ~]# devlink dev eswitch show pci/0000:81:00.1
pci/0000:81:00.1: mode legacy inline-mode link encap-mode basic
[root@sm15 ~]# cat /etc/udev/rules.d/98-eswitch.rules
# managed by one-deploy
# --- PCI
SUBSYSTEM=="pci", ACTION=="add", ENV{ID_PATH}=="pci-0000:81:00.2", \
RUN+="/usr/sbin/devlink dev eswitch set 'pci/0000:81:00.0' mode switchdev"
SUBSYSTEM=="pci", ACTION=="add", ENV{ID_PATH}=="pci-0000:81:01.2", \
RUN+="/usr/sbin/devlink dev eswitch set 'pci/0000:81:00.1' mode switchdev"

From the intel 810 eswitch guide

. Remove all VFs from the PF under test.
The 800 Series Network Adapter allows switching in and out of switchdev mode
only if there are no VFs created/associated with related PF.
a. Stop all VMs, containers, or DPDK applications using VFs connected to the PF.
b. Unload all VFs from the PF by setting the number of VFs to 0:
echo 0 > /sys/class/net/$/device/sriov_numvfs
NOTE
When the PF driver is already in switchdev mode, for each VF that is attached to
the PF, there is a corresponding VF_PR netdev. When the VF is removed, the
corresponding VF_PR netdev is automatically removed as well.

However, since you mention driver re-binding, please consider for the follow-up commit(s) that most likely we will use vfio-pci on the virtual functions since there seems to be a race condition (the permissions from the udev rule race the passthrough) that prevents using libvirt managed vfio-pci devices. To avoid that, all vfs will be pre-bound. For example

    sm15:
      ansible_host: sm15
      pci_devices:
        - vendor: "15b3"
          device: "1015"
          class: "0200"
          set_switchdev: true
          set_driver: omit
          set_numvfs: 1
          set_name: "pf{0[1]}{0[3]}"
          unlisted: false
        - vendor: "15b3"
          device: "1016"
          class: "0200"
          set_name: "pf{1[1]}{1[3]}vf{2}"
          set_driver: vfio-pci
          virtual: true
          unlisted: false

@sk4zuzu
Copy link
Copy Markdown
Collaborator Author

sk4zuzu commented May 25, 2026

@dann1 so yes there are 2 obvious cases:

  1. Just after reboot, where we can assume that there are 0 VFs.
  2. Later when there may be some configuration done already, by someone or something and there may be VFs pre-configured.

Because of 2. is always possible we cannot assume 1. will be the only situation we deal with. So actually we need to combine both cases into a single solution. I'd say the best would be to extend sriov-manage.sh script, but it has to include code that would deal with devices that have already VFs enabled. Then it's probably easier to see sometihng fails than looking at UDEV logs in the end. 🤔

- Migrate bash/systemd code to 'template' tasks
- Extend sriov-manage.sh script to accept switchdev option
- Revert UDEV attempt

Signed-off-by: Michal Opala <sk4zuzu@gmail.com>
@sk4zuzu
Copy link
Copy Markdown
Collaborator Author

sk4zuzu commented May 25, 2026

@dann1

What about 84bd83c ? I think this should be much simpler approach (mostly what you suggested). 🤔

I also simplified my mock https://github.com/sk4zuzu/libvirt-qemu-research/blob/ba34a6853e093b0dd749d3882c1924ff50f0e751/files/switchdevmock.sh kept only essential stuff, so kernel no longer panics.. 🤭

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants