Skip to content

qat_init.sh: check_driver() times out on 420xx hardware when both qat_4xxx and qat_420xx are loaded #149

@aleksandrov-denis

Description

@aleksandrov-denis

Problem

qat.service fails to start on 420xx hardware with:

QAT driver is still not present after 20s. Aborting qat_init

even when the hardware and drivers are fully functional.

Root cause

get_module_state() builds a grep pattern for every name in SUPPORTED_DRIVER_NAMES (qat_4xxx qat_420xx) and returns the fifth field (module state) for every match in /proc/modules:

get_module_state() {
    CMD=""
    for SUPPORTED_DRIVER_NAME in $SUPPORTED_DRIVER_NAMES;
    do
        CMD="$CMD -e ^$SUPPORTED_DRIVER_NAME"
    done
    echo "$(cat /proc/modules | grep $CMD | cut -d' ' -f5)"
}

On 420xx hardware the kernel loads both modules — qat_420xx for the physical devices and qat_4xxx as a dependency — so get_module_state() returns two lines:

Live
Live

check_driver() then compares this multi-line string against the literal "Live":

while [ "$CURRENT_STATE" != "Live" ]

"Live\nLive" != "Live" is always true, so the loop runs until the 20-second timeout and the service exits with an error, despite both drivers being fully ready.

Reproduction

On any 420xx host (PCI device ID 0x4946):

# Both modules are present even though qat_4xxx has no devices bound
lsmod | grep qat
# qat_420xx  ...
# qat_4xxx   ...   ← loaded, no bound devices
# intel_qat  ... 2 qat_4xxx,qat_420xx

# Simulate what get_module_state() returns
grep -e '^qat_4xxx' -e '^qat_420xx' /proc/modules | cut -d' ' -f5
# Live
# Live    ← two lines; never matches the string "Live"

systemctl start qat
# → "QAT driver is still not present after 20s. Aborting qat_init"

The workaround is to manually unload the idle module first (rmmod qat_4xxx), after which only one line is returned and the comparison succeeds.

Proposed Fix

Replace the string-equality check with a grep that succeeds as soon as any driver in the output reaches the Live state:

# before
while [ "$CURRENT_STATE" != "Live" ]

# after
while ! echo "$CURRENT_STATE" | grep -q "^Live$"

This is the correct semantic: the relevant hardware driver (qat_420xx) being Live is sufficient signal to proceed. The state of co-loaded but idle drivers (qat_4xxx) is irrelevant.

Testing

Verified on an Intel 420xx SR630v3 system running RHEL 9 (kernel 5.14.0-687.el9.x86_64, qatlib 25.08.0) with both qat_4xxx and qat_420xx loaded. Prior to the fix systemctl start qat timed out consistently; after the fix the service
starts cleanly and all 8 PFs are configured correctly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions