Skip to content

Slowness in SP-Pilot and SP-MGS traffic in switch zones #241

@askfongjojo

Description

@askfongjojo

Pilot requests are timing out frequently on switch zone 1 of rack2. The issue isn't seen on switch zone 0. The main difference based on the dendrite logs are the many errors like the ones below in the switch1's log:

15:38:32.892Z DEBG dpd: received response for message that is not outstanding
    header = Header { version: 1, message_id: 7757853, message_kind: SpResponse }
    message = Message { version: 6, body: SpResponse(Read { modules: ModuleId(0x80000), failed_modules: ModuleId(0x0), read: MemoryRead { page: Sff8636(Lower), offset: 0, len: 1 } }) }
    outstanding_message_id = 7757855
    peer = [fe80::aa40:25ff:fe05:400%78]:11112
    task = io
    unit = transceiver-controller

There are other errors on non-existing qfsp ports(?)

15:38:32.058Z DEBG dpd: timed out without response, retrying
    task = io
    unit = transceiver-controller
15:38:32.159Z DEBG dpd: timed out without response, retrying
    task = io
    unit = transceiver-controller
15:38:32.260Z ERRO dpd: failed to send message within the retry limit
    limit = 3
    task = io
    unit = transceiver-controller
15:38:32.267Z ERRO dpd: controller error reading from module
    bank = 0
    len = 3
    module = 50
    offset = 6
    page = 0
    reason = MaxRetries(3)
    unit = qsfp-ffi
15:38:32.267Z ERRO dpd: qsfp 50 read failed
    module = Pltfm
    unit = bf-sde
15:38:32.368Z DEBG dpd: timed out without response, retrying
    task = io
    unit = transceiver-controller
15:38:32.469Z DEBG dpd: timed out without response, retrying
    task = io
    unit = transceiver-controller
15:38:32.571Z ERRO dpd: failed to send message within the retry limit
    limit = 3
    task = io
    unit = transceiver-controller
15:38:32.578Z ERRO dpd: controller error reading from module
    bank = 0
    len = 3
    module = 52
    offset = 6
    page = 0
    reason = MaxRetries(3)
    unit = qsfp-ffi
15:38:32.578Z ERRO dpd: qsfp 52 read failed
    module = Pltfm
    unit = bf-sde

The dendrite slowness was also observed about a week ago on rack2, at the time affecting both switch zones. The slowness was gone after online update and didn't reoccur until ~4 days after the most recent software update. We took some system metrics and core files of dendrite, mgd, and the scrimlet and tracked them in https://github.com/oxidecomputer/meta/issues/915.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions