Skip to content

Improve ways to bulk manage nodes in Matter networks #182

Description

@mkerstner

Problem statement

A Matter network ("fabric") consists of usually one administrator and multiple nodes/devices.

To manage such a network in some cases actions need to be executed that span multiple nodes because configurations or data need to be in sync. Examples are:

  • Encryption keys
  • Bindings
  • Group configurations
  • Permissions/Access Control entries

The Matter Server should have a way to execute such actions and keep track of the process and status of such configurations covering all nodes in the network.

One example of such a more complex task is a encryption key rotation which requires

  • Generation of the new key
  • Deploying the new key as additional key to all nodes and monitor the progress; especially odes that are offline need tobe considered because here the new key is not deployed
  • Ideally have a way to view the status of such a multi-node deployment of data and also guide the user
  • Once the new key got deployed to all nodes we can remove the old key in the controller and update all nodes again to also remove the old key there to clean it up

While the list of potential use cases that could be handled by an additional (bulk) node management layer is broad (see community signals below) we want to establish a solid foundation as a first step and validate it on one of the core aspects: one time key rotation

Community signals

Note that the list of potential use cases that could be handled by an additional node management layer is broad. Examples mentioned above and below should surface more concrete scenarios based on community feedback.

  1. No visibility into which nodes have received (or missed) configuration changesA community member managing 43 Matter/Thread devices built a full custom integration ("Matter Saver") specifically because there was no central overview — devices go offline silently after restarts, with no way to see which node is down, what its Node ID is, how it connects through the mesh, or why it has communication errors. The lack of per-node status tracking is the core gap this tool fills, and it's directly analogous to what network-wide key rotation or ACL deployment tracking would require.community.home-assistant.io — Matter Saver Home Assistant

  2. Device bindings require coordinating configuration across multiple nodes, and the tooling is raw
    Matter bindings allow devices to communicate directly with each other without going through a central controller. There's a need for functionality and UI to create, maintain, and view bindings, including listing all existing bindings between devices with cluster details and setting up new ones with compatibility checking. A community member went further: while the Matter Server can create direct bindings in principle, the UI is really not user-friendly, which prompted building a new integration specifically for this.
    github.com/orgs/home-assistant — Matter Device Binding Support #2729 / community.home-assistant.io — Device to Device Bindings GitHubHome Assistant

  3. Offline nodes silently miss configuration, with no retry or pending-state tracking
    Users with mixed battery and mains-powered networks repeatedly hit situations where a subset of nodes go offline and miss updates. In one case, after a Matter Server update, only 5 out of 15 devices came back online, with the add-on still running normally but no indication of which nodes had missed the state change or how to recover them. There's no concept of a "pending" or "needs sync" state visible to the user.
    community.home-assistant.io — Most Matter Thread devices go offline after Matter Server update Home Assistant

  4. Firmware OTA updates, the closest existing multi-node operation, are unreliable and opaque
    Users report that Home Assistant can't push firmware updates to Thread/Matter nodes nearly as reliably as Apple Home can. Failures are silent or cryptic, with no per-node progress view. This is the most direct existing analog to key rotation: a network-wide operation that needs to reach every node, handle failures gracefully, and report status.
    github.com/home-assistant/core — Issue #145478 GitHub

  5. Thread operational dataset changes have a pending/migration path in the spec, but HA exposes none of it
    The openHAB Matter binding documentation notes that for nodes containing a Thread Border Router Management Cluster, a "pending" operational dataset can be configured to coordinate configuration changes with current Thread devices without requiring those devices to be reconfigured individually, as a live migration path. Home Assistant has no equivalent pending or migration state exposed to users. Thread network key rotation is therefore a live, under-served version of the exact multi-node deployment problem in the problem statement.
    openhab.org — Matter Bindings openHAB

Scope & Boundaries

In scope

We are focusing on 2 things:

  1. Build a node management layer inside the Matter Server to manage relevant data and establish a solid foundation for further improvements thereof.

The Matter specification already has the idea of such a multi node management layer in the section about "Joint Fabric", so the idea is to start by implementing the "Joint Fabric Datastore" cluster logic.

We do not want to use exactly this as a cluster, but want to evolve it for internal needs with added convenience. Through an additional layer on top which not only manages "data deployments", but also "multi step tasks" like a key rotation.

  1. Validate by executing a one time key rotation (needed anyway in the case that special counters e.g. for groups or such run over)

Not in scope

  • Additional bulk actions on nodes

Foreseen solution

The Matter specification already has the idea of such a multi node management layer in the section about "Joint Fabric", so the idea is to start by implementing the "Joint Fabric Datastore" cluster logic.

We do not want to use exactly this as a cluster, but want to evolve it for internal needs with added convenience. Through an additional layer on top which not only manages "data deployments", but also "multi step tasks" like a key rotation.

Risks & open questions

Risks

  • Especially bulk permission changes or key rotation is a very crucial process that could remove access to all nodes in the matter fabric. Consequently, we need to ensure that the process is bullet proof and tested thoroughly with our community

Appetite

Medium - 3-4 weeks

Execution issues

No response

Decision log

Date Decision Outcome

Metadata

Metadata

Labels

No labels
No labels
No fields configured for Opportunity.

Projects

Status
Shaping

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions