Skip to content

improvements to controller/fsm disconnection handling#883

Open
Aurashk wants to merge 11 commits into
developfrom
aurashk/improve-handling-of-connection-problems-in-fsm
Open

improvements to controller/fsm disconnection handling#883
Aurashk wants to merge 11 commits into
developfrom
aurashk/improve-handling-of-connection-problems-in-fsm

Conversation

@Aurashk
Copy link
Copy Markdown
Contributor

@Aurashk Aurashk commented Apr 20, 2026

Description

Fixes issue #516
Several improvements and fixes to the controller, particularly disconnection handling.

Type of change

  • New feature / enhancement
  • Optimization
  • Bug fix
  • Breaking change
  • Documentation

Change log

  1. The status command now checks that there is a valid connection to all children, showing them as disconnected without changing the internal state if they can't be reached. Disconnected children will no longer be marked as in error
  2. Before any fsm command is issued, the connection of all children in the chain is checked, the shell will emit a warning and abort the command if any of the children are disconnected. It will suggest that you can exclude the disconnected child if you want to carry on without it. If you exclude all disconnected children, fsm commands can go ahead.
  3. When you restart killed rest apps, they now work properly (status will show not in error, fsm commands are possible) see this comment for an example. This is done by refreshing the new rest endpoint from the connectivity service if a connection couldn't be established, similar to the connection refreshing for gRPC children.
  4. When you exclude a particular target application, it's parent won't be excluded by default, which avoids this behaviour

Suggested manual testing checklist

drunc-unified-shell ssh-standalone daqsystemtest/config/daqsystemtest/example-configs.data.xml local-1x1-config MyTest
boot
status
kill --name df-01
status
drunc-unified-shell ssh-standalone daqsystemtest/config/daqsystemtest/example-configs.data.xml local-1x1-config MyTest
boot
status
kill --name df-01
conf
exclude --target root-controller/df-controller/df-01
conf
drunc-unified-shell ssh-standalone daqsystemtest/config/daqsystemtest/example-configs.data.xml local-1x1-config MyTest
boot
status
kill --name df-01
status
restart --name df-01
status
conf

Note: you may need to do status twice, re-connection is asynchronous and doesn't happen instantly

Same as 2, observe that the parent df-controller will be configured

Developer checklist

Prior to marking this as "Ready for Review"

Tests ran on: hep cluster from develop

Unit tests - some tests can't be ran on the CI. This is documented. If this PR checks a feature that can't be tested with CI, this has been marked appropriately.

Integration tests - the daqsystemtest_integtest_bundle requires a lot of resources, and connections to the EHN1 infrastructure. Check the cross referenced list if you can't run these. The developer needs to run at least the .

  • Unit tests (pytest --marker) passed
    • With relevant marker
    • Without marker
  • Integration tests passed
    • Only daqsystemtest_integtest_bundle.sh -k minimal_system_quick_test.py
    • Full daqsystemtest_integtest_bundle.sh
  • Testing skipped as there are no core code changes in this PR, this only relates to documentation/CI workflows

Final checklist prior to marking this as "Ready for Review"

  • Code is clearly commented.
  • New unit tests have been added, or is documented in # ISSUE NUMBER
  • A suitable reviewer has been chosen from this list.

Reviewer checklist

  • This branch has been rebased with develop prior to testing.
  • Suggested manual tests show changes.
  • CI workflows fails documented (if present)
  • Integration tests passed
    • Only concern yourself if failures related to drunc are in the log files
    • If non-drunc failure appears:
      • Validate failure in fresh working area
      • Contact Pawel if unsure

Once the features are validated and both the unit and integration tests pass, the PRs is ready to be merged.

Prior to merging

Choose one of the following an complete all substeps
  • Changes only affect the Run Control, are in a single repository, and do not affect the end user.
    • Changes are documented in docstrings and code comments
    • Wiki has been updated if architectural or endpoint changes
  • Otherwise
    • Workflow changes demonstrated in the Change Log (if necessary)
    • Wiki has been updated (if necessary)
    • #daq-sw-librarians Slack channel notified (see below)

Once completed, the reviewer can merge the PR.

Notification message for a Slack channel

Note - this should be to #dunedaq-integration for general workflow that isn't during a release candidate period, and to #daq-release-prep otherwise.

For an single merge that changes the user workflow

The CCM WG has an isolated PR ready to merge that affects user workflows. The PR is:

_URL_

I will leave time for any comments, otherwise will merge these at the end of the work day _Insert your time zone_.

For co-ordinated merge

The CCM WG has a set of co-ordinated merges ready to merge. The PRs are:

_URL_

_URL_


I will leave time for any comments, otherwise will merge these at the end of the day.

@Aurashk Aurashk changed the title improvements to contorller handling improvements to controller/fsm disconnection handling Apr 20, 2026
@Aurashk Aurashk requested a review from jamesturner246 April 22, 2026 13:41
@Aurashk Aurashk marked this pull request as ready for review April 22, 2026 13:41
Copy link
Copy Markdown
Contributor

@jamesturner246 jamesturner246 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely the right call to separate out the error classes. Way less confusing.

Just the things we talked about last week: restore the self.included in Nodes, and ClientSideState being rest/app-related so not belonging to other nodes. Ideally want to reduce the use of these opaque classes, so hopefully we can eventualy refactor them out.

self.name = name
self.node_type = node_type
self.included = True
self._state = ClientSideState()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClientSideState is rest-API related so should stay in RestNode.

self.log = get_logger(f"controller.child_iface.{name}-child-node")
self.name = name
self.node_type = node_type
self.included = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and each node should have a self.included within the class.

@jamesturner246
Copy link
Copy Markdown
Contributor

Also @PawelPlesniak, I think Aurash has made enough changes to be added to the controller reviewer list 👍

@Aurashk Aurashk force-pushed the aurashk/improve-handling-of-connection-problems-in-fsm branch from 322a69b to 887f27e Compare May 14, 2026 09:03
Co-authored-by: Copilot <copilot@github.com>
@Aurashk Aurashk requested a review from jamesturner246 May 14, 2026 12:00
@Aurashk
Copy link
Copy Markdown
Contributor Author

Aurashk commented May 14, 2026

Thanks @jamesturner246, I made the updates to the child node/included members

@Aurashk
Copy link
Copy Markdown
Contributor Author

Aurashk commented May 14, 2026

MSQT is passing on HEP but needs full integration tests run

Copy link
Copy Markdown
Contributor

@jamesturner246 jamesturner246 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, but I think we still need to explicitly set self.included in the general case -- see comments.

execute_along_path=execute_along_path,
execute_on_all_subsequent_children_in_path=execute_on_all_subsequent_children_in_path,
)
self.included = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still want include to set self.included = True, no?

execute_along_path=execute_along_path,
execute_on_all_subsequent_children_in_path=execute_on_all_subsequent_children_in_path,
)
self.included = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still want exclude to set self.included = False

@Aurashk
Copy link
Copy Markdown
Contributor Author

Aurashk commented May 14, 2026

Looking good, but I think we still need to explicitly set self.included in the general case -- see comments.

Thanks for spotting that, I agree for completeness we should do that. Although I don't think there is any behavioural difference currently? Since gRPCChildNode is a wrapper around ControllerStub and the servicer Controller manages it's included state through self.stateful_node

It's all very confusing because the included state is both a server-side and client-side state at the same time. if you are talking to a controller child it manages it's own included state on the server side, if you are talking to a RESTAPIChildNode the included state lives on the client-side. I assume this works because the remote applications don't need to know if they are included or not, as long as the controller knows, it can propagate stuff to included applications and ignore excluded ones.

@jamesturner246
Copy link
Copy Markdown
Contributor

jamesturner246 commented May 15, 2026

include is used in e.g. address_target_path, which needs to know which children are included without having to connect to them.

It's a bit weird that inclusion status is saved in StatefulNode, since that class should only be about FSM state. I'd gravitate towards moving inclusion and other meta outside of StatefulNode rather than adding to it, since it's currently a bit opaque, and could do with a refactor. And renaming it. Or better yet using a dedicated FSM package.

Yup. ControllerDriver connects to a Controller, and think of ChildNodes as controller 'drivers', belonging to a Controller, connecting to child Controllers.

Counter question, is there any case where a node needs to know whether itself is included? I thought the whole point of the include/exclude system was that the command would not even be propagated to that node at all in the first place. Questioned aimed at everyone really -- what do you think @PawelPlesniak?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants