improvements to controller/fsm disconnection handling#883
Conversation
jamesturner246
left a comment
There was a problem hiding this comment.
Definitely the right call to separate out the error classes. Way less confusing.
Just the things we talked about last week: restore the self.included in Nodes, and ClientSideState being rest/app-related so not belonging to other nodes. Ideally want to reduce the use of these opaque classes, so hopefully we can eventualy refactor them out.
| self.name = name | ||
| self.node_type = node_type | ||
| self.included = True | ||
| self._state = ClientSideState() |
There was a problem hiding this comment.
ClientSideState is rest-API related so should stay in RestNode.
| self.log = get_logger(f"controller.child_iface.{name}-child-node") | ||
| self.name = name | ||
| self.node_type = node_type | ||
| self.included = True |
There was a problem hiding this comment.
... and each node should have a self.included within the class.
|
Also @PawelPlesniak, I think Aurash has made enough changes to be added to the controller reviewer list 👍 |
make exclude work if disconnected
322a69b to
887f27e
Compare
Co-authored-by: Copilot <copilot@github.com>
|
Thanks @jamesturner246, I made the updates to the child node/included members |
|
MSQT is passing on HEP but needs full integration tests run |
jamesturner246
left a comment
There was a problem hiding this comment.
Looking good, but I think we still need to explicitly set self.included in the general case -- see comments.
| execute_along_path=execute_along_path, | ||
| execute_on_all_subsequent_children_in_path=execute_on_all_subsequent_children_in_path, | ||
| ) | ||
| self.included = True |
There was a problem hiding this comment.
Still want include to set self.included = True, no?
| execute_along_path=execute_along_path, | ||
| execute_on_all_subsequent_children_in_path=execute_on_all_subsequent_children_in_path, | ||
| ) | ||
| self.included = False |
There was a problem hiding this comment.
Still want exclude to set self.included = False
Thanks for spotting that, I agree for completeness we should do that. Although I don't think there is any behavioural difference currently? Since It's all very confusing because the included state is both a server-side and client-side state at the same time. if you are talking to a controller child it manages it's own |
|
It's a bit weird that inclusion status is saved in Yup. Counter question, is there any case where a node needs to know whether itself is included? I thought the whole point of the include/exclude system was that the command would not even be propagated to that node at all in the first place. Questioned aimed at everyone really -- what do you think @PawelPlesniak? |
Description
Fixes issue #516
Several improvements and fixes to the controller, particularly disconnection handling.
Type of change
Change log
statuscommand now checks that there is a valid connection to all children, showing them asdisconnectedwithout changing the internal state if they can't be reached. Disconnected children will no longer be marked asin errorexcludethe disconnected child if you want to carry on without it. If youexcludeall disconnected children, fsm commands can go ahead.statuswill show not in error, fsm commands are possible) see this comment for an example. This is done by refreshing the new rest endpoint from the connectivity service if a connection couldn't be established, similar to the connection refreshing for gRPC children.Suggested manual testing checklist
Note: you may need to do status twice, re-connection is asynchronous and doesn't happen instantly
Same as 2, observe that the parent
df-controllerwill be configuredDeveloper checklist
Prior to marking this as "Ready for Review"
Tests ran on: hep cluster from develop
Unit tests - some tests can't be ran on the CI. This is documented. If this PR checks a feature that can't be tested with CI, this has been marked appropriately.
Integration tests - the
daqsystemtest_integtest_bundlerequires a lot of resources, and connections to the EHN1 infrastructure. Check the cross referenced list if you can't run these. The developer needs to run at least the .pytest --marker) passeddaqsystemtest_integtest_bundle.sh -k minimal_system_quick_test.pydaqsystemtest_integtest_bundle.shFinal checklist prior to marking this as "Ready for Review"
Reviewer checklist
druncare in the log filesdruncfailure appears:Once the features are validated and both the unit and integration tests pass, the PRs is ready to be merged.
Choose one of the following an complete all substepsPrior to merging
Once completed, the reviewer can merge the PR.
Notification message for a Slack channel
Note - this should be to #dunedaq-integration for general workflow that isn't during a release candidate period, and to #daq-release-prep otherwise.
For an single merge that changes the user workflow
For co-ordinated merge