Skip to content

Feat/abstract radio connection#246

Open
zindello wants to merge 8 commits into
pyMC-dev:devfrom
zindello:feat/abstract-radio-connection
Open

Feat/abstract radio connection#246
zindello wants to merge 8 commits into
pyMC-dev:devfrom
zindello:feat/abstract-radio-connection

Conversation

@zindello
Copy link
Copy Markdown

feat: Decouple radio hardware from service startup; add RadioManager

⚠️ Dependency

Requires pyMC-dev/pyMC_core#70 to be merged and released before this can be merged. That PR fixes KissModemWrapper silent thread death and sys.exit() on GPIO failure — both of which this branch depends on for correct KISS disconnect detection and GPIO error recovery. It also fixes a missing endpoint in the sx1262 radio type for updating hardware config on-the-fly.


The Problem

Before this change, the pyMC Repeater had a fundamental usability issue: if radio hardware
was unavailable, misconfigured, or failed, the entire service failed to start.
The HTTP
server and web UI came up only if the radio initialised successfully. This meant:

  • A user with a newly flashed device and no radio connected got a completely unreachable
    service — no web UI, no status, no way to diagnose the problem without SSH access.
  • Any radio failure mid-run (power blip, USB disconnect) took down the entire service,
    including the web UI. The user had no visibility and no way to recover without a restart.

The net result: users who hit any hardware problem were forced to SSH into the device and
configure via the CLI. For a device intended to be deployed and managed through its web
interface, this was less than ideal.

What Changed

RadioManager — radio hardware now has its own lifecycle

A new RadioManager class owns the entire radio hardware lifecycle: connect, retry on
failure with exponential backoff (5s → 10s → 30s → 60s), report status, and reconnect
if the radio is lost mid-run. It runs as an asyncio background task and calls back into
the daemon when hardware comes up or goes away.

The HTTP server now starts unconditionally before any radio connection is attempted.
The web UI is always reachable. Radio hardware connects asynchronously in the background.

Automatic retry with backoff

If radio hardware is unavailable or fails to initialise, RadioManager retries
automatically. The user sees the current status through the web UI and /api/stats
rather than a blank page or a timeout.

Mid-run disconnect and reconnect

If the radio is lost while running (e.g. USB hardware failure, power blip on a KISS
modem), RadioManager detects the failure, tears down the dispatcher and all helpers
cleanly, and re-enters the retry loop. The HTTP server, GPS, and sensors keep running
throughout. When hardware comes back, everything reinitialises automatically.

Companion app connections survive a mid-run radio disconnect — the companion disconnects
when the radio goes away, and can reconnect and resume once the radio comes back.

Clearer error messaging

The "Repeater handler not initialized" error that appeared in the packet log when the
radio wasn't connected has been replaced with "Radio not available — connecting or
hardware unavailable", which tells the user what is actually happening.

Bugs Fixed

KISS radio disconnect not detected

When a KISS modem loses its serial connection, KissModemWrapper._rx_worker and
_tx_worker die silently without setting is_connected = False. A paired fix in
pymc_core now sets is_connected = False on I/O failure. RadioManager._wait_for_disconnect()
polls radio.is_connected every second for radios that expose it, and falls back to
asyncio event-wait for those that don't (SX1262).

CH341 singleton not reset on reconnect

The CH341Async singleton was not being reset on radio cleanup, causing
sx1262_ch341 radios to fail on reconnect. Fixed in RadioManager._cleanup_radio(). This has not been tested yet as I do not have CH341 hardware to test

local_hash_bytes not initialised in __init__

local_hash and local_identity were initialised to None in __init__ but
local_hash_bytes was not, risking AttributeError before initialize() completes.

Radio hardware leaked on post-init CancelledError

self._current_radio was assigned after the post-init config block. If CancelledError
fired during post-init (e.g. shutdown during startup), _cleanup_radio() found
_current_radio = None and skipped hardware teardown, leaking the radio handle.

Dispatcher None during reconnect cycles

config_manager.live_update_daemon() was accessing dispatcher without a None guard.
After RadioManager was introduced, dispatcher is legitimately None during reconnect
cycles. Added guard to prevent AttributeError.

New API Surface

Adds a radio key to the /api/stats response. Additive only — no existing fields
changed.

This field is not yet consumed by the web UI but is available for future use — dashboard
status indicators, external monitoring, companion app integration.

"radio": {
  "status": "connected",
  "type": "sx1262",
  "error": null,
  "connected_at": 1747291483.925,
  "last_error_at": null,
  "retry_count": 0,
  "retry_delay_seconds": 0
}

Joshua Mesilane and others added 8 commits May 15, 2026 10:06
HTTP server and web UI now always start regardless of radio hardware
availability. RadioManager owns the full radio lifecycle — connection,
exponential backoff retry, mid-run reconnect, and hardware cleanup —
so the daemon is no longer blocked on hardware initialisation.

- Add RadioManager: asyncio task with retry loop, status tracking,
  on_connected/on_disconnected callbacks, and /api/stats reporting
- Refactor main.py: initialize() handles identity + sensors + GPS;
  _on_radio_connected() sets up dispatcher and all helpers; HTTP
  server starts before RadioManager so it is always reachable
- Move LocalIdentity setup to initialize() so public key is available
  from boot, not only after first radio connection
- Add engine.py async stop() for awaitable mid-run teardown
- Remove _kiss_transport_restart_required() from config_manager;
  radio_type and KISS config changes now trigger automatic reconnect
  via notify_config_changed() instead of requiring a service restart
- Fix SPI re-init bug: explicitly reset _initialized on SX1262Radio
  singleton after cleanup() so begin() is called on reconnect
- Move CH341 and radio.cleanup() out of _shutdown() into RadioManager

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
"Repeater handler not initialized" was a code-level description that
appeared in the UI and logs whenever storage endpoints were called
before the radio connected. Replaced with "Radio not available —
connecting or hardware unavailable" which reflects the actual state.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pymc_core calls sys.exit() when GPIO hardware is unavailable, raising
SystemExit (BaseException) which bypassed the except Exception handler
in _connect_loop() and killed the process. Now caught alongside
Exception so GPIO failure is treated as a normal connection error —
RadioManager logs it, sets status to error, and retries with backoff.
Required for KISS radio use on hardware without GPIO (e.g. x86 server).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…stemExit workaround

KissModemWrapper now sets is_connected=False when worker threads die on I/O error (fixed
in core), so _wait_for_disconnect() polls that attribute for radios that have it. Radios
without is_connected (SX1262) fall back to the original asyncio event-wait — no polling
needed as disconnect is signalled via the dispatcher exiting.

Removes except (Exception, SystemExit) workaround now that core raises a catchable
exception instead of calling sys.exit() on GPIO hardware failure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…init CancelledError

local_hash_bytes was not initialised to None in __init__ alongside local_hash and
local_identity, risking AttributeError before initialize() completes.

Moves self._current_radio = radio assignment to before the post-init config block so
that _cleanup_radio() correctly finds and cleans up the hardware handle if
CancelledError fires during post-init. Previously _current_radio was still None at
that point, leaking the radio object.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nt cleanup() and configure_radio()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nnect_loop

Consolidates three duplicated backoff state blocks into a single
_enter_backoff() helper, and moves the post-init hasattr block into
_apply_post_init_config(). _connect_loop shrinks from ~90 lines to ~45.
All 20 RadioManager tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- _enter_backoff on disconnect now passes "Radio disconnected" so
  get_status().error is never null/stale after a mid-run radio loss
- notify_config_changed resets _retry_delay alongside _retry_count
  so get_status() is immediately consistent after a config save
- Move get_radio_for_board import out of the while loop (cached by
  Python, but importing inside a loop is unusual)
- ensure_future → create_task in _wait_for_disconnect (modern API)
- _apply_post_init_config docstring now mentions the RF parameter
  logging it performs alongside event-loop and CAD threshold setup
- Add retry_delay_seconds assertion to notify_config_changed test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant