Skip to content

Add cooldown to ASR sensor to suppress rapid transcripts#2609

Open
openminddev wants to merge 90 commits into
mainfrom
add-asr-cooldown
Open

Add cooldown to ASR sensor to suppress rapid transcripts#2609
openminddev wants to merge 90 commits into
mainfrom
add-asr-cooldown

Conversation

@openminddev

Copy link
Copy Markdown
Contributor

Introduce a cooldown mechanism to asrSensorCore to suppress follow-up ASR transcripts delivered too quickly. Adds defaultASRCooldown constant, cooldown and lastDeliver fields, and a resolveCooldown helper that disables cooldown when TTS interrupt is enabled. newSensorCore signature updated to accept a cooldown and callers (newASRCommon, NewParallelASR) now pass the default. pushTranscript now skips and logs follow-ups within the cooldown window. Tests added to verify resolveCooldown behavior and the cooldown delivery semantics.

Introduce a new Go workspace for the om1 project: add Makefile, cmd/main, go.mod and go.sum. Implement core internal packages for runtime functionality and plugins: config loader/types, actions (connectors, orchestrator, schema generation and tests), backgrounds (orchestrator and registry), inputs (sensors and orchestrator), llm interfaces, fuser (prompt fusion + KB), hooks runner, http client, and plugin entrypoints. Includes unit tests for action schema generation and action orchestrator tick/stop behavior. Provides plumbing for registering/loading plugins and basic orchestration patterns (concurrent/sequential/dependency modes).
Add a reconnecting WebSocket client and a new GoogleASR input plugin (PortAudio + WS) and include a conversation.json5 example. Refactor the inputs API: Sensor interface changed (Listen/Poll/RawToText/FormattedLatestBuffer/Stop), Message shape renamed, and Orchestrator no longer stores buffers internally (uses sensors' FormattedLatestBuffer). Update Fuser.Fuse to accept sensor buffer slices. Update runtime to pass sensor buffers and simplify LLM orchestration: Orchestrator now wraps 'llm' field and manages history; call sites adjusted. Rename and consolidate speak action plugin to speak/elevenlabs_tts (types and logging names updated) and add actions package entrypoint; remove old move and speak packages. Update telemetry API: IOProvider.RecordTick signature simplified. Add go.mod dependencies for portaudio and gorilla/websocket. Improve JSON5 loader to quote unquoted keys. Note: these changes introduce several breaking API changes (Sensor, Fuser.Fuse, RecordTick, action connector types) that require updates across callers.
Makefile: add automatic download/installation of zenoh-c, set CGO flags and platform-specific DYLD/LD env handling, and propagate the library path to build, run, lint, test, fmt, vet, and dependency targets; expand help text and fix install target.

Providers: add a new atomic Speaking flag (internal/providers/tts_state.go) to indicate when TTS is streaming audio.

ElevenLabs TTS: set providers.Speaking true before synthesis/playback and false after to mark active playback; remove an extra enqueue log line.

Google ASR: skip audio capture while providers.Speaking is set to avoid ASR picking up playback audio.

These changes ensure the zenoh-c runtime library is available during development and CI, and avoid input capture interfering with TTS playback.
Add an entry to .gitignore to exclude the Zenoh runtime/cache directory (go/.zenoh-c/) and a comment marker. This prevents local Zenoh state from being accidentally committed.
Expose and refactor internal plugin/schema APIs, tighten logging, and remove unused test code. Key changes:
- Actions: clarify AgentAction fields, introduce Factory/Register/Load patterns for connectors.
- Schema: export InterfaceSpec, InterfaceRegistry and schema helpers (BuildSchema, BuildPropertySchema, KindToJSONType) and wire BuildSchemaForAction to use them.
- Backgrounds: add Factory registry, Register and Load helpers and an UnknownPluginError type.
- Providers: remove the HistoryManager/related LLM history code and unused imports.
- LLM/Runtime/Inputs/Plugins: remove debug prints, adjust buffering and minor formatting/locking fixes.
- Google ASR: add Time to ASRMessage, reduce stats ticker interval (30s→15s), improve latency logging, and tidy audio packaging/stream handling.
- Deleted obsolete tests (actions/orchestrator_test.go, actions/schema_test.go).
These changes prepare the codebase for external use of schema/registry helpers, improve observability, and clean up dead code.
Introduce a global logger package and replace ad-hoc zap creation across plugins: internal/logger provides Set/Get for a shared *zap.Logger; main sets the global logger and passes the logger into runtime. Add a CGo zenoh wrapper (internal/zenoh/session.go) that exposes Open/Put/Close to publish raw bytes via zenoh-c (requires zenoh-c headers/libs). Add a new arm_g1 Zenoh action plugin (plugins/actions/arm_g1_zenoh) that registers the arm_g1 interface and publishes Unitree G1 arm requests to a Zenoh topic, including a CDR little-endian serializer for Unitree requests. Wire the new plugin by importing it in plugins/actions/actions.go. Update existing plugins (emotion, speak/elevenlabs_tts, inputs/google_asr) to use logger.Get() instead of creating new zap instances and add an info log when enqueueing TTS text.
Change config path references in go/Makefile from ../config/*.json5 to ./config/*.json5 for the run, dev, and list-configs targets so configuration files are resolved relative to the go directory when invoking these commands.
Replace the old cgo-based zenoh-c wrapper with the pure-Go github.com/eclipse-zenoh/zenoh-go client. Implement a new Session using zenoh-go (Open with optional endpoint, Put, Close) and add a Publish helper. Simplify the arm_g1/zenoh connector to use the new session API (no config parsing, rely on default), and add debug log-level to the Makefile dev run. go.mod was updated to include the zenoh-go dependency and related indirect changes.
ws: add context with cancel to Client, use DialContext so in-progress TLS handshakes are interrupted on Close, and call cancel() from Close. Make Close idempotent for stopCh. Improve read/write loops: early-return on stopCh, log read loop stop, treat normal close as non-error, and fix conn mutex handling in writeMessage with deferred unlock. google_asr: Fix Stop() to avoid calling Stream methods while holding the sensor mutex by capturing and nil-ing paStream under lock, unlocking, then stopping/closing the stream. These changes improve shutdown correctness and avoid deadlocks during teardown.
Add a new Unitree G1 conversation JSON5 config (go/config/unitree_g1_conversation.json5) providing a greeting mode, agent settings, actions and connectors. Fix stripJSON5 in go/internal/config/loader.go to remove backslash-newline continuations before splitting and preserve trailing-comma cleanup. Update plugin imports and package: change actions import from arm_g1_zenoh to arm_g1 and rename file/package go/plugins/actions/arm_g1_zenoh/... to go/plugins/actions/arm_g1/zenoh.go with package name adjusted accordingly.
Introduce a Publisher wrapper in the zenoh session package with DeclarePublisher/Put/Drop and adjust Session.Close to call Close(nil). Update arm_g1 connector to open a zenoh session with optional endpoint, declare a publisher for sport requests, use the publisher for puts, log/handle failures gracefully, and drop the publisher on Stop. Fix CDR alignment, buffer sizing and padding in serializeUnitreeRequest and standardize the JSON parameter formatting to match the Python connector. Convert Tick implementations (emotion, arm_g1, elevenlabs TTS) to block on ctx.Done() and simplify Stop to properly cleanup publishers and sessions.
Introduce CDR serialization helpers and integrate Zenoh publishing for ASR, plus add WS reconnect behavior.

- Add internal/zenoh/cdr.go with AppendInt32LE, AppendUint32LE, AppendInt64LE and AppendCDRString helpers for little-endian CDR encoding.
- Use the new zenoh helpers in plugins/actions/arm_g1 to replace local byte-append helpers.
- Integrate Zenoh into GoogleASR: add ZenohEndpoint config, open a session and declare a publisher during sensor init, publish serialized ASR text (serializeASRText) on buffer flush, and clean up publisher/session on stop. serializeASRText builds a CDR LE payload including a timestamp, UUID frame_id, and the text.
- Update internal/ws client Connect to support automatic reconnect when cfg.Reconnect is true, retrying every 5s and honoring context cancellation.
- Add usage of github.com/google/uuid in ASR serialization.
Replace the old log-based emotion action with a Zenoh-backed connector. Configs (conversation.json5 and unitree_g1_conversation.json5) now reference connector "zenoh" for the emotion action. The previous emotion/log connector implementation was removed and a new emotion/zenoh connector was added which opens a Zenoh session, declares a publisher on topic "om/avatar/request", and publishes serialized AvatarFaceRequest payloads (CDR little-endian encoding with timestamp, request ID, code, and face_text). The new connector gracefully handles Zenoh unavailability, logs publish results, and cleans up publisher/session on Stop.
Replace zenohsession.AppendCDRString with manual serialization: append a NUL (0x00) to faceText, write the byte length as a little-endian uint32, then append the bytes. This ensures the CDR string is encoded with an explicit length and null terminator for compatibility with the Zenoh consumer.
Introduce an Emotion enum and EmotionInput struct for the emotion action, including EnumValues enumerating supported expressions (happy, confused, curious, excited, sad, think). Register the emotion interface with a descriptive message and register the Zenoh connector. Rename and export constructor from newZenohConnector to NewZenohConnector and update its registration. Also remove redundant comment lines in arm_g1/zenoh.go.
Adjust serializeAvatarRequest to follow CDR alignment rules: align before fields (no trailing padding after request_id since next field is int8) and insert padding before the face_text uint32 length. Implement request_id encoding without post-padding (write length + bytes explicitly), increase buffer capacity, and clarify comments about wire layout. Also remove a redundant comment in arm_g1 zenoh publisher code.
Introduce a new AvatarProvider (go/internal/providers/avatar.go) that manages zenoh publishers/subscribers for om/avatar/request and om/avatar/response, handles CDR (pycdr2-compatible) serialization/deserialization for avatar commands and health checks, and exposes SendAvatarCommand. Extend zenoh session (go/internal/zenoh/session.go) with Subscriber type, DeclareSubscriber and Drop to support incoming messages. Update emotion plugin (go/plugins/actions/emotion/zenoh.go) to use the AvatarProvider singleton instead of managing its own zenoh session/publisher and remove duplicate serialization logic; Stop() is simplified accordingly. The provider handles unavailable zenoh sessions gracefully and replies to STATUS requests with health responses.
Add a descriptive comment for NewAvatarProvider, simplify the handleRequest comment, and remove debug log statements for health-check requests/responses to reduce log noise. Minor whitespace cleanup; no functional changes to request handling or publishing behavior.
Introduce a centralized ElevenLabs TTS provider and lifecycle hook support across the runtime.

- Add go/internal/providers/elevenlabs.go: a singleton ElevenLabsProvider that handles HTTP synthesis, persistent ffplay streaming, queueing, silence handling and lifecycle for TTS playback.
- Extend hooks runner (go/internal/hooks/hooks.go): support templated variables, improved command execution with captured stdout/stderr, message handler that forwards messages to the ElevenLabs provider, helper funcs for template formatting and safe string extraction.
- Wire lifecycle hooks into runtime and mode manager (go/internal/runtime/*): global hooks are created and invoked for OnStartup, OnEntry and OnExit with context payloads; time-based transitions fire OnTimeout hooks; Transition now includes reason and transition context.
- Refactor speak action (go/plugins/actions/speak/elevenlabs_tts.go): remove duplicated ffplay/http logic and use the new providers.ElevenLabs singleton, simplifying connector and lifecycle.
- Improve WebSocket client resilience (go/internal/ws/client.go): better logging, non-blocking read/reconnect behavior and a reconnect helper to re-dial when enabled.
- Add example lifecycle_hooks to config (go/config/unitree_g1_conversation.json5) demonstrating on_startup message handler using elevenlabs.

These changes centralize TTS handling, reduce duplicated code, and add lifecycle hook templating and global hook handling to support safe automated startup/transition messages.
Export and document several constructors and runtime types, and add brief lifecycle comments across runtime and plugin code. Renamed newModeSetup to NewModeSetup and updated runtime to use the exported constructor; added docs for loadComponents, toRuntimeConfig, buildMeta, addMeta, collectSchemas, mergePrompt, and toolCallsToMaps. Introduced ModeState/ModeManager comments and a NewModeManager constructor doc. In plugin connectors: export and rename connector constructors (NewArmG1ZenohConnector, NewEmotionZenohConnector, NewElevenLabsTTS), add Tick/Stop/no-op lifecycle comments, and simplify arm_g1 connector by removing the customActionMap and sending actions directly. Also removed an extraneous package doc comment in ws/client.go. These changes improve API visibility and add documentation for maintainability; note the behavioral change in arm_g1/zenoh where actions are no longer remapped.
Introduce clearer constructors, helper functions, and plugin registries across packages and update callers accordingly. Key changes: move logger builder to internal/logger (BuildLogger) and use it from main; rename fuser.New -> NewFuser and hooks.New -> NewHooks and update runtime callers; add doc comments and small API helpers for actions/inputs/backgrounds/llm (Call/Result types, registry Factory types, Register/Load helpers, UnknownPluginError.Error implementations); add orchestration helper methods (runConcurrent/runSequential/runWithDeps, SetSchemas/FunctionSchemas/Reset) and small runtime adjustments. These changes improve naming consistency, visibility of helpers, plugin loading ergonomics, and overall code readability without changing core behavior.
Introduce utility functions and logging to the LLM plugin common code: add parseOpenAIResponse, buildMessages, remarshal helper, and logResponseLatency to capture latency and proxy/upstream headers. Import net/http, time, logger and zap for these features. Export and rename Gemini constructor to NewGemini, update llm registration accordingly, export FunctionSchemas and SetSchemas, and wire request timing + logResponseLatency into Gemini's Call to record response metrics. These changes improve observability and provide small API/visibility refactors for the Gemini LLM integration.
Introduce context-aware mode transitions and a safe PortAudio reference counter.

- Config: change default mode to "conversation", add an "approaching" mode and a set of transition_rules to unitree_g1_conversation.json5; add go/config/memory to .gitignore.
- Types: extend TransitionRule with ContextConditions to support context-aware transitions.
- ModeManager: subscribe to a Zenoh topic (om/mode/context) for best-effort user context updates, store userContext, add CheckTransitions with ordered checks (time-based, context-aware, input-triggered), and implement helpers to evaluate context conditions and priorities. Add Close and UpdateUserContext helpers.
- Runtime: clone per-mode LLM config before adding metadata and error if no LLM configured; tidy lifecycle/orchestrator comments and ensure manager.Close() on shutdown.
- Audio: add providers/portaudio.go implementing a process-wide reference-counted PortAudio wrapper; integrate it into GoogleASRInput (Acquire/Release, captureDone coordination, safer stream stop/close) to avoid terminating PortAudio while others still use it.
- Misc: add github.com/google/uuid to go.mod; small comment and code cleanups across files.

These changes enable context-driven transitions, prevent concurrent PortAudio termination races, and avoid mutating shared LLM configuration maps.
Introduce greeting conversation state machine, IO provider, and related plumbing for TTS/status publishing and background detection. Adds a ConfidenceCalculator-driven GreetingConversationStateMachineProvider and IOProvider singletons, ElevenLabs greeting connector, TTS text normalization and CDR status serialization, and ApproachingPerson background task. Replaces the old slim providers.go, adds util.ToFloat/FloatFrom and updates manager to use it, increments tick counter in runtime, and registers the new plugins in main/actions. Also includes a small, likely accidental edit to google_asr.go.
Change the agent action in unitree_g1_conversation config from "speak" to "greeting_conversation" (name and llm_label) and switch the connector from "elevenlabs_tts" to "greeting_conversation_elevenlabs" to match the plugin. Also remove an extraneous debug log line in greeting_conversation_elevenlabs.Tick to reduce noisy logging.
Introduce a FacePresence provider and sensor: new providers/face_presence.go and inputs/face_presence.go implement fetching /who, snapshot shaping, and a sensor that polls and formats presence lines. Add util.Sleep to support context-aware sleeps. Refactor zenoh session handling: make Open use defaults and local-network discovery (OpenWithOptions/openClient/openDiscovery), add a default endpoint constant, and improve logging/error messages. Update dependent code to the new APIs: AvatarProvider no longer takes an endpoint and callers use providers.Avatar(), various plugins (arm_g1, emotion, greeting_conversation, google_asr, approaching_person) now call zenoh.Open() without endpoint and use util.Sleep where appropriate. Remove an unused local sleep helper and drop ZenohEndpoint config field from greeting_conversation. Overall: new face-presence feature plus cleanup and more robust zenoh session management.
Introduce a ModeContextProvider singleton as an in-process, best-effort bus for user-context updates that drive context-aware mode transitions. Replace previous Zenoh-based context propagation: ModeManager no longer subscribes to the Zenoh topic and its Close() is simplified; runtime now consumes providers.ModeContext().Updates() and uses a new scheduleTransition helper to enqueue mode changes. Update plugins to publish context updates via providers.ModeContext() (ApproachingPerson and greeting_conversation) and adjust imports/cleanup accordingly. Also add ApproachingPerson to the conversation config and adjust transition rules to target conversation mode; remove a reconnect log line from the websocket client.
Large repo reorganization: moved the Go project from the go/ subdirectory into the repository root (Makefile, cmd/, internal/, plugins/, go.mod/sum, etc.), preserving file contents. Removed legacy Python src/ tree, many config JSON5 files and tests, and deleted .gitmodules. Updated .pre-commit-config.yaml and .github/copilot-instructions.md as part of the cleanup. This simplifies the repository layout and centralizes the Go application.
Replace Python-centric CI and runtime with a Go-focused build pipeline and zenoh-c integration. GitHub Actions workflows updated to setup-go, cache/download zenoh-c, run make build/test/lint (go vet, golangci-lint, go test) and verify the produced binary; many Python/uv/CycloneDDS steps removed. Dockerfile switched to a multi-stage Go builder image that builds the om1 binary, bundles the zenoh-c shared libs, and provides a slimmer runtime image with a simplified entrypoint. Makefile BUILD_DIR path fixed and docker-compose env/command switched to OM1_CONFIG with updated defaults. .gitignore cleaned to remove Python artifacts and track .zenoh-c and build outputs.
Align OpenWithOptions behavior with the updated LocalNetwork comment: when LocalNetwork is true, try connecting to the local router (openClient) and fall back to discovery; when false, try discovery first and fall back to client connect. Also updated the LocalNetwork doc comment and related warning log messages to reflect the corrected semantics.
openminddev and others added 23 commits June 4, 2026 14:39
Change default robot name from Pam to Iris and streamline identity text. Expand allowed physical actions and add detailed behavior guidelines/gesture mappings to improve interaction (e.g., shake_hand, face_wave, salute, heart, shrug, come_closer, rotate_hand, flexible, speak_action(_extended), etc.). Add a new agent input entry (arm_g1, llm_label: robot_action, connector: zenoh). Add "model: \"long\"" to the TTS configuration. Also simplify some conversation prompt content and remove prior event/company-specific guidance.
* update readme for Golang migration - initial commit

* update contribution.md

* update config.md

* update input.md

* update introduction.md

* update new_mode.md for Golang migration

* update example docs files for Golang migration

* update intro and get started docs for Golang migration

* update config and input docs for Golang migration

* update llm and action docs for Golang migration

* update project structure docs for Golang migration

* update trouble shooting guide for Golang migration

* update remaining docs for Golang migration

* updated docs

* updated make commands and steps in readme and getting started
Update ASR unit tests to expect a three-word transcript "hello there world" instead of the previous two-word string. In plugins/inputs/elevenlabs_asr_test.go: add a "three english words" case (accepted) and change the "two english words" case to be rejected; update the committed message assertion to match the new three-word transcript. In plugins/inputs/google_asr_test.go: update the expected parsed reply to "hello there world" and ensure speech timing state is validated accordingly. These changes align test expectations with updated transcript handling logic.
Add a new .github/workflows/binary-release.yml workflow to build and publish OM1 binaries. The workflow supports workflow_dispatch inputs (version, publish) and automatic behavior for tag pushes and nightly builds, sets ZENOH_C_VERSION, caches zenoh-c, and builds for linux-amd64, linux-arm64, darwin-arm64, darwin-amd64 and windows-amd64. Artifacts are packaged (tar.gz on Unix, zip on Windows) with bundled zenoh-c libraries and adjusted rpaths (patchelf / install_name_tool), checksums are generated, and artifacts are uploaded. When publish is enabled the job flattens artifacts, manages a nightly tag, and creates/updates a GitHub Release with the built files.
Change the GitHub Actions binary-release workflow to trigger on pushes to the 'go' branch instead of 'main'. Tag-based triggers and manual workflow_dispatch remain unchanged. Aligns the release workflow with the repository's branch naming.
Update .github/workflows/binary-release.yml to use the macos-15-intel runner for the darwin-amd64 job (was macos-13). Remove the entire Windows build job (build-windows) including MSYS2 setup, zenoh-c download, build and packaging steps, and adjust the release job dependencies to no longer require build-windows.
Change the release name used by softprops/action-gh-release for nightly builds from 'Nightly (latest main)' to 'Development Build' in .github/workflows/binary-release.yml. No other workflow behavior was modified.
Update README.md for clarity and readability: convert the plain Note line into a markdown admonition ([!NOTE]) for consistent styling, and replace repeated plain-text MIT License mentions with links to the project's LICENSE file (./LICENSE) to make it easier for readers to view the full license.
Replace a verbose, repetitive paragraph describing the MIT License with a concise single-line reference to the LICENSE file in README.md to reduce verbosity and duplication.
Set job environment name and url based on the release version (nightly -> development, otherwise production) and wire the release URL to the environment. Add an id to the GitHub release step so its outputs can be referenced. Append a new step that writes a formatted release summary to $GITHUB_STEP_SUMMARY including release URL, channel, commit SHA and listed release assets.
Update .github/workflows/binary-release.yml to set environment.name to 'staging' when needs.setup.outputs.version == 'nightly' (previously 'development'), falling back to 'production' otherwise. This routes nightly release artifacts to the staging environment.
* Migrate Riva to Golang

* Add parallel ASR

* Remove testing functions

* Fix merge conflicts

* Shorten comments

* Run make fmt

* Optimize ASR folder structure

* Rename ASR model to provider and refactor aggregator to sensor core (#2606)

* Rename ASR model->provider and refactor aggregator

Replace the 'Model' identifier with 'Provider' across ASR configs, streams, metrics, logs, and tests. Refactor asrAggregator into asrSensorCore (renaming constructor/newAggregator to newSensorCore and updating receivers and methods). Update transcriberStream to use provider and adjust onTranscript signature, metric labels, logging keys, and parallel-ASR dedup state (lastModel -> lastProvider). Also include minor comment and formatting tweaks.

* Add VSCode Go settings and stop ignoring .vscode

Remove .vscode from .gitignore and add .vscode/settings.json to commit VS Code Go configuration. The new settings configure go test flags (-p 8, -v), set CGO include/lib paths and runtime library paths to the local .zenoh-c directory, and enable the "integration" build tag so VS Code can build and run integration tests that rely on the native zenoh-c library.

---------

Co-authored-by: openminddev <147775420+openminddev@users.noreply.github.com>
* Migrate Riva to Golang

* Add parallel ASR

* Remove testing functions

* Fix merge conflicts

* Shorten comments

* Run make fmt

* Optimize ASR folder structure

* Rename ASR model->provider and refactor aggregator

Replace the 'Model' identifier with 'Provider' across ASR configs, streams, metrics, logs, and tests. Refactor asrAggregator into asrSensorCore (renaming constructor/newAggregator to newSensorCore and updating receivers and methods). Update transcriberStream to use provider and adjust onTranscript signature, metric labels, logging keys, and parallel-ASR dedup state (lastModel -> lastProvider). Also include minor comment and formatting tweaks.

* Add VSCode Go settings and stop ignoring .vscode

Remove .vscode from .gitignore and add .vscode/settings.json to commit VS Code Go configuration. The new settings configure go test flags (-p 8, -v), set CGO include/lib paths and runtime library paths to the local .zenoh-c directory, and enable the "integration" build tag so VS Code can build and run integration tests that rely on the native zenoh-c library.

* Add Go vs Python feature comparison to README

Introduce a new "Go vs. Python Feature Comparison" section in README.md that documents current parity between the Go and Python runtimes. Adds a capabilities table showing which features are available or under development in Go (hardware connectors, VLMs, sensors, messaging, simulators, full autonomy, etc.), plus a note recommending the Python runtime for features still marked as under development and links to the Python runtime and contributing guidance.

---------

Co-authored-by: Shicai He <94800998+shicaih@users.noreply.github.com>
Introduce a new internal/providers/vlm package that implements video capture and utilities.

Adds Frame (with custom JSON/base64 marshaling), streamBase (lifecycle, buffering, drop counting), and helpers splitJPEGStream and jpegQScale. Implements VideoStream (camera capture via ffmpeg) and VideoRTSPStream (RTSP capture with reconnect logic and ffmpeg arg builders). Adds video device enumeration for Linux and macOS (avfoundation) and unit tests covering JPEG splitting, qscale mapping, Frame JSON, stream lifecycle, and defaults.

This enables capturing MJPEG frames from local devices or RTSP sources and provides safe, buffered delivery to consumers.
Introduce a Visual Language Model (VLM) input plugin (camera + RTSP) with a vision client, sensor implementation, Gemini defaults and unit tests; register the plugin in inputs. Add Prometheus VLM metrics and a generic RecordResponseLatency helper in internal/metrics and switch existing OpenAI-compatible LLM providers to use it (removing duplicated latency/logging code). Improve video capture handling: run-restart loop for VideoStream, lower default JPEG quality, add camera retry delay and util.Sleep usage, and minor RTSP stream cleanup. Update README and conversation config to surface VLMGemini support.
Clarify Visual Language Models (VLM) support in the README: update the table note to indicate OpenAI and Gemini VLMs are supported (removing the previous note about lack of Go support). This aligns the documentation with current capabilities.
Introduce an internal VLM describer (internal/providers/vlm) to call vision chat-completions with optional image attachment and record metrics. Add a singleton LatestFrame provider (internal/providers/latest_frame.go) with tests to store and retrieve the most-recent JPEG frame and a freshness check. Integrate vision-based greeting into the greeting hook: attempt a vision describe using the latest frame (with fall back to text-only LLM), and add related defaults and helpers. Add util.FirstNonEmpty and its tests, refactor the vlm input to use the new describer and to populate the LatestFrame, and remove the old vlm client implementation.
Remove the DecodeFormat field and its default from the VideoRTSPStream implementation and tests (internal/providers/vlm). Update VideoRTSPStreamConfig and NewVideoRTSPStream to no longer handle decode format. Add a plugin-level default RTSP URL and ensure NewRTSPSensor fills cfg.RTSPURL when empty (plugins/inputs/vlm), and adjust the constructor call to match the simplified config. Update tests to drop the DecodeFormat assertion.
Add an atomic pending counter and Busy() helper so Busy reflects queued-but-unplayed speech. Increment pending when requests are enqueued and decrement when handled. Refactor player loop into handleRequest to centralize synthesis/playback, ensure Speaking is set/cleared reliably, handle pre-roll silence, errors and interrupts, and require ffplay availability. In greeting_conversation: introduce ttsPollInterval, use tts.Busy() instead of fragile flags, remove pendingFinishedUpdate, add switchWhenTTSDone to wait (with timeout) for TTS to drain before switching modes, and simplify waitingOnTTS logic. These changes prevent mode switches while queued TTS remains and add a timeout to avoid stalling.
Add a new GreetingStatus input sensor and expose final-turn guidance from the greeting state machine.

- config/greeting_conversation.json5: register the new GreetingStatus input in the conversation config.
- internal/providers/greeting_conversation_state.go: add finalTurnGuidance constant and an EndingGuidance() method (mutex-protected) that returns guidance when the conversation is concluding or about to hit max turns.
- plugins/inputs/greeting_status.go: new sensor that registers as "GreetingStatus", retrieves the GreetingConversationStateMachineProvider, and exposes the EndingGuidance via FormattedLatestBuffer(). Other sensor methods are present as no-ops/stand-ins.

This enables the runtime to surface a short LLM guidance for the final exchange so the assistant can produce a brief, warm goodbye and mark the conversation as finished.
Register VLMGeminiRTSP in the greeting conversation config (added to two component lists). Adjust EndingGuidance in the state machine: update the comment, remove special-case checks for finished/concluding states, and change the trigger condition to only return finalTurnGuidance when turnCount+1 > maxTurnCount (tightens the ending logic/addresses an off-by-one and redundant state handling).
Delete rubail, samantha, and david persona ID entries from config/greeting_conversation.json5. This cleans up the persona mapping in the greeting conversation configuration by removing these (presumably obsolete or unused) entries.
Introduce a cooldown mechanism to asrSensorCore to suppress follow-up ASR transcripts delivered too quickly. Adds defaultASRCooldown constant, cooldown and lastDeliver fields, and a resolveCooldown helper that disables cooldown when TTS interrupt is enabled. newSensorCore signature updated to accept a cooldown and callers (newASRCommon, NewParallelASR) now pass the default. pushTranscript now skips and logs follow-ups within the cooldown window. Tests added to verify resolveCooldown behavior and the cooldown delivery semantics.
@openminddev openminddev requested a review from a team as a code owner June 5, 2026 19:04
Delete an unnecessary inline comment in TestPushTranscriptCooldownSuppressesFollowup (plugins/inputs/asr/asr_common_test.go). This is a cosmetic cleanup and does not change test behavior.
@codecov

codecov Bot commented Jun 5, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 66.66667% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
plugins/inputs/asr/asr_common.go 72.72% 3 Missing ⚠️
plugins/inputs/asr/parallel_asr.go 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Base automatically changed from go to main June 9, 2026 18:46
@openminddev openminddev requested review from a team as code owners June 9, 2026 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants