diff --git a/CHANGELOG.md b/CHANGELOG.md index c90f4deb..194b420a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,59 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 Copyright 2026 Firefly Software Foundation. Licensed under the Apache License 2.0. +## [26.06.11] - 2026-06-22 + +SP-3: human-in-the-loop tool approval re-based onto pydantic-ai native deferred-tools. + +### Added + +- **Native tool approval / HITL.** Tools can declare `requires_approval=True` + (`firefly_tool(...)`, `BaseTool`, and threaded through `ToolKit.as_pydantic_tools()` + / `as_toolset()`). When the model calls such a tool, the agent run **pauses before + executing it** and returns a `DeferredToolRequests` as `result.output`. Detect with + the new `is_deferred(result)` helper; **resume** via + `agent.run(message_history=..., deferred_tool_results=DeferredToolResults(approvals={call_id: True | ToolApproved(override_args=...) | ToolDenied(message=...)}))`. + `FireflyAgent` auto-detects HITL (any approval-requiring tool/`ToolKit`/`as_toolset()`, + or an `ApprovalRequiredToolset` in `toolsets`) and widens its output union to allow the + pause **only then** — non-HITL agents are unchanged. Force with `hitl=True`. +- **Inline (non-pausing) approval.** `FireflyAgent(approval_handler=...)` resolves + approvals *inside* the run via a native `HandleDeferredToolCalls` capability — for + programmatic / policy-based auto-approval. +- **Native re-exports** from `fireflyframework_agentic.tools`: `DeferredToolRequests`, + `DeferredToolResults`, `ToolApproved`, `ToolDenied`, `ApprovalRequired` (plus the + already-exported `ApprovalRequiredToolset`). `is_deferred` and the `ApprovalHandler` + type are exported from `fireflyframework_agentic.agents`. + +### Changed + +- Post-run cross-cutting code now treats a paused run as a control object, not a final + answer: `_persist_memory`, the output-guard, validation, cache, logging, and + explainability middleware all **skip** a `DeferredToolRequests` output (preventing + corrupted memory turns, spurious `OutputGuardError`/`OutputReviewError`, and caching a + pause). +- Tool **guard denials** (validation / rate-limit / sandbox) now raise `ToolGuardError` + instead of a plain `ToolError`. `ToolGuardError` subclasses `ToolError`, so existing + `except ToolError` handlers are unaffected. +- `BaseTool._guarded_execute` now lets pydantic-ai's `ApprovalRequired` / `CallDeferred` + control signals propagate untouched (like `ModelRetry`), instead of wrapping them as + `ToolError`. This makes **dynamic** approval work — a tool body (with `takes_ctx=True`) + may `raise ApprovalRequired(metadata=...)` to defer that specific call; pair with + `FireflyAgent(hitl=True)` so the output union allows the pause. + +### Removed (breaking) + +- **`ApprovalGuard`** (and the `ApprovalCallback` alias). The bespoke guard-chain approval + (sync bool callback → `ToolError` on denial, no pause/resume/metadata) is replaced by the + native protocol above. Migration: `docs/migration.md` §6. + +### Notes + +- HITL stays three distinct layers by design: tool approval (native deferred-tools, agent + layer), workflow `human()` / `WorkflowInterrupt` (journal-replay), and pipeline `Pause` + / `approve_pause` (checkpoint). They are not collapsed. +- Validated against a live Anthropic model: a `requires_approval` tool pauses the real run + (tool body does not execute), and resuming with approval runs it exactly once. + ## [26.06.10] - 2026-06-22 SP-5: native structured-output modes for reasoning patterns. diff --git a/README.md b/README.md index 8a3e6084..1e547c8b 100644 --- a/README.md +++ b/README.md @@ -156,9 +156,14 @@ create your own components; the framework discovers them via duck typing. - **Tools** — `ToolProtocol` (duck-typed) and `BaseTool` (inheritance) let you choose your extensibility style. `ToolBuilder` provides a fluent API for building tools - without subclassing. Five guard types (`ValidationGuard`, `RateLimitGuard`, - `ApprovalGuard`, `SandboxGuard`, `CompositeGuard`) intercept calls before execution. - Three composition patterns (`SequentialComposer`, `FallbackComposer`, + without subclassing. Four guard types (`ValidationGuard`, `RateLimitGuard`, + `SandboxGuard`, `CompositeGuard`) intercept calls before execution (a rejected guard + raises `ToolGuardError`). For **human-in-the-loop**, mark a tool `requires_approval=True`: + the agent run **pauses** before executing it and returns a `DeferredToolRequests` + (detected via `is_deferred(result)`), which you resume with `deferred_tool_results=` — + approving (`ToolApproved`), denying (`ToolDenied`), or auto-deciding inline via an + `approval_handler=`. The native deferred-tools types are re-exported from + `fireflyframework_agentic.tools`. Three composition patterns (`SequentialComposer`, `FallbackComposer`, `ConditionalComposer`) build higher-order tools. `ToolKit` groups tools for bulk registration. Nine built-in tools (calculator, datetime, filesystem, HTTP, JSON, search, shell, text, database) are ready to attach to any agent. @@ -177,7 +182,11 @@ create your own components; the framework discovers them via duck typing. **Reflexion** (execute → critique → retry), **Tree of Thoughts** (branch → evaluate → select), and **Goal Decomposition** (goal → phases → tasks). All produce structured `ReasoningResult` with `ReasoningTrace`. Prompts are - slot-overridable. `OutputReviewer` can validate final outputs. `ReasoningPipeline` + slot-overridable. Each pattern's structured output is wrapped in a pydantic-ai + output mode — selected per-pattern via `output_mode=` or framework-wide via the + `reasoning_output_mode` config — `"tool"` (`ToolOutput`), `"native"` (provider + structured output), or `"prompted"` (`PromptedOutput`, portable to any model). + `OutputReviewer` can validate final outputs. `ReasoningPipeline` chains patterns sequentially.

@@ -536,6 +545,12 @@ async def lookup(query: str) -> str: return f"Result for {query}" ``` +> **Human-in-the-loop:** mark a tool `@firefly_tool(name=..., requires_approval=True)` and the +> agent run **pauses** before executing it — `run()` returns a `DeferredToolRequests` +> (detect with `is_deferred(result)`). Resume with +> `agent.run(message_history=paused.all_messages(), deferred_tool_results=DeferredToolResults(approvals={call_id: True}))`. +> Full detail in [docs/tools.md](docs/tools.md#human-in-the-loop-tool-approval). + ### 4. Add Memory for Multi-Turn Conversations ```python @@ -713,11 +728,11 @@ content processing, validation, explainability, and pipelines. Detailed guides for each module: - [Architecture](docs/architecture.md) — Design principles and layer diagram -- [Agents](docs/agents.md) — Lifecycle, registry, delegation, decorators +- [Agents](docs/agents.md) — Lifecycle, registry, delegation, decorators, human-in-the-loop approval - [Template Agents](docs/templates.md) — Summarizer, classifier, extractor, conversational, router -- [Tools](docs/tools.md) — Protocol, builder, guards, composition, built-ins +- [Tools](docs/tools.md) — Protocol, builder, guards, composition, built-ins, native HITL approval (`requires_approval`, deferred resume) - [Prompts](docs/prompts.md) — Templates, versioning, composition, validation -- [Reasoning Patterns](docs/reasoning.md) — 6 patterns, structured outputs, custom patterns +- [Reasoning Patterns](docs/reasoning.md) — 6 patterns, structured outputs, output modes (`output_mode`/`reasoning_output_mode`), custom patterns - [Content](docs/content.md) — Chunking, compression, batch processing - [Memory](docs/memory.md) — Conversation history, working memory, storage backends - [Validation](docs/validation.md) — Rules, QoS guards, output reviewer diff --git a/docs/README.md b/docs/README.md index e46a00d8..c2a45470 100644 --- a/docs/README.md +++ b/docs/README.md @@ -43,7 +43,7 @@ below it, keeping the dependency graph acyclic and each module independently tes |---|---| | **[Agents](agents.md)** | `FireflyAgent`, `AgentRegistry`, `AgentLifecycle`, `@firefly_agent` decorator, middleware stack (`AgentMiddleware`, `MiddlewareChain`, `Logging`/`PromptGuard`/`CostGuard`/`Observability`/`Explainability`/`Cache`/`OutputGuard`/`Validation`/`Retry`/`PromptCache` middleware), 7 delegation strategies (round-robin, capability, content-based, cost-aware, chain, fallback, weighted), `FallbackModelWrapper` / `run_with_fallback`, `ResultCache` | | **[Template Agents](templates.md)** | Five factory functions: summarizer, classifier, extractor, conversational, router | -| **[Tools](tools.md)** | `ToolProtocol`, `BaseTool`, `ToolBuilder`, guards, composition, caching, 9 built-in tools; full-fidelity schemas via `ParameterSpec(python_type=…)`, `RunContext` opt-in (`takes_ctx`), `ToolKit.as_toolset()` + re-exported native combinators (`FilteredToolset`, `WrapperToolset`, `ApprovalRequiredToolset`, …) | +| **[Tools](tools.md)** | `ToolProtocol`, `BaseTool`, `ToolBuilder`, guards, composition, caching, 9 built-in tools; full-fidelity schemas via `ParameterSpec(python_type=…)`, `RunContext` opt-in (`takes_ctx`), `ToolKit.as_toolset()` + re-exported native combinators (`FilteredToolset`, `WrapperToolset`, `ApprovalRequiredToolset`, …); human-in-the-loop tool approval (`requires_approval` / `is_deferred` / `deferred_tool_results` / `approval_handler`) | | **[Prompts](prompts.md)** | `PromptTemplate`, `PromptRegistry`, composers, validation, loaders | | **[Content](content.md)** | `TextChunker`, `MarkdownChunker`, `DocumentSplitter`, `ImageTiler`, `BatchProcessor`, compression; binary normalization (`content.binary`, `[binary]` extra: `BinaryNormalizer`, office/PDF/image/archive/email converters) | | **[Memory](memory.md)** | `ConversationMemory`, `WorkingMemory`, `MemoryManager`, `InMemoryStore` / `FileStore` / `SQLiteStore` backends, `MemoryScope`, LLM summarisation | diff --git a/docs/agents.md b/docs/agents.md index 8f8427d8..ef3aa5df 100644 --- a/docs/agents.md +++ b/docs/agents.md @@ -137,6 +137,34 @@ agent = FireflyAgent( --- +## Human-in-the-Loop Tool Approval + +When a tool declares `requires_approval=True` (or an `ApprovalRequiredToolset` gates the +toolset), `run()` / `run_sync()` **pause before the tool executes** and return a +`DeferredToolRequests` as `result.output`. Detect this with `is_deferred(result)`, then +resume by calling the agent again with the paused messages and the human's decision: + +```python +from fireflyframework_agentic.agents import FireflyAgent, is_deferred +from fireflyframework_agentic.tools import DeferredToolResults + +result = await agent.run("Delete record 42.") +if is_deferred(result): + approvals = {c.tool_call_id: True for c in result.output.approvals} # True / ToolApproved / ToolDenied + result = await agent.run( + message_history=result.all_messages(), + deferred_tool_results=DeferredToolResults(approvals=approvals), + ) +``` + +`FireflyAgent` auto-detects HITL from approval-requiring tools/toolsets (widening its output +union only then); force it with `hitl=True`, or resolve approvals inline without pausing via +`approval_handler=`. Post-run middleware (output guard, validation, cache) and memory all skip +a paused result — it is a control object, not a final answer. See +[Human-in-the-Loop Tool Approval](tools.md#human-in-the-loop-tool-approval) for the full guide. + +--- + ## Agent Registry The `AgentRegistry` is a singleton that maps agent names to `FireflyAgent` instances. diff --git a/docs/architecture.md b/docs/architecture.md index 5b6f2973..c4d7541e 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -61,7 +61,7 @@ graph TD subgraph Agent Layer AGT["Agents
FireflyAgent · AgentRegistry
DelegationRouter · AgentLifecycle
@firefly_agent · 5 templates · 11 middleware
7 delegation strategies · FallbackModelWrapper
ResultCache · run timeout
"] - TOOLS["Tools
BaseTool · ToolBuilder · ToolKit · CachedTool
5 guards · 3 composers · tool timeout
ToolRegistry · 9 built-ins
"] + TOOLS["Tools
BaseTool · ToolBuilder · ToolKit · CachedTool
4 guards · 3 composers · tool timeout · HITL approval
ToolRegistry · 9 built-ins
"] PROMPTS["Prompts
PromptTemplate · PromptRegistry
3 composers · PromptValidator
PromptLoader
"] CONTENT["Content
TextChunker · DocumentSplitter · MarkdownChunker
ImageTiler · BatchProcessor
ContextCompressor · SlidingWindowManager
content.binary (BinaryNormalizer · office converters)
"] MEM["Memory
MemoryManager · ConversationMemory
WorkingMemory · TokenEstimator
InMemoryStore · FileStore · SQLiteStore
summarization · create_llm_summarizer
export/import · async wrappers
"] @@ -166,7 +166,6 @@ classDiagram ToolProtocol <|.. ConditionalComposer GuardProtocol <|.. ValidationGuard GuardProtocol <|.. RateLimitGuard - GuardProtocol <|.. ApprovalGuard GuardProtocol <|.. SandboxGuard GuardProtocol <|.. CompositeGuard ReasoningPattern <|.. AbstractReasoningPattern @@ -215,7 +214,7 @@ system. Every other module depends on at least one Core component. from environment variables and `.env` files. It actively rejects removed serving/exposure config fields (e.g. `otlp_endpoint`, `rbac_enabled`, `cors_allowed_origins`, `cost_calculator`) with a `ValueError`. -- **exceptions.py** -- A structured exception hierarchy of 34 classes rooted at +- **exceptions.py** -- A structured exception hierarchy of 42 classes rooted at `FireflyAgenticError`. - **plugin.py** -- `PluginDiscovery` discovers and loads entry-point plugins at startup. - **resilience/circuit_breaker.py** -- `CircuitBreaker` (with `CircuitState` and diff --git a/docs/migration.md b/docs/migration.md index e0cff968..13361292 100644 --- a/docs/migration.md +++ b/docs/migration.md @@ -134,6 +134,50 @@ await triage(args, runner=FireflyAgentRunner()) # both resolved from the regis --- +## 6. `ApprovalGuard` removed — human-in-the-loop is now native (breaking) + +**Why.** Tool approval was a bespoke guard (`ApprovalGuard(callback)`) that ran inside +Firefly's guard chain and raised `ToolError` on a denied call — a synchronous, all-or-nothing +gate with no pause/resume, metadata, or per-call granularity, parallel to pydantic-ai's own +deferred-tools protocol. It has been **removed** in favour of the native protocol +(`requires_approval`, `DeferredToolRequests`/`DeferredToolResults`, `ApprovalRequired`, +`ApprovalRequiredToolset`). + +```python +# Before — guard that blocks the run on denial +from fireflyframework_agentic.tools.guards import ApprovalGuard + +async def approve(tool_name, kwargs) -> bool: + return await ask_admin(tool_name, kwargs) + +@guarded(ApprovalGuard(callback=approve)) +@firefly_tool("delete_record", description="Delete a record") +async def delete_record(record_id: str) -> str: ... + +# After — native: the run PAUSES for sign-off, then resumes +from fireflyframework_agentic.agents import is_deferred +from fireflyframework_agentic.tools import DeferredToolResults + +@firefly_tool("delete_record", description="Delete a record", requires_approval=True) +async def delete_record(record_id: str) -> str: ... + +result = await agent.run("delete record 42") +if is_deferred(result): + approvals = {c.tool_call_id: await ask_admin(c) for c in result.output.approvals} # bool / ToolApproved / ToolDenied + result = await agent.run(message_history=result.all_messages(), + deferred_tool_results=DeferredToolResults(approvals=approvals)) +``` + +For the old **inline, non-pausing** behaviour (a callback decides programmatically), pass +`FireflyAgent(approval_handler=...)` — wired as a native `HandleDeferredToolCalls` capability. +See [Human-in-the-Loop Tool Approval](tools.md#human-in-the-loop-tool-approval). + +Also: guard denials (validation, rate-limit, sandbox) now raise `ToolGuardError` instead of a +plain `ToolError`. `ToolGuardError` **subclasses** `ToolError`, so existing `except ToolError` +handlers keep working. + +--- + ## Checklist - [ ] Replace every `type_annotation="..."` with `python_type=` in @@ -144,3 +188,7 @@ await triage(args, runner=FireflyAgentRunner()) # both resolved from the regis - [ ] Review workflows for global cost/budget effects now that sub-agents run through `FireflyAgent`; pass `runner=DefaultAgentRunner()` if you want the old path. - [ ] Import toolset combinators / `RunContext` from `fireflyframework_agentic.tools`. +- [ ] Replace `ApprovalGuard` with `requires_approval=True` + the native pause/resume flow + (`is_deferred()` + `deferred_tool_results=`), or an inline `approval_handler=`. +- [ ] If you matched on `ToolError` from guard denials specifically, note it is now the + `ToolGuardError` subclass (still caught by `except ToolError`). diff --git a/docs/tools.md b/docs/tools.md index 545d3995..38c072ea 100644 --- a/docs/tools.md +++ b/docs/tools.md @@ -188,7 +188,12 @@ flowchart LR Each guard implements `GuardProtocol.check(tool_name, kwargs) -> GuardResult`. When a guard returns `GuardResult(passed=False, reason=...)`, `BaseTool.execute()` raises a -`ToolError` before the handler runs. +`ToolGuardError` (a `ToolError` subclass) before the handler runs. + +> Guards are for **hard, synchronous policy** (validation, rate-limiting, sandboxing). +> For **human-in-the-loop approval** — pausing a run until a person signs off — use the +> native deferred-tools path described in +> [Human-in-the-Loop Tool Approval](#human-in-the-loop-tool-approval), not a guard. ### Built-in Guards @@ -196,8 +201,6 @@ guard returns `GuardResult(passed=False, reason=...)`, `BaseTool.execute()` rais required keyword argument is missing. - **RateLimitGuard** -- `RateLimitGuard(max_calls, period_seconds=60.0)` — sliding-window rate limiter that caps invocations per time window. -- **ApprovalGuard** -- `ApprovalGuard(callback)` — human-in-the-loop gate; `callback` is - an async callable `(tool_name, kwargs) -> bool` that must return `True` to approve. - **SandboxGuard** -- `SandboxGuard(*, allowed_patterns=(), denied_patterns=())` — converts each kwarg value to a string and rejects it if it matches a `denied_patterns` regex (unless it also matches an `allowed_patterns` regex, which takes precedence). @@ -253,6 +256,139 @@ async def fetch(url: str) -> str: --- +## Human-in-the-Loop Tool Approval + +Destructive or sensitive tools can require a human to sign off **before** they run. +Firefly delegates this to pydantic-ai's native **deferred-tools** protocol — no bespoke +mechanism — so approval is per-tool-call, carries metadata, and survives durable +execution. + +### Declaring a tool that needs approval + +Set `requires_approval=True` on the tool. It works on the decorator, `BaseTool` +subclasses, and tools added to a `ToolKit` (via either `as_pydantic_tools()` or +`as_toolset()`): + +```python +from fireflyframework_agentic.tools import firefly_tool + +@firefly_tool("delete_record", description="Delete a database record", requires_approval=True) +async def delete_record(record_id: str) -> str: + ... +``` + +### Pause → approve → resume + +When the model calls an approval-required tool, the run **pauses before executing it** +and returns a `DeferredToolRequests` as `result.output`. Check this with `is_deferred()`, +collect the human decision, then **resume** by calling `run()` again with the prior +messages and a `DeferredToolResults`: + +```python +from fireflyframework_agentic.agents import FireflyAgent, is_deferred +from fireflyframework_agentic.tools import DeferredToolResults, ToolApproved, ToolDenied + +agent = FireflyAgent("ops", model="anthropic:claude-haiku-4-5", tools=[delete_record]) + +result = await agent.run("Delete record 42.") +if is_deferred(result): + requests = result.output # DeferredToolRequests + decisions = {} + for call in requests.approvals: # each is a ToolCallPart + # call.tool_name, call.args, requests.metadata.get(call.tool_call_id) + decisions[call.tool_call_id] = True # True / ToolApproved(override_args=...) / ToolDenied(message=...) + result = await agent.run( + message_history=result.all_messages(), # required: thread the paused run's messages + deferred_tool_results=DeferredToolResults(approvals=decisions), + ) +# result.output is now the model's final answer +``` + +The full pause → approve → resume flow: + +```mermaid +sequenceDiagram + actor Caller + participant Agent as FireflyAgent + participant Model as LLM + actor Human + + Caller->>Agent: run("Delete record 42.") + Agent->>Model: prompt + tool schemas + Model-->>Agent: call delete_record(record_id="42") + Note over Agent: requires_approval → run PAUSES
before executing the tool + Agent-->>Caller: result.output = DeferredToolRequests
(is_deferred(result) is True) + + Caller->>Human: present requests.approvals (+ metadata) + Human-->>Caller: decision = True / ToolApproved(override_args=…) / ToolDenied(message=…) + + Caller->>Agent: run(message_history=result.all_messages(),
deferred_tool_results=DeferredToolResults(approvals={call_id: decision})) + alt approved + Agent->>Agent: execute delete_record
(RunContext.tool_call_approved = True) + else denied + Note over Agent: ToolDenied message returned to the model
(not a crash — the run continues) + end + Agent->>Model: continue with tool result / denial + Model-->>Agent: final answer + Agent-->>Caller: result.output = final answer +``` + +> With an `approval_handler=` (the inline, non-pausing path), the handler resolves the +> `DeferredToolRequests` **inside** the run via a native `HandleDeferredToolCalls` capability, +> so the run never returns to the caller paused. + +Decision values: `True` approves (equivalent to `ToolApproved()`); `ToolApproved(override_args={...})` +approves but replaces the call arguments; `ToolDenied(message="...")` denies — the message is +returned to the model, which continues (it is **not** a crash). On resume the tool's +`RunContext.tool_call_approved` is `True` and `RunContext.tool_call_metadata` carries any +`DeferredToolResults.metadata` you passed. + +> Do not set `conversation_id` on the resume call — an explicit `message_history` and +> memory injection are mutually exclusive. + +### Auto-detection and forcing HITL + +`FireflyAgent` widens its output type to allow the `DeferredToolRequests` pause exactly when +HITL is in play. It auto-detects this from any `requires_approval` tool (directly, inside a +`ToolKit`, or inside a `ToolKit.as_toolset()`), or an `ApprovalRequiredToolset` in `toolsets`. +If your tools defer **dynamically** (raising `pydantic_ai.exceptions.ApprovalRequired`) so +detection can't see it statically, pass `hitl=True`. + +### Dynamic, predicate-based approval + +To gate **existing** tools by a runtime predicate (e.g. approve small amounts, hold large +ones), wrap a toolset in the native `ApprovalRequiredToolset`: + +```python +from fireflyframework_agentic.tools import ApprovalRequiredToolset + +gated = ApprovalRequiredToolset( + my_toolkit.as_toolset(), + approval_required_func=lambda ctx, tool_def, args: args.get("amount", 0) > 1000, +) +agent = FireflyAgent("payments", model="...", toolsets=[gated]) +``` + +### Inline (non-pausing) approval + +For programmatic / policy-based auto-approval that resolves **inside** the run instead of +pausing, pass an `approval_handler` — wired as a native `HandleDeferredToolCalls` capability: + +```python +def auto_approve(ctx, requests): + return requests.build_results(approvals={c.tool_call_id: True for c in requests.approvals}) + +agent = FireflyAgent("ops", model="...", tools=[delete_record], approval_handler=auto_approve) +``` + +> **Three distinct HITL layers, by design.** Tool approval (this section) is native +> deferred-tools at the **agent** layer. A **workflow** pause uses `human()` / +> `WorkflowInterrupt` (journal-replay). A **pipeline** node pauses by returning `Pause(...)` +> and resumes via `approve_pause` (checkpoint). They solve different problems and are not +> collapsed into one mechanism. + +--- + ## Composition Tools can be composed into higher-level operations. Each composer implements diff --git a/docs/tutorial.md b/docs/tutorial.md index a5b6e5db..89006ad5 100644 --- a/docs/tutorial.md +++ b/docs/tutorial.md @@ -626,8 +626,9 @@ during a conversation to fetch data, trigger side-effects, or run computations. Pydantic AI already supports tool functions, but fireflyframework-agentic wraps them with a richer layer: a **protocol-based type system** (`ToolProtocol` → `BaseTool`), -**guards** that enforce validation, rate-limiting, sandboxing, and approval policies -before a tool executes, **composition** primitives (sequential, fallback, conditional), +**guards** that enforce validation, rate-limiting, and sandboxing +before a tool executes (human-in-the-loop *approval* is separate — it pauses the run +rather than rejecting it), **composition** primitives (sequential, fallback, conditional), a **global registry** for discovery, and a **ToolKit** that can convert framework tools into Pydantic AI tools for injection into any agent. @@ -654,7 +655,6 @@ graph TB VG["ValidationGuard"] RG["RateLimitGuard"] SG["SandboxGuard"] - AG["ApprovalGuard"] end subgraph "Composition" @@ -681,7 +681,6 @@ graph TB CG --> VG CG --> RG CG --> SG - CG --> AG BT -.->|"compose"| SEQ BT -.->|"compose"| FB BT -.->|"compose"| COND @@ -733,9 +732,10 @@ exchange_tool = ( ### Tool Guards In production, you rarely want a tool to run unconditionally. Guards are decorators -that wrap a tool's execution with policy checks — input validation, rate-limiting, -filesystem sandboxing, or human-in-the-loop approval. They run **before** the handler -(and optionally after), and they stack via `CompositeGuard`. +that wrap a tool's execution with **hard, synchronous policy checks** — input validation, +rate-limiting, or filesystem sandboxing. They run **before** the handler (and optionally +after), and they stack via `CompositeGuard`. (Human-in-the-loop *approval*, which pauses +the run rather than rejecting it, is handled separately — see below.) #### Validation Guard @@ -785,24 +785,26 @@ async def read_file(path: str) -> str: ... ``` -#### Approval Guard +#### Human-in-the-loop approval -Requires explicit approval before execution — useful for destructive operations. -You provide an async callback that receives the tool name and kwargs and returns -`True` to approve: +Human approval is **not** a guard — it pauses the run rather than failing it. Mark a tool +with `requires_approval=True`; the agent run then pauses before the tool executes and the +caller resumes with an approval decision. See +[Human-in-the-Loop Tool Approval](tools.md#human-in-the-loop-tool-approval) for the full flow. ```python -from fireflyframework_agentic.tools.guards import ApprovalGuard +from fireflyframework_agentic.agents import is_deferred +from fireflyframework_agentic.tools import DeferredToolResults -async def require_admin_approval(tool_name: str, kwargs: dict) -> bool: - """In production, this would check a queue, Slack webhook, or admin UI.""" - print(f"Approve {tool_name} with {kwargs}? (auto-approved in dev)") - return True - -@guarded(ApprovalGuard(callback=require_admin_approval)) -@firefly_tool(name="delete_record", description="Delete a database record") +@firefly_tool(name="delete_record", description="Delete a database record", requires_approval=True) async def delete_record(record_id: str) -> str: ... + +result = await agent.run("Delete record 42.") +if is_deferred(result): # paused for sign-off + approvals = {c.tool_call_id: True for c in result.output.approvals} # True / ToolApproved / ToolDenied + result = await agent.run(message_history=result.all_messages(), + deferred_tool_results=DeferredToolResults(approvals=approvals)) ``` ### Composing Guards diff --git a/fireflyframework_agentic/agents/__init__.py b/fireflyframework_agentic/agents/__init__.py index c8a491fe..8d708ea9 100644 --- a/fireflyframework_agentic/agents/__init__.py +++ b/fireflyframework_agentic/agents/__init__.py @@ -14,7 +14,7 @@ """Agents subpackage -- creation, registration, lifecycle, and delegation.""" -from fireflyframework_agentic.agents.base import FireflyAgent +from fireflyframework_agentic.agents.base import ApprovalHandler, FireflyAgent, is_deferred from fireflyframework_agentic.agents.builtin_middleware import ( BudgetExceededError, CacheMiddleware, @@ -69,6 +69,7 @@ "agent_lifecycle", "AgentMiddleware", "AgentRegistry", + "ApprovalHandler", "BudgetExceededError", "CacheMiddleware", "CacheStatistics", @@ -84,6 +85,7 @@ "FallbackModelWrapper", "FallbackStrategy", "FireflyAgent", + "is_deferred", "LoggingMiddleware", "MiddlewareChain", "MiddlewareContext", diff --git a/fireflyframework_agentic/agents/base.py b/fireflyframework_agentic/agents/base.py index 2708f6d2..38324283 100644 --- a/fireflyframework_agentic/agents/base.py +++ b/fireflyframework_agentic/agents/base.py @@ -29,14 +29,16 @@ import logging import re import time -from collections.abc import Sequence +from collections.abc import Awaitable, Callable, Sequence from typing import TYPE_CHECKING, Any, Generic, cast -from pydantic_ai import Agent +from pydantic_ai import Agent, DeferredToolRequests, DeferredToolResults, RunContext from pydantic_ai import Tool as PydanticTool +from pydantic_ai.capabilities import HandleDeferredToolCalls from pydantic_ai.exceptions import UserError from pydantic_ai.models import Model from pydantic_ai.settings import ModelSettings +from pydantic_ai.toolsets import ApprovalRequiredToolset from fireflyframework_agentic.agents.builtin_middleware import ( LoggingMiddleware, @@ -114,6 +116,32 @@ def _suggested_retry_delay(exc: BaseException) -> float | None: return float(match.group(1)) if match else None +#: An inline human-in-the-loop approval handler. Receives the run context and the +#: pending :class:`~pydantic_ai.DeferredToolRequests`, and returns a +#: :class:`~pydantic_ai.DeferredToolResults` resolving some/all approvals so the +#: run continues *without* pausing (return ``None`` to decline and let the run +#: pause normally). Wired into the inner agent as a ``HandleDeferredToolCalls`` +#: capability. Use ``requests.build_results(...)`` to construct the result. +ApprovalHandler = Callable[ + [RunContext[Any], DeferredToolRequests], + "DeferredToolResults | None | Awaitable[DeferredToolResults | None]", +] + + +def is_deferred(result: Any) -> bool: + """Whether an agent run *paused* instead of completing. + + Returns ``True`` when ``result.output`` is a pydantic-ai + :class:`~pydantic_ai.DeferredToolRequests` — i.e. the run hit a tool that + requires human-in-the-loop approval (``.approvals``) or external execution + (``.calls``) and is awaiting a :class:`~pydantic_ai.DeferredToolResults` + before it can continue. Resume by calling :meth:`FireflyAgent.run` (or + ``run_sync``) again with ``message_history=result.all_messages()`` and + ``deferred_tool_results=...``. + """ + return isinstance(getattr(result, "output", None), DeferredToolRequests) + + class FireflyAgent(Generic[AgentDepsT, OutputT]): """A managed agent that wraps a Pydantic AI :class:`Agent`. @@ -154,6 +182,14 @@ class FireflyAgent(Generic[AgentDepsT, OutputT]): messages back after each invocation. auto_register: When *True* (the default), the agent is automatically added to the global :class:`AgentRegistry`. + hitl: Force human-in-the-loop mode on. Normally HITL is auto-detected + from approval-requiring tools/toolsets; set this when tools defer + dynamically (raising ``ApprovalRequired``) so the output union is + widened to allow the ``DeferredToolRequests`` pause. + approval_handler: Optional inline approval handler (see + :data:`ApprovalHandler`). When supplied, approvals are resolved + *inside* the run via a ``HandleDeferredToolCalls`` capability instead + of pausing — useful for programmatic / policy-based auto-approval. """ def __init__( @@ -176,6 +212,8 @@ def __init__( middleware: list[AgentMiddleware] | None = None, default_middleware: bool = True, auto_register: bool = True, + hitl: bool = False, + approval_handler: ApprovalHandler | None = None, ) -> None: cfg = get_config() @@ -206,15 +244,33 @@ def __init__( merged_settings["temperature"] = cfg.default_temperature resolved_settings: ModelSettings | None = cast("ModelSettings", merged_settings) if merged_settings else None + # Human-in-the-loop: when any tool requires approval (declared on a + # BaseTool/ToolKit, or via an ApprovalRequiredToolset in ``toolsets``), + # or HITL is forced, or an inline approval handler is supplied, the run + # may pause and surface a ``DeferredToolRequests``. pydantic-ai only + # allows that terminal when it is in ``output_type`` — otherwise it + # raises ``UserError`` — so widen the output union exactly once here, + # gated strictly on HITL so non-HITL agents keep their declared shape. + self._hitl_enabled = hitl or approval_handler is not None or self._detect_hitl(tools, toolsets) + effective_output_type: Any = output_type + if self._hitl_enabled: + existing = list(output_type) if isinstance(output_type, list) else [output_type] + if DeferredToolRequests not in existing: + existing.append(DeferredToolRequests) + effective_output_type = existing + + capabilities = [HandleDeferredToolCalls(approval_handler)] if approval_handler is not None else [] + self._agent: Agent[AgentDepsT, OutputT] = Agent( resolved_model, instructions=instructions, - output_type=output_type, + output_type=effective_output_type, deps_type=deps_type, tools=self._resolve_tools(tools), toolsets=list(toolsets), retries=resolved_retries, model_settings=resolved_settings, + capabilities=capabilities, name=name, ) @@ -278,6 +334,7 @@ async def run( conversation_id: str | None = None, timeout: float | None = None, context: AgentContext | None = None, + deferred_tool_results: DeferredToolResults | None = None, **kwargs: Any, ) -> Any: """Run the agent asynchronously. @@ -290,8 +347,16 @@ async def run( is provided, the agent automatically loads conversation history as ``message_history`` and stores new messages after the run. + To **resume** a run that paused for human-in-the-loop approval (see + :func:`is_deferred`), pass ``deferred_tool_results`` together with + ``message_history=.all_messages()``; do not also set + *conversation_id* on the resume call (memory injection and an explicit + ``message_history`` are mutually exclusive). + Delegates to ``pydantic_ai.Agent.run``. """ + if deferred_tool_results is not None: + kwargs["deferred_tool_results"] = deferred_tool_results if context is None: context = _AgentContext() @@ -339,9 +404,15 @@ def run_sync( conversation_id: str | None = None, timeout: float | None = None, context: AgentContext | None = None, + deferred_tool_results: DeferredToolResults | None = None, **kwargs: Any, ) -> Any: - """Run the agent synchronously. Delegates to ``pydantic_ai.Agent.run_sync``.""" + """Run the agent synchronously. Delegates to ``pydantic_ai.Agent.run_sync``. + + Supports the same ``deferred_tool_results`` resume path as :meth:`run`. + """ + if deferred_tool_results is not None: + kwargs["deferred_tool_results"] = deferred_tool_results if context is None: context = _AgentContext() @@ -473,6 +544,11 @@ def _persist_memory( """After a run, store new messages in conversation memory.""" if self._memory is None or conversation_id is None: return + if is_deferred(result): + # The run paused for approval; its output is a control object, not a + # final answer. Persisting it would corrupt the conversation turn — + # the resumed run records the completed turn instead. + return if not hasattr(result, "new_messages"): return new_msgs = result.new_messages() @@ -785,12 +861,41 @@ def _resolve_tools(tools: Sequence[Any]) -> list[Any]: item.pydantic_handler(), name=item.name, description=item.description, + requires_approval=item.requires_approval, ) ) else: resolved.append(item) return resolved + @staticmethod + def _detect_hitl(tools: Sequence[Any], toolsets: Sequence[Any]) -> bool: + """Whether any tool/toolset can pause the run for approval. + + True if a :class:`BaseTool` (directly or inside a ``ToolKit``) declares + ``requires_approval``, a toolset is a native + :class:`~pydantic_ai.toolsets.ApprovalRequiredToolset`, or a toolset + (e.g. ``ToolKit.as_toolset()``) exposes a tool that declares + ``requires_approval``. Raw ``pydantic_ai.Tool`` objects with + ``requires_approval`` are also honoured; if introspection misses a case, + pass ``hitl=True`` explicitly. + """ + for item in tools: + if isinstance(item, ToolKit): + if any(getattr(t, "requires_approval", False) for t in item.tools): + return True + elif getattr(item, "requires_approval", False): + return True + for ts in toolsets: + if isinstance(ts, ApprovalRequiredToolset): + return True + # A FunctionToolset exposes its tools as a name -> Tool mapping; honour + # any that declare requires_approval (so as_toolset() HITL kits work). + ts_tools = getattr(ts, "tools", None) + if isinstance(ts_tools, dict) and any(getattr(t, "requires_approval", False) for t in ts_tools.values()): + return True + return False + # -- Dunder -------------------------------------------------------------- def __repr__(self) -> str: diff --git a/fireflyframework_agentic/agents/builtin_middleware.py b/fireflyframework_agentic/agents/builtin_middleware.py index ce407499..8017d9cc 100644 --- a/fireflyframework_agentic/agents/builtin_middleware.py +++ b/fireflyframework_agentic/agents/builtin_middleware.py @@ -47,6 +47,7 @@ from typing import Any from opentelemetry import trace as _trace +from pydantic_ai import DeferredToolRequests from fireflyframework_agentic.agents.middleware import MiddlewareContext from fireflyframework_agentic.exceptions import BudgetExceededError, OutputReviewError @@ -72,6 +73,14 @@ logger = logging.getLogger(__name__) +def _is_deferred_output(result: Any) -> bool: + """Whether a run paused (output is a ``DeferredToolRequests``) rather than + completing. Output-consuming ``after_run`` hooks must skip such a result: + it is a HITL control object, not a final answer — scanning, validating, or + caching it is wrong (and would raise on a type mismatch).""" + return isinstance(getattr(result, "output", None), DeferredToolRequests) + + # -- Errors ------------------------------------------------------------------ @@ -141,6 +150,16 @@ async def after_run(self, context: MiddlewareContext, result: Any) -> Any: elapsed_ms = (time.monotonic() - t0) * 1000 if t0 is not None else 0.0 method = context.method or "run" + if _is_deferred_output(result): + logger.log( + self._level, + "⏸ %s.%s paused after %.1fms awaiting tool approval", + context.agent_name, + method, + elapsed_ms, + ) + return result + suffix = self._usage_suffix(result) if self._include_usage else "" reasoning = self._reasoning_suffix(result) @@ -420,7 +439,9 @@ async def before_run(self, context: MiddlewareContext) -> None: async def after_run(self, context: MiddlewareContext, result: Any) -> Any: """Record a decision with output information.""" output_text = "" - if hasattr(result, "output"): + if _is_deferred_output(result): + output_text = "" + elif hasattr(result, "output"): output_text = str(result.output)[:200] self._get_recorder().record( "llm_call", @@ -458,6 +479,10 @@ async def before_run(self, context: MiddlewareContext) -> None: async def after_run(self, context: MiddlewareContext, result: Any) -> Any: """Store the result in the cache on miss.""" + if _is_deferred_output(result): + # Never cache a paused run — a later identical prompt would be served + # a stale ``DeferredToolRequests`` instead of running for real. + return result if "_cache_result" not in context.metadata: prompt_str = str(context.prompt) if context.prompt is not None else "" self._cache.put(context.agent_name, prompt_str, result) @@ -529,6 +554,8 @@ async def before_run(self, context: MiddlewareContext) -> None: async def after_run(self, context: MiddlewareContext, result: Any) -> Any: """Scan the output; reject or sanitise if unsafe.""" + if _is_deferred_output(result): + return result output_text = "" if hasattr(result, "output"): output_text = str(result.output) @@ -586,6 +613,8 @@ async def before_run(self, context: MiddlewareContext) -> None: async def after_run(self, context: MiddlewareContext, result: Any) -> Any: """Validate the output; raise on failure.""" + if _is_deferred_output(result): + return result raw = result.output if hasattr(result, "output") else result parsed, parse_errors = self._reviewer._parse_output(raw) diff --git a/fireflyframework_agentic/tools/__init__.py b/fireflyframework_agentic/tools/__init__.py index 9453b34e..bde32cde 100644 --- a/fireflyframework_agentic/tools/__init__.py +++ b/fireflyframework_agentic/tools/__init__.py @@ -22,8 +22,16 @@ ``pydantic_ai`` directly. """ -# Native pydantic-ai surface, re-exported so callers compose toolsets in one import. -from pydantic_ai import RunContext +# Native pydantic-ai surface, re-exported so callers compose toolsets, request +# human approval, and resume deferred runs in one import. +from pydantic_ai import ( + DeferredToolRequests, + DeferredToolResults, + RunContext, + ToolApproved, + ToolDenied, +) +from pydantic_ai.exceptions import ApprovalRequired from pydantic_ai.toolsets import ( ApprovalRequiredToolset, CombinedToolset, @@ -52,7 +60,6 @@ ) from fireflyframework_agentic.tools.decorators import firefly_tool, guarded, retryable from fireflyframework_agentic.tools.guards import ( - ApprovalGuard, CompositeGuard, RateLimitGuard, SandboxGuard, @@ -62,13 +69,15 @@ from fireflyframework_agentic.tools.toolkit import ToolKit, to_pydantic_handler __all__ = [ - "ApprovalGuard", + "ApprovalRequired", "ApprovalRequiredToolset", "BaseTool", "CachedTool", "CombinedToolset", "CompositeGuard", "ConditionalComposer", + "DeferredToolRequests", + "DeferredToolResults", "FallbackComposer", "FilteredToolset", "FunctionToolset", @@ -82,7 +91,9 @@ "RunContext", "SandboxGuard", "SequentialComposer", + "ToolApproved", "ToolBuilder", + "ToolDenied", "ToolInfo", "ToolKit", "ToolProtocol", diff --git a/fireflyframework_agentic/tools/base.py b/fireflyframework_agentic/tools/base.py index 9c73e635..9d965656 100644 --- a/fireflyframework_agentic/tools/base.py +++ b/fireflyframework_agentic/tools/base.py @@ -38,8 +38,9 @@ from pydantic import BaseModel, Field from pydantic_ai import ModelRetry, RunContext +from pydantic_ai.exceptions import ApprovalRequired, CallDeferred -from fireflyframework_agentic.exceptions import ToolError, ToolTimeoutError +from fireflyframework_agentic.exceptions import ToolError, ToolGuardError, ToolTimeoutError logger = logging.getLogger(__name__) @@ -151,6 +152,12 @@ class BaseTool(ABC): ``RunContext`` (agent deps, usage, retry count, messages). The generated handler injects it; ``_execute`` then receives it as the keyword-only ``_ctx``. Guards and caching never see ``ctx``. + requires_approval: When ``True``, the tool is exposed to pydantic-ai as + an approval-required tool. The agent run pauses before this tool + executes and surfaces a ``DeferredToolRequests`` for human + sign-off (resolved natively via ``DeferredToolResults``); the tool + body runs only once the call is approved. See the agent's + human-in-the-loop docs. """ def __init__( @@ -163,6 +170,7 @@ def __init__( parameters: Sequence[ParameterSpec] = (), timeout: float | None = None, takes_ctx: bool = False, + requires_approval: bool = False, ) -> None: self._name = name self._description = description @@ -171,6 +179,7 @@ def __init__( self._parameters = list(parameters) self._timeout = timeout self._takes_ctx = takes_ctx + self._requires_approval = requires_approval # -- Properties ---------------------------------------------------------- @@ -204,12 +213,18 @@ def takes_ctx(self) -> bool: """Whether this tool opts into a pydantic-ai ``RunContext``.""" return self._takes_ctx + @property + def requires_approval(self) -> bool: + """Whether calls to this tool require human-in-the-loop approval.""" + return self._requires_approval + # -- Execution ----------------------------------------------------------- async def execute(self, **kwargs: Any) -> Any: """Run the tool after evaluating all guards. - If any guard rejects, a :class:`ToolError` is raised immediately. + If any guard rejects, a :class:`ToolGuardError` (a ``ToolError`` + subclass) is raised immediately. When a *timeout* is configured, wraps the execution in :func:`asyncio.wait_for` and raises :class:`ToolTimeoutError` on expiry. @@ -230,7 +245,7 @@ async def _guarded_execute(self, kwargs: dict[str, Any], *, ctx: Any) -> Any: for guard in self._guards: result = await guard.check(self._name, kwargs) if not result.passed: - raise ToolError(f"Guard rejected execution of tool '{self._name}': {result.reason}") + raise ToolGuardError(f"Guard rejected execution of tool '{self._name}': {result.reason}") logger.debug("Executing tool '%s' with kwargs=%s", self._name, list(kwargs.keys())) # ``_ctx`` reaches ``_execute`` only for ctx-aware tools; it is never in @@ -243,6 +258,12 @@ async def _guarded_execute(self, kwargs: dict[str, Any], *, ctx: Any) -> Any: return await coro except TimeoutError: raise ToolTimeoutError(f"Tool '{self._name}' timed out after {self._timeout}s") from None + except (ApprovalRequired, CallDeferred): + # pydantic-ai human-in-the-loop / deferral control signals. Like + # ``ModelRetry`` these are NOT errors: they must reach the agent graph + # untouched so the run pauses and surfaces a ``DeferredToolRequests``. + # Wrapping them as ``ToolError`` would break dynamic tool approval. + raise except ToolError: raise except Exception as exc: diff --git a/fireflyframework_agentic/tools/decorators.py b/fireflyframework_agentic/tools/decorators.py index 3a51ff72..98484c11 100644 --- a/fireflyframework_agentic/tools/decorators.py +++ b/fireflyframework_agentic/tools/decorators.py @@ -54,8 +54,9 @@ def __init__( description: str = "", tags: Sequence[str] = (), guards: Sequence[GuardProtocol] = (), + requires_approval: bool = False, ) -> None: - super().__init__(name, description=description, tags=tags, guards=guards) + super().__init__(name, description=description, tags=tags, guards=guards, requires_approval=requires_approval) self._handler = handler async def _execute(self, **kwargs: Any) -> Any: @@ -86,6 +87,7 @@ def firefly_tool( description: str = "", tags: Sequence[str] = (), guards: Sequence[GuardProtocol] = (), + requires_approval: bool = False, auto_register: bool = True, ) -> Callable[[AsyncHandler], BaseTool]: """Create a :class:`BaseTool` from an async function and optionally register it. @@ -95,6 +97,9 @@ def firefly_tool( description: Description (falls back to the function's docstring). tags: Tags for capability-based discovery. guards: Guards evaluated before execution. + requires_approval: When *True*, calls to this tool pause the agent run + for human-in-the-loop approval via pydantic-ai's native + deferred-tools protocol (see the agent HITL docs). auto_register: When *True*, the tool is added to the global registry. Returns: @@ -109,6 +114,7 @@ def decorator(func: AsyncHandler) -> BaseTool: description=description or func.__doc__ or "", tags=tags, guards=guards, + requires_approval=requires_approval, ) if auto_register: tool_registry.register(tool) diff --git a/fireflyframework_agentic/tools/guards.py b/fireflyframework_agentic/tools/guards.py index db763ca6..6a8eb9f8 100644 --- a/fireflyframework_agentic/tools/guards.py +++ b/fireflyframework_agentic/tools/guards.py @@ -25,14 +25,11 @@ import re import threading import time -from collections.abc import Awaitable, Callable, Sequence +from collections.abc import Sequence from typing import Any from fireflyframework_agentic.tools.base import GuardProtocol, GuardResult -# Type alias for the human-in-the-loop approval callback -ApprovalCallback = Callable[[str, dict[str, Any]], Awaitable[bool]] - class ValidationGuard: """Validates that all required parameters declared in a tool's spec are present. @@ -84,26 +81,6 @@ async def check(self, tool_name: str, kwargs: dict[str, Any]) -> GuardResult: return GuardResult(passed=True) -class ApprovalGuard: - """Human-in-the-loop guard that delegates the decision to an async callback. - - The callback receives the tool name and kwargs and must return ``True`` - to approve execution. - - Parameters: - callback: An async callable ``(tool_name, kwargs) -> bool``. - """ - - def __init__(self, callback: ApprovalCallback) -> None: - self._callback = callback - - async def check(self, tool_name: str, kwargs: dict[str, Any]) -> GuardResult: - approved = await self._callback(tool_name, kwargs) - if not approved: - return GuardResult(passed=False, reason="Execution not approved") - return GuardResult(passed=True) - - class SandboxGuard: """Restricts tool arguments by matching against allow / deny regex patterns. diff --git a/fireflyframework_agentic/tools/toolkit.py b/fireflyframework_agentic/tools/toolkit.py index aa95c18c..029499d3 100644 --- a/fireflyframework_agentic/tools/toolkit.py +++ b/fireflyframework_agentic/tools/toolkit.py @@ -113,6 +113,7 @@ def as_pydantic_tools(self) -> list[PydanticTool[Any]]: to_pydantic_handler(tool), name=tool.name, description=tool.description, + requires_approval=getattr(tool, "requires_approval", False), ) ) return pydantic_tools @@ -130,7 +131,14 @@ def as_toolset(self) -> FunctionToolset[Any]: for tool in self._tools: # Forward the description too — add_function drops it otherwise, so the # LLM would see a toolset tool with no description (a real fidelity loss). - toolset.add_function(to_pydantic_handler(tool), name=tool.name, description=tool.description) + # ``requires_approval`` is forwarded so a kit's HITL tools keep pausing + # the run for approval even when the kit is added as a toolset. + toolset.add_function( + to_pydantic_handler(tool), + name=tool.name, + description=tool.description, + requires_approval=getattr(tool, "requires_approval", False), + ) return toolset def __len__(self) -> int: diff --git a/pyproject.toml b/pyproject.toml index ed069a28..3bb5eca6 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [project] name = "fireflyframework-agentic" -version = "26.06.10" +version = "26.06.11" description = "A GenAI metaframework built on Pydantic AI for building production-grade GenAI applications with agents, reasoning patterns, prompt engineering, observability, and more." readme = "README.md" license = { text = "Apache-2.0" } diff --git a/tests/integration/test_real_anthropic_e2e.py b/tests/integration/test_real_anthropic_e2e.py index cb867e77..a86fa1f4 100644 --- a/tests/integration/test_real_anthropic_e2e.py +++ b/tests/integration/test_real_anthropic_e2e.py @@ -35,8 +35,9 @@ import pytest from pydantic import BaseModel +from pydantic_ai import DeferredToolResults -from fireflyframework_agentic.agents.base import FireflyAgent +from fireflyframework_agentic.agents.base import FireflyAgent, is_deferred from fireflyframework_agentic.memory.manager import MemoryManager from fireflyframework_agentic.observability.usage import UsageTracker from fireflyframework_agentic.pipeline.builder import PipelineBuilder @@ -144,6 +145,48 @@ async def test_reasoning_prompted_output_mode(self): assert result.steps_taken >= 1 assert "60" in str(result.output) + async def test_hitl_tool_approval_pause_resume(self): + """A requires_approval tool pauses the real run before executing; resuming + with approval runs the tool and the model uses its result.""" + calls: list[str] = [] + + class DeleteTool(BaseTool): + def __init__(self) -> None: + super().__init__( + "delete_record", + description="Delete a record by its id. This is destructive.", + parameters=[ParameterSpec(name="record_id", python_type=str)], + requires_approval=True, + ) + + async def _execute(self, *, record_id: str) -> str: + calls.append(record_id) + return f"record {record_id} deleted" + + ag = FireflyAgent( + "e2e-hitl", + model=HAIKU, + tools=[DeleteTool()], + instructions="To delete, call delete_record with the id. Then confirm to the user.", + auto_register=False, + ) + + paused = await ag.run("Please delete record 42.") + assert is_deferred(paused), "the approval-required tool should pause the run" + assert calls == [], "the tool body must NOT run before approval" + approvals = paused.output.approvals + assert [c.tool_name for c in approvals] == ["delete_record"] + call_id = approvals[0].tool_call_id + + resumed = await ag.run( + None, + message_history=paused.all_messages(), + deferred_tool_results=DeferredToolResults(approvals={call_id: True}), + ) + assert not is_deferred(resumed) + assert calls == ["42"], "the tool body must run exactly once, after approval" + assert "42" in str(resumed.output) + async def test_pipeline_agent_step(self): """A DAG pipeline runs a real agent step end to end.""" ag = FireflyAgent("e2e-pipe", model=HAIKU, instructions="Answer with a single word.", auto_register=False) diff --git a/tests/unit/agents/test_hitl_deferred.py b/tests/unit/agents/test_hitl_deferred.py new file mode 100644 index 00000000..a1faa585 --- /dev/null +++ b/tests/unit/agents/test_hitl_deferred.py @@ -0,0 +1,529 @@ +# Copyright 2026 Firefly Software Foundation +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""SP-3: human-in-the-loop tool approval via pydantic-ai native deferred-tools. + +An approval-required tool pauses the run and surfaces a ``DeferredToolRequests`` +(``is_deferred`` is True). The caller resumes by approving/denying via +``DeferredToolResults``. Post-run cross-cutting code (memory, OutputGuard, +Validation, Cache, Logging) must treat the paused result as a control object, +not a final answer. +""" + +from __future__ import annotations + +import logging +from types import SimpleNamespace +from typing import Any + +import pytest +from pydantic_ai import DeferredToolRequests, DeferredToolResults, Tool, ToolApproved, ToolDenied +from pydantic_ai.exceptions import UserError +from pydantic_ai.messages import ModelResponse, TextPart, ToolCallPart +from pydantic_ai.models.function import FunctionModel +from pydantic_ai.toolsets import ApprovalRequiredToolset, FunctionToolset + +from fireflyframework_agentic.agents.base import FireflyAgent, is_deferred +from fireflyframework_agentic.agents.builtin_middleware import ( + CacheMiddleware, + ExplainabilityMiddleware, + OutputGuardError, + OutputGuardMiddleware, + ValidationMiddleware, +) +from fireflyframework_agentic.agents.cache import ResultCache +from fireflyframework_agentic.exceptions import OutputReviewError +from fireflyframework_agentic.memory.manager import MemoryManager +from fireflyframework_agentic.tools import ApprovalRequired +from fireflyframework_agentic.tools.base import BaseTool, ParameterSpec +from fireflyframework_agentic.tools.decorators import firefly_tool +from fireflyframework_agentic.tools.toolkit import ToolKit + + +def _model_calling(*calls: tuple[str, dict[str, Any], str]) -> FunctionModel: + """A model that emits the given tool calls in its first response, then text. + + Each call is ``(tool_name, args, tool_call_id)``. Emitting several in one + response exercises a single pause carrying multiple pending approvals. + """ + state = {"n": 0} + + def fn(messages: Any, info: Any) -> ModelResponse: + state["n"] += 1 + if state["n"] == 1: + return ModelResponse(parts=[ToolCallPart(tool_name=t, args=a, tool_call_id=c) for (t, a, c) in calls]) + return ModelResponse(parts=[TextPart("done")]) + + return FunctionModel(fn) + + +class _RecordingTool(BaseTool): + """An approval-required tool that records the args it actually executed with.""" + + def __init__(self, name: str = "rec_delete") -> None: + super().__init__( + name, + description="delete by name", + parameters=[ParameterSpec(name="name", python_type=str)], + requires_approval=True, + ) + self.ran_with: list[str] = [] + + async def _execute(self, *, name: str) -> str: + self.ran_with.append(name) + return f"DELETED {name}" + + +class _DynamicTool(BaseTool): + """A ctx-aware tool that defers *dynamically* by raising ApprovalRequired until approved.""" + + def __init__(self) -> None: + super().__init__( + "dyn_delete", + description="delete a record (needs approval)", + parameters=[ParameterSpec(name="record_id", python_type=str)], + takes_ctx=True, + ) + self.seen: dict[str, Any] = {} + + async def _execute(self, *, record_id: str, _ctx: Any) -> str: + if not getattr(_ctx, "tool_call_approved", False): + raise ApprovalRequired(metadata={"reason": "destructive", "record_id": record_id}) + self.seen["approved"] = _ctx.tool_call_approved + self.seen["metadata"] = dict(_ctx.tool_call_metadata or {}) + return f"deleted {record_id}" + + +def _approval_model() -> FunctionModel: + """A model that calls ``delete_db`` first, then answers with text.""" + state = {"n": 0} + + def fn(messages: Any, info: Any) -> ModelResponse: + state["n"] += 1 + if state["n"] == 1: + return ModelResponse( + parts=[ToolCallPart(tool_name="delete_db", args={"name": "prod"}, tool_call_id="call_1")] + ) + return ModelResponse(parts=[TextPart("done")]) + + return FunctionModel(fn) + + +@firefly_tool("delete_db", description="Delete a database", requires_approval=True, auto_register=False) +async def delete_db(name: str) -> str: + return f"DELETED {name}" + + +@pytest.mark.asyncio +async def test_requires_approval_pauses_then_resumes_approve(): + agent = FireflyAgent("hitl", model=_approval_model(), tools=[delete_db], auto_register=False) + + paused = await agent.run("delete prod") + assert is_deferred(paused) + reqs = paused.output + assert isinstance(reqs, DeferredToolRequests) + assert [c.tool_name for c in reqs.approvals] == ["delete_db"] + assert not reqs.calls + call_id = reqs.approvals[0].tool_call_id + + resumed = await agent.run( + None, + message_history=paused.all_messages(), + deferred_tool_results=DeferredToolResults(approvals={call_id: True}), + ) + assert not is_deferred(resumed) + assert resumed.output == "done" + + +@pytest.mark.asyncio +async def test_resume_deny_surfaces_to_model_and_completes(): + agent = FireflyAgent("hitl-deny", model=_approval_model(), tools=[delete_db], auto_register=False) + paused = await agent.run("delete prod") + call_id = paused.output.approvals[0].tool_call_id + + resumed = await agent.run( + None, + message_history=paused.all_messages(), + deferred_tool_results=DeferredToolResults(approvals={call_id: ToolDenied(message="not allowed")}), + ) + # A denial is not a crash: the model is told and the run completes. + assert not is_deferred(resumed) + assert resumed.output == "done" + + +@pytest.mark.asyncio +async def test_inline_approval_handler_resolves_without_pausing(): + def handler(ctx: Any, requests: DeferredToolRequests) -> DeferredToolResults: + return requests.build_results(approvals={c.tool_call_id: True for c in requests.approvals}) + + agent = FireflyAgent( + "hitl-inline", model=_approval_model(), tools=[delete_db], approval_handler=handler, auto_register=False + ) + result = await agent.run("delete prod") + assert not is_deferred(result) + assert result.output == "done" + + +def test_non_hitl_agent_output_type_is_not_widened(): + agent = FireflyAgent("plain", model=_approval_model(), auto_register=False) + assert agent._hitl_enabled is False + + +def test_hitl_detected_from_requires_approval_tool(): + agent = FireflyAgent("d1", model=_approval_model(), tools=[delete_db], auto_register=False) + assert agent._hitl_enabled is True + + +def test_hitl_detected_from_toolkit(): + kit = ToolKit("danger", [delete_db]) + agent = FireflyAgent("d2", model=_approval_model(), tools=[kit], auto_register=False) + assert agent._hitl_enabled is True + + +def test_hitl_detected_from_as_toolset(): + kit = ToolKit("danger", [delete_db]) + agent = FireflyAgent("d3", model=_approval_model(), toolsets=[kit.as_toolset()], auto_register=False) + assert agent._hitl_enabled is True + + +def test_hitl_detected_from_approval_required_toolset(): + wrapped = ApprovalRequiredToolset(FunctionToolset(), approval_required_func=lambda *a: True) + agent = FireflyAgent("d4", model=_approval_model(), toolsets=[wrapped], auto_register=False) + assert agent._hitl_enabled is True + + +def test_hitl_forced_via_flag(): + agent = FireflyAgent("d5", model=_approval_model(), hitl=True, auto_register=False) + assert agent._hitl_enabled is True + + +@pytest.mark.asyncio +async def test_paused_result_not_persisted_to_memory(): + from fireflyframework_agentic.memory.manager import MemoryManager + + mem = MemoryManager() + agent = FireflyAgent("hitl-mem", model=_approval_model(), tools=[delete_db], memory=mem, auto_register=False) + paused = await agent.run("delete prod", conversation_id="c1") + assert is_deferred(paused) + # The pause must not be written as a conversation turn. + assert mem.get_message_history("c1") == [] + + +@pytest.mark.asyncio +async def test_validation_middleware_skips_deferred_output(): + class _AlwaysFailReviewer: + def _parse_output(self, raw: Any) -> tuple[Any, list[str]]: + return None, ["reviewer always fails"] + + def _validate_output(self, validated: Any) -> Any: + return None + + agent = FireflyAgent( + "hitl-val", + model=_approval_model(), + tools=[delete_db], + middleware=[ValidationMiddleware(reviewer=_AlwaysFailReviewer())], + default_middleware=False, + auto_register=False, + ) + # Would raise OutputReviewError if the deferred output were validated. + paused = await agent.run("delete prod") + assert is_deferred(paused) + + # Sanity: the same reviewer DOES fire on a real completed output. + plain = FireflyAgent( + "plain-val", + model=FunctionModel(lambda m, i: ModelResponse(parts=[TextPart("hello")])), + middleware=[ValidationMiddleware(reviewer=_AlwaysFailReviewer())], + default_middleware=False, + auto_register=False, + ) + with pytest.raises(OutputReviewError): + await plain.run("hi") + + +@pytest.mark.asyncio +async def test_cache_middleware_does_not_cache_a_pause(): + cache = ResultCache() + agent = FireflyAgent( + "hitl-cache", + model=_approval_model(), + tools=[delete_db], + middleware=[CacheMiddleware(cache=cache)], + default_middleware=False, + auto_register=False, + ) + paused = await agent.run("delete prod") + assert is_deferred(paused) + # The paused run must not have been cached. + assert cache.get("hitl-cache", "delete prod") is None + + +def test_is_deferred_helper(): + class _R: + output = DeferredToolRequests(approvals=[], calls=[], metadata={}) + + class _Done: + output = "final answer" + + assert is_deferred(_R()) is True + assert is_deferred(_Done()) is False + assert is_deferred(None) is False + + +@pytest.mark.asyncio +async def test_run_sync_requires_approval_pauses_then_resumes_approve(): + agent = FireflyAgent("hitl-sync", model=_approval_model(), tools=[delete_db], auto_register=False) + paused = agent.run_sync("delete prod") + assert is_deferred(paused) + call_id = paused.output.approvals[0].tool_call_id + resumed = agent.run_sync( + None, + message_history=paused.all_messages(), + deferred_tool_results=DeferredToolResults(approvals={call_id: True}), + ) + assert not is_deferred(resumed) + assert resumed.output == "done" + + +@pytest.mark.asyncio +async def test_resume_approve_with_override_args_replaces_tool_args(): + tool = _RecordingTool() + agent = FireflyAgent( + "override", model=_model_calling(("rec_delete", {"name": "prod"}, "c1")), tools=[tool], auto_register=False + ) + paused = await agent.run("delete prod") + call_id = paused.output.approvals[0].tool_call_id + resumed = await agent.run( + None, + message_history=paused.all_messages(), + deferred_tool_results=DeferredToolResults(approvals={call_id: ToolApproved(override_args={"name": "staging"})}), + ) + assert not is_deferred(resumed) + assert tool.ran_with == ["staging"] # override_args replaced the model's "prod" + + +@pytest.mark.asyncio +async def test_dynamic_approval_required_from_tool_body_with_hitl_flag(): + tool = _DynamicTool() + agent = FireflyAgent( + "dyn", + model=_model_calling(("dyn_delete", {"record_id": "7"}, "c1")), + tools=[tool], + hitl=True, + auto_register=False, + ) + paused = await agent.run("delete 7") + assert is_deferred(paused) + call_id = paused.output.approvals[0].tool_call_id + resumed = await agent.run( + None, + message_history=paused.all_messages(), + deferred_tool_results=DeferredToolResults(approvals={call_id: True}), + ) + assert not is_deferred(resumed) + assert tool.seen.get("approved") is True + + # Without hitl=True the dynamic deferral has nowhere to surface -> UserError. + bad = FireflyAgent( + "dyn-bad", + model=_model_calling(("dyn_delete", {"record_id": "7"}, "c1")), + tools=[_DynamicTool()], + auto_register=False, + ) + with pytest.raises(UserError): + await bad.run("delete 7") + + +@pytest.mark.asyncio +async def test_approval_metadata_round_trips_to_run_context(): + tool = _DynamicTool() + agent = FireflyAgent( + "meta", + model=_model_calling(("dyn_delete", {"record_id": "9"}, "c1")), + tools=[tool], + hitl=True, + auto_register=False, + ) + paused = await agent.run("delete 9") + call_id = paused.output.approvals[0].tool_call_id + # Outbound: ApprovalRequired(metadata=...) surfaces in DeferredToolRequests.metadata. + assert paused.output.metadata.get(call_id) == {"reason": "destructive", "record_id": "9"} + # Inbound: DeferredToolResults.metadata reaches the tool's RunContext on resume. + resumed = await agent.run( + None, + message_history=paused.all_messages(), + deferred_tool_results=DeferredToolResults(approvals={call_id: True}, metadata={call_id: {"reviewer": "alice"}}), + ) + assert not is_deferred(resumed) + assert tool.seen.get("approved") is True + assert tool.seen.get("metadata") == {"reviewer": "alice"} + + +@pytest.mark.asyncio +async def test_approval_required_toolset_pauses_and_resumes_end_to_end(): + ran: list[str] = [] + + async def do_thing(x: str) -> str: + ran.append(x) + return f"did {x}" + + fts = FunctionToolset() + fts.add_function(do_thing, name="do_thing") + gated = ApprovalRequiredToolset(fts, approval_required_func=lambda *a: True) + agent = FireflyAgent( + "ars", model=_model_calling(("do_thing", {"x": "go"}, "c1")), toolsets=[gated], auto_register=False + ) + + paused = await agent.run("do it") + assert is_deferred(paused) + assert ran == [] # gated tool did not run before approval + call_id = paused.output.approvals[0].tool_call_id + resumed = await agent.run( + None, + message_history=paused.all_messages(), + deferred_tool_results=DeferredToolResults(approvals={call_id: True}), + ) + assert not is_deferred(resumed) + assert ran == ["go"] + + +@pytest.mark.asyncio +async def test_output_guard_middleware_skips_deferred_output(): + class _BlockAll: + def scan(self, text: str) -> Any: + return SimpleNamespace( + safe=False, matched_categories=["x"], matched_patterns=["p"], reason="blocked", sanitised_output=None + ) + + agent = FireflyAgent( + "og", + model=_approval_model(), + tools=[delete_db], + middleware=[OutputGuardMiddleware(guard=_BlockAll())], + default_middleware=False, + auto_register=False, + ) + paused = await agent.run("delete prod") + assert is_deferred(paused) # the block-all guard did NOT fire on the paused output + + plain = FireflyAgent( + "og-plain", + model=FunctionModel(lambda m, i: ModelResponse(parts=[TextPart("hi")])), + middleware=[OutputGuardMiddleware(guard=_BlockAll())], + default_middleware=False, + auto_register=False, + ) + with pytest.raises(OutputGuardError): + await plain.run("x") + + +@pytest.mark.asyncio +async def test_explainability_middleware_records_paused_marker(): + records: list[tuple[str, dict[str, Any]]] = [] + + class _Recorder: + def record(self, kind: str, **kw: Any) -> None: + records.append((kind, kw)) + + agent = FireflyAgent( + "expl", + model=_approval_model(), + tools=[delete_db], + middleware=[ExplainabilityMiddleware(recorder=_Recorder())], + default_middleware=False, + auto_register=False, + ) + paused = await agent.run("delete prod") + assert is_deferred(paused) + summaries = [kw.get("output_summary") for _, kw in records] + assert "" in summaries + + +@pytest.mark.asyncio +async def test_logging_middleware_logs_paused_branch(caplog: pytest.LogCaptureFixture): + agent = FireflyAgent("logp", model=_approval_model(), tools=[delete_db], auto_register=False) + with caplog.at_level(logging.INFO, logger="fireflyframework_agentic.agents.builtin_middleware"): + paused = await agent.run("delete prod") + assert is_deferred(paused) + assert "paused" in caplog.text and "awaiting tool approval" in caplog.text + assert "completed in" not in caplog.text # the normal completion line is NOT emitted + + +@pytest.mark.asyncio +async def test_multiple_pending_approvals_in_one_pause(): + a = _RecordingTool("del_a") + b = _RecordingTool("del_b") + model = _model_calling(("del_a", {"name": "A"}, "ca"), ("del_b", {"name": "B"}, "cb")) + agent = FireflyAgent("multi", model=model, tools=[a, b], auto_register=False) + + paused = await agent.run("delete both") + assert is_deferred(paused) + ids = {c.tool_name: c.tool_call_id for c in paused.output.approvals} + assert set(ids) == {"del_a", "del_b"} + + resumed = await agent.run( + None, + message_history=paused.all_messages(), + deferred_tool_results=DeferredToolResults( + approvals={ids["del_a"]: True, ids["del_b"]: ToolDenied(message="no")} + ), + ) + assert not is_deferred(resumed) + assert a.ran_with == ["A"] # approved ran + assert b.ran_with == [] # denied did not run + + +@pytest.mark.asyncio +async def test_approval_handler_returning_none_still_pauses(): + def decline(ctx: Any, requests: DeferredToolRequests) -> None: + return None # decline to resolve inline -> the run pauses normally + + agent = FireflyAgent( + "handler-none", model=_approval_model(), tools=[delete_db], approval_handler=decline, auto_register=False + ) + paused = await agent.run("delete prod") + assert is_deferred(paused) + call_id = paused.output.approvals[0].tool_call_id + resumed = await agent.run( + None, + message_history=paused.all_messages(), + deferred_tool_results=DeferredToolResults(approvals={call_id: True}), + ) + assert not is_deferred(resumed) + + +def test_raw_pydantic_tool_requires_approval_detected(): + async def fn(x: str) -> str: + return x + + agent = FireflyAgent("raw", model=_approval_model(), tools=[Tool(fn, requires_approval=True)], auto_register=False) + assert agent._hitl_enabled is True + + +@pytest.mark.asyncio +async def test_resume_with_memory_does_not_double_inject(): + mem = MemoryManager() + agent = FireflyAgent("memr", model=_approval_model(), tools=[delete_db], memory=mem, auto_register=False) + paused = await agent.run("delete prod", conversation_id="c1") + assert is_deferred(paused) + # Resume with an explicit message_history and NO conversation_id: memory injection is + # skipped (mutually exclusive), so prior messages are not double-injected and it completes. + resumed = await agent.run( + None, + message_history=paused.all_messages(), + deferred_tool_results=DeferredToolResults(approvals={paused.output.approvals[0].tool_call_id: True}), + ) + assert not is_deferred(resumed) + assert resumed.output == "done" diff --git a/tests/unit/tools/test_requires_approval.py b/tests/unit/tools/test_requires_approval.py new file mode 100644 index 00000000..ce971049 --- /dev/null +++ b/tests/unit/tools/test_requires_approval.py @@ -0,0 +1,112 @@ +# Copyright 2026 Firefly Software Foundation +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""SP-3 tool layer: ``requires_approval`` threading + guard-denial taxonomy. + +A tool's ``requires_approval`` flag must reach the native ``pydantic_ai.Tool`` +on every conversion path (direct ``tools=``, ``ToolKit.as_pydantic_tools``, +``ToolKit.as_toolset``). Guard denials now raise ``ToolGuardError``. The legacy +``ApprovalGuard`` (bespoke approval via ``ToolError``) is removed in favour of +native deferred-tools. +""" + +from __future__ import annotations + +from typing import Any + +import pytest + +from fireflyframework_agentic.exceptions import ToolError, ToolGuardError +from fireflyframework_agentic.tools.base import BaseTool, GuardResult, ParameterSpec +from fireflyframework_agentic.tools.decorators import firefly_tool +from fireflyframework_agentic.tools.toolkit import ToolKit + + +class _Tool(BaseTool): + def __init__(self, *, requires_approval: bool = False, guards: Any = ()) -> None: + super().__init__( + "danger", + description="a dangerous tool", + parameters=[ParameterSpec(name="q", python_type=str)], + guards=guards, + requires_approval=requires_approval, + ) + + async def _execute(self, **kwargs: Any) -> Any: + return "ran" + + +def test_base_tool_requires_approval_property(): + assert _Tool().requires_approval is False + assert _Tool(requires_approval=True).requires_approval is True + + +def test_firefly_tool_requires_approval_flag(): + @firefly_tool("t", requires_approval=True, auto_register=False) + async def t(x: str) -> str: + return x + + assert t.requires_approval is True + + @firefly_tool("t2", auto_register=False) + async def t2(x: str) -> str: + return x + + assert t2.requires_approval is False + + +def test_toolkit_as_pydantic_tools_threads_requires_approval(): + # Distinguish the two tools by giving them different names. + a = _Tool(requires_approval=True) + a._name = "a" + b = _Tool(requires_approval=False) + b._name = "b" + kit = ToolKit("k", [a, b]) + + pyd = {t.name: t for t in kit.as_pydantic_tools()} + assert pyd["a"].requires_approval is True + assert pyd["b"].requires_approval is False + + +def test_toolkit_as_toolset_threads_requires_approval(): + a = _Tool(requires_approval=True) + a._name = "a" + b = _Tool(requires_approval=False) + b._name = "b" + toolset = ToolKit("k", [a, b]).as_toolset() + + assert toolset.tools["a"].requires_approval is True + assert toolset.tools["b"].requires_approval is False + + +@pytest.mark.asyncio +async def test_guard_denial_raises_tool_guard_error(): + class _Deny: + async def check(self, tool_name: str, kwargs: dict[str, Any]) -> GuardResult: + return GuardResult(passed=False, reason="nope") + + tool = _Tool(guards=[_Deny()]) + with pytest.raises(ToolGuardError) as exc: + await tool.execute(q="x") + # ToolGuardError remains a ToolError subclass — existing `except ToolError` still catches it. + assert isinstance(exc.value, ToolError) + + +def test_legacy_approval_guard_is_removed(): + import fireflyframework_agentic.tools as tools_pkg + + assert not hasattr(tools_pkg, "ApprovalGuard") + # Native replacements are exported instead. + for name in ("ApprovalRequired", "ApprovalRequiredToolset", "DeferredToolRequests", "DeferredToolResults"): + assert hasattr(tools_pkg, name) diff --git a/uv.lock b/uv.lock index c77e3b2b..44d67b0a 100644 --- a/uv.lock +++ b/uv.lock @@ -1148,7 +1148,7 @@ wheels = [ [[package]] name = "fireflyframework-agentic" -version = "26.6.10" +version = "26.6.11" source = { editable = "." } dependencies = [ { name = "genai-prices" },