fireflyframework · ancongui · Jun 22, 2026 · Jun 22, 2026 · Jun 22, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,59 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 Copyright 2026 Firefly Software Foundation. Licensed under the Apache License 2.0.
 
+## [26.06.11] - 2026-06-22
+
+SP-3: human-in-the-loop tool approval re-based onto pydantic-ai native deferred-tools.
+
+### Added
+
+- **Native tool approval / HITL.** Tools can declare `requires_approval=True`
+  (`firefly_tool(...)`, `BaseTool`, and threaded through `ToolKit.as_pydantic_tools()`
+  / `as_toolset()`). When the model calls such a tool, the agent run **pauses before
+  executing it** and returns a `DeferredToolRequests` as `result.output`. Detect with
+  the new `is_deferred(result)` helper; **resume** via
+  `agent.run(message_history=..., deferred_tool_results=DeferredToolResults(approvals={call_id: True | ToolApproved(override_args=...) | ToolDenied(message=...)}))`.
+  `FireflyAgent` auto-detects HITL (any approval-requiring tool/`ToolKit`/`as_toolset()`,
+  or an `ApprovalRequiredToolset` in `toolsets`) and widens its output union to allow the
+  pause **only then** — non-HITL agents are unchanged. Force with `hitl=True`.
+- **Inline (non-pausing) approval.** `FireflyAgent(approval_handler=...)` resolves
+  approvals *inside* the run via a native `HandleDeferredToolCalls` capability — for
+  programmatic / policy-based auto-approval.
+- **Native re-exports** from `fireflyframework_agentic.tools`: `DeferredToolRequests`,
+  `DeferredToolResults`, `ToolApproved`, `ToolDenied`, `ApprovalRequired` (plus the
+  already-exported `ApprovalRequiredToolset`). `is_deferred` and the `ApprovalHandler`
+  type are exported from `fireflyframework_agentic.agents`.
+
+### Changed
+
+- Post-run cross-cutting code now treats a paused run as a control object, not a final
+  answer: `_persist_memory`, the output-guard, validation, cache, logging, and
+  explainability middleware all **skip** a `DeferredToolRequests` output (preventing
+  corrupted memory turns, spurious `OutputGuardError`/`OutputReviewError`, and caching a
+  pause).
+- Tool **guard denials** (validation / rate-limit / sandbox) now raise `ToolGuardError`
+  instead of a plain `ToolError`. `ToolGuardError` subclasses `ToolError`, so existing
+  `except ToolError` handlers are unaffected.
+- `BaseTool._guarded_execute` now lets pydantic-ai's `ApprovalRequired` / `CallDeferred`
+  control signals propagate untouched (like `ModelRetry`), instead of wrapping them as
+  `ToolError`. This makes **dynamic** approval work — a tool body (with `takes_ctx=True`)
+  may `raise ApprovalRequired(metadata=...)` to defer that specific call; pair with
+  `FireflyAgent(hitl=True)` so the output union allows the pause.
+
+### Removed (breaking)
+
+- **`ApprovalGuard`** (and the `ApprovalCallback` alias). The bespoke guard-chain approval
+  (sync bool callback → `ToolError` on denial, no pause/resume/metadata) is replaced by the
+  native protocol above. Migration: `docs/migration.md` §6.
+
+### Notes
+
+- HITL stays three distinct layers by design: tool approval (native deferred-tools, agent
+  layer), workflow `human()` / `WorkflowInterrupt` (journal-replay), and pipeline `Pause`
+  / `approve_pause` (checkpoint). They are not collapsed.
+- Validated against a live Anthropic model: a `requires_approval` tool pauses the real run
+  (tool body does not execute), and resuming with approval runs it exactly once.
+
 ## [26.06.10] - 2026-06-22
 
 SP-5: native structured-output modes for reasoning patterns.

diff --git a/README.md b/README.md
@@ -156,9 +156,14 @@ create your own components; the framework discovers them via duck typing.
 
 - **Tools** — `ToolProtocol` (duck-typed) and `BaseTool` (inheritance) let you choose
   your extensibility style. `ToolBuilder` provides a fluent API for building tools
-  without subclassing. Five guard types (`ValidationGuard`, `RateLimitGuard`,
-  `ApprovalGuard`, `SandboxGuard`, `CompositeGuard`) intercept calls before execution.
-  Three composition patterns (`SequentialComposer`, `FallbackComposer`,
+  without subclassing. Four guard types (`ValidationGuard`, `RateLimitGuard`,
+  `SandboxGuard`, `CompositeGuard`) intercept calls before execution (a rejected guard
+  raises `ToolGuardError`). For **human-in-the-loop**, mark a tool `requires_approval=True`:
+  the agent run **pauses** before executing it and returns a `DeferredToolRequests`
+  (detected via `is_deferred(result)`), which you resume with `deferred_tool_results=` —
+  approving (`ToolApproved`), denying (`ToolDenied`), or auto-deciding inline via an
+  `approval_handler=`. The native deferred-tools types are re-exported from
+  `fireflyframework_agentic.tools`. Three composition patterns (`SequentialComposer`, `FallbackComposer`,
   `ConditionalComposer`) build higher-order tools. `ToolKit` groups tools for
   bulk registration. Nine built-in tools (calculator, datetime, filesystem, HTTP,
   JSON, search, shell, text, database) are ready to attach to any agent.
@@ -177,7 +182,11 @@ create your own components; the framework discovers them via duck typing.
   **Reflexion** (execute → critique → retry), **Tree of Thoughts** (branch →
   evaluate → select), and **Goal Decomposition** (goal → phases → tasks).
   All produce structured `ReasoningResult` with `ReasoningTrace`. Prompts are
-  slot-overridable. `OutputReviewer` can validate final outputs. `ReasoningPipeline`
+  slot-overridable. Each pattern's structured output is wrapped in a pydantic-ai
+  output mode — selected per-pattern via `output_mode=` or framework-wide via the
+  `reasoning_output_mode` config — `"tool"` (`ToolOutput`), `"native"` (provider
+  structured output), or `"prompted"` (`PromptedOutput`, portable to any model).
+  `OutputReviewer` can validate final outputs. `ReasoningPipeline`
   chains patterns sequentially.
 
 <p align="center">
@@ -536,6 +545,12 @@ async def lookup(query: str) -> str:
     return f"Result for {query}"
 ```
 
+> **Human-in-the-loop:** mark a tool `@firefly_tool(name=..., requires_approval=True)` and the
+> agent run **pauses** before executing it — `run()` returns a `DeferredToolRequests`
+> (detect with `is_deferred(result)`). Resume with
+> `agent.run(message_history=paused.all_messages(), deferred_tool_results=DeferredToolResults(approvals={call_id: True}))`.
+> Full detail in [docs/tools.md](docs/tools.md#human-in-the-loop-tool-approval).
+
 ### 4. Add Memory for Multi-Turn Conversations
 
 ```python
@@ -713,11 +728,11 @@ content processing, validation, explainability, and pipelines.
 Detailed guides for each module:
 
 - [Architecture](docs/architecture.md) — Design principles and layer diagram
-- [Agents](docs/agents.md) — Lifecycle, registry, delegation, decorators
+- [Agents](docs/agents.md) — Lifecycle, registry, delegation, decorators, human-in-the-loop approval
 - [Template Agents](docs/templates.md) — Summarizer, classifier, extractor, conversational, router
-- [Tools](docs/tools.md) — Protocol, builder, guards, composition, built-ins
+- [Tools](docs/tools.md) — Protocol, builder, guards, composition, built-ins, native HITL approval (`requires_approval`, deferred resume)
 - [Prompts](docs/prompts.md) — Templates, versioning, composition, validation
-- [Reasoning Patterns](docs/reasoning.md) — 6 patterns, structured outputs, custom patterns
+- [Reasoning Patterns](docs/reasoning.md) — 6 patterns, structured outputs, output modes (`output_mode`/`reasoning_output_mode`), custom patterns
 - [Content](docs/content.md) — Chunking, compression, batch processing
 - [Memory](docs/memory.md) — Conversation history, working memory, storage backends
 - [Validation](docs/validation.md) — Rules, QoS guards, output reviewer

diff --git a/docs/README.md b/docs/README.md
@@ -43,7 +43,7 @@ below it, keeping the dependency graph acyclic and each module independently tes
 |---|---|
 | **[Agents](agents.md)** | `FireflyAgent`, `AgentRegistry`, `AgentLifecycle`, `@firefly_agent` decorator, middleware stack (`AgentMiddleware`, `MiddlewareChain`, `Logging`/`PromptGuard`/`CostGuard`/`Observability`/`Explainability`/`Cache`/`OutputGuard`/`Validation`/`Retry`/`PromptCache` middleware), 7 delegation strategies (round-robin, capability, content-based, cost-aware, chain, fallback, weighted), `FallbackModelWrapper` / `run_with_fallback`, `ResultCache` |
 | **[Template Agents](templates.md)** | Five factory functions: summarizer, classifier, extractor, conversational, router |
-| **[Tools](tools.md)** | `ToolProtocol`, `BaseTool`, `ToolBuilder`, guards, composition, caching, 9 built-in tools; full-fidelity schemas via `ParameterSpec(python_type=…)`, `RunContext` opt-in (`takes_ctx`), `ToolKit.as_toolset()` + re-exported native combinators (`FilteredToolset`, `WrapperToolset`, `ApprovalRequiredToolset`, …) |
+| **[Tools](tools.md)** | `ToolProtocol`, `BaseTool`, `ToolBuilder`, guards, composition, caching, 9 built-in tools; full-fidelity schemas via `ParameterSpec(python_type=…)`, `RunContext` opt-in (`takes_ctx`), `ToolKit.as_toolset()` + re-exported native combinators (`FilteredToolset`, `WrapperToolset`, `ApprovalRequiredToolset`, …); human-in-the-loop tool approval (`requires_approval` / `is_deferred` / `deferred_tool_results` / `approval_handler`) |
 | **[Prompts](prompts.md)** | `PromptTemplate`, `PromptRegistry`, composers, validation, loaders |
 | **[Content](content.md)** | `TextChunker`, `MarkdownChunker`, `DocumentSplitter`, `ImageTiler`, `BatchProcessor`, compression; binary normalization (`content.binary`, `[binary]` extra: `BinaryNormalizer`, office/PDF/image/archive/email converters) |
 | **[Memory](memory.md)** | `ConversationMemory`, `WorkingMemory`, `MemoryManager`, `InMemoryStore` / `FileStore` / `SQLiteStore` backends, `MemoryScope`, LLM summarisation |

diff --git a/docs/agents.md b/docs/agents.md
@@ -137,6 +137,34 @@ agent = FireflyAgent(
 
 ---
 
+## Human-in-the-Loop Tool Approval
+
+When a tool declares `requires_approval=True` (or an `ApprovalRequiredToolset` gates the
+toolset), `run()` / `run_sync()` **pause before the tool executes** and return a
+`DeferredToolRequests` as `result.output`. Detect this with `is_deferred(result)`, then
+resume by calling the agent again with the paused messages and the human's decision:
+
+```python
+from fireflyframework_agentic.agents import FireflyAgent, is_deferred
+from fireflyframework_agentic.tools import DeferredToolResults
+
+result = await agent.run("Delete record 42.")
+if is_deferred(result):
+    approvals = {c.tool_call_id: True for c in result.output.approvals}  # True / ToolApproved / ToolDenied
+    result = await agent.run(
+        message_history=result.all_messages(),
+        deferred_tool_results=DeferredToolResults(approvals=approvals),
+    )
+```
+
+`FireflyAgent` auto-detects HITL from approval-requiring tools/toolsets (widening its output
+union only then); force it with `hitl=True`, or resolve approvals inline without pausing via
+`approval_handler=`. Post-run middleware (output guard, validation, cache) and memory all skip
+a paused result — it is a control object, not a final answer. See
+[Human-in-the-Loop Tool Approval](tools.md#human-in-the-loop-tool-approval) for the full guide.
+
+---
+
 ## Agent Registry
 
 The `AgentRegistry` is a singleton that maps agent names to `FireflyAgent` instances.

diff --git a/docs/architecture.md b/docs/architecture.md
@@ -61,7 +61,7 @@ graph TD
 
     subgraph Agent Layer
         AGT["Agents<br/><small>FireflyAgent · AgentRegistry<br/>DelegationRouter · AgentLifecycle<br/>@firefly_agent · 5 templates · 11 middleware<br/>7 delegation strategies · FallbackModelWrapper<br/>ResultCache · run timeout</small>"]
-        TOOLS["Tools<br/><small>BaseTool · ToolBuilder · ToolKit · CachedTool<br/>5 guards · 3 composers · tool timeout<br/>ToolRegistry · 9 built-ins</small>"]
+        TOOLS["Tools<br/><small>BaseTool · ToolBuilder · ToolKit · CachedTool<br/>4 guards · 3 composers · tool timeout · HITL approval<br/>ToolRegistry · 9 built-ins</small>"]
         PROMPTS["Prompts<br/><small>PromptTemplate · PromptRegistry<br/>3 composers · PromptValidator<br/>PromptLoader</small>"]
         CONTENT["Content<br/><small>TextChunker · DocumentSplitter · MarkdownChunker<br/>ImageTiler · BatchProcessor<br/>ContextCompressor · SlidingWindowManager<br/>content.binary (BinaryNormalizer · office converters)</small>"]
         MEM["Memory<br/><small>MemoryManager · ConversationMemory<br/>WorkingMemory · TokenEstimator<br/>InMemoryStore · FileStore · SQLiteStore<br/>summarization · create_llm_summarizer<br/>export/import · async wrappers</small>"]
@@ -166,7 +166,6 @@ classDiagram
     ToolProtocol <|.. ConditionalComposer
     GuardProtocol <|.. ValidationGuard
     GuardProtocol <|.. RateLimitGuard
-    GuardProtocol <|.. ApprovalGuard
     GuardProtocol <|.. SandboxGuard
     GuardProtocol <|.. CompositeGuard
     ReasoningPattern <|.. AbstractReasoningPattern
@@ -215,7 +214,7 @@ system. Every other module depends on at least one Core component.
   from environment variables and `.env` files. It actively rejects removed serving/exposure
   config fields (e.g. `otlp_endpoint`, `rbac_enabled`, `cors_allowed_origins`,
   `cost_calculator`) with a `ValueError`.
-- **exceptions.py** -- A structured exception hierarchy of 34 classes rooted at
+- **exceptions.py** -- A structured exception hierarchy of 42 classes rooted at
   `FireflyAgenticError`.
 - **plugin.py** -- `PluginDiscovery` discovers and loads entry-point plugins at startup.
 - **resilience/circuit_breaker.py** -- `CircuitBreaker` (with `CircuitState` and

diff --git a/docs/migration.md b/docs/migration.md
@@ -134,6 +134,50 @@ await triage(args, runner=FireflyAgentRunner())   # both resolved from the regis
 
 ---
 
+## 6. `ApprovalGuard` removed — human-in-the-loop is now native (breaking)
+
+**Why.** Tool approval was a bespoke guard (`ApprovalGuard(callback)`) that ran inside
+Firefly's guard chain and raised `ToolError` on a denied call — a synchronous, all-or-nothing
+gate with no pause/resume, metadata, or per-call granularity, parallel to pydantic-ai's own
+deferred-tools protocol. It has been **removed** in favour of the native protocol
+(`requires_approval`, `DeferredToolRequests`/`DeferredToolResults`, `ApprovalRequired`,
+`ApprovalRequiredToolset`).
+
+```python
+# Before — guard that blocks the run on denial
+from fireflyframework_agentic.tools.guards import ApprovalGuard
+
+async def approve(tool_name, kwargs) -> bool:
+    return await ask_admin(tool_name, kwargs)
+
+@guarded(ApprovalGuard(callback=approve))
+@firefly_tool("delete_record", description="Delete a record")
+async def delete_record(record_id: str) -> str: ...
+
+# After — native: the run PAUSES for sign-off, then resumes
+from fireflyframework_agentic.agents import is_deferred
+from fireflyframework_agentic.tools import DeferredToolResults
+
+@firefly_tool("delete_record", description="Delete a record", requires_approval=True)
+async def delete_record(record_id: str) -> str: ...
+
+result = await agent.run("delete record 42")
+if is_deferred(result):
+    approvals = {c.tool_call_id: await ask_admin(c) for c in result.output.approvals}  # bool / ToolApproved / ToolDenied
+    result = await agent.run(message_history=result.all_messages(),
+                             deferred_tool_results=DeferredToolResults(approvals=approvals))
+```
+
+For the old **inline, non-pausing** behaviour (a callback decides programmatically), pass
+`FireflyAgent(approval_handler=...)` — wired as a native `HandleDeferredToolCalls` capability.
+See [Human-in-the-Loop Tool Approval](tools.md#human-in-the-loop-tool-approval).
+
+Also: guard denials (validation, rate-limit, sandbox) now raise `ToolGuardError` instead of a
+plain `ToolError`. `ToolGuardError` **subclasses** `ToolError`, so existing `except ToolError`
+handlers keep working.
+
+---
+
 ## Checklist
 
 - [ ] Replace every `type_annotation="..."` with `python_type=<real type>` in
@@ -144,3 +188,7 @@ await triage(args, runner=FireflyAgentRunner())   # both resolved from the regis
 - [ ] Review workflows for global cost/budget effects now that sub-agents run through
       `FireflyAgent`; pass `runner=DefaultAgentRunner()` if you want the old path.
 - [ ] Import toolset combinators / `RunContext` from `fireflyframework_agentic.tools`.
+- [ ] Replace `ApprovalGuard` with `requires_approval=True` + the native pause/resume flow
+      (`is_deferred()` + `deferred_tool_results=`), or an inline `approval_handler=`.
+- [ ] If you matched on `ToolError` from guard denials specifically, note it is now the
+      `ToolGuardError` subclass (still caught by `except ToolError`).