-
Notifications
You must be signed in to change notification settings - Fork 210
design: Add 0002-isolated-state proposal #551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,305 @@ | ||
| # Isolated State | ||
|
|
||
| **Status**: Proposed | ||
|
|
||
| **Date**: 2026-02-16 | ||
|
|
||
| **Issue**: N/A | ||
|
|
||
| ## Context | ||
|
|
||
| Today, the `Agent` class stores all mutable per-invocation state as instance fields. A few examples include: | ||
|
|
||
| - `messages` — conversation history | ||
| - `state` (AgentState) — user-facing key-value state | ||
| - `event_loop_metrics` — token usage and performance metrics | ||
| - `trace_span` — the current OpenTelemetry trace span | ||
| - `_interrupt_state` — interrupt tracking | ||
|
|
||
| Because this state lives directly on the agent instance, two concurrent invocations would corrupt each other's data. The SDK prevents this with a `threading.Lock` that raises `ConcurrencyException` if a second call arrives while the first is still running: | ||
|
|
||
| ```python | ||
| # From agent.py stream_async | ||
| acquired = self._invocation_lock.acquire(blocking=False) | ||
| if not acquired: | ||
| raise ConcurrencyException( | ||
| "Agent is already processing a request. Concurrent invocations are not supported." | ||
| ) | ||
| ``` | ||
|
|
||
| ### The problem in practice | ||
|
|
||
| A simple concurrent use case fails today: | ||
|
|
||
| ```python | ||
| import asyncio | ||
| from strands import Agent | ||
|
|
||
| agent = Agent(system_prompt="You are a helpful assistant.") | ||
|
|
||
| async def main(): | ||
| # This raises ConcurrencyException on the second call | ||
| results = await asyncio.gather( | ||
| agent.invoke_async("Summarize the Python GIL"), | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what is the expected behavior here though? I can think of multiple ways this can/should be handled
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The expected behavior is that this invoke_async is completely isolated from the second call below. They each interact with different state. For example, each call is editing a separate messages array. They however share the same configurations (e.g., the model provider, system prompt, etc.). |
||
| agent.invoke_async("Summarize the Rust borrow checker"), | ||
|
Comment on lines
+43
to
+44
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this something we want to encourage? Would you say it's an anti-pattern? Should we encourage folks to just have a func that returns a new instance of their defined agent?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see you address in it "workaround" section :) |
||
| ) | ||
|
|
||
| asyncio.run(main()) | ||
| ``` | ||
|
|
||
| ### The workaround is verbose and limiting | ||
|
|
||
| To get around this today, users must create separate agent instances: | ||
|
|
||
| ```python | ||
| import asyncio | ||
| from strands import Agent | ||
|
|
||
| def make_agent(): | ||
| return Agent( | ||
| model=my_model, | ||
| tools=[tool_a, tool_b], | ||
| system_prompt="You are a helpful assistant.", | ||
| ) | ||
|
|
||
| async def main(): | ||
| results = await asyncio.gather( | ||
| make_agent().invoke_async("Summarize the Python GIL"), | ||
| make_agent().invoke_async("Summarize the Rust borrow checker"), | ||
| ) | ||
|
|
||
| asyncio.run(main()) | ||
| ``` | ||
|
|
||
| This works for simple scripts, but breaks down anywhere a function accepts an agent instance directly. The factory-function pattern can't help when the caller expects a pre-configured agent. `Graph.add_node` is one example — it takes an agent instance, and it validates that each node has a unique instance: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also this is a different pattern compared to the problem descirbed above. the second invocation is unaware of the first invocation because they are different agents. |
||
|
|
||
| ```python | ||
| # From graph.py _validate_node_executor | ||
| if id(executor) in seen_instances: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Graph wise , it is valid? Since we can always revisit a Node |
||
| raise ValueError("Duplicate node instance detected. Each node must have a unique object instance.") | ||
| ``` | ||
|
|
||
| If you have a generic agent (e.g., a summarizer) that you want to reuse across multiple graph nodes, you can't. You must create separate instances with identical configuration: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can't we just remove that requirement? why do we have it?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nodes can execute in parallel in Graph. If two nodes running in parallel share the same agent instance, a concurrency runtime error will raise. |
||
|
|
||
| ```python | ||
| from strands import Agent | ||
| from strands.multiagent.graph import GraphBuilder | ||
|
|
||
| summarizer_config = dict( | ||
| model=my_model, | ||
| tools=[summarize_tool], | ||
| system_prompt="You are a summarizer.", | ||
| ) | ||
|
|
||
| graph = GraphBuilder() | ||
| # Must create separate instances even though they're identical | ||
| graph.add_node(Agent(**summarizer_config), node_id="summarize_a") | ||
| graph.add_node(Agent(**summarizer_config), node_id="summarize_b") | ||
| ``` | ||
|
|
||
| This goes against the SDK's goal of building agents in just a few lines of code. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is a pretty weak use case. @zastrowm mentioned that a team had one. But introducing complexity to replace a factory pattern which is essentially needs a better reason than I know you call this out above where Graph wont work. But this comes back to my confusion around the earlier statement I would want a strong use case to justify the concurrent case |
||
|
|
||
| ### State reset is fragile | ||
|
|
||
| Any code that needs to reset an agent to a clean state must manually reach into its internals and know which fields to clear. This is error-prone — if the agent gains new stateful fields in the future, every reset site must be updated or it silently leaks state between executions. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this section ignores the simpler answer where we simply make the property private and add methods to make operating on them easier. I'd want to see pros and cons |
||
|
|
||
| The graph implementation is a good example of this: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: graph implementation could be much better. I'd say we should rethink that before changing agent |
||
|
|
||
| ```python | ||
| # From graph.py GraphNode.reset_executor_state | ||
| def reset_executor_state(self) -> None: | ||
| if hasattr(self.executor, "messages"): | ||
| self.executor.messages = copy.deepcopy(self._initial_messages) | ||
|
|
||
| if hasattr(self.executor, "state"): | ||
| self.executor.state = AgentState(self._initial_state.get()) | ||
|
|
||
| self.execution_status = Status.PENDING | ||
| self.result = None | ||
| ``` | ||
|
|
||
| It deep-copies initial state at construction time and manually resets specific fields. This pattern would need to be replicated anywhere else that needs to reset agent state. | ||
|
|
||
| ## Decision | ||
|
|
||
| Consider making `Agent` stateless by extracting all per-invocation mutable state into an isolated state object, managed through a session manager and keyed by an invocation key. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Making agent stateless also enables durable orchestrators |
||
|
|
||
| ### Isolated invocation state | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Two pattern that come to mind after reading this proposal:
This proposal sounds like we are changing the agent into more of this "executor" pattern. I know that lots of folks get confused by this, they think initializing an agent is heavy. The idea of separating state from the agent makes sense, and helps with this confusion. I also think an "AgentProvider" might help to do the same thing, and might involve less change to the existing Agent class |
||
|
|
||
| One approach would be to move all mutable state out of the agent instance and into a per-invocation state object: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. system prompts, tools, models, etc. are also mutable. Considering meta-agent concept for multi-agent systems and context management, I'd argue the line between mutable/immutable is not as clear |
||
|
|
||
| ```python | ||
| class InvocationState: | ||
| """All mutable state for a single agent invocation.""" | ||
| messages: Messages | ||
|
Comment on lines
+132
to
+134
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How are messages going to be per invocation? Being that we need to pass the entire message history to the llm - would we bundle all messages from all invocation states? |
||
| agent_state: AgentState | ||
| event_loop_metrics: EventLoopMetrics | ||
| trace_span: trace_api.Span | None | ||
| interrupt_state: _InterruptState | ||
| ... | ||
| ``` | ||
|
|
||
| The agent instance would retain only configuration: model, tools, system prompt, hooks, callback handler, conversation manager, etc. In the future, configuration could also be extracted into its own isolated object to allow per-invocation overrides, but this document focuses on invocation state to highlight the core problem and start the discussion. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if one invocation modifies tools while another is running?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can get on board with this but I think it is an all or nothing problem. Either everything is tied an invocation or we cannot reasonably handle edge cases and have a locking mechanism. As agents become more and more autonomous and begin modifying themselves I think having only partial coverage will be painful |
||
|
|
||
| ### Session manager provides state | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is essentially snaphots? How does this compare to snapshot proposal
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we make it more plugable? It doens't has to be coupled with session manager? |
||
|
|
||
| At invocation time, the agent could read state from a session manager using an invocation key: | ||
|
|
||
|
Comment on lines
+146
to
+147
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would state be an optional feature? Since session management is optional how would we handle this? |
||
| ```python | ||
| # Pseudo-code for agent.stream_async | ||
| async def stream_async(self, prompt, *, invocation_key=None, **kwargs): | ||
| # Resolve the invocation key | ||
| key = invocation_key or self._default_invocation_key | ||
|
|
||
| # Load isolated state from session manager | ||
| invocation_state = await self.session_manager.load(key) | ||
|
|
||
| # Run the event loop against the isolated state (not self) | ||
| async for event in self._run_loop(invocation_state, prompt, **kwargs): | ||
| yield event | ||
|
|
||
| # Persist state back | ||
| await self.session_manager.save(key, invocation_state) | ||
| ``` | ||
|
|
||
| Because each invocation would operate on its own state object, there would be no shared mutable state on the agent. The `threading.Lock` and `ConcurrencyException` would no longer be needed. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is true; the reason we added the exception was because folks were unaware, but usually they want one of two behaviors:
What the exception was solving was:
In the cases of (1) & (2), we would still need locking and an exception, no? |
||
|
|
||
| ### Default behavior and backwards compatibility | ||
|
|
||
| One idea is to introduce a default in-memory session manager. Each agent instance would get a default invocation key that is stable across calls: | ||
|
|
||
| ```python | ||
| class InMemorySessionManager(SessionManager): | ||
| """Stores state in memory, keyed by invocation key.""" | ||
|
|
||
| def __init__(self): | ||
| self._store: dict[str, InvocationState] = {} | ||
|
|
||
| async def load(self, key: str) -> InvocationState: | ||
| if key not in self._store: | ||
| self._store[key] = InvocationState() | ||
| return self._store[key] | ||
|
|
||
| async def save(self, key: str, state: InvocationState) -> None: | ||
| self._store[key] = state | ||
| ``` | ||
|
|
||
| When no invocation key is supplied, the agent would use a default key tied to the instance. This would mean: | ||
|
|
||
| - Sequential calls accumulate conversation history, just like today. | ||
| - A single agent instance with no invocation key behaves identically to the current implementation. | ||
| - No code changes required for existing users who interact with the agent through `__call__` or `invoke_async`. | ||
|
|
||
| However, code that directly accesses instance fields like `agent.messages` or `agent.state` would be affected. These fields would no longer live on the agent instance, so existing patterns like `print(agent.messages)` or `agent.state["key"] = value` would need to change. This is the primary backwards compatibility concern and is discussed further in the Consequences section. | ||
|
|
||
| ### Concurrent usage with invocation keys | ||
|
|
||
| Users who want concurrency could supply distinct invocation keys: | ||
|
|
||
| ```python | ||
| import asyncio | ||
| from strands import Agent | ||
|
|
||
| agent = Agent(system_prompt="You are a helpful assistant.") | ||
|
|
||
| async def main(): | ||
| results = await asyncio.gather( | ||
| agent.invoke_async("Summarize the Python GIL", invocation_key="task-1"), | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. assume a case of agentcore being invoked multiple times. what makes invocation key different?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Going to +1 this - the majority of the use cases that led to the exceptions where mis-use using the same user/conversation. It was not forking. There is a learning/concept/documentation problem with strands where we don't make it clear that Agent=Conversation, but this is not solving the same problem as the exceptio |
||
| agent.invoke_async("Summarize the Rust borrow checker", invocation_key="task-2"), | ||
| ) | ||
|
|
||
| asyncio.run(main()) | ||
| ``` | ||
|
|
||
| Each key would get its own isolated messages, agent state, metrics, and trace span. No lock contention, no `ConcurrencyException`. | ||
|
|
||
| ### Multi-agent patterns could become simpler | ||
|
|
||
| With isolated state, multi-agent patterns that reuse the same agent instance become possible. For example, graph nodes could share a single agent and rely on unique invocation keys per execution: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why aren't they possible today? |
||
|
|
||
| ```python | ||
| from strands import Agent | ||
| from strands.multiagent.graph import GraphBuilder | ||
|
|
||
| summarizer = Agent( | ||
| model=my_model, | ||
| tools=[summarize_tool], | ||
| system_prompt="You are a summarizer.", | ||
| ) | ||
|
|
||
| graph = GraphBuilder() | ||
| # Same instance, different invocation keys per execution | ||
| graph.add_node(summarizer, node_id="summarize_a") | ||
| graph.add_node(summarizer, node_id="summarize_b") | ||
| ``` | ||
|
|
||
| The `_validate_node_executor` duplicate-instance check would no longer be needed. `GraphNode.reset_executor_state` could be removed — each execution would start with a fresh invocation state loaded from the session manager. No more deep-copying initial state, no more manually resetting fields, and no risk of missing new stateful fields in the future. | ||
|
|
||
| ## Developer Experience | ||
|
|
||
| ### Basic usage (unchanged) | ||
|
|
||
| ```python | ||
| from strands import Agent | ||
|
|
||
| agent = Agent(system_prompt="You are a helpful assistant.") | ||
| result = agent("Hello!") # Uses default invocation key | ||
| result = agent("Follow up") # Same key, conversation continues | ||
| ``` | ||
|
|
||
| ### Concurrent usage | ||
|
|
||
| ```python | ||
| import asyncio | ||
| from strands import Agent | ||
|
|
||
| agent = Agent(system_prompt="You are a helpful assistant.") | ||
|
|
||
| async def handle_request(user_id: str, message: str): | ||
| return await agent.invoke_async(message, invocation_key=user_id) | ||
|
|
||
| async def main(): | ||
| results = await asyncio.gather( | ||
| handle_request("user-1", "What is Python?"), | ||
| handle_request("user-2", "What is Rust?"), | ||
| ) | ||
| ``` | ||
|
|
||
| ### State reset | ||
|
|
||
| Rather than reaching into agent internals: | ||
|
|
||
| ```python | ||
| # Today: manually reset individual fields | ||
| agent.messages = [] | ||
| agent.state = AgentState() | ||
| ``` | ||
|
|
||
| State could be cleared through the session manager: | ||
|
|
||
| ```python | ||
| # Proposed: clear state for a given invocation key | ||
| await agent.session_manager.clear(invocation_key) | ||
| ``` | ||
|
|
||
| ## Consequences | ||
|
|
||
| ### What could become easier | ||
|
|
||
| - Concurrent agent usage with a single instance | ||
| - Resetting or clearing agent state without reaching into internals | ||
| - Adding new stateful fields without updating reset logic in graph or other consumers | ||
| - Serving multiple users/conversations from a single agent instance | ||
|
|
||
| ### What could become harder or change | ||
|
|
||
| - Internal code that currently reads `self.messages` or `self.state` would need to be updated to read from the invocation state object | ||
|
mkmeral marked this conversation as resolved.
|
||
| - For example, hook callbacks that receive the agent and access `agent.messages` would need to be adapted | ||
| - Session manager becomes a required concept (though a default in-memory implementation could make it invisible for simple use cases) | ||
| - The `threading.Lock` and `ConcurrencyException` would be removed, which means users who relied on the exception as a signal would need to adapt | ||
|
|
||
| ### Backwards compatibility is the biggest concern | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes!! |
||
|
|
||
| Today, users directly read and write instance fields like `agent.messages` and `agent.state`. Moving these into an isolated invocation state object would break that public API surface. Community tools, custom hooks, and user code that accesses these fields would all need updating. Providing a smooth migration path — whether through proxy accessors, a compatibility layer, or clear deprecation — is the most significant challenge with this proposal. | ||
|
|
||
| Given the scope of this change, it may be worth considering this as part of a v2 of the Python SDK rather than attempting it as a backwards-compatible evolution of v1. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall thoughts:
and lastly, I am a bit hesitant on the proposal, mainly because of backwards compatibility concern. I'm not sure if the reward is worth the effort. Specific use cases I think would help here, like user scenarios where these things are broken
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be great. because my understanding from those tickets is, they either don't want concurrency (in which case we want to throw errors), or they want cancellation/continuation with the next agent invocation. I don't think we have any truly concurrent use case. But I might be wrong