Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
305 changes: 305 additions & 0 deletions designs/0002-isolated-state.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,305 @@
# Isolated State
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall thoughts:

  1. I'd like to understand more about how it relates to snapshots
  2. I'd like to work backwards from user experience. This doc tries to solve concurrency problem, but there are multiple ways a user might want to handle that. do we cover all?

and lastly, I am a bit hesitant on the proposal, mainly because of backwards compatibility concern. I'm not sure if the reward is worth the effort. Specific use cases I think would help here, like user scenarios where these things are broken

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Snapshot can help with simplifying the serializing and deserializing and so this proposal would then mainly target the concurrency issue.
  2. Graph is used as an example of user experience. I could also link to tickets raised by customers that led us to setting up the concurrency runtime error.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could also link to tickets raised by customers that led us to setting up the concurrency runtime error.

This would be great. because my understanding from those tickets is, they either don't want concurrency (in which case we want to throw errors), or they want cancellation/continuation with the next agent invocation. I don't think we have any truly concurrent use case. But I might be wrong


**Status**: Proposed

**Date**: 2026-02-16

**Issue**: N/A

## Context

Today, the `Agent` class stores all mutable per-invocation state as instance fields. A few examples include:

- `messages` — conversation history
- `state` (AgentState) — user-facing key-value state
- `event_loop_metrics` — token usage and performance metrics
- `trace_span` — the current OpenTelemetry trace span
- `_interrupt_state` — interrupt tracking

Because this state lives directly on the agent instance, two concurrent invocations would corrupt each other's data. The SDK prevents this with a `threading.Lock` that raises `ConcurrencyException` if a second call arrives while the first is still running:

```python
# From agent.py stream_async
acquired = self._invocation_lock.acquire(blocking=False)
if not acquired:
raise ConcurrencyException(
"Agent is already processing a request. Concurrent invocations are not supported."
)
```

### The problem in practice

A simple concurrent use case fails today:

```python
import asyncio
from strands import Agent

agent = Agent(system_prompt="You are a helpful assistant.")

async def main():
# This raises ConcurrencyException on the second call
results = await asyncio.gather(
agent.invoke_async("Summarize the Python GIL"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the expected behavior here though? I can think of multiple ways this can/should be handled

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected behavior is that this invoke_async is completely isolated from the second call below. They each interact with different state. For example, each call is editing a separate messages array. They however share the same configurations (e.g., the model provider, system prompt, etc.).

agent.invoke_async("Summarize the Rust borrow checker"),
Comment on lines +43 to +44
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we want to encourage? Would you say it's an anti-pattern? Should we encourage folks to just have a func that returns a new instance of their defined agent?

Copy link
Copy Markdown
Contributor

@afarntrog afarntrog Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you address in it "workaround" section :)

)

asyncio.run(main())
```

### The workaround is verbose and limiting

To get around this today, users must create separate agent instances:

```python
import asyncio
from strands import Agent

def make_agent():
return Agent(
model=my_model,
tools=[tool_a, tool_b],
system_prompt="You are a helpful assistant.",
)

async def main():
results = await asyncio.gather(
make_agent().invoke_async("Summarize the Python GIL"),
make_agent().invoke_async("Summarize the Rust borrow checker"),
)

asyncio.run(main())
```

This works for simple scripts, but breaks down anywhere a function accepts an agent instance directly. The factory-function pattern can't help when the caller expects a pre-configured agent. `Graph.add_node` is one example — it takes an agent instance, and it validates that each node has a unique instance:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this is a different pattern compared to the problem descirbed above. the second invocation is unaware of the first invocation because they are different agents.


```python
# From graph.py _validate_node_executor
if id(executor) in seen_instances:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graph wise , it is valid? Since we can always revisit a Node

raise ValueError("Duplicate node instance detected. Each node must have a unique object instance.")
```

If you have a generic agent (e.g., a summarizer) that you want to reuse across multiple graph nodes, you can't. You must create separate instances with identical configuration:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we just remove that requirement? why do we have it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nodes can execute in parallel in Graph. If two nodes running in parallel share the same agent instance, a concurrency runtime error will raise.


```python
from strands import Agent
from strands.multiagent.graph import GraphBuilder

summarizer_config = dict(
model=my_model,
tools=[summarize_tool],
system_prompt="You are a summarizer.",
)

graph = GraphBuilder()
# Must create separate instances even though they're identical
graph.add_node(Agent(**summarizer_config), node_id="summarize_a")
graph.add_node(Agent(**summarizer_config), node_id="summarize_b")
```

This goes against the SDK's goal of building agents in just a few lines of code.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a pretty weak use case. @zastrowm mentioned that a team had one. But introducing complexity to replace a factory pattern which is essentially

def create_agent_instance() -> Agent:
   ...

needs a better reason than
This goes against the SDK's goal of building agents in just a few lines of code.

I know you call this out above where Graph wont work.

But this comes back to my confusion around the earlier statement two concurrent invocations

I would want a strong use case to justify the concurrent case


### State reset is fragile

Any code that needs to reset an agent to a clean state must manually reach into its internals and know which fields to clear. This is error-prone — if the agent gains new stateful fields in the future, every reset site must be updated or it silently leaks state between executions.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section ignores the simpler answer where we simply make the property private and add methods to make operating on them easier. I'd want to see pros and cons


The graph implementation is a good example of this:
Copy link
Copy Markdown
Contributor

@mkmeral mkmeral Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: graph implementation could be much better. I'd say we should rethink that before changing agent


```python
# From graph.py GraphNode.reset_executor_state
def reset_executor_state(self) -> None:
if hasattr(self.executor, "messages"):
self.executor.messages = copy.deepcopy(self._initial_messages)

if hasattr(self.executor, "state"):
self.executor.state = AgentState(self._initial_state.get())

self.execution_status = Status.PENDING
self.result = None
```

It deep-copies initial state at construction time and manually resets specific fields. This pattern would need to be replicated anywhere else that needs to reset agent state.

## Decision

Consider making `Agent` stateless by extracting all per-invocation mutable state into an isolated state object, managed through a session manager and keyed by an invocation key.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making agent stateless also enables durable orchestrators


### Isolated invocation state
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two pattern that come to mind after reading this proposal:

  • AgentProvider: A class that provides agent instances. Each is independent of the other so they can all be invoked individually
  • AgentExecutor: Given the state object of an agent, this will run the agent loop on that state, and alter it as it goes along. All changes are applied to the state object, not the AgentExecutor.

This proposal sounds like we are changing the agent into more of this "executor" pattern. I know that lots of folks get confused by this, they think initializing an agent is heavy. The idea of separating state from the agent makes sense, and helps with this confusion. I also think an "AgentProvider" might help to do the same thing, and might involve less change to the existing Agent class


One approach would be to move all mutable state out of the agent instance and into a per-invocation state object:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

system prompts, tools, models, etc. are also mutable.

Considering meta-agent concept for multi-agent systems and context management, I'd argue the line between mutable/immutable is not as clear


```python
class InvocationState:
"""All mutable state for a single agent invocation."""
messages: Messages
Comment on lines +132 to +134
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are messages going to be per invocation? Being that we need to pass the entire message history to the llm - would we bundle all messages from all invocation states?

agent_state: AgentState
event_loop_metrics: EventLoopMetrics
trace_span: trace_api.Span | None
interrupt_state: _InterruptState
...
```

The agent instance would retain only configuration: model, tools, system prompt, hooks, callback handler, conversation manager, etc. In the future, configuration could also be extracted into its own isolated object to allow per-invocation overrides, but this document focuses on invocation state to highlight the core problem and start the discussion.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if one invocation modifies tools while another is running?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can get on board with this but I think it is an all or nothing problem.

Either everything is tied an invocation or we cannot reasonably handle edge cases and have a locking mechanism. As agents become more and more autonomous and begin modifying themselves I think having only partial coverage will be painful


### Session manager provides state
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is essentially snaphots? How does this compare to snapshot proposal

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it more plugable? It doens't has to be coupled with session manager?


At invocation time, the agent could read state from a session manager using an invocation key:

Comment on lines +146 to +147
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would state be an optional feature? Since session management is optional how would we handle this?

```python
# Pseudo-code for agent.stream_async
async def stream_async(self, prompt, *, invocation_key=None, **kwargs):
# Resolve the invocation key
key = invocation_key or self._default_invocation_key

# Load isolated state from session manager
invocation_state = await self.session_manager.load(key)

# Run the event loop against the isolated state (not self)
async for event in self._run_loop(invocation_state, prompt, **kwargs):
yield event

# Persist state back
await self.session_manager.save(key, invocation_state)
```

Because each invocation would operate on its own state object, there would be no shared mutable state on the agent. The `threading.Lock` and `ConcurrencyException` would no longer be needed.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threading.Lock and ConcurrencyException would no longer be needed.

I'm not sure this is true; the reason we added the exception was because folks were unaware, but usually they want one of two behaviors:

  1. They wait for the pending invocation to complete and continue
  2. They interrupt the current invocation

What the exception was solving was:

  • Strands doesn't support the above, so we throw an exception because they were unaware that neither of the above are what we do

In the cases of (1) & (2), we would still need locking and an exception, no?


### Default behavior and backwards compatibility

One idea is to introduce a default in-memory session manager. Each agent instance would get a default invocation key that is stable across calls:

```python
class InMemorySessionManager(SessionManager):
"""Stores state in memory, keyed by invocation key."""

def __init__(self):
self._store: dict[str, InvocationState] = {}

async def load(self, key: str) -> InvocationState:
if key not in self._store:
self._store[key] = InvocationState()
return self._store[key]

async def save(self, key: str, state: InvocationState) -> None:
self._store[key] = state
```

When no invocation key is supplied, the agent would use a default key tied to the instance. This would mean:

- Sequential calls accumulate conversation history, just like today.
- A single agent instance with no invocation key behaves identically to the current implementation.
- No code changes required for existing users who interact with the agent through `__call__` or `invoke_async`.

However, code that directly accesses instance fields like `agent.messages` or `agent.state` would be affected. These fields would no longer live on the agent instance, so existing patterns like `print(agent.messages)` or `agent.state["key"] = value` would need to change. This is the primary backwards compatibility concern and is discussed further in the Consequences section.

### Concurrent usage with invocation keys

Users who want concurrency could supply distinct invocation keys:

```python
import asyncio
from strands import Agent

agent = Agent(system_prompt="You are a helpful assistant.")

async def main():
results = await asyncio.gather(
agent.invoke_async("Summarize the Python GIL", invocation_key="task-1"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assume a case of agentcore being invoked multiple times. what makes invocation key different?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to +1 this - the majority of the use cases that led to the exceptions where mis-use using the same user/conversation. It was not forking.

There is a learning/concept/documentation problem with strands where we don't make it clear that Agent=Conversation, but this is not solving the same problem as the exceptio

agent.invoke_async("Summarize the Rust borrow checker", invocation_key="task-2"),
)

asyncio.run(main())
```

Each key would get its own isolated messages, agent state, metrics, and trace span. No lock contention, no `ConcurrencyException`.

### Multi-agent patterns could become simpler

With isolated state, multi-agent patterns that reuse the same agent instance become possible. For example, graph nodes could share a single agent and rely on unique invocation keys per execution:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why aren't they possible today?


```python
from strands import Agent
from strands.multiagent.graph import GraphBuilder

summarizer = Agent(
model=my_model,
tools=[summarize_tool],
system_prompt="You are a summarizer.",
)

graph = GraphBuilder()
# Same instance, different invocation keys per execution
graph.add_node(summarizer, node_id="summarize_a")
graph.add_node(summarizer, node_id="summarize_b")
```

The `_validate_node_executor` duplicate-instance check would no longer be needed. `GraphNode.reset_executor_state` could be removed — each execution would start with a fresh invocation state loaded from the session manager. No more deep-copying initial state, no more manually resetting fields, and no risk of missing new stateful fields in the future.

## Developer Experience

### Basic usage (unchanged)

```python
from strands import Agent

agent = Agent(system_prompt="You are a helpful assistant.")
result = agent("Hello!") # Uses default invocation key
result = agent("Follow up") # Same key, conversation continues
```

### Concurrent usage

```python
import asyncio
from strands import Agent

agent = Agent(system_prompt="You are a helpful assistant.")

async def handle_request(user_id: str, message: str):
return await agent.invoke_async(message, invocation_key=user_id)

async def main():
results = await asyncio.gather(
handle_request("user-1", "What is Python?"),
handle_request("user-2", "What is Rust?"),
)
```

### State reset

Rather than reaching into agent internals:

```python
# Today: manually reset individual fields
agent.messages = []
agent.state = AgentState()
```

State could be cleared through the session manager:

```python
# Proposed: clear state for a given invocation key
await agent.session_manager.clear(invocation_key)
```

## Consequences

### What could become easier

- Concurrent agent usage with a single instance
- Resetting or clearing agent state without reaching into internals
- Adding new stateful fields without updating reset logic in graph or other consumers
- Serving multiple users/conversations from a single agent instance

### What could become harder or change

- Internal code that currently reads `self.messages` or `self.state` would need to be updated to read from the invocation state object
Comment thread
mkmeral marked this conversation as resolved.
- For example, hook callbacks that receive the agent and access `agent.messages` would need to be adapted
- Session manager becomes a required concept (though a default in-memory implementation could make it invisible for simple use cases)
- The `threading.Lock` and `ConcurrencyException` would be removed, which means users who relied on the exception as a signal would need to adapt

### Backwards compatibility is the biggest concern
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!!


Today, users directly read and write instance fields like `agent.messages` and `agent.state`. Moving these into an isolated invocation state object would break that public API surface. Community tools, custom hooks, and user code that accesses these fields would all need updating. Providing a smooth migration path — whether through proxy accessors, a compatibility layer, or clear deprecation — is the most significant challenge with this proposal.

Given the scope of this change, it may be worth considering this as part of a v2 of the Python SDK rather than attempting it as a backwards-compatible evolution of v1.
Loading