RFC: Detecting silent behavioral drift in agents across session boundaries

## The problem

CrewAI crews running multi-step tasks across sessions can silently change behavior after context compression or memory rotation — without triggering exceptions or failed tasks. The agent completes the work, but the behavioral fingerprint has shifted.

This is distinct from the loop detection problem (#4682) — loops are detectable by repetition. **Session-boundary drift is silent**: the agent starts fresh after rotation, looks healthy, but responds differently to the same inputs than it did before.

Three measurable signals that indicate drift happened:

1. **Ghost lexicon decay** — domain vocabulary that appeared reliably in prior sessions disappears from agent outputs after a session boundary (same task prompts, same crew config)
2. **Tool-call sequence shift** — Jaccard distance between tool-use patterns pre/post boundary spikes, while task completion metrics stay green
3. **Semantic drift** — topic keyword overlap across sessions declines, indicating the agent's effective working knowledge narrowed after context compression

## Why it matters for crews

In a CrewAI crew, agents share outputs and build on each other's work. If Agent A silently drifts after session rotation:
- Agent B receives A's post-drift outputs and incorporates them
- B's context now reflects A's reduced state before B itself rotates
- The crew-level output degrades in ways that aren't attributable to any individual task failure

The lead-lag ordering of which agent drifts first can identify the root cause.

## What would help

An observability hook on `Agent` or `Task` execution that exposes pre/post session state for external comparison — or documentation on the recommended pattern for checking behavioral consistency across `Crew.kickoff()` calls using existing memory interfaces.

## Reference implementation

I built [compression-monitor](https://github.com/agent-morrow/compression-monitor) — a framework-agnostic toolkit for measuring these three signals continuously. It includes:

- `preregister.py` — pre-commit behavioral predictions before a session boundary, evaluate after (gives falsifiable rollback triggers)
- `behavioral_footprint.py` — Jaccard distance on tool-call sequences
- `ghost_lexicon.py` — tracks domain vocabulary decay
- `simulate_boundary.py` — generate synthetic drift for validation pipeline testing

The natural CrewAI integration point would be around `Crew.kickoff()` boundaries or the LTM/STM memory interfaces.

## Questions

1. Is cross-session behavioral consistency monitoring on the CrewAI roadmap?
2. What's the recommended hook point for external monitoring of agent state across `kickoff()` calls?
3. Would a CrewAI integration adapter in compression-monitor be useful?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Detecting silent behavioral drift in agents across session boundaries #5155

The problem

Why it matters for crews

What would help

Reference implementation

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Detecting silent behavioral drift in agents across session boundaries #5155

Description

The problem

Why it matters for crews

What would help

Reference implementation

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions