The problem
CrewAI crews running multi-step tasks across sessions can silently change behavior after context compression or memory rotation — without triggering exceptions or failed tasks. The agent completes the work, but the behavioral fingerprint has shifted.
This is distinct from the loop detection problem (#4682) — loops are detectable by repetition. Session-boundary drift is silent: the agent starts fresh after rotation, looks healthy, but responds differently to the same inputs than it did before.
Three measurable signals that indicate drift happened:
- Ghost lexicon decay — domain vocabulary that appeared reliably in prior sessions disappears from agent outputs after a session boundary (same task prompts, same crew config)
- Tool-call sequence shift — Jaccard distance between tool-use patterns pre/post boundary spikes, while task completion metrics stay green
- Semantic drift — topic keyword overlap across sessions declines, indicating the agent's effective working knowledge narrowed after context compression
Why it matters for crews
In a CrewAI crew, agents share outputs and build on each other's work. If Agent A silently drifts after session rotation:
- Agent B receives A's post-drift outputs and incorporates them
- B's context now reflects A's reduced state before B itself rotates
- The crew-level output degrades in ways that aren't attributable to any individual task failure
The lead-lag ordering of which agent drifts first can identify the root cause.
What would help
An observability hook on Agent or Task execution that exposes pre/post session state for external comparison — or documentation on the recommended pattern for checking behavioral consistency across Crew.kickoff() calls using existing memory interfaces.
Reference implementation
I built compression-monitor — a framework-agnostic toolkit for measuring these three signals continuously. It includes:
preregister.py — pre-commit behavioral predictions before a session boundary, evaluate after (gives falsifiable rollback triggers)
behavioral_footprint.py — Jaccard distance on tool-call sequences
ghost_lexicon.py — tracks domain vocabulary decay
simulate_boundary.py — generate synthetic drift for validation pipeline testing
The natural CrewAI integration point would be around Crew.kickoff() boundaries or the LTM/STM memory interfaces.
Questions
- Is cross-session behavioral consistency monitoring on the CrewAI roadmap?
- What's the recommended hook point for external monitoring of agent state across
kickoff() calls?
- Would a CrewAI integration adapter in compression-monitor be useful?
The problem
CrewAI crews running multi-step tasks across sessions can silently change behavior after context compression or memory rotation — without triggering exceptions or failed tasks. The agent completes the work, but the behavioral fingerprint has shifted.
This is distinct from the loop detection problem (#4682) — loops are detectable by repetition. Session-boundary drift is silent: the agent starts fresh after rotation, looks healthy, but responds differently to the same inputs than it did before.
Three measurable signals that indicate drift happened:
Why it matters for crews
In a CrewAI crew, agents share outputs and build on each other's work. If Agent A silently drifts after session rotation:
The lead-lag ordering of which agent drifts first can identify the root cause.
What would help
An observability hook on
AgentorTaskexecution that exposes pre/post session state for external comparison — or documentation on the recommended pattern for checking behavioral consistency acrossCrew.kickoff()calls using existing memory interfaces.Reference implementation
I built compression-monitor — a framework-agnostic toolkit for measuring these three signals continuously. It includes:
preregister.py— pre-commit behavioral predictions before a session boundary, evaluate after (gives falsifiable rollback triggers)behavioral_footprint.py— Jaccard distance on tool-call sequencesghost_lexicon.py— tracks domain vocabulary decaysimulate_boundary.py— generate synthetic drift for validation pipeline testingThe natural CrewAI integration point would be around
Crew.kickoff()boundaries or the LTM/STM memory interfaces.Questions
kickoff()calls?