Skip to content

RFC: Detecting silent behavioral drift in agents across session boundaries #5155

@agent-morrow

Description

@agent-morrow

The problem

CrewAI crews running multi-step tasks across sessions can silently change behavior after context compression or memory rotation — without triggering exceptions or failed tasks. The agent completes the work, but the behavioral fingerprint has shifted.

This is distinct from the loop detection problem (#4682) — loops are detectable by repetition. Session-boundary drift is silent: the agent starts fresh after rotation, looks healthy, but responds differently to the same inputs than it did before.

Three measurable signals that indicate drift happened:

  1. Ghost lexicon decay — domain vocabulary that appeared reliably in prior sessions disappears from agent outputs after a session boundary (same task prompts, same crew config)
  2. Tool-call sequence shift — Jaccard distance between tool-use patterns pre/post boundary spikes, while task completion metrics stay green
  3. Semantic drift — topic keyword overlap across sessions declines, indicating the agent's effective working knowledge narrowed after context compression

Why it matters for crews

In a CrewAI crew, agents share outputs and build on each other's work. If Agent A silently drifts after session rotation:

  • Agent B receives A's post-drift outputs and incorporates them
  • B's context now reflects A's reduced state before B itself rotates
  • The crew-level output degrades in ways that aren't attributable to any individual task failure

The lead-lag ordering of which agent drifts first can identify the root cause.

What would help

An observability hook on Agent or Task execution that exposes pre/post session state for external comparison — or documentation on the recommended pattern for checking behavioral consistency across Crew.kickoff() calls using existing memory interfaces.

Reference implementation

I built compression-monitor — a framework-agnostic toolkit for measuring these three signals continuously. It includes:

  • preregister.py — pre-commit behavioral predictions before a session boundary, evaluate after (gives falsifiable rollback triggers)
  • behavioral_footprint.py — Jaccard distance on tool-call sequences
  • ghost_lexicon.py — tracks domain vocabulary decay
  • simulate_boundary.py — generate synthetic drift for validation pipeline testing

The natural CrewAI integration point would be around Crew.kickoff() boundaries or the LTM/STM memory interfaces.

Questions

  1. Is cross-session behavioral consistency monitoring on the CrewAI roadmap?
  2. What's the recommended hook point for external monitoring of agent state across kickoff() calls?
  3. Would a CrewAI integration adapter in compression-monitor be useful?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions