Monitoring and Insights #3102

srinivaskiranj · 2025-07-02T07:30:37Z

srinivaskiranj
Jul 2, 2025

I'm looking for production readiness of my CrewAI agents , which calls for Performance insights on how different models and prompts are performing and secondly monitoring . Any suggestions for opensouce tools ?
I came across https://www.braintrust.dev/ , anyone has success in integrting that with CrewAI , Highly appreciate your response

KeepALifeUS · 2026-02-12T20:56:03Z

KeepALifeUS
Feb 12, 2026

For production monitoring, here's what works:

1. State-Based Observability

Built-in monitoring via shared state:

state = {
    "run_id": str(uuid.uuid4()),
    "metrics": {
        "total_tokens": 0,
        "agent_latencies": {},
        "tool_calls": [],
        "errors": []
    },
    "model_usage": {}
}

def track_agent(agent_name, tokens, latency, model):
    state["metrics"]["total_tokens"] += tokens
    state["metrics"]["agent_latencies"][agent_name] = latency
    if model not in state["model_usage"]:
        state["model_usage"][model] = {"calls": 0, "tokens": 0}
    state["model_usage"][model]["calls"] += 1
    state["model_usage"][model]["tokens"] += tokens

2. Open Source Tools

Langfuse (you mentioned Braintrust):

from langfuse import Langfuse

langfuse = Langfuse()

@langfuse.trace()
def run_crew(task):
    result = crew.kickoff(inputs={"task": task})
    return result

Weights & Biases:

import wandb
wandb.init(project="crewai-production")
wandb.log({"tokens": state["metrics"]["total_tokens"]})

3. Key Metrics to Track

Per-agent latency — which agent is slow?
Per-model token usage — cost attribution
Tool call frequency — what's being used?
Error rates — per agent, per model

4. Production Insight

The state-based approach gives you:

Built-in audit trail
No external dependencies
Easy comparison between runs

More on observability patterns: https://github.com/KeepALifeUS/autonomous-agents

0 replies

srinivaskiranj · 2026-02-13T03:28:32Z

srinivaskiranj
Feb 13, 2026
Author

[like] Kiran Jaggu reacted to your message:

…

________________________________ From: Vladyslav Shapovalov ***@***.***> Sent: Thursday, February 12, 2026 8:56:26 PM To: crewAIInc/crewAI ***@***.***> Cc: Kiran Jaggu ***@***.***>; Author ***@***.***> Subject: Re: [crewAIInc/crewAI] Monitoring and Insights (Discussion #3102) For production monitoring, here's what works: 1. State-Based Observability Built-in monitoring via shared state: state = { "run_id": str(uuid.uuid4()), "metrics": { "total_tokens": 0, "agent_latencies": {}, "tool_calls": [], "errors": [] }, "model_usage": {} } def track_agent(agent_name, tokens, latency, model): state["metrics"]["total_tokens"] += tokens state["metrics"]["agent_latencies"][agent_name] = latency if model not in state["model_usage"]: state["model_usage"][model] = {"calls": 0, "tokens": 0} state["model_usage"][model]["calls"] += 1 state["model_usage"][model]["tokens"] += tokens 2. Open Source Tools Langfuse (you mentioned Braintrust): from langfuse import Langfuse langfuse = Langfuse() @langfuse.trace() def run_crew(task): result = crew.kickoff(inputs={"task": task}) return result Weights & Biases: import wandb wandb.init(project="crewai-production") wandb.log({"tokens": state["metrics"]["total_tokens"]}) 3. Key Metrics to Track * Per-agent latency — which agent is slow? * Per-model token usage — cost attribution * Tool call frequency — what's being used? * Error rates — per agent, per model 4. Production Insight The state-based approach gives you: * Built-in audit trail * No external dependencies * Easy comparison between runs More on observability patterns: https://github.com/KeepALifeUS/autonomous-agents — Reply to this email directly, view it on GitHub<#3102 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIHGE4BZU6XASB4FFPAH7F34LTSHVAVCNFSM6AAAAACATCW5V6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTKNZYG43TQOA>. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

devonakelley · 2026-02-22T06:30:47Z

devonakelley
Feb 22, 2026

Langfuse and W&B are solid for the observability side. The gap I kept running into was: ok I can see which model is underperforming, but then what? I still have to manually go change the routing or swap the model.

I built Kalibr to close that loop. It auto-instruments your LLM calls (OpenAI, Anthropic, Google), captures traces, and then uses outcome data you report to shift routing automatically. So if Claude starts degrading on a specific task type, traffic moves to the next best option without you touching anything.

Open source Python SDK, might be worth trying alongside Langfuse since they solve different parts of the problem.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring and Insights #3102

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Monitoring and Insights #3102

Uh oh!

srinivaskiranj Jul 2, 2025

Replies: 3 comments

Uh oh!

KeepALifeUS Feb 12, 2026

1. State-Based Observability

2. Open Source Tools

3. Key Metrics to Track

4. Production Insight

Uh oh!

srinivaskiranj Feb 13, 2026 Author

Uh oh!

devonakelley Feb 22, 2026

srinivaskiranj
Jul 2, 2025

KeepALifeUS
Feb 12, 2026

srinivaskiranj
Feb 13, 2026
Author

devonakelley
Feb 22, 2026