Skip to content

feat: multi-model comparison benchmark for Crusoe Managed Inference#61

Open
Sakshi3027 wants to merge 1 commit into
crusoecloud:mainfrom
Sakshi3027:feat/model-comparison-crusoe
Open

feat: multi-model comparison benchmark for Crusoe Managed Inference#61
Sakshi3027 wants to merge 1 commit into
crusoecloud:mainfrom
Sakshi3027:feat/model-comparison-crusoe

Conversation

@Sakshi3027

Copy link
Copy Markdown

What this adds

A benchmark that runs multiple Crusoe models across reasoning, code generation,
and summarization tasks concurrently producing a latency and throughput leaderboard.

Models compared:

  • meta-llama/Llama-3.3-70B-Instruct
  • deepseek-ai/DeepSeek-V3-0324
  • Qwen/Qwen3-235B-A22B

Tasks:

  • Reasoning (multi-step math)
  • Code generation (Sieve of Eratosthenes with type hints)
  • Summarization (SQL vs NoSQL in 3 bullets)

All 9 combinations (3 models × 3 tasks) run concurrently via asyncio.gather.
Total benchmark time equals the slowest single call.

Why it's useful

Teams evaluating which Crusoe model to use for their workload need a
quick way to compare quality and speed across tasks. This gives them
a runnable starting point they can extend with their own prompts.

Testing

Tested locally with Groq (3 Llama variants) as a drop-in replacement.

To run on Crusoe:
export CRUSOE_API_KEY="your-api-key"
python compare.py

Related contributions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant