Skip to content

lechmazur/nyt-connections

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

120 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Extended Version

This benchmark evaluates large language models (LLMs) using 940 NYT Connections puzzles, with additional words included to increase difficulty.

As of Feb 4, 2025, there is a new version of the benchmark. The standard NYT Connections benchmark is nearing saturation, with o1 scoring 90.7 and o3, along with other reasoning models, expected this year. The current rules require knowing only three categories, letting the fourth fall into place. To increase difficulty, Extended Connections adds up to four extra trick words to each puzzle. We double-check that none of the added words fit into any category used in the corresponding puzzle. New puzzles have expanded the total from 436 to 940 as of Feb 2, 2026.

Chart: Extended Version

Leaderboard

Leaderboard: Extended Version

Rank Model Score % #Puzzles
1 Gemini 3.1 Pro Preview 98.4 940
2 gemini-3-pro-preview 96.3 940
3 Claude Opus 4.6 (high reasoning) 94.7 940
4 GPT-5.4 (xhigh reasoning) 94.0 940
5 GPT-5.4 (high reasoning) 93.6 940
6 Grok 4.20 Multi-Agent Exp Beta 0304 93.4 940
7 GPT-5.4 (medium reasoning) 91.9 940
8 grok-4-1-fast-reasoning 91.7 940
9 Grok 4.20 0309 (Reasoning) 90.3 940
10 grok-4.20-experimental-beta-0304-reasoning 89.5 940
11 gpt-5.2-xhigh 88.6 940
12 Gemini 3 Flash Preview 88.4 940
13 GPT-5.2 Pro 85.7 940
14 Claude Sonnet 4.6 (high reasoning) 85.7 940
15 GLM-5.1 84.3 940
16 Claude Sonnet 4.6 Thinking 32K 82.4 940
17 GLM-5 81.7 940
18 Claude Opus 4.6 Thinking 16K 81.7 940
19 Gemma 4 31B Reasoning 79.5 940
20 Kimi K2.5 Thinking 78.3 940
21 gpt-5.2-high 77.5 940
22 GPT-5.4 Mini (xhigh reasoning) 71.8 940
23 gpt-5.2-medium 71.4 940
24 Qwen 3.6 Plus 71.3 940
25 Qwen3.5-397B-A17B 69.2 940
26 gpt-5.2-low 66.7 940
27 Qwen3.5-122B-A10B 63.6 940
28 Claude Opus 4.5 Thinking 16K 62.6 940
29 Qwen3.5-27B 60.7 940
30 Claude Opus 4.5 (no reasoning) 60.3 940
31 Claude Sonnet 4.6 Thinking 16K 57.6 940
32 Claude Opus 4.6 (no reasoning) 55.9 940
33 Claude Sonnet 4.6 (no reasoning) 55.0 940
34 DeepSeek V3.2 50.2 940
35 Claude Sonnet 4.5 Thinking 16K 49.4 940
36 Claude Sonnet 4.5 (no reasoning) 47.4 940
37 qwen3-max-2026-01-23 42.1 940
38 ByteDance Seed2.0 Pro 42.1 940
39 Claude Opus 4.7 (high reasoning) 41.0 940
40 Xiaomi MiMo V2 Pro 40.9 940
41 Step 3.5 Flash 39.9 940
42 MiniMax-M2.7 35.2 940
43 GPT-5.4 (no reasoning) 32.8 940
44 LongCat Flash Thinking 31.0 940
45 Gemma 4 31B IT 30.1 940
46 minimax-m2.5 29.6 940
47 Arcee Trinity Large Thinking 29.5 940
48 gpt-5.2-none 28.1 940
49 minimax-m2 27.0 940
50 Claude 4.5 Haiku 26.0 940
51 grok-4-1-fast-non-reasoning 25.1 940
52 qwen3-max-thinking 24.1 940
53 minimax-m2.1 22.7 940
54 Baidu Ernie 5.0 21.2 940
55 Gemini 3.1 Flash-Lite Preview 19.7 940
56 Grok 4.20 0309 (Non-Reasoning) 19.2 940
57 Llama 4 Maverick 18.4 940
58 DeepSeek V3.2 (no reasoning) 17.8 940
59 grok-4.20-experimental-beta-0304-non-reasoning 17.6 940
60 Mistral Large 3 17.2 940
61 Mistral Medium 3.1 15.5 940
62 Claude Opus 4.7 (no reasoning) 15.3 940

Correlation of puzzle-level results: heatmap

Correlations


Newest 100 puzzles.

To counteract the possibility of an LLM's training data including the solutions, we have also tested only the 100 latest puzzles. Note that lower scores do not necessarily indicate that NYT Connections solutions are in the training data, as the difficulty of the first puzzles was lower.


Chart: Newest 100 puzzles, extended version

Newest 100 puzzles


Humans vs. LLMs

To explore how top language models (LLMs) compare to humans in the New York Times Connections puzzle, we used official NYT performance data from December 2024 to February 2025, as analyzed by u/Bryschien1996, alongside a simulated gameplay setup that mirrors the human experience. This setup involves a multi-step process where solvers iteratively propose groups, receive feedback ("correct," "one away," "incorrect"), and are allowed up to four mistakes before failing. According to NYT data, the average human player solved approximately 71% of puzzles over the three-month period from December 2024 to February 2025, with solve rates ranging from 39% on the toughest days (e.g., February 2, 2025) to 98% on the easiest (e.g., February 26, 2025). It's worth noting that NYT Connections players are self-selected and likely perform better than the general population. We collected data from nine LLMs spanning a range of scores in the Extended Connections benchmark.

nyt_connections_chart_basic

The results reveal that top reasoning LLMs from OpenAI consistently outperform the average human player. DeepSeek R1 performs closest to the level of an average NYT Connections player.

Elite human players, however, set a higher standard, achieving a 100% win rate during the same period:

ex

o1, with a 98.9% win rate, comes close to this elite level. o1-pro, which has not yet been tested in this gameplay simulation setup, might be able to match these top humans. Thus, directly determining whether AI achieves superhuman performance on NYT Connections could hinge on comparing the number of mistakes made before fully solving each puzzle.


Original NYT Connections LLM Benchmark

This benchmark evaluates large language models (LLMs) using 436 NYT Connections puzzles. Three different prompts, not optimized for LLMs through prompt engineering, are used. Both uppercase and lowercase puzzles are assessed.

Chart: Original Version

nyt_connections_chart

Leaderboard: Original Version

Model Score
o1 90.7
o1-preview 87.1
o3-mini 72.4
DeepSeek R1 54.4
o1-mini 42.2
Multi-turn ensemble 37.8
Gemini 2.0 Flash Thinking Exp 01-21 37.0
GPT-4 Turbo 28.3
GPT-4o 2024-11-20 27.9
GPT-4o 2024-08-06 26.5
Llama 3.1 405B 26.3
Claude 3.5 Sonnet (2024-10-22) 25.9
Claude 3 Opus 24.8
Grok Beta 23.7
Llama 3.3 70B 23.7
Gemini 1.5 Pro (Sept) 22.7
Deepseek-V3 21.0
Gemini 2.0 Flash Exp 20.0
Gemma 2 27B 18.8
Qwen 2.5 Max 18.6
Gemini 2.0 Flash Thinking Exp 18.6
Mistral Large 2 17.4
Qwen 2.5 72B 14.8
Claude 3.5 Haiku 13.7
MiniMax-Text-01 13.6
Nova Pro 12.5
Phi-4 11.6
Mistral Small 3 10.5
DeepSeek-V2.5 9.9

Leaderboard: Older models

These models are excluded from the main board because they ran fewer than 940 total puzzles.

Rank Model Score % #Puzzles (window) Total Coverage
1 Sherlock Think Alpha 92.4 759 759/940
2 Grok 4 Fast Reasoning 92.1 759 759/940
3 Grok 4 91.7 759 759/940
4 Sonoma Sky Alpha 90.7 759 759/940
5 o3-pro (medium reasoning) 87.3 759 759/940
6 GPT-5 Pro 83.9 759 759/940
7 o1-pro (medium reasoning) 82.5 651 651/940
8 o3 (high reasoning) 78.6 759 759/940
9 GPT-5 (high reasoning) 77.0 759 759/940
10 o4-mini (high reasoning) 73.6 759 759/940
11 o3 (medium reasoning) 73.0 759 759/940
12 GPT-5 (medium reasoning) 72.2 759 759/940
13 o1 (medium reasoning) 70.8 651 651/940
14 GPT-5.1 (high reasoning) 69.9 759 759/940
15 o4-mini (medium reasoning) 68.8 651 651/940
16 GPT-5 mini (medium reasoning) 66.9 759 759/940
17 GPT-5 (low reasoning) 65.4 759 759/940
18 GPT-5.1 (medium reasoning) 62.7 759 759/940
19 o3-mini (high reasoning) 61.4 651 651/940
20 GLM-4.7 59.5 767 767/940
21 Claude Opus 4.1 Thinking 16K 58.8 759 759/940
22 DeepSeek V3.1 Reasoner 57.7 759 759/940
23 Gemini 2.5 Pro 57.6 759 759/940
24 Kimi K2 Thinking 64K 57.3 924 924/940
25 Qwen 3 235B A22B 54.3 759 759/940
26 Gemini 2.5 Pro Exp 03-25 54.1 651 651/940
27 o3-mini (medium reasoning) 53.6 651 651/940
28 Claude Opus 4 Thinking 16K 49.7 759 759/940
29 DeepSeek R1 05/28 48.6 759 759/940
30 Qwen 3 235B A22B 25-07 Think 46.2 759 759/940
31 Gemini 2.5 Pro Preview 05-06 42.5 651 651/940
32 Claude Sonnet 4 Thinking 16K 40.3 759 759/940
33 Claude Sonnet 4 Thinking 64K 39.6 651 651/940
34 GPT-OSS-120B 38.7 759 759/940
35 DeepSeek R1 38.6 651 651/940
36 Claude Opus 4.1 (no reasoning) 37.1 759 759/940
37 Qwen 3 30B A3B 36.6 759 759/940
38 Qwen 3 32B 35.8 759 759/940
39 Qwen 3 30B A3B 25-07 Thinking 35.5 759 759/940
40 Claude Opus 4 (no reasoning) 34.4 759 759/940
41 GPT-4.5 Preview 34.2 651 651/940
42 Claude 3.7 Sonnet Thinking 16K 33.6 651 651/940
43 Qwen 3 Next 80B A3B Thinking 32.9 759 759/940
44 Qwen QwQ-32B 16K 31.4 651 651/940
45 Grok 3 Mini Beta (high) 30.2 759 759/940
46 GLM-4.5 30.2 759 759/940
47 Claude Opus 4.6 Thinking 32K 28.1 98 98/940
48 GPT-5 (minimal reasoning) 27.3 759 759/940
49 o1-mini 26.9 651 651/940
50 Claude Sonnet 4 (no reasoning) 26.6 759 759/940
51 Grok 3 Mini Beta (low) 26.0 651 651/940
52 Quasar Alpha 25.4 651 651/940
53 Cohere Command A Reasoning 16K 25.3 759 759/940
54 Gemini 2.5 Flash 25.2 759 759/940
55 Sherlock Dash Alpha 25.1 759 759/940
56 Grok 4 Fast Non-Reasoning 24.9 759 759/940
57 GPT-4o Mar 2025 24.5 759 759/940
58 GLM-4.6 24.2 759 759/940
59 Qwen 3 Max Preview 23.9 759 759/940
60 Kimi K2-0905 23.6 759 759/940
61 Gemini 2.0 Flash Think Exp 01-21 23.1 649 649/940
62 GPT-4.1 22.8 759 759/940
63 Sonoma Dusk Alpha 22.8 759 759/940
64 GPT-4o Feb 2025 22.7 651 651/940
65 GPT-5.1 (no reasoning) 22.1 759 759/940
66 Polaris Alpha 21.8 759 759/940
67 Gemini 2.0 Pro Exp 02-05 21.7 651 651/940
68 DeepSeek V3.1 Non-Think 21.6 759 759/940
69 MiniMax-M1 21.3 688 688/940
70 Kimi K2 19.8 759 759/940
71 Qwen 3 235B A22B 25-07 Instruct 19.8 759 759/940
72 Grok 3 Beta (no reasoning) 19.7 759 759/940
73 Grok 2 12-12 19.2 651 651/940
74 Gemini 1.5 Pro (Sept) 19.2 601 601/940
75 Claude 3 Opus 19.2 650 650/940
76 Claude 3.7 Sonnet 19.2 651 651/940
77 Gemini 2.0 Flash 18.8 651 651/940
78 GPT-4o 2024-11-20 18.7 601 601/940
79 Qwen 2.5 Max 18.0 651 651/940
80 GPT-4o 2024-08-06 17.8 601 601/940
81 Claude 3.5 Sonnet 2024-10-22 17.7 651 651/940
82 Llama 4 Scout 17.4 759 759/940
83 DeepSeek V3-0324 16.8 759 759/940
84 Llama 3.1 405B 16.2 651 651/940
85 DeepSeek V3 15.1 651 651/940
86 Llama 3.3 70B 15.1 651 651/940
87 Baidu Ernie 4.5 300B A47B 14.8 759 759/940
88 GPT-4.1 mini 14.4 759 759/940
89 LongCat Flash 13.9 660 660/940
90 MiniMax-Text-01 13.8 759 759/940
91 Cohere Command A 13.1 759 759/940
92 Mistral Large 2 12.4 759 759/940
93 Gemma 2 27B 12.2 651 651/940
94 Gemma 3 27B 11.6 759 759/940
95 Mistral Medium 3 11.5 759 759/940
96 Mistral Small 3.1 11.4 651 651/940
97 Mistral Small 3.2 11.2 759 759/940
98 Qwen 2.5 72B 10.5 759 759/940
99 Claude 3.5 Haiku 10.0 759 759/940
100 Amazon Nova Pro 9.9 759 759/940
101 Microsoft Phi-4 9.9 759 759/940
102 GPT-4o mini 9.7 759 759/940
103 Mistral Small 3 8.9 601 601/940
104 GPT-4.1 nano 8.1 759 759/940
105 GLM4-32B-0414 7.6 759 759/940
106 Claude 3 Haiku 2.2 601 601/940

Notes

  • Claude Opus 4.7 refuses a lot of requests.
  • Partial credit is awarded if the puzzle isn't completely solved.
  • Only one attempt is allowed per puzzle. Humans solving puzzles on the NYT website get four attempts and a notification when they're one step away from the solution.
  • Multi-turn ensemble is my unpublished system. It utilizes multiple LLMs, multi-turn dialogues, and other proprietary techniques. It is slower and more costly to run but it does very well. It outperforms non-o1 LLMs on MMLU-Pro and GPQA.
  • This benchmark is not affiliated with the New York Times

Other multi-agent benchmarks

Other benchmarks


Updates

  • April 16, 2025: Claude Opus 4.7 added.
  • April 15, 2026: GLM-5.1, Step 3.5 Flash, Qwen3.5-27B added.
  • April 6, 2026: GPT 5.4 (high), Gemma 4 31B Reasoning, Qwen3.5-122B-A10B added.
  • April 4, 2026: MiniMax-M2.7 added.
  • April 3, 2026: Arcee Trinity Large Thinking, Qwen 3.6 Plus, Gemma 4 31B added.
  • Mar 6, 2026: Grok 4.20 Beta Experminatal, Gemini 3.1 Flash-Lite Preview added.
  • Mar 5, 2026: GPT-5.4 added.
  • Feb 23, 2026: GLM-5 added.
  • Feb 20, 2026: Gemini 3.1 Pro Preview, ByteDance Seed2.0 Pro, Baidu Ernie 5.0 added.
  • Feb 17, 2026: Claude Sonnet 4.6, Qwen3.5-397B-A17B, MiniMax-M2.5 added.
  • Feb 6, 2026: Claude Opus 4.6 added.
  • Feb 2, 2026: 940 total puzzles. Kimi K2.5 Thinking, Qwen3 Max (2026-01-23), MiniMax-M2.1, DeepSeek V3.2 added.
  • Dec 17, 2025: Gemini 3 Flash Preview added.
  • Dec 12, 2025: GPT 5.2 xhigh, GPT 5.2 Pro added.
  • Dec 11, 2025: GPT 5.2 added.
  • Dec 2, 2025: Mistral Large 3 added.
  • Nov 24, 2025: Claude Opus 4.5 added.
  • Nov 21, 2025: Grok 4.1 Fast added.
  • Nov 18, 2025: Gemini 3 Pro Preview, GPT 5.1 added
  • Nov 12, 2025: Kimi K2 Thinking added.
  • Oct 15, 2025: Claude Haiku 4.5 added.
  • Oct 14, 2025: Claude Sonnet 4.5, Deepseek V3.2 Exp, GLM-4.6 added.
  • Sep 19, 2025: Grok 4 Fast, Qwen 3 Next 80B A3B Thinking, LongCat Flash Chat added.
  • Sep 6, 2025: Kimi K2-0905 added.
  • Sep 5, 2025: Qwen 3 Max Preview, Qwen 3 235B A22B 25-07 Instruct added.
  • Aug 23, 2025: GPT-5 high reasoning and Cohere Command A Reasoning (16K) added.
  • Aug 22, 2025: DeepSeek 3.1, Qwen 3 30B A3B 25-07, Mistral Medium 3.1, GPT-5 minimal and low reasoning added.
  • Aug 7, 2025: GPT-5 added.
  • Aug 5, 2025: Claude Opus 4.1, GPT-OSS-120B added.
  • July 28, 2025: GLM-4.5, Qwen 3 235B A22B 25-07 Thinking added.
  • July 14, 2025: 108 new puzzles added. Kimi K2 added.
  • July 10, 2025: Grok 4 added.
  • July 3, 2025: Qwen 3 32B, GLM4-32B-0414 added.
  • July 2, 2025: Baidu Ernie 4.5 300B A47B, MiniMax-M1, Mistral Small 3.2 added.
  • June 10, 2025: o3-pro added.
  • June 5, 2025: Gemini 2.5 Pro Preview 06-05 added.
  • May 28, 2025: DeepSeek R1 05/28 added.
  • May 22, 2025: Claude 4 models added.
  • May 7, 2025: Gemini 2.5 Pro Preview 05-06 added. Mistral Medium 3 added.
  • Apr 30, 2025: Qwen 3 added.
  • Apr 18, 2025: o3, o4-mini, Gemini 2.5 Flash Preview added.
  • Apr 15, 2025: GPT-4.1 added.
  • Apr 10, 2025: Grok 3 added.
  • Apr 5, 2025: Llama 4 Maverick, Llama 4 Scout added.
  • Mar 28, 2025: GPT-4o March 2025 added.
  • Mar 25, 2025: 50 new questions added. Gemini 2.5 Pro Exp 03-25 and DeepSeek V3-0324 added.
  • Mar 23, 2025: Humans vs. LLMs section added.
  • Mar 21, 2025: o1-pro added. o3-mini-high added.
  • Mar 17, 2025: Cohere Command A and Mistral Small 3.1 added.
  • Mar 12, 2025: Gemma 3 27B added.
  • Mar 7, 2025: Qwen QwQ added.
  • Feb 27, 2025: GPT-4.5 Preview added.
  • Feb 24, 2025: Claude 3.7 Sonnet Thinking, Clade 3.7 Sonnet, GPT-4o Feb 2025, Qwen 2.5 Max, GPT-4o 2024-11-20 added.
  • Feb 6, 2025: Gemini 2.0 Pro Exp 02-05 added.
  • Feb 4, 2025: A new, more challenging version with extra words in each puzzle. Separate scoring for the 100 newest questions. Correlation heatmap.
  • Jan 31, 2025: o3-mini (72.4) added.
  • Jan 30, 2025: Mistral Small 3 (10.5) added.
  • Jan 29, 2025: DeepSeek R1 (54.5) added.
  • Jan 28, 2025: Qwen 2.5 Max (18.6) added.
  • Jan 22, 2025: Phi-4 (11.6), Nova Pro (12.5), Gemini 2.0 Flash Thinking Exp 01-21 (37.0) added.
  • Jan 16, 2025: Gemini 2.0 Flash Thinking Exp, o1, MiniMax-Tex-o1 added. Gemini 2.0 Flash Thinking Exp sometimes hits the output token limit.
  • Dec 27, 2024: GPT-4o 2024-11-20, Llama 3.3 70B, Gemini 2.0 Flash Exp, Deepseek-V3 added. Gemini 2.0 Flash Thinking Exp could not be benchmarked because its output gets cut off for some puzzles.
  • Claude 3.5 Haiku added. 13.7.
  • Claude 3.5 Sonnet (2024-10-22) added. Improves from 25.9 from 24.4.
  • Grok Beta added. Improves from 21.3 to 23.7. It's described as "experimental language model with state-of-the-art reasoning capabilities, best for complex and multi-step use cases. It is the successor of Grok 2 with enhanced context length."
  • Follow @lechmazur on X (Twitter) for other upcoming benchmarks and more.

About

Benchmark that evaluates LLMs using 759 NYT Connections puzzles extended with extra trick words

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages