Extended Version

This benchmark evaluates large language models (LLMs) using 940 NYT Connections puzzles, with additional words included to increase difficulty.

As of Feb 4, 2025, there is a new version of the benchmark. The standard NYT Connections benchmark is nearing saturation, with o1 scoring 90.7 and o3, along with other reasoning models, expected this year. The current rules require knowing only three categories, letting the fourth fall into place. To increase difficulty, Extended Connections adds up to four extra trick words to each puzzle. We double-check that none of the added words fit into any category used in the corresponding puzzle. New puzzles have expanded the total from 436 to 940 as of Feb 2, 2026.

Chart: Extended Version

Leaderboard: Extended Version

Rank	Model	Score %	#Puzzles
1	Gemini 3.1 Pro Preview	98.4	940
2	gemini-3-pro-preview	96.3	940
3	Claude Opus 4.6 (high reasoning)	94.7	940
4	GPT-5.4 (xhigh reasoning)	94.0	940
5	GPT-5.4 (high reasoning)	93.6	940
6	Grok 4.20 Multi-Agent Exp Beta 0304	93.4	940
7	GPT-5.4 (medium reasoning)	91.9	940
8	grok-4-1-fast-reasoning	91.7	940
9	Grok 4.20 0309 (Reasoning)	90.3	940
10	grok-4.20-experimental-beta-0304-reasoning	89.5	940
11	gpt-5.2-xhigh	88.6	940
12	Gemini 3 Flash Preview	88.4	940
13	GPT-5.2 Pro	85.7	940
14	Claude Sonnet 4.6 (high reasoning)	85.7	940
15	GLM-5.1	84.3	940
16	Claude Sonnet 4.6 Thinking 32K	82.4	940
17	GLM-5	81.7	940
18	Claude Opus 4.6 Thinking 16K	81.7	940
19	Gemma 4 31B Reasoning	79.5	940
20	Kimi K2.5 Thinking	78.3	940
21	gpt-5.2-high	77.5	940
22	GPT-5.4 Mini (xhigh reasoning)	71.8	940
23	gpt-5.2-medium	71.4	940
24	Qwen 3.6 Plus	71.3	940
25	Qwen3.5-397B-A17B	69.2	940
26	gpt-5.2-low	66.7	940
27	Qwen3.5-122B-A10B	63.6	940
28	Claude Opus 4.5 Thinking 16K	62.6	940
29	Qwen3.5-27B	60.7	940
30	Claude Opus 4.5 (no reasoning)	60.3	940
31	Claude Sonnet 4.6 Thinking 16K	57.6	940
32	Claude Opus 4.6 (no reasoning)	55.9	940
33	Claude Sonnet 4.6 (no reasoning)	55.0	940
34	DeepSeek V3.2	50.2	940
35	Claude Sonnet 4.5 Thinking 16K	49.4	940
36	Claude Sonnet 4.5 (no reasoning)	47.4	940
37	qwen3-max-2026-01-23	42.1	940
38	ByteDance Seed2.0 Pro	42.1	940
39	Claude Opus 4.7 (high reasoning)	41.0	940
40	Xiaomi MiMo V2 Pro	40.9	940
41	Step 3.5 Flash	39.9	940
42	MiniMax-M2.7	35.2	940
43	GPT-5.4 (no reasoning)	32.8	940
44	LongCat Flash Thinking	31.0	940
45	Gemma 4 31B IT	30.1	940
46	minimax-m2.5	29.6	940
47	Arcee Trinity Large Thinking	29.5	940
48	gpt-5.2-none	28.1	940
49	minimax-m2	27.0	940
50	Claude 4.5 Haiku	26.0	940
51	grok-4-1-fast-non-reasoning	25.1	940
52	qwen3-max-thinking	24.1	940
53	minimax-m2.1	22.7	940
54	Baidu Ernie 5.0	21.2	940
55	Gemini 3.1 Flash-Lite Preview	19.7	940
56	Grok 4.20 0309 (Non-Reasoning)	19.2	940
57	Llama 4 Maverick	18.4	940
58	DeepSeek V3.2 (no reasoning)	17.8	940
59	grok-4.20-experimental-beta-0304-non-reasoning	17.6	940
60	Mistral Large 3	17.2	940
61	Mistral Medium 3.1	15.5	940
62	Claude Opus 4.7 (no reasoning)	15.3	940

Correlation of puzzle-level results: heatmap

Newest 100 puzzles.

To counteract the possibility of an LLM's training data including the solutions, we have also tested only the 100 latest puzzles. Note that lower scores do not necessarily indicate that NYT Connections solutions are in the training data, as the difficulty of the first puzzles was lower.

Chart: Newest 100 puzzles, extended version

Humans vs. LLMs

To explore how top language models (LLMs) compare to humans in the New York Times Connections puzzle, we used official NYT performance data from December 2024 to February 2025, as analyzed by u/Bryschien1996, alongside a simulated gameplay setup that mirrors the human experience. This setup involves a multi-step process where solvers iteratively propose groups, receive feedback ("correct," "one away," "incorrect"), and are allowed up to four mistakes before failing. According to NYT data, the average human player solved approximately 71% of puzzles over the three-month period from December 2024 to February 2025, with solve rates ranging from 39% on the toughest days (e.g., February 2, 2025) to 98% on the easiest (e.g., February 26, 2025). It's worth noting that NYT Connections players are self-selected and likely perform better than the general population. We collected data from nine LLMs spanning a range of scores in the Extended Connections benchmark.

The results reveal that top reasoning LLMs from OpenAI consistently outperform the average human player. DeepSeek R1 performs closest to the level of an average NYT Connections player.

Elite human players, however, set a higher standard, achieving a 100% win rate during the same period:

o1, with a 98.9% win rate, comes close to this elite level. o1-pro, which has not yet been tested in this gameplay simulation setup, might be able to match these top humans. Thus, directly determining whether AI achieves superhuman performance on NYT Connections could hinge on comparing the number of mistakes made before fully solving each puzzle.

Original NYT Connections LLM Benchmark

This benchmark evaluates large language models (LLMs) using 436 NYT Connections puzzles. Three different prompts, not optimized for LLMs through prompt engineering, are used. Both uppercase and lowercase puzzles are assessed.

Chart: Original Version

Leaderboard: Original Version

Model	Score
o1	90.7
o1-preview	87.1
o3-mini	72.4
DeepSeek R1	54.4
o1-mini	42.2
Multi-turn ensemble	37.8
Gemini 2.0 Flash Thinking Exp 01-21	37.0
GPT-4 Turbo	28.3
GPT-4o 2024-11-20	27.9
GPT-4o 2024-08-06	26.5
Llama 3.1 405B	26.3
Claude 3.5 Sonnet (2024-10-22)	25.9
Claude 3 Opus	24.8
Grok Beta	23.7
Llama 3.3 70B	23.7
Gemini 1.5 Pro (Sept)	22.7
Deepseek-V3	21.0
Gemini 2.0 Flash Exp	20.0
Gemma 2 27B	18.8
Qwen 2.5 Max	18.6
Gemini 2.0 Flash Thinking Exp	18.6
Mistral Large 2	17.4
Qwen 2.5 72B	14.8
Claude 3.5 Haiku	13.7
MiniMax-Text-01	13.6
Nova Pro	12.5
Phi-4	11.6
Mistral Small 3	10.5
DeepSeek-V2.5	9.9

Leaderboard: Older models

These models are excluded from the main board because they ran fewer than 940 total puzzles.

Rank	Model	Score %	#Puzzles (window)	Total Coverage
1	Sherlock Think Alpha	92.4	759	759/940
2	Grok 4 Fast Reasoning	92.1	759	759/940
3	Grok 4	91.7	759	759/940
4	Sonoma Sky Alpha	90.7	759	759/940
5	o3-pro (medium reasoning)	87.3	759	759/940
6	GPT-5 Pro	83.9	759	759/940
7	o1-pro (medium reasoning)	82.5	651	651/940
8	o3 (high reasoning)	78.6	759	759/940
9	GPT-5 (high reasoning)	77.0	759	759/940
10	o4-mini (high reasoning)	73.6	759	759/940
11	o3 (medium reasoning)	73.0	759	759/940
12	GPT-5 (medium reasoning)	72.2	759	759/940
13	o1 (medium reasoning)	70.8	651	651/940
14	GPT-5.1 (high reasoning)	69.9	759	759/940
15	o4-mini (medium reasoning)	68.8	651	651/940
16	GPT-5 mini (medium reasoning)	66.9	759	759/940
17	GPT-5 (low reasoning)	65.4	759	759/940
18	GPT-5.1 (medium reasoning)	62.7	759	759/940
19	o3-mini (high reasoning)	61.4	651	651/940
20	GLM-4.7	59.5	767	767/940
21	Claude Opus 4.1 Thinking 16K	58.8	759	759/940
22	DeepSeek V3.1 Reasoner	57.7	759	759/940
23	Gemini 2.5 Pro	57.6	759	759/940
24	Kimi K2 Thinking 64K	57.3	924	924/940
25	Qwen 3 235B A22B	54.3	759	759/940
26	Gemini 2.5 Pro Exp 03-25	54.1	651	651/940
27	o3-mini (medium reasoning)	53.6	651	651/940
28	Claude Opus 4 Thinking 16K	49.7	759	759/940
29	DeepSeek R1 05/28	48.6	759	759/940
30	Qwen 3 235B A22B 25-07 Think	46.2	759	759/940
31	Gemini 2.5 Pro Preview 05-06	42.5	651	651/940
32	Claude Sonnet 4 Thinking 16K	40.3	759	759/940
33	Claude Sonnet 4 Thinking 64K	39.6	651	651/940
34	GPT-OSS-120B	38.7	759	759/940
35	DeepSeek R1	38.6	651	651/940
36	Claude Opus 4.1 (no reasoning)	37.1	759	759/940
37	Qwen 3 30B A3B	36.6	759	759/940
38	Qwen 3 32B	35.8	759	759/940
39	Qwen 3 30B A3B 25-07 Thinking	35.5	759	759/940
40	Claude Opus 4 (no reasoning)	34.4	759	759/940
41	GPT-4.5 Preview	34.2	651	651/940
42	Claude 3.7 Sonnet Thinking 16K	33.6	651	651/940
43	Qwen 3 Next 80B A3B Thinking	32.9	759	759/940
44	Qwen QwQ-32B 16K	31.4	651	651/940
45	Grok 3 Mini Beta (high)	30.2	759	759/940
46	GLM-4.5	30.2	759	759/940
47	Claude Opus 4.6 Thinking 32K	28.1	98	98/940
48	GPT-5 (minimal reasoning)	27.3	759	759/940
49	o1-mini	26.9	651	651/940
50	Claude Sonnet 4 (no reasoning)	26.6	759	759/940
51	Grok 3 Mini Beta (low)	26.0	651	651/940
52	Quasar Alpha	25.4	651	651/940
53	Cohere Command A Reasoning 16K	25.3	759	759/940
54	Gemini 2.5 Flash	25.2	759	759/940
55	Sherlock Dash Alpha	25.1	759	759/940
56	Grok 4 Fast Non-Reasoning	24.9	759	759/940
57	GPT-4o Mar 2025	24.5	759	759/940
58	GLM-4.6	24.2	759	759/940
59	Qwen 3 Max Preview	23.9	759	759/940
60	Kimi K2-0905	23.6	759	759/940
61	Gemini 2.0 Flash Think Exp 01-21	23.1	649	649/940
62	GPT-4.1	22.8	759	759/940
63	Sonoma Dusk Alpha	22.8	759	759/940
64	GPT-4o Feb 2025	22.7	651	651/940
65	GPT-5.1 (no reasoning)	22.1	759	759/940
66	Polaris Alpha	21.8	759	759/940
67	Gemini 2.0 Pro Exp 02-05	21.7	651	651/940
68	DeepSeek V3.1 Non-Think	21.6	759	759/940
69	MiniMax-M1	21.3	688	688/940
70	Kimi K2	19.8	759	759/940
71	Qwen 3 235B A22B 25-07 Instruct	19.8	759	759/940
72	Grok 3 Beta (no reasoning)	19.7	759	759/940
73	Grok 2 12-12	19.2	651	651/940
74	Gemini 1.5 Pro (Sept)	19.2	601	601/940
75	Claude 3 Opus	19.2	650	650/940
76	Claude 3.7 Sonnet	19.2	651	651/940
77	Gemini 2.0 Flash	18.8	651	651/940
78	GPT-4o 2024-11-20	18.7	601	601/940
79	Qwen 2.5 Max	18.0	651	651/940
80	GPT-4o 2024-08-06	17.8	601	601/940
81	Claude 3.5 Sonnet 2024-10-22	17.7	651	651/940
82	Llama 4 Scout	17.4	759	759/940
83	DeepSeek V3-0324	16.8	759	759/940
84	Llama 3.1 405B	16.2	651	651/940
85	DeepSeek V3	15.1	651	651/940
86	Llama 3.3 70B	15.1	651	651/940
87	Baidu Ernie 4.5 300B A47B	14.8	759	759/940
88	GPT-4.1 mini	14.4	759	759/940
89	LongCat Flash	13.9	660	660/940
90	MiniMax-Text-01	13.8	759	759/940
91	Cohere Command A	13.1	759	759/940
92	Mistral Large 2	12.4	759	759/940
93	Gemma 2 27B	12.2	651	651/940
94	Gemma 3 27B	11.6	759	759/940
95	Mistral Medium 3	11.5	759	759/940
96	Mistral Small 3.1	11.4	651	651/940
97	Mistral Small 3.2	11.2	759	759/940
98	Qwen 2.5 72B	10.5	759	759/940
99	Claude 3.5 Haiku	10.0	759	759/940
100	Amazon Nova Pro	9.9	759	759/940
101	Microsoft Phi-4	9.9	759	759/940
102	GPT-4o mini	9.7	759	759/940
103	Mistral Small 3	8.9	601	601/940
104	GPT-4.1 nano	8.1	759	759/940
105	GLM4-32B-0414	7.6	759	759/940
106	Claude 3 Haiku	2.2	601	601/940

Notes

Claude Opus 4.7 refuses a lot of requests.
Partial credit is awarded if the puzzle isn't completely solved.
Only one attempt is allowed per puzzle. Humans solving puzzles on the NYT website get four attempts and a notification when they're one step away from the solution.
Multi-turn ensemble is my unpublished system. It utilizes multiple LLMs, multi-turn dialogues, and other proprietary techniques. It is slower and more costly to run but it does very well. It outperforms non-o1 LLMs on MMLU-Pro and GPQA.
This benchmark is not affiliated with the New York Times

Other multi-agent benchmarks

Other benchmarks

Updates

April 16, 2025: Claude Opus 4.7 added.
April 15, 2026: GLM-5.1, Step 3.5 Flash, Qwen3.5-27B added.
April 6, 2026: GPT 5.4 (high), Gemma 4 31B Reasoning, Qwen3.5-122B-A10B added.
April 4, 2026: MiniMax-M2.7 added.
April 3, 2026: Arcee Trinity Large Thinking, Qwen 3.6 Plus, Gemma 4 31B added.
Mar 6, 2026: Grok 4.20 Beta Experminatal, Gemini 3.1 Flash-Lite Preview added.
Mar 5, 2026: GPT-5.4 added.
Feb 23, 2026: GLM-5 added.
Feb 20, 2026: Gemini 3.1 Pro Preview, ByteDance Seed2.0 Pro, Baidu Ernie 5.0 added.
Feb 17, 2026: Claude Sonnet 4.6, Qwen3.5-397B-A17B, MiniMax-M2.5 added.
Feb 6, 2026: Claude Opus 4.6 added.
Feb 2, 2026: 940 total puzzles. Kimi K2.5 Thinking, Qwen3 Max (2026-01-23), MiniMax-M2.1, DeepSeek V3.2 added.
Dec 17, 2025: Gemini 3 Flash Preview added.
Dec 12, 2025: GPT 5.2 xhigh, GPT 5.2 Pro added.
Dec 11, 2025: GPT 5.2 added.
Dec 2, 2025: Mistral Large 3 added.
Nov 24, 2025: Claude Opus 4.5 added.
Nov 21, 2025: Grok 4.1 Fast added.
Nov 18, 2025: Gemini 3 Pro Preview, GPT 5.1 added
Nov 12, 2025: Kimi K2 Thinking added.
Oct 15, 2025: Claude Haiku 4.5 added.
Oct 14, 2025: Claude Sonnet 4.5, Deepseek V3.2 Exp, GLM-4.6 added.
Sep 19, 2025: Grok 4 Fast, Qwen 3 Next 80B A3B Thinking, LongCat Flash Chat added.
Sep 6, 2025: Kimi K2-0905 added.
Sep 5, 2025: Qwen 3 Max Preview, Qwen 3 235B A22B 25-07 Instruct added.
Aug 23, 2025: GPT-5 high reasoning and Cohere Command A Reasoning (16K) added.
Aug 22, 2025: DeepSeek 3.1, Qwen 3 30B A3B 25-07, Mistral Medium 3.1, GPT-5 minimal and low reasoning added.
Aug 7, 2025: GPT-5 added.
Aug 5, 2025: Claude Opus 4.1, GPT-OSS-120B added.
July 28, 2025: GLM-4.5, Qwen 3 235B A22B 25-07 Thinking added.
July 14, 2025: 108 new puzzles added. Kimi K2 added.
July 10, 2025: Grok 4 added.
July 3, 2025: Qwen 3 32B, GLM4-32B-0414 added.
July 2, 2025: Baidu Ernie 4.5 300B A47B, MiniMax-M1, Mistral Small 3.2 added.
June 10, 2025: o3-pro added.
June 5, 2025: Gemini 2.5 Pro Preview 06-05 added.
May 28, 2025: DeepSeek R1 05/28 added.
May 22, 2025: Claude 4 models added.
May 7, 2025: Gemini 2.5 Pro Preview 05-06 added. Mistral Medium 3 added.
Apr 30, 2025: Qwen 3 added.
Apr 18, 2025: o3, o4-mini, Gemini 2.5 Flash Preview added.
Apr 15, 2025: GPT-4.1 added.
Apr 10, 2025: Grok 3 added.
Apr 5, 2025: Llama 4 Maverick, Llama 4 Scout added.
Mar 28, 2025: GPT-4o March 2025 added.
Mar 25, 2025: 50 new questions added. Gemini 2.5 Pro Exp 03-25 and DeepSeek V3-0324 added.
Mar 23, 2025: Humans vs. LLMs section added.
Mar 21, 2025: o1-pro added. o3-mini-high added.
Mar 17, 2025: Cohere Command A and Mistral Small 3.1 added.
Mar 12, 2025: Gemma 3 27B added.
Mar 7, 2025: Qwen QwQ added.
Feb 27, 2025: GPT-4.5 Preview added.
Feb 24, 2025: Claude 3.7 Sonnet Thinking, Clade 3.7 Sonnet, GPT-4o Feb 2025, Qwen 2.5 Max, GPT-4o 2024-11-20 added.
Feb 6, 2025: Gemini 2.0 Pro Exp 02-05 added.
Feb 4, 2025: A new, more challenging version with extra words in each puzzle. Separate scoring for the 100 newest questions. Correlation heatmap.
Jan 31, 2025: o3-mini (72.4) added.
Jan 30, 2025: Mistral Small 3 (10.5) added.
Jan 29, 2025: DeepSeek R1 (54.5) added.
Jan 28, 2025: Qwen 2.5 Max (18.6) added.
Jan 22, 2025: Phi-4 (11.6), Nova Pro (12.5), Gemini 2.0 Flash Thinking Exp 01-21 (37.0) added.
Jan 16, 2025: Gemini 2.0 Flash Thinking Exp, o1, MiniMax-Tex-o1 added. Gemini 2.0 Flash Thinking Exp sometimes hits the output token limit.
Dec 27, 2024: GPT-4o 2024-11-20, Llama 3.3 70B, Gemini 2.0 Flash Exp, Deepseek-V3 added. Gemini 2.0 Flash Thinking Exp could not be benchmarked because its output gets cut off for some puzzles.
Claude 3.5 Haiku added. 13.7.
Claude 3.5 Sonnet (2024-10-22) added. Improves from 25.9 from 24.4.
Grok Beta added. Improves from 21.3 to 23.7. It's described as "experimental language model with state-of-the-art reasoning capabilities, best for complex and multi-step use cases. It is the successor of Grok 2 with enhanced context length."
Follow @lechmazur on X (Twitter) for other upcoming benchmarks and more.

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
functions		functions
images		images
prompts		prompts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extended Version

Chart: Extended Version

Leaderboard: Extended Version

Correlation of puzzle-level results: heatmap

Newest 100 puzzles.

Chart: Newest 100 puzzles, extended version

Humans vs. LLMs

Original NYT Connections LLM Benchmark

Chart: Original Version

Leaderboard: Original Version

Leaderboard: Older models

Notes

Other multi-agent benchmarks

Other benchmarks

Updates

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Extended Version

Chart: Extended Version

Leaderboard: Extended Version

Correlation of puzzle-level results: heatmap

Newest 100 puzzles.

Chart: Newest 100 puzzles, extended version

Humans vs. LLMs

Original NYT Connections LLM Benchmark

Chart: Original Version

Leaderboard: Original Version

Leaderboard: Older models

Notes

Other multi-agent benchmarks

Other benchmarks

Updates

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages