CursorBench 3.1

We evaluate agents on ambiguous, multi-file tasks from real Cursor sessions. Higher scores are better.


1	Opus 4.7 Max	64.8%	$11.02
2	GPT-5.5 Extra High	64.3%	$4.37
3	Opus 4.8 Max	63.8%	$7.59
4	Composer 2.5	63.2%	$0.55
5	GPT-5.5 High	62.6%	$3.59
6	Opus 4.8 Extra High	62.1%	$6.14
7	Opus 4.7 Extra High	61.6%	$7.11
8	Opus 4.7 High	59.4%	$5.01
9	GPT-5.5 Medium	59.2%	$2.22
10	Opus 4.8 High	58.4%	$4.41
11	Opus 4.8 Medium	56.6%	$3.83
12	Opus 4.8 Low	54.3%	$2.93
13	Opus 4.7 Medium	52.7%	$2.93
14	Composer 2	52.2%	$0.56
15	Gemini 3.5 Flash	49.8%	$1.94
16	Sonnet 4.6 Max	49.0%	$3.09
17	GPT-5.5 Low	48.8%	$1.19
18	Sonnet 4.6 High	48.8%	$3.06
19	Opus 4.7 Low	48.3%	$1.87
20	Kimi 2.6	47.6%	$1.27
21	Sonnet 4.6 Medium	46.0%	$2.64
22	Sonnet 4.6 Low	41.5%	$1.89
23	Kimi 2.5	31.9%	$0.87

Changelog

CursorBench 3.1

Introduced problems focused on codebase understanding, bugfinding, planning, and code review.
Improved grading criteria for some edit tasks.

CursorBench 3.0

Initial set of tasks focused on edit, refactor, and bugfix problems.

Avg cost / task is computed by applying each model's published per-million-token pricing (input, cache read, and output) to the tokens it used on each CursorBench 3.1 task, then averaging across tasks. Results are subject to variance; small differences in scores may not be statistically meaningful.

CursorBench 3.1

We evaluate agents on ambiguous, multi-file tasks from real Cursor sessions. Higher scores are better.

More about CursorBench


1	Opus 4.7 Max	64.8%	$11.02
2	GPT-5.5 Extra High	64.3%	$4.37
3	Opus 4.8 Max	63.8%	$7.59
4	Composer 2.5	63.2%	$0.55
5	GPT-5.5 High	62.6%	$3.59
6	Opus 4.8 Extra High	62.1%	$6.14
7	Opus 4.7 Extra High	61.6%	$7.11
8	Opus 4.7 High	59.4%	$5.01
9	GPT-5.5 Medium	59.2%	$2.22
10	Opus 4.8 High	58.4%	$4.41
11	Opus 4.8 Medium	56.6%	$3.83
12	Opus 4.8 Low	54.3%	$2.93
13	Opus 4.7 Medium	52.7%	$2.93
14	Composer 2	52.2%	$0.56
15	Gemini 3.5 Flash	49.8%	$1.94
16	Sonnet 4.6 Max	49.0%	$3.09
17	GPT-5.5 Low	48.8%	$1.19
18	Sonnet 4.6 High	48.8%	$3.06
19	Opus 4.7 Low	48.3%	$1.87
20	Kimi 2.6	47.6%	$1.27
21	Sonnet 4.6 Medium	46.0%	$2.64
22	Sonnet 4.6 Low	41.5%	$1.89
23	Kimi 2.5	31.9%	$0.87

Changelog

CursorBench 3.1

Introduced problems focused on codebase understanding, bugfinding, planning, and code review.
Improved grading criteria for some edit tasks.

CursorBench 3.0

Initial set of tasks focused on edit, refactor, and bugfix problems.

CursorBench 3.1

We evaluate agents on ambiguous, multi-file tasks from real Cursor sessions. Higher scores are better.

More about CursorBench


1	Opus 4.7 Max	64.8%	$11.02
2	GPT-5.5 Extra High	64.3%	$4.37
3	Opus 4.8 Max	63.8%	$7.59
4	Composer 2.5	63.2%	$0.55
5	GPT-5.5 High	62.6%	$3.59
6	Opus 4.8 Extra High	62.1%	$6.14
7	Opus 4.7 Extra High	61.6%	$7.11
8	Opus 4.7 High	59.4%	$5.01
9	GPT-5.5 Medium	59.2%	$2.22
10	Opus 4.8 High	58.4%	$4.41
11	Opus 4.8 Medium	56.6%	$3.83
12	Opus 4.8 Low	54.3%	$2.93
13	Opus 4.7 Medium	52.7%	$2.93
14	Composer 2	52.2%	$0.56
15	Gemini 3.5 Flash	49.8%	$1.94
16	Sonnet 4.6 Max	49.0%	$3.09
17	GPT-5.5 Low	48.8%	$1.19
18	Sonnet 4.6 High	48.8%	$3.06
19	Opus 4.7 Low	48.3%	$1.87
20	Kimi 2.6	47.6%	$1.27
21	Sonnet 4.6 Medium	46.0%	$2.64
22	Sonnet 4.6 Low	41.5%	$1.89
23	Kimi 2.5	31.9%	$0.87

Changelog

CursorBench 3.1

Introduced problems focused on codebase understanding, bugfinding, planning, and code review.
Improved grading criteria for some edit tasks.

CursorBench 3.0

Initial set of tasks focused on edit, refactor, and bugfix problems.