CursorBench 3.2

We evaluate agents on ambiguous, multi-file tasks from real Cursor sessions. Higher scores are better.

	Model
1	Fable 5 Max	70.5%	$17.32	103,525	72
2	Opus 5 Max	70.0%	$8.23	61,838	78
3	Opus 5 Extra High	69.3%	$7.35	54,239	72
4	Fable 5 Extra High	68.4%	$11.73	64,971	56
5	GPT-5.6 Sol Max	67.2%	$5.69	28,320	48
6	Opus 5 High	66.7%	$3.91	27,932	48
7	Grok 4.5 High*	66.7%	$1.51	19,521	33
8	Fable 5 High	66.5%	$8.77	43,747	48
9	Grok 4.5 Medium*	65.4%	$1.54	18,914	34
10	Fable 5 Medium	65.2%	$6.80	30,366	41
11	GPT-5.6 Terra Max	64.9%	$2.89	32,969	47
12	GPT-5.6 Sol Extra High	64.5%	$3.88	19,699	38
13	Opus 5 Medium	64.3%	$3.29	23,612	44
14	Grok 4.5 Low*	63.5%	$1.22	15,841	31
15	GPT-5.6 Sol High	63.5%	$2.79	13,867	32
16	Opus 5 Low	62.8%	$2.55	18,529	37
17	Opus 4.8 Max	62.3%	$5.77	71,411	44
18	Fable 5 Low	62.1%	$4.46	18,182	31
19	Sonnet 5 Max	61.5%	$6.45	92,882	86
20	GPT-5.6 Luna Max	61.1%	$1.97	87,973	61
21	Kimi K3 Max	60.8%	$2.70	38,428	57
22	GPT-5.6 Sol Medium	60.0%	$1.95	9,747	27
23	Kimi K3 High	59.7%	$1.89	26,846	47
24	Opus 4.8 Extra High	59.4%	$4.50	51,121	40
25	GPT-5.6 Terra Extra High	59.2%	$1.44	16,089	29
26	Sonnet 5 Extra High	58.7%	$4.16	52,871	67
27	GPT-5.5 High	58.4%	$2.05	12,183	28
28	GPT-5.5 Extra High	58.4%	$2.85	17,534	32
29	Opus 4.8 High	58.0%	$3.15	33,548	33
30	GPT-5.6 Luna Extra High	57.7%	$1.14	22,480	48
31	Sonnet 5 High	56.9%	$3.19	39,483	57
32	GPT-5.6 Luna High	56.8%	$0.82	15,141	40
33	Opus 4.8 Medium	56.1%	$2.81	28,384	32
34	Composer 2.5	56.1%	$0.44	14,286	33
35	GLM 5.2 Max	55.0%	$1.76	35,946	58
36	GPT-5.6 Terra High	54.2%	$0.89	9,468	23
37	GPT-5.5 Medium	53.8%	$1.51	8,522	25
38	Gemini 3.6 Flash High	53.5%	$1.56	30,436	64
39	Opus 4.8 Low	53.1%	$2.02	19,624	27
40	GPT-5.6 Sol Low	52.6%	$1.01	5,104	19
41	Sonnet 5 Medium	52.4%	$2.16	26,200	46
42	GLM 5.2 High	51.5%	$1.19	21,829	49
43	Gemini 3.6 Flash Medium	51.2%	$1.48	28,511	62
44	Kimi K3 Low	50.5%	$0.99	13,007	33
45	GPT-5.6 Terra Medium	50.3%	$0.61	6,222	20
46	Kimi K2.7 Code	49.7%	$1.43	31,247	58
47	Gemini 3.5 Flash	48.8%	$2.20	46,702	77
48	GPT-5.6 Luna Medium	47.7%	$0.39	7,095	28
49	Sonnet 5 Low	47.7%	$1.30	16,269	33
50	Gemini 3.6 Flash Low	47.4%	$1.13	20,529	50
51	GPT-5.6 Terra Low	46.9%	$0.53	5,312	19
52	GPT-5.5 Low	46.6%	$0.98	5,168	20
53	GPT-5.6 Luna Low	37.6%	$0.16	3,209	17

Grok 4.5 has an advantage on CursorBench: an earlier snapshot of the Cursor codebase was unintentionally included in training. The exact score impact is unclear. That data has been removed for future models. For a rundown of third-party benchmark scores, see the Grok 4.5 launch blog.

Changelog

Jul 9, 2026

Reporting

Updated GPT-5.6 Sol, Terra, and Luna results to account for cache write costs.

Jul 8, 2026

Tasks

CursorBench 3.2
- Introduced instruction following and advanced tool use problems.

May 19, 2026

Tasks

CursorBench 3.1
- Introduced problems focused on codebase understanding, bugfinding, planning, and code review.
- Improved grading criteria for some edit tasks.

Mar 11, 2026

Tasks

CursorBench 3.0
- Initial set of tasks focused on edit, refactor, and bugfix problems.

Avg cost / task is computed by applying each model's published per-million-token pricing (input, cache read, cache write, and output) to the tokens it used on each task. Results are subject to variance; small differences in scores may not be statistically meaningful.

CursorBench 3.2

We evaluate agents on ambiguous, multi-file tasks from real Cursor sessions. Higher scores are better.

More about CursorBench

	Model
1	Fable 5 Max	70.5%	$17.32	103,525	72
2	Opus 5 Max	70.0%	$8.23	61,838	78
3	Opus 5 Extra High	69.3%	$7.35	54,239	72
4	Fable 5 Extra High	68.4%	$11.73	64,971	56
5	GPT-5.6 Sol Max	67.2%	$5.69	28,320	48
6	Opus 5 High	66.7%	$3.91	27,932	48
7	Grok 4.5 High*	66.7%	$1.51	19,521	33
8	Fable 5 High	66.5%	$8.77	43,747	48
9	Grok 4.5 Medium*	65.4%	$1.54	18,914	34
10	Fable 5 Medium	65.2%	$6.80	30,366	41
11	GPT-5.6 Terra Max	64.9%	$2.89	32,969	47
12	GPT-5.6 Sol Extra High	64.5%	$3.88	19,699	38
13	Opus 5 Medium	64.3%	$3.29	23,612	44
14	Grok 4.5 Low*	63.5%	$1.22	15,841	31
15	GPT-5.6 Sol High	63.5%	$2.79	13,867	32
16	Opus 5 Low	62.8%	$2.55	18,529	37
17	Opus 4.8 Max	62.3%	$5.77	71,411	44
18	Fable 5 Low	62.1%	$4.46	18,182	31
19	Sonnet 5 Max	61.5%	$6.45	92,882	86
20	GPT-5.6 Luna Max	61.1%	$1.97	87,973	61
21	Kimi K3 Max	60.8%	$2.70	38,428	57
22	GPT-5.6 Sol Medium	60.0%	$1.95	9,747	27
23	Kimi K3 High	59.7%	$1.89	26,846	47
24	Opus 4.8 Extra High	59.4%	$4.50	51,121	40
25	GPT-5.6 Terra Extra High	59.2%	$1.44	16,089	29
26	Sonnet 5 Extra High	58.7%	$4.16	52,871	67
27	GPT-5.5 High	58.4%	$2.05	12,183	28
28	GPT-5.5 Extra High	58.4%	$2.85	17,534	32
29	Opus 4.8 High	58.0%	$3.15	33,548	33
30	GPT-5.6 Luna Extra High	57.7%	$1.14	22,480	48
31	Sonnet 5 High	56.9%	$3.19	39,483	57
32	GPT-5.6 Luna High	56.8%	$0.82	15,141	40
33	Opus 4.8 Medium	56.1%	$2.81	28,384	32
34	Composer 2.5	56.1%	$0.44	14,286	33
35	GLM 5.2 Max	55.0%	$1.76	35,946	58
36	GPT-5.6 Terra High	54.2%	$0.89	9,468	23
37	GPT-5.5 Medium	53.8%	$1.51	8,522	25
38	Gemini 3.6 Flash High	53.5%	$1.56	30,436	64
39	Opus 4.8 Low	53.1%	$2.02	19,624	27
40	GPT-5.6 Sol Low	52.6%	$1.01	5,104	19
41	Sonnet 5 Medium	52.4%	$2.16	26,200	46
42	GLM 5.2 High	51.5%	$1.19	21,829	49
43	Gemini 3.6 Flash Medium	51.2%	$1.48	28,511	62
44	Kimi K3 Low	50.5%	$0.99	13,007	33
45	GPT-5.6 Terra Medium	50.3%	$0.61	6,222	20
46	Kimi K2.7 Code	49.7%	$1.43	31,247	58
47	Gemini 3.5 Flash	48.8%	$2.20	46,702	77
48	GPT-5.6 Luna Medium	47.7%	$0.39	7,095	28
49	Sonnet 5 Low	47.7%	$1.30	16,269	33
50	Gemini 3.6 Flash Low	47.4%	$1.13	20,529	50
51	GPT-5.6 Terra Low	46.9%	$0.53	5,312	19
52	GPT-5.5 Low	46.6%	$0.98	5,168	20
53	GPT-5.6 Luna Low	37.6%	$0.16	3,209	17

Changelog

Jul 9, 2026

Reporting

Updated GPT-5.6 Sol, Terra, and Luna results to account for cache write costs.

Jul 8, 2026

Tasks

CursorBench 3.2
- Introduced instruction following and advanced tool use problems.

May 19, 2026

Tasks

CursorBench 3.1
- Introduced problems focused on codebase understanding, bugfinding, planning, and code review.
- Improved grading criteria for some edit tasks.

Mar 11, 2026

Tasks

CursorBench 3.0
- Initial set of tasks focused on edit, refactor, and bugfix problems.

CursorBench 3.2

We evaluate agents on ambiguous, multi-file tasks from real Cursor sessions. Higher scores are better.

More about CursorBench

	Model
1	Fable 5 Max	70.5%	$17.32	103,525	72
2	Opus 5 Max	70.0%	$8.23	61,838	78
3	Opus 5 Extra High	69.3%	$7.35	54,239	72
4	Fable 5 Extra High	68.4%	$11.73	64,971	56
5	GPT-5.6 Sol Max	67.2%	$5.69	28,320	48
6	Opus 5 High	66.7%	$3.91	27,932	48
7	Grok 4.5 High*	66.7%	$1.51	19,521	33
8	Fable 5 High	66.5%	$8.77	43,747	48
9	Grok 4.5 Medium*	65.4%	$1.54	18,914	34
10	Fable 5 Medium	65.2%	$6.80	30,366	41
11	GPT-5.6 Terra Max	64.9%	$2.89	32,969	47
12	GPT-5.6 Sol Extra High	64.5%	$3.88	19,699	38
13	Opus 5 Medium	64.3%	$3.29	23,612	44
14	Grok 4.5 Low*	63.5%	$1.22	15,841	31
15	GPT-5.6 Sol High	63.5%	$2.79	13,867	32
16	Opus 5 Low	62.8%	$2.55	18,529	37
17	Opus 4.8 Max	62.3%	$5.77	71,411	44
18	Fable 5 Low	62.1%	$4.46	18,182	31
19	Sonnet 5 Max	61.5%	$6.45	92,882	86
20	GPT-5.6 Luna Max	61.1%	$1.97	87,973	61
21	Kimi K3 Max	60.8%	$2.70	38,428	57
22	GPT-5.6 Sol Medium	60.0%	$1.95	9,747	27
23	Kimi K3 High	59.7%	$1.89	26,846	47
24	Opus 4.8 Extra High	59.4%	$4.50	51,121	40
25	GPT-5.6 Terra Extra High	59.2%	$1.44	16,089	29
26	Sonnet 5 Extra High	58.7%	$4.16	52,871	67
27	GPT-5.5 High	58.4%	$2.05	12,183	28
28	GPT-5.5 Extra High	58.4%	$2.85	17,534	32
29	Opus 4.8 High	58.0%	$3.15	33,548	33
30	GPT-5.6 Luna Extra High	57.7%	$1.14	22,480	48
31	Sonnet 5 High	56.9%	$3.19	39,483	57
32	GPT-5.6 Luna High	56.8%	$0.82	15,141	40
33	Opus 4.8 Medium	56.1%	$2.81	28,384	32
34	Composer 2.5	56.1%	$0.44	14,286	33
35	GLM 5.2 Max	55.0%	$1.76	35,946	58
36	GPT-5.6 Terra High	54.2%	$0.89	9,468	23
37	GPT-5.5 Medium	53.8%	$1.51	8,522	25
38	Gemini 3.6 Flash High	53.5%	$1.56	30,436	64
39	Opus 4.8 Low	53.1%	$2.02	19,624	27
40	GPT-5.6 Sol Low	52.6%	$1.01	5,104	19
41	Sonnet 5 Medium	52.4%	$2.16	26,200	46
42	GLM 5.2 High	51.5%	$1.19	21,829	49
43	Gemini 3.6 Flash Medium	51.2%	$1.48	28,511	62
44	Kimi K3 Low	50.5%	$0.99	13,007	33
45	GPT-5.6 Terra Medium	50.3%	$0.61	6,222	20
46	Kimi K2.7 Code	49.7%	$1.43	31,247	58
47	Gemini 3.5 Flash	48.8%	$2.20	46,702	77
48	GPT-5.6 Luna Medium	47.7%	$0.39	7,095	28
49	Sonnet 5 Low	47.7%	$1.30	16,269	33
50	Gemini 3.6 Flash Low	47.4%	$1.13	20,529	50
51	GPT-5.6 Terra Low	46.9%	$0.53	5,312	19
52	GPT-5.5 Low	46.6%	$0.98	5,168	20
53	GPT-5.6 Luna Low	37.6%	$0.16	3,209	17

Changelog

Jul 9, 2026

Reporting

Updated GPT-5.6 Sol, Terra, and Luna results to account for cache write costs.

Jul 8, 2026

Tasks

CursorBench 3.2
- Introduced instruction following and advanced tool use problems.

May 19, 2026

Tasks

CursorBench 3.1
- Introduced problems focused on codebase understanding, bugfinding, planning, and code review.
- Improved grading criteria for some edit tasks.

Mar 11, 2026

Tasks

CursorBench 3.0
- Initial set of tasks focused on edit, refactor, and bugfix problems.