CursorBench 3.2

We evaluate agents on ambiguous, multi-file tasks from real Cursor sessions. Higher scores are better.

	Model
1	Fable 5 Max	70.5%	$17.32	103,525	72
2	Fable 5 Extra High	68.4%	$11.73	64,971	56
3	GPT-5.6 Sol Max	67.2%	$5.69	28,320	48
4	Grok 4.5 High*	66.7%	$1.51	19,521	33
5	Fable 5 High	66.5%	$8.77	43,747	48
6	Grok 4.5 Medium*	65.4%	$1.54	18,914	34
7	Fable 5 Medium	65.2%	$6.80	30,366	41
8	GPT-5.6 Terra Max	64.9%	$2.89	32,969	47
9	GPT-5.6 Sol Extra High	64.5%	$3.88	19,699	38
10	Grok 4.5 Low*	63.5%	$1.22	15,841	31
11	GPT-5.6 Sol High	63.5%	$2.79	13,867	32
12	Opus 4.8 Max	62.3%	$5.77	71,411	44
13	Fable 5 Low	62.1%	$4.46	18,182	31
14	Sonnet 5 Max	61.5%	$6.45	92,882	86
15	GPT-5.6 Luna Max	61.1%	$1.97	87,973	61
16	GPT-5.6 Sol Medium	60.0%	$1.95	9,747	27
17	Opus 4.8 Extra High	59.4%	$4.50	51,121	40
18	GPT-5.6 Terra Extra High	59.2%	$1.44	16,089	29
19	Sonnet 5 Extra High	58.7%	$4.16	52,871	67
20	GPT-5.5 High	58.4%	$2.05	12,183	28
21	GPT-5.5 Extra High	58.4%	$2.85	17,534	32
22	Opus 4.8 High	58.0%	$3.15	33,548	33
23	GPT-5.6 Luna Extra High	57.7%	$1.14	22,480	48
24	Sonnet 5 High	56.9%	$3.19	39,483	57
25	GPT-5.6 Luna High	56.8%	$0.82	15,141	40
26	Opus 4.8 Medium	56.1%	$2.81	28,384	32
27	Composer 2.5	56.1%	$0.44	14,286	33
28	GLM 5.2 Max	55.0%	$1.76	35,946	58
29	GPT-5.6 Terra High	54.2%	$0.89	9,468	23
30	GPT-5.5 Medium	53.8%	$1.51	8,522	25
31	Opus 4.8 Low	53.1%	$2.02	19,624	27
32	GPT-5.6 Sol Low	52.6%	$1.01	5,104	19
33	Sonnet 5 Medium	52.4%	$2.16	26,200	46
34	GLM 5.2 High	51.5%	$1.19	21,829	49
35	GPT-5.6 Terra Medium	50.3%	$0.61	6,222	20
36	Kimi K2.7 Code	49.7%	$1.43	31,247	58
37	Gemini 3.5 Flash	48.8%	$2.20	46,702	77
38	GPT-5.6 Luna Medium	47.7%	$0.39	7,095	28
39	Sonnet 5 Low	47.7%	$1.30	16,269	33
40	GPT-5.6 Terra Low	46.9%	$0.53	5,312	19
41	GPT-5.5 Low	46.6%	$0.98	5,168	20
42	GPT-5.6 Luna Low	37.6%	$0.16	3,209	17

Grok 4.5 has an advantage on CursorBench: an earlier snapshot of the Cursor codebase was unintentionally included in training. The exact score impact is unclear. That data has been removed for future models. For a rundown of third-party benchmark scores, see the Grok 4.5 launch blog.

Changelog

Jul 9, 2026

Reporting

Updated GPT-5.6 Sol, Terra, and Luna results to account for cache write costs.

Jul 8, 2026

Tasks

CursorBench 3.2
- Introduced instruction following and advanced tool use problems.

May 19, 2026

Tasks

CursorBench 3.1
- Introduced problems focused on codebase understanding, bugfinding, planning, and code review.
- Improved grading criteria for some edit tasks.

Mar 11, 2026

Tasks

CursorBench 3.0
- Initial set of tasks focused on edit, refactor, and bugfix problems.

Avg cost / task is computed by applying each model's published per-million-token pricing (input, cache read, cache write, and output) to the tokens it used on each task across the CursorBench 3.2 benchmarks, then averaging with the same task weights as the CursorBench 3.2 score. Results are subject to variance; small differences in scores may not be statistically meaningful.

CursorBench 3.2

We evaluate agents on ambiguous, multi-file tasks from real Cursor sessions. Higher scores are better.

More about CursorBench

	Model
1	Fable 5 Max	70.5%	$17.32	103,525	72
2	Fable 5 Extra High	68.4%	$11.73	64,971	56
3	GPT-5.6 Sol Max	67.2%	$5.69	28,320	48
4	Grok 4.5 High*	66.7%	$1.51	19,521	33
5	Fable 5 High	66.5%	$8.77	43,747	48
6	Grok 4.5 Medium*	65.4%	$1.54	18,914	34
7	Fable 5 Medium	65.2%	$6.80	30,366	41
8	GPT-5.6 Terra Max	64.9%	$2.89	32,969	47
9	GPT-5.6 Sol Extra High	64.5%	$3.88	19,699	38
10	Grok 4.5 Low*	63.5%	$1.22	15,841	31
11	GPT-5.6 Sol High	63.5%	$2.79	13,867	32
12	Opus 4.8 Max	62.3%	$5.77	71,411	44
13	Fable 5 Low	62.1%	$4.46	18,182	31
14	Sonnet 5 Max	61.5%	$6.45	92,882	86
15	GPT-5.6 Luna Max	61.1%	$1.97	87,973	61
16	GPT-5.6 Sol Medium	60.0%	$1.95	9,747	27
17	Opus 4.8 Extra High	59.4%	$4.50	51,121	40
18	GPT-5.6 Terra Extra High	59.2%	$1.44	16,089	29
19	Sonnet 5 Extra High	58.7%	$4.16	52,871	67
20	GPT-5.5 High	58.4%	$2.05	12,183	28
21	GPT-5.5 Extra High	58.4%	$2.85	17,534	32
22	Opus 4.8 High	58.0%	$3.15	33,548	33
23	GPT-5.6 Luna Extra High	57.7%	$1.14	22,480	48
24	Sonnet 5 High	56.9%	$3.19	39,483	57
25	GPT-5.6 Luna High	56.8%	$0.82	15,141	40
26	Opus 4.8 Medium	56.1%	$2.81	28,384	32
27	Composer 2.5	56.1%	$0.44	14,286	33
28	GLM 5.2 Max	55.0%	$1.76	35,946	58
29	GPT-5.6 Terra High	54.2%	$0.89	9,468	23
30	GPT-5.5 Medium	53.8%	$1.51	8,522	25
31	Opus 4.8 Low	53.1%	$2.02	19,624	27
32	GPT-5.6 Sol Low	52.6%	$1.01	5,104	19
33	Sonnet 5 Medium	52.4%	$2.16	26,200	46
34	GLM 5.2 High	51.5%	$1.19	21,829	49
35	GPT-5.6 Terra Medium	50.3%	$0.61	6,222	20
36	Kimi K2.7 Code	49.7%	$1.43	31,247	58
37	Gemini 3.5 Flash	48.8%	$2.20	46,702	77
38	GPT-5.6 Luna Medium	47.7%	$0.39	7,095	28
39	Sonnet 5 Low	47.7%	$1.30	16,269	33
40	GPT-5.6 Terra Low	46.9%	$0.53	5,312	19
41	GPT-5.5 Low	46.6%	$0.98	5,168	20
42	GPT-5.6 Luna Low	37.6%	$0.16	3,209	17

Changelog

Jul 9, 2026

Reporting

Updated GPT-5.6 Sol, Terra, and Luna results to account for cache write costs.

Jul 8, 2026

Tasks

CursorBench 3.2
- Introduced instruction following and advanced tool use problems.

May 19, 2026

Tasks

CursorBench 3.1
- Introduced problems focused on codebase understanding, bugfinding, planning, and code review.
- Improved grading criteria for some edit tasks.

Mar 11, 2026

Tasks

CursorBench 3.0
- Initial set of tasks focused on edit, refactor, and bugfix problems.

CursorBench 3.2

We evaluate agents on ambiguous, multi-file tasks from real Cursor sessions. Higher scores are better.

More about CursorBench

	Model
1	Fable 5 Max	70.5%	$17.32	103,525	72
2	Fable 5 Extra High	68.4%	$11.73	64,971	56
3	GPT-5.6 Sol Max	67.2%	$5.69	28,320	48
4	Grok 4.5 High*	66.7%	$1.51	19,521	33
5	Fable 5 High	66.5%	$8.77	43,747	48
6	Grok 4.5 Medium*	65.4%	$1.54	18,914	34
7	Fable 5 Medium	65.2%	$6.80	30,366	41
8	GPT-5.6 Terra Max	64.9%	$2.89	32,969	47
9	GPT-5.6 Sol Extra High	64.5%	$3.88	19,699	38
10	Grok 4.5 Low*	63.5%	$1.22	15,841	31
11	GPT-5.6 Sol High	63.5%	$2.79	13,867	32
12	Opus 4.8 Max	62.3%	$5.77	71,411	44
13	Fable 5 Low	62.1%	$4.46	18,182	31
14	Sonnet 5 Max	61.5%	$6.45	92,882	86
15	GPT-5.6 Luna Max	61.1%	$1.97	87,973	61
16	GPT-5.6 Sol Medium	60.0%	$1.95	9,747	27
17	Opus 4.8 Extra High	59.4%	$4.50	51,121	40
18	GPT-5.6 Terra Extra High	59.2%	$1.44	16,089	29
19	Sonnet 5 Extra High	58.7%	$4.16	52,871	67
20	GPT-5.5 High	58.4%	$2.05	12,183	28
21	GPT-5.5 Extra High	58.4%	$2.85	17,534	32
22	Opus 4.8 High	58.0%	$3.15	33,548	33
23	GPT-5.6 Luna Extra High	57.7%	$1.14	22,480	48
24	Sonnet 5 High	56.9%	$3.19	39,483	57
25	GPT-5.6 Luna High	56.8%	$0.82	15,141	40
26	Opus 4.8 Medium	56.1%	$2.81	28,384	32
27	Composer 2.5	56.1%	$0.44	14,286	33
28	GLM 5.2 Max	55.0%	$1.76	35,946	58
29	GPT-5.6 Terra High	54.2%	$0.89	9,468	23
30	GPT-5.5 Medium	53.8%	$1.51	8,522	25
31	Opus 4.8 Low	53.1%	$2.02	19,624	27
32	GPT-5.6 Sol Low	52.6%	$1.01	5,104	19
33	Sonnet 5 Medium	52.4%	$2.16	26,200	46
34	GLM 5.2 High	51.5%	$1.19	21,829	49
35	GPT-5.6 Terra Medium	50.3%	$0.61	6,222	20
36	Kimi K2.7 Code	49.7%	$1.43	31,247	58
37	Gemini 3.5 Flash	48.8%	$2.20	46,702	77
38	GPT-5.6 Luna Medium	47.7%	$0.39	7,095	28
39	Sonnet 5 Low	47.7%	$1.30	16,269	33
40	GPT-5.6 Terra Low	46.9%	$0.53	5,312	19
41	GPT-5.5 Low	46.6%	$0.98	5,168	20
42	GPT-5.6 Luna Low	37.6%	$0.16	3,209	17

Changelog

Jul 9, 2026

Reporting

Updated GPT-5.6 Sol, Terra, and Luna results to account for cache write costs.

Jul 8, 2026

Tasks

CursorBench 3.2
- Introduced instruction following and advanced tool use problems.

May 19, 2026

Tasks

CursorBench 3.1
- Introduced problems focused on codebase understanding, bugfinding, planning, and code review.
- Improved grading criteria for some edit tasks.

Mar 11, 2026

Tasks

CursorBench 3.0
- Initial set of tasks focused on edit, refactor, and bugfix problems.