CursorBench 3.1
We evaluate agents on ambiguous, multi-file tasks from real Cursor sessions. Higher scores are better.
More about CursorBench| Model | |||||
|---|---|---|---|---|---|
| 1 | Fable 5 Max | 72.9% | $18.02 | 63,842 | 76 |
| 2 | Fable 5 Extra High | 72.0% | $13.74 | 48,754 | 63 |
| 3 | Fable 5 High | 70.6% | $10.81 | 37,173 | 54 |
| 4 | Fable 5 Medium | 69.8% | $8.27 | 28,507 | 47 |
| 5 | Opus 4.7 Max | 64.8% | $11.02 | 62,989 | 96 |
| 6 | GPT-5.5 Extra High | 64.3% | $4.37 | 17,905 | 46 |
| 7 | Fable 5 Low | 64.2% | $5.70 | 18,882 | 36 |
| 8 | Opus 4.8 Max | 63.8% | $7.59 | 77,370 | 60 |
| 9 | Composer 2.5 | 63.2% | $0.55 | 15,152 | 37 |
| 10 | GPT-5.5 High | 62.6% | $3.59 | 13,329 | 40 |
| 11 | Opus 4.8 Extra High | 62.1% | $6.14 | 55,622 | 54 |
| 12 | Opus 4.7 Extra High | 61.6% | $7.11 | 43,942 | 72 |
| 13 | Opus 4.7 High | 59.4% | $5.01 | 32,227 | 59 |
| 14 | GPT-5.5 Medium | 59.2% | $2.22 | 9,065 | 35 |
| 15 | Opus 4.8 High | 58.4% | $4.41 | 36,788 | 45 |
| 16 | Opus 4.8 Medium | 56.6% | $3.83 | 31,684 | 41 |
| 17 | Opus 4.8 Low | 54.3% | $2.93 | 22,726 | 36 |
| 18 | Opus 4.7 Medium | 52.7% | $2.93 | 19,193 | 41 |
| 19 | Composer 2 | 52.2% | $0.56 | 14,163 | 40 |
| 20 | Gemini 3.5 Flash | 49.8% | $1.94 | 35,105 | 79 |
| 21 | Sonnet 4.6 Max | 49.0% | $3.09 | 40,280 | 55 |
| 22 | GPT-5.5 Low | 48.8% | $1.19 | 4,923 | 24 |
| 23 | Sonnet 4.6 High | 48.8% | $3.06 | 37,352 | 57 |
| 24 | Opus 4.7 Low | 48.3% | $1.87 | 13,164 | 29 |
| 25 | Kimi 2.6 | 47.6% | $1.27 | 24,783 | 56 |
| 26 | Sonnet 4.6 Medium | 46.0% | $2.64 | 31,360 | 50 |
| 27 | Sonnet 4.6 Low | 41.5% | $1.89 | 21,211 | 50 |
| 28 | Kimi 2.5 | 31.9% | $0.87 | 9,446 | 30 |
Changelog
CursorBench 3.1
- Introduced problems focused on codebase understanding, bugfinding, planning, and code review.
- Improved grading criteria for some edit tasks.
CursorBench 3.0
- Initial set of tasks focused on edit, refactor, and bugfix problems.
Avg cost / task is computed by applying each model's published per-million-token pricing (input, cache read, cache write, and output) to the tokens it used on each CursorBench 3.1 task, then averaging across tasks. Results are subject to variance; small differences in scores may not be statistically meaningful.