CursorBench 3.1
We evaluate agents on ambiguous, multi-file tasks from real Cursor sessions. Higher scores are better.
More about CursorBench| 1 | Opus 4.7 Max | 64.8% | $11.02 |
| 2 | GPT-5.5 Extra High | 64.3% | $4.37 |
| 3 | Opus 4.8 Max | 63.8% | $7.59 |
| 4 | Composer 2.5 | 63.2% | $0.55 |
| 5 | GPT-5.5 High | 62.6% | $3.59 |
| 6 | Opus 4.8 Extra High | 62.1% | $6.14 |
| 7 | Opus 4.7 Extra High | 61.6% | $7.11 |
| 8 | Opus 4.7 High | 59.4% | $5.01 |
| 9 | GPT-5.5 Medium | 59.2% | $2.22 |
| 10 | Opus 4.8 High | 58.4% | $4.41 |
| 11 | Opus 4.8 Medium | 56.6% | $3.83 |
| 12 | Opus 4.8 Low | 54.3% | $2.93 |
| 13 | Opus 4.7 Medium | 52.7% | $2.93 |
| 14 | Composer 2 | 52.2% | $0.56 |
| 15 | Gemini 3.5 Flash | 49.8% | $1.94 |
| 16 | Sonnet 4.6 Max | 49.0% | $3.09 |
| 17 | GPT-5.5 Low | 48.8% | $1.19 |
| 18 | Sonnet 4.6 High | 48.8% | $3.06 |
| 19 | Opus 4.7 Low | 48.3% | $1.87 |
| 20 | Kimi 2.6 | 47.6% | $1.27 |
| 21 | Sonnet 4.6 Medium | 46.0% | $2.64 |
| 22 | Sonnet 4.6 Low | 41.5% | $1.89 |
| 23 | Kimi 2.5 | 31.9% | $0.87 |
Changelog
CursorBench 3.1
- Introduced problems focused on codebase understanding, bugfinding, planning, and code review.
- Improved grading criteria for some edit tasks.
CursorBench 3.0
- Initial set of tasks focused on edit, refactor, and bugfix problems.
Avg cost / task is computed by applying each model's published per-million-token pricing (input, cache read, and output) to the tokens it used on each CursorBench 3.1 task, then averaging across tasks. Results are subject to variance; small differences in scores may not be statistically meaningful.