CursorBench 3.1
We evaluate agents on ambiguous, multi-file tasks from real Cursor sessions. Higher scores are better.
More about CursorBench| 1 | Opus 4.7 Max | 64.8% | $11.02 |
| 2 | GPT-5.5 Extra High | 64.3% | $4.37 |
| 3 | Composer 2.5 | 63.2% | $0.55 |
| 4 | GPT-5.5 High | 62.6% | $3.59 |
| 5 | Opus 4.7 Extra High | 61.6% | $7.11 |
| 6 | Opus 4.7 High | 59.4% | $5.01 |
| 7 | GPT-5.5 Medium | 59.2% | $2.22 |
| 8 | Opus 4.7 Medium | 52.7% | $2.93 |
| 9 | Composer 2 | 52.2% | $0.56 |
| 10 | Gemini 3.5 Flash | 49.8% | $1.94 |
| 11 | GPT-5.5 Low | 48.8% | $1.19 |
| 12 | Opus 4.7 Low | 48.3% | $1.87 |
| 13 | Kimi 2.6 | 47.6% | $1.27 |
| 14 | Kimi 2.5 | 31.9% | $0.87 |
Changelog
CursorBench 3.1
- Introduced problems focused on codebase understanding, bugfinding, planning, and code review.
- Improved grading criteria for some edit tasks.
CursorBench 3.0
- Initial set of tasks focused on edit, refactor, and bugfix problems.
Avg cost / task is computed by applying each model's published per-million-token pricing (input, cache read, and output) to the tokens it used on each CursorBench 3.1 task, then averaging across tasks.