CursorBench 3.1

We evaluate agents on ambiguous, multi-file tasks from real Cursor sessions. Higher scores are better.

More about CursorBench
A scatter and line chart comparing Opus 4.7, GPT-5.5, Composer 2.5, and Composer 2 scores against average cost per task.70% CursorBench 3.1 score65%60%55%50%45%40%35%30%$12$10$8$6$4$2$0Average cost per task(max)(xhigh)(high)(high)(medium)(low)(low)Composer 2.5Opus 4.7 xhigh(default)GPT-5.5 medium(default)Composer 2Gemini 3.5 FlashKimi 2.6Kimi 2.5
Model%$/task
1Opus 4.7 Max64.8%$11.02
2GPT-5.5 Extra High64.3%$4.37
3Composer 2.563.2%$0.55
4GPT-5.5 High62.6%$3.59
5Opus 4.7 Extra High61.6%$7.11
6Opus 4.7 High59.4%$5.01
7GPT-5.5 Medium59.2%$2.22
8Opus 4.7 Medium52.7%$2.93
9Composer 252.2%$0.56
10Gemini 3.5 Flash49.8%$1.94
11GPT-5.5 Low48.8%$1.19
12Opus 4.7 Low48.3%$1.87
13Kimi 2.647.6%$1.27
14Kimi 2.531.9%$0.87

Changelog

CursorBench 3.1

  • Introduced problems focused on codebase understanding, bugfinding, planning, and code review.
  • Improved grading criteria for some edit tasks.

CursorBench 3.0

  • Initial set of tasks focused on edit, refactor, and bugfix problems.

Avg cost / task is computed by applying each model's published per-million-token pricing (input, cache read, and output) to the tokens it used on each CursorBench 3.1 task, then averaging across tasks.