Cursor · Product Manager, Agent Harness

Our mission is to automate coding. The first step in our journey is to build the best tool for professional programmers, using a combination of inventive research, design, and engineering. Our organization is very flat, and our team is small and talent dense. We particularly like people who are truth-seeking, passionate, and creative. We enjoy spirited debate, crazy ideas, and shipping code.

About the Role

The Agent Harness is what makes Cursor's agents actually work. It determines how agents decompose tasks into subtasks, how they interact with the file system and terminal, how they handle failures and retries, and how developers observe and steer what's happening. When an agent gets stuck, loops, or hallucinates, the harness is why—and the harness is how you fix it.

As a Product Manager for the Agent Harness, you will own this framework. Agent quality is improving rapidly—we shipped Composer 2, our own frontier coding model, and are training agents through real-time RL on user data. Your job is to turn those research advances into product that developers can feel.

This is not a role where you write specs and hand them off. You'll be reading agent traces, analyzing failure modes, designing evaluation frameworks, and making judgment calls about what an agent should and shouldn't attempt. You'll work at the boundary between research and product, where the roadmap is shaped by empirical results as much as customer feedback.

Example projects include...

Owning the agent planning and execution framework: how agents decompose tasks, decide what tools to use, and recover when a step fails. Balancing autonomy with predictability.
Designing how developers observe and steer agents: real-time progress, guardrails, the ability to redirect mid-task. The experience should build trust without requiring micromanagement.
Building evaluation and benchmarking systems: defining what "good" means for agent quality—task completion rate, error recovery, hallucination frequency—and building the harnesses to measure it. These measurements drive engineering and research priorities.
Analyzing agent traces at scale: identifying where agents get stuck, loop, hallucinate, or take unproductive paths, and turning those patterns into concrete improvements.
Defining the primitives for agent extensibility: how agents use tools, access codebase context, call external services via MCPs and plugins on the Cursor Marketplace, and how developers customize agent behavior through rules and constraints.
Improving the default Cursor agent experience (the “Auto” model setting): making smart model choices based on user needs, model capabilities, and cost appetite.
Shaping multi-agent coordination: how subagents share context and avoid conflicts when executing in parallel across files and systems. This matters more as developers spin up fleets of agents simultaneously.

You may be a fit if

You have built or evaluated AI agents, LLM applications, or ML-powered developer tools.
You're deeply technical. You're comfortable reading code, analyzing traces, and reasoning about system behavior at a low level.
You have strong intuition for evaluation and measurement. You know how to define metrics that capture quality, not just activity.
You can move between the big picture and the details—from "what should agents be capable of in six months?" to "why did this agent fail on this specific task?"
You're comfortable in a research-adjacent environment where the roadmap is shaped by empirical results, not just customer requests.
You have experience with reinforcement learning, agent frameworks, or AI evaluation—either as a practitioner or working closely with researchers.
You thrive in ambiguous, fast-moving environments and enjoy making hard tradeoffs with incomplete information.

#LI-DNI

About the Role

Example projects include...

Owning the agent planning and execution framework: how agents decompose tasks, decide what tools to use, and recover when a step fails. Balancing autonomy with predictability.
Designing how developers observe and steer agents: real-time progress, guardrails, the ability to redirect mid-task. The experience should build trust without requiring micromanagement.
Building evaluation and benchmarking systems: defining what "good" means for agent quality—task completion rate, error recovery, hallucination frequency—and building the harnesses to measure it. These measurements drive engineering and research priorities.
Analyzing agent traces at scale: identifying where agents get stuck, loop, hallucinate, or take unproductive paths, and turning those patterns into concrete improvements.
Defining the primitives for agent extensibility: how agents use tools, access codebase context, call external services via MCPs and plugins on the Cursor Marketplace, and how developers customize agent behavior through rules and constraints.
Improving the default Cursor agent experience (the “Auto” model setting): making smart model choices based on user needs, model capabilities, and cost appetite.
Shaping multi-agent coordination: how subagents share context and avoid conflicts when executing in parallel across files and systems. This matters more as developers spin up fleets of agents simultaneously.

You may be a fit if

You have built or evaluated AI agents, LLM applications, or ML-powered developer tools.
You're deeply technical. You're comfortable reading code, analyzing traces, and reasoning about system behavior at a low level.
You have strong intuition for evaluation and measurement. You know how to define metrics that capture quality, not just activity.
You can move between the big picture and the details—from "what should agents be capable of in six months?" to "why did this agent fail on this specific task?"
You're comfortable in a research-adjacent environment where the roadmap is shaped by empirical results, not just customer requests.
You have experience with reinforcement learning, agent frameworks, or AI evaluation—either as a practitioner or working closely with researchers.
You thrive in ambiguous, fast-moving environments and enjoy making hard tradeoffs with incomplete information.

#LI-DNI

About the Role

Example projects include...

Owning the agent planning and execution framework: how agents decompose tasks, decide what tools to use, and recover when a step fails. Balancing autonomy with predictability.
Designing how developers observe and steer agents: real-time progress, guardrails, the ability to redirect mid-task. The experience should build trust without requiring micromanagement.
Building evaluation and benchmarking systems: defining what "good" means for agent quality—task completion rate, error recovery, hallucination frequency—and building the harnesses to measure it. These measurements drive engineering and research priorities.
Analyzing agent traces at scale: identifying where agents get stuck, loop, hallucinate, or take unproductive paths, and turning those patterns into concrete improvements.
Defining the primitives for agent extensibility: how agents use tools, access codebase context, call external services via MCPs and plugins on the Cursor Marketplace, and how developers customize agent behavior through rules and constraints.
Improving the default Cursor agent experience (the “Auto” model setting): making smart model choices based on user needs, model capabilities, and cost appetite.
Shaping multi-agent coordination: how subagents share context and avoid conflicts when executing in parallel across files and systems. This matters more as developers spin up fleets of agents simultaneously.

You may be a fit if

You have built or evaluated AI agents, LLM applications, or ML-powered developer tools.
You're deeply technical. You're comfortable reading code, analyzing traces, and reasoning about system behavior at a low level.
You have strong intuition for evaluation and measurement. You know how to define metrics that capture quality, not just activity.
You can move between the big picture and the details—from "what should agents be capable of in six months?" to "why did this agent fail on this specific task?"
You're comfortable in a research-adjacent environment where the roadmap is shaped by empirical results, not just customer requests.
You have experience with reinforcement learning, agent frameworks, or AI evaluation—either as a practitioner or working closely with researchers.
You thrive in ambiguous, fast-moving environments and enjoy making hard tradeoffs with incomplete information.

#LI-DNI

Product Manager, Agent Harness

About the Role

Example projects include...

You may be a fit if

Apply for this role

Product Manager, Agent Harness

About the Role

Example projects include...

You may be a fit if

Apply for this role

Product Manager, Agent Harness

About the Role

Example projects include...

You may be a fit if

Apply for this role