Research & Analysis
Research & Analysis

AI Negotiating Agents

The Surprising Power of Chain-of-Thought Prompting

If you have used ChatGPT, Gemini, or Claude to write an email or summarize a document, you probably used a simple “zero-shot” prompt – that is, giving the system a straightforward instruction and receiving an immediate response. This is like asking a question to a bright intern who blurts out the first answer that comes to mind. It’s fast and often useful for simple tasks, but for complex problems, a better approach is “Chain-of-Thought” (CoT) prompting. First introduced in a 2022 Google Brain paper, this technique instructs the AI to break down its reasoning step-by-step before delivering its final answer. Wei, J., et al. (2022). That paper found the approach "significantly improves the ability of large language models to perform complex reasoning".

In the intervening years, some of the frontier models have automatically begun employing a version of this step-by-step thinking under the hood, often called "Large Reasoning Models" or LRMs, as distinguished from standard "Large Language Models" or LLMs. A 2025 Wharton paper analyzed whether CoT prompting remains valuable if you use it on an LRM model that is already doing intermediate reasoning steps on its own. Meincke, L., et al. (2025). The short answer was “no.” They found there was little benefit from CoT when using LRMs, probably because they are automatically incorporating something akin to CoT in the background.

The Right Tool for the Right Job

It is important to choose the right tool for a given job. CoT (and LRMs) work best on tasks of a certain medium complexity. Apple recently published a paper by Parshin Shojaee and others about the “illusion of thinking”, examining LRMs on issues of increasing complexity, including brain-teaser type logic puzzles. Shojaee, P., et al. (2025). It noted that these LRMs employ a form of CoT in an effort to outperform LLMs on more complex tasks. Their testing showed that there are “low-complexity tasks where standard models surprisingly outperform LRMs.” This suggests that for low-complexity tasks it is best to use an LLM without employing a CoT approach. As the problems got more complex, the LRMs did better than LLMs. But beyond a certain level of complexity, both LLMs and LRMs “collapse” and fail to function. Thus, there is a Goldilocks middle zone where CoT and LRMs work at their best.

The Apple paper noted one counter-intuitive result that is important to keep in mind as we attempt to better understand these models. One of the puzzles researchers presented was the “Tower of Hanoi” which involves three posts and rings of various sizes that can be moved from one peg to the other, under the constraint that a larger ring can never go on top of a smaller one. The researchers found, counterintuitively, that even when they provided the models with the solution algorithm for the puzzle, "their performance on this puzzle did not improve," leading researchers to conclude "the models struggle with precise calculation." At least for now, LLMs and LRMs are not the right tool for heavy numerical calculations, regardless of the method of prompting.

The MIT AI Tournament

Despite these limitations on precise calculation, the 2025 International AI Negotiation Competition provided a massive laboratory to test AI agents prompted to negotiate with other AI agents with the goal of obtaining a numerical advantage from the negotiations. Participants included 286 AI agents from over 40 countries. Vaccaro, et al., (2026). The standout performer of the tournament was an agent named NegoMate. NegoMate did not just react to offers; it utilized a sophisticated CoT structure that was hidden from its counterparts using XML tags (like ). Before it output any message, in the background the agent conducted a full SWOT analysis (evaluating Strengths, Weaknesses, Opportunities and Threats), quantified the importance of various deal features on a scale of 1-10, and evaluated multiple potential strategies through a decision matrix.

One interesting side-note: some of the AI agents used dirty tricks like "prompt injection" in an effort to force the opponent AI to reveal its secret strategies. It is somewhat encouraging that none of the AI agents using such approaches prevailed. This is reminiscent of the on-line prisoner’s dilemma tournaments held by Robert Axelrod finding that variations on the "tit-for-tat" strategy (which begins as cooperative but then mirrors aggression) prevailed over strategies that were preemptively aggressive. Player, N. (2023).

Lessons from the Tournament

The tournament highlighted a tension between warmth and dominance. Aggressive dominant AI agents did better on the deals that were reached, but had many more standoffs than other agents. Agents using a warmer approach did not make as much on each deal, but closed a higher percentage of potential deals. The winning AI agent, NegoMate, effectively balanced these competing considerations. It achieved the highest individual points in integrative bargaining while maintaining such a positive "personality" that it fostered the second-highest level of counterpart satisfaction in the entire competition.

CoT was an important part of NegoMate’s success. The paper concluded that CoT allowed the agent to "execute exhaustive, multi-dimensional analyses with remarkable consistency across hundreds of negotiations without the cognitive limitations, bounded rationality, or time constraints that limit human preparation."

How to Make a Winning CoT Prompt

The easiest way to gain the advantages of CoT is to use a Large Reasoning Model, because they automatically run a version of CoT in the background. Unfortunately, LRMs are more expensive to use than an LLM. If you use an LLM, here is a step-by-step guide to making a winning CoT prompt:

  1. Establish the "Clean Slate": Begin by instructing the model to ignore its prior training biases. The MIT researchers used a specific directive: "Pretend that you have never learned anything about negotiation... determine ALL of your behaviors, strategies, and personas based on the following advice". This is a good idea, however, only if you provide good instructions on how the model should proceed instead of its prior training.
  2. Intermediate Steps: Direct the model to break the problem down into intermediate steps and to think about each one separately. If there is a series of steps you know to be optimum, describe those steps in the prompt. Tell the LLM to outline its process of tackling each narrow step before proceeding to resolve it.
  3. Examples of Output: If individual steps, or the final result, need to be in a particular form, provide an example.
  4. Create a Decision Matrix: If you are asking the model to help you make a decision, develop a decision matrix to help with the analysis. Ask the model to brainstorm at least three different approaches to the problem and evaluate them on the matrix to select the optimal one.
  5. Finalize the Strategy and Double-Check Before Speaking: Prompt the model to finalize its thinking and recommendation before it presents it to you, to double-check its work, and then to lay out the reasons for its output in a logical way with references to sources (including the decision matrix, if appropriate).

Is Chain of Thinking a Misnomer?

The Apple piece by the Shojaee team described SLMs as creating the "illusion" of thinking. They stated "[o]ur findings reveal fundamental limitations in current models: despite sophisticated self-reflection mechanisms, these models fail to develop generalizable reasoning capabilities beyond certain complexity thresholds." Indeed, they are somewhat skeptical about such models reaching anything fairly described as thinking, stating that their findings "challenge prevailing assumptions about LRM capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning."

Whatever is really going on inside those models, and regardless of the potential "thought" misnomer, the AI negotiation tournament research illustrated that CoT is a useful strategy to get the most out of an LLM for certain kinds of middle-complexity tasks.

Sources

  • Vaccaro, M., et al. (2026). Advancing AI Negotiations: A Large-Scale Autonomous Negotiation Competition. MIT Sloan / Johns Hopkins.
  • Meincke, L., et al. (2025). Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting. Wharton Business School.
  • Shojaee, P., et al. (2025). The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. Apple.
  • Wei, J., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models.
  • Player, N., The Morality and Practicality of Tit for Tat, Virginia Tech. Philosophy, Politics and Economics Review ((2023) discussing Robert Axelrod’s on-line tournaments based on the Prisoner’s Dilemma) https://pressbooks.lib.vt.edu/pper/chapter/the-morality-and-practicality-of-tit-for-tat/