When a frontier AI agent sits down to negotiate, it can describe a deal in fluent prose. What it cannot reliably do is price the deal, push for the better terms, or plan the move that pays off next quarter. That gap between linguistic competence and economic competence is the central finding of SidConArena, a new arXiv benchmark that puts large language models through a deliberately stripped-down simulated economy and watches how they behave when the only constraint is the rules of the game.
The benchmark, submitted to arXiv on 24 June 2026, frames its environment as a finite-horizon partially observable stochastic game, a category familiar from multi-agent decision theory. In plain terms, several AI agents share a closed economy and make moves over a fixed number of rounds, each one only able to see part of the world at a time. The economy has three coupled phases that run in sequence: agents negotiate in natural language and commit to binding trades; they run those traded resources through deterministic converters that turn inputs into outputs; and they compete in sealed-bid auctions for long-term assets whose payoff extends across the rest of the game. Because trades bind and the production rules are fixed, the economy is positive-sum: a sufficiently clever deal benefits both sides, and the losers lose mostly because they failed to find or price those deals.
The architecture is built to test free-form bargaining without letting language drift away from the rules. Agents receive structured observations of game state, are dispatched in a way that knows which phase is active, and interact with the world through a neural-symbolic action interface that turns their text into legal moves. Asynchronous execution lets agents keep chatting while a binding trade is being processed, so the system is not artificially throttling conversation speed. The paper evaluates both homogeneous self-play tournaments, where copies of the same model play each other, and heterogeneous Elo-style tournaments, where frontier models from different labs compete in the same economy.
The headline pattern in the results is the one economists would predict: stronger frontier models achieve higher economic outcomes. Scale, in the broad sense, helps. But the paper's qualitative failure modes cut across that improvement and are, if anything, more interesting than the ranking itself. Agents routinely misvalue resources, accepting trades that cost more than they return or refusing trades that would have netted a gain. They bargain passively, conceding early rather than probing for the deal structure that would have served them. And they are weak at long-horizon investment planning, treating the auction phase as a standalone event rather than as a lever for the rest of the game.
These are not benchmark artifacts in the usual sense. The economy is simple enough, and the production rules deterministic enough, that a competent economic actor should be able to learn its pricing within a few episodes. The fact that frontier models still misvalue resources inside that setting suggests the failure is not a matter of missing information but of how the models weigh options against each other when the answer has to be produced as a sentence in the middle of a conversation. Passive bargaining points at the same place from a different angle: the agents are fluent enough to hold a negotiation but not aggressive enough to extract surplus, which is closer to a behavioral stance encoded in training than to a perceptual limit. Weak long-horizon planning, finally, fits a pattern that has shown up in other agent evaluations: when the payoff requires composing several moves and discounting future value correctly, current LLMs tend to collapse to a near-myopic policy.
The qualifier matters. SidConArena is an arXiv preprint, not a peer-reviewed paper, and the numerical figures in the paper are embedded as images and LaTeX in the PDF rather than rendered in the HTML, so the specific Elo ratings and percentage tables should be read as provisional until they are checked against the camera-ready version. The qualitative failure modes come from the paper's own analysis of one specific positive-sum economy, not from a survey across many settings. So the honest reading is narrower than "AI agents cannot trade": inside this benchmark's three-phase structure, with these production rules and these auction mechanics, scaling improves outcomes but the underlying pathologies persist. Whether those pathologies survive contact with a different economy, a richer set of goods, or repeated play with memory is the next question the benchmark will have to answer, and the one to watch as the authors release their code and the community runs its own tournaments.