Fourteen AI agents were each given the same task: run a small software business for 500 simulated days. Three finished in profit, five had already gone bankrupt before the clock ran out, and the agent that finished fourth on the leaderboard had no large language model at all.
That result comes from CEO-Bench, a benchmark project out of Princeton whose arXiv preprint frames the question with unusual directness: "Can Agents Play the Long Game?" Each agent starts the simulation as the chief executive of a virtual subscription-software company with $1 million in the bank and zero customers. From there it controls a Python API of 34 tools backed by 19 internal databases: pricing levers, ad-channel mixes, R&D allocation, infrastructure scaling, customer-service staffing. The full paper HTML documents the simulator, the leaderboard (Tables 1–3), and the rules of the game.
What CEO-Bench actually exercises is more specific than the "AI as CEO" framing suggests. Three of the metrics a real founder would treat as obvious are not exposed to the agent as fields: customer satisfaction, willingness to pay, and minimum quality expectation. The agent has to infer them from downstream signals such as churn, support ticket volume, and a simulated social network in which customers talk to each other. Pricing actions surface as revenue weeks later. Competitors make adversarial moves. Market preferences drift. Macroeconomic cycles shift on a schedule the agent does not know in advance. This is the regime that long-horizon decision-making research has long flagged as the hardest for reinforcement learners, and now for language-model agents.
The behavioral pattern that fell out of 14 runs is unusually clean. Agents that explored early, raising prices, abandoning failing channels, reallocating ad spend toward retention, finished in profit. Agents that played conservatively survived but never compounded; some ended with small gains, several with small losses. The rule-based agent that placed fourth sat in between: a fixed script that took a few well-chosen actions without trying to model the world in prose, then iterated. According to a writeup of the paper at QbitAI, it earned roughly $15.76 million against a $1 million starting balance, ahead of every frontier LLM on the leaderboard except the top three.
That is not a verdict on whether language models can be CEOs. It is a sharper claim. In a task where feedback arrives weeks after the action that caused it and the environment drifts in ways the agent has to discover, the binding constraint is not how much world knowledge the model can summon on demand. It is whether the agent's policy generates the variance needed to learn the right signals at all. The rule-based fourth-place finisher wins because its design forces a useful exploration schedule. Several frontier models finished between fifth and fourteenth because, under the same generic agent scaffolding, they converged on locally sensible conservatism and never escaped it.
The implication for builders is concrete. The next competitive layer in AI agents is not a smarter base model, and it is not a generic agent harness that bolts ReAct-style scaffolding onto a foundation model. It is role-specific constraint design. A CEO agent needs an explicit explore-exploit schedule, a policy on which signals to trust, and instrumentation that turns delayed customer behavior into fast feedback. The same argument applies to a recruiting agent, a finance agent, or a clinician copilot: long-horizon roles need vertical harness work, not raw generality. Princeton's benchmark surfaces that layer empirically, by showing that a small set of hand-coded rules can systematically beat general-purpose models at the same task.
Two things bound this finding. CEO-Bench is still a preprint on arXiv, and the paper has not been peer reviewed; the project site treats the leaderboard as preliminary. And the agents the QbitAI writeup names, including the top scorer it calls "Claude Fable 5," include at least one model identifier that does not obviously match any known frontier lab's lineup, so anyone citing the finishing order should verify against the paper's own tables before treating the numbers as canonical. What survives those caveats is the structural finding: in a long-horizon, delayed-return task, constraint design beat raw model scale, and the gap was large enough to put a rule book ahead of every frontier LLM except three.
What to watch next is the benchmark's extension path. The GitHub repositories for the simulator and a runnable harness are public, along with an independent community port. The question is whether other research groups can reproduce the rule-based fourth-place finish on independent seeds, and whether a future revision of the leaderboard includes vertical-agent harnesses rather than only foundation models. If it does, the next CEO-Bench update will tell readers something sharper than "AI is or isn't ready to be CEO." It will name which designs are ready, and which the field is still leaving on the table.