An AI Peer-Reviewed 10,000 Research Papers. Now the Experiment Is Published.

PREVIEWAn AI Peer-Reviewed 10,000 Research Papers. Now the Experiment Is Published. · MD

Last year, when two of the most influential conferences in computer science (ICML, focused on machine learning research, and STOC, focused on the theory of computing) accepted papers for their 2026 editions, they ran something alongside their usual review process: a Google-built AI system called the Paper Assistant Tool, or PAT, that scanned roughly 10,000 submissions and returned a written critique to authors within about half an hour. The formal paper on that experiment is now a public preprint, and it gives the research community its first citable, measurable record of what AI-assisted peer review looks like at conference scale.

PAT is built around what researchers call inference scaling, an approach that runs an AI model more than once on each submission rather than asking for a single answer. According to the preprint, the system reads the theoretical results, checks the experimental claims, suggests improvements, and flags flaws, then bundles those checks into a structured report. It is not built to make accept-or-reject decisions; it is built to deliver a faster, deeper first pass that human reviewers can build on or ignore.

The two conferences handled the experiment in different ways, and that difference is part of what the paper now puts on the table. ICML framed PAT as an explicit research program from the start. The ICML blog post from January 2026 said authors could opt in or out, and the retrospective published in March 2026 treats the deployment as an experiment the organizers want to learn from rather than a permanent feature of the conference. STOC handled it differently. Google's writeup of the STOC deployment describes PAT as a Gemini-backed feedback tool aimed at theoretical computer scientists, narrowing the scope to work that already leans heavily on proof-checking. Neither conference gave PAT a seat in the human-reviewer deliberation that decides acceptance.

The most quoted number so far is the 34%. According to the paper, PAT identified 34% more mathematical errors than a baseline AI review that got only one shot at each submission. The baseline is what researchers call zero-shot prompting, where a model is asked a question and not allowed a second pass to find its own mistakes. The comparison, in other words, is one multi-pass AI system against a single-pass AI system, not AI against humans.

Three things make the 34% worth pausing on before repeating it. First, the paper was written by the PAT team itself and published on arXiv, the open repository where computer scientists post work before formal peer review. The number could shift in a later version. Second, the metric is specifically about catching mathematical errors, not about novelty, methodological soundness, reproducibility, ethics, or importance, which are the other categories peer review is supposed to cover. Third, the headline figure of roughly 10,000 papers bundles ICML and STOC together, and the per-conference breakdown, plus how often human reviewers actually used PAT's reports, is not spelled out in the public sources.

The paper also proposes a four-level taxonomy of AI-human collaboration in scientific evaluation, walking from AI as a tool for spotting typos to AI as a near-peer that human reviewers lean on heavily. That taxonomy is more durable than the 34% number, because it gives the field a vocabulary to argue about future AI review systems, including ones built by people other than Google.

Several open questions now sit on top of the public record. The ICML blog post describes an opt-in mechanism but does not say how many authors opted in. The retrospective and the Google STOC post are positive in tone but stop short of endorsing PAT as a permanent part of the review process. The conferences have not said whether PAT reports reached human reviewers, whether reviewers were required to read them, or whether any acceptance decisions were reversed on the basis of PAT's flags. Those are the questions the next round of post-mortems, or the peer-reviewed version of the arXiv preprint, will need to answer.

The thing to watch now is whether the preprint turns into a peer-reviewed paper, and whether ICML or STOC publishes a follow-up retrospective that puts numbers on author opt-in, reviewer usage of PAT reports, and any case where the AI system and a human reviewer disagreed on a paper. Those are the figures that will decide whether the experiment stays a footnote or becomes a template.

An AI Peer-Reviewed 10,000 Research Papers. Now the Experiment Is Published. — type0 | type0

An AI Peer-Reviewed 10,000 Research Papers. Now the Experiment Is Published.

Sources