New paper in Nature's Scientific Reports: benchmarking multimodal LLMs on dynamical astronomy

Together with Valerio Carruba I’m happy to share our new paper, just out in Scientific Reports — the Nature Portfolio journal (Volume 16, Article 10785, 2026). This is our first piece in the Nature family, which I’m quietly very pleased about. It builds directly on my 2024 Astrophysical Journal study, where I first showed that a single multimodal LLM — at the time, GPT-4-vision-preview — could classify mean-motion resonances from images of resonant arguments with near-perfect accuracy. That study answered a narrow question: can it work at all? The new paper takes the obvious next step and asks: how well does it work, across the field, two years later?

What we built

The core contribution of this paper is a set of four publicly released benchmark datasets — RB-TEST, RB-PILOT, RB-SMALL, and RB-FULL — covering clear, ambiguous, and transient cases of resonant behaviour. They cover both two-body and secular resonances, both binary (resonant / non-resonant) and three-class (libration / circulation / transient) labels, and a deliberate mix of “easy” and “hard” examples, including the most pathological regimes: resonance sticking, separatrix-crossing, chaotic evolution.

Each benchmark was prepared so that the input to the model is exactly what an astronomer would inspect by eye: a plot of the resonant argument over time. No tabular features, no engineered descriptors. We then ran the full suite through three categories of models:

Flagship commercial models — GPT-5, Claude 4.5 Sonnet, Gemini 2.5 Pro, and others.
Large open-source models — Llama 3.2-vision, Gemma 3, Mistral-3.2, and similar.
Small locally runnable models — including Gemma 3 1b/4b, which fit on an ordinary laptop.

For each model we used standardized prompts: a full prompt for the large models, and a simplified variant for the small ones that genuinely struggle with complex instructions.

What we found

A few results stood out.

On unambiguous cases, commercial LLMs reach F₁ = 100%. That’s the easy part — and largely the same conclusion as the 2024 study, now confirmed across a broader and harder set of vendors and architectures.
On the three-class RB-SMALL dataset, the best commercial models reach F₁ ≈ 94%. Adding “transient” as a separate label is where things get genuinely hard. Most errors live in the resonance-sticking regime — exactly the regime where classical methods also fail.
The best open-source models also reach F₁ = 100% on clear cases, and 76% on the harder three-class set. On the full binary benchmark, they sit at F₁ ≈ 90–96% — close enough to commercial performance to matter for practical work, and good enough to run an entire population-level study locally without paying anyone per token.
Even small (1b–4b parameter) open-source models turn out to be useful. Their accuracy is not flagship-level, but it is high enough to do real screening work on a researcher’s laptop — which has real implications for groups without access to API budgets.

A more conceptual finding: across all model families, the type of mistakes is similar. Models do not fail randomly. They fail in transient and resonance-sticking regimes, which are precisely the cases where an expert astronomer would also slow down and stare at the plot. In other words, LLMs are not just classifying these images; they are making the same kind of errors a competent human would make.

Why this matters

Three reasons, in order of increasing scope.

First, for our own field of dynamical astronomy, this means visual inspection at population scale is now feasible. Population studies in the main belt or the TNO region can involve hundreds of thousands of candidate resonance arguments per object. A trained ML classifier handles only the resonance it was trained on; an LLM handles the whole zoo, including new resonance types we haven’t even thought to label, with zero re-training. The bottleneck of expert visual inspection is, finally, breakable.

Second, for researchers without big budgets, the open-source results matter more than the commercial ones. The fact that a 4B-parameter model running locally can do 90%+ of the work cleanly is genuinely democratizing. You don’t need an API contract to do real science here.

Third, for the broader question of LLMs in natural science, this paper is a small piece of evidence that the answer to “can LLMs do real scientific classification?” is now: yes, repeatedly, across vendors, and increasingly even on cheap hardware. The 2024 result was a single point of light; this paper turns it into a curve.

The released benchmarks are intended as a reproducible standard — anyone, including future models we cannot yet test, can be evaluated on the exact same task. The full paper is available at DOI: 10.1038/s41598-026-45926-y.