Skip to content

New paper in Nature's Scientific Reports: benchmarking multimodal LLMs on dynamical astronomy

Posted on:April 5, 2026 at 10:00 AM

Together with Valerio Carruba I’m happy to share our new paper, just out in Scientific Reports — the Nature Portfolio journal (Volume 16, Article 10785, 2026). This is our first piece in the Nature family, which I’m quietly very pleased about. It builds directly on my 2024 Astrophysical Journal study, where I first showed that a single multimodal LLM — at the time, GPT-4-vision-preview — could classify mean-motion resonances from images of resonant arguments with near-perfect accuracy. That study answered a narrow question: can it work at all? The new paper takes the obvious next step and asks: how well does it work, across the field, two years later?

What we built

The core contribution of this paper is a set of four publicly released benchmark datasets — RB-TEST, RB-PILOT, RB-SMALL, and RB-FULL — covering clear, ambiguous, and transient cases of resonant behaviour. They cover both two-body and secular resonances, both binary (resonant / non-resonant) and three-class (libration / circulation / transient) labels, and a deliberate mix of “easy” and “hard” examples, including the most pathological regimes: resonance sticking, separatrix-crossing, chaotic evolution.

Each benchmark was prepared so that the input to the model is exactly what an astronomer would inspect by eye: a plot of the resonant argument over time. No tabular features, no engineered descriptors. We then ran the full suite through three categories of models:

For each model we used standardized prompts: a full prompt for the large models, and a simplified variant for the small ones that genuinely struggle with complex instructions.

What we found

A few results stood out.

A more conceptual finding: across all model families, the type of mistakes is similar. Models do not fail randomly. They fail in transient and resonance-sticking regimes, which are precisely the cases where an expert astronomer would also slow down and stare at the plot. In other words, LLMs are not just classifying these images; they are making the same kind of errors a competent human would make.

Why this matters

Three reasons, in order of increasing scope.

First, for our own field of dynamical astronomy, this means visual inspection at population scale is now feasible. Population studies in the main belt or the TNO region can involve hundreds of thousands of candidate resonance arguments per object. A trained ML classifier handles only the resonance it was trained on; an LLM handles the whole zoo, including new resonance types we haven’t even thought to label, with zero re-training. The bottleneck of expert visual inspection is, finally, breakable.

Second, for researchers without big budgets, the open-source results matter more than the commercial ones. The fact that a 4B-parameter model running locally can do 90%+ of the work cleanly is genuinely democratizing. You don’t need an API contract to do real science here.

Third, for the broader question of LLMs in natural science, this paper is a small piece of evidence that the answer to “can LLMs do real scientific classification?” is now: yes, repeatedly, across vendors, and increasingly even on cheap hardware. The 2024 result was a single point of light; this paper turns it into a curve.

The released benchmarks are intended as a reproducible standard — anyone, including future models we cannot yet test, can be evaluated on the exact same task. The full paper is available at DOI: 10.1038/s41598-026-45926-y.

Evgeny Smirnov — researcher, entrepreneur, and software developer based in Barcelona. More about me