Bayesian Statistics
Hierarchical models and uncertainty quantification for algorithm benchmarking
Context
Computational chemistry benchmarks almost always report point estimates over a small test set (5 to 50 cases is typical). A mean wall time says nothing about the spread, and a ranking built on means can be noise from the test set rather than signal about the algorithms.
Hierarchical Bayesian approach
We fit hierarchical models with brms (Bayesian Regression Models using Stan)
that treat each test case as drawn from a population
(Goswami 2025). The model estimates both the
mean performance per algorithm and the between-problem variance, producing
full posterior distributions over rankings rather than single numbers.
On 238 molecular reactions comparing Dimer, GPDimer, and OT-GP, this separates real differences from problem-selection noise. The output reads as “algorithm A beats B with 94% posterior probability” rather than “A has a lower mean than B.”
Transferability
Nothing in the model is specific to saddle point searches. Any benchmark that compares algorithms across test problems - ML potentials on material classes, solvers on PDE families, samplers on target distributions - drops into the same hierarchical structure and inherits honest uncertainty on its rankings.
Open directions
- Applying the hierarchical Bayesian framework to benchmark ML potentials (not just saddle search algorithms) across material classes.
- Connecting performance uncertainty to experimental prediction uncertainty: when does algorithmic noise dominate measurement noise?