bayesian-stats

Bayesian Statistics

Bayesian statistics banner

Hierarchical models and uncertainty quantification for algorithm benchmarking

Bayesian benchmarking framework diagram
Bayesian hierarchical model DAG

Context

Computational chemistry benchmarks almost always report point estimates over a small test set (5 to 50 cases is typical). A mean wall time says nothing about the spread, and a ranking built on means can be noise from the test set rather than signal about the algorithms.

Hierarchical Bayesian approach

We fit hierarchical models with brms (Bayesian Regression Models using Stan) that treat each test case as drawn from a population (Goswami 2025). The model estimates both the mean performance per algorithm and the between-problem variance, producing full posterior distributions over rankings rather than single numbers.

On 238 molecular reactions comparing Dimer, GPDimer, and OT-GP, this separates real differences from problem-selection noise. The output reads as “algorithm A beats B with 94% posterior probability” rather than “A has a lower mean than B.”

Transferability

Nothing in the model is specific to saddle point searches. Any benchmark that compares algorithms across test problems - ML potentials on material classes, solvers on PDE families, samplers on target distributions - drops into the same hierarchical structure and inherits honest uncertainty on its rankings.

Open directions

  • Applying the hierarchical Bayesian framework to benchmark ML potentials (not just saddle search algorithms) across material classes.
  • Connecting performance uncertainty to experimental prediction uncertainty: when does algorithmic noise dominate measurement noise?

References

Goswami, Rohit. 2025. “Bayesian Hierarchical Models for Quantitative Estimates for Performance Metrics Applied to Saddle Search Algorithms.” Aip Advances 15 (8): 85210. https://doi.org/10.1063/5.0283639.

← All research threads