Why is there so little statistical analyses in ML research?

Why is it so common in ML research to skip statistical tests to confirm that the results are genuinely significant? Often, a single outcome is reported instead of conducting multiple runs and using tests like a t-test or Mann-Whitney U Test. In other fields, like psychology or medicine, drawing conclusions from a single sample would be unacceptable, so why is this not seen as an issue in ML research?

Also, can anyone recommend a book that covers statistical tests in the context of ML?

5 Likes

Such statistics are primarily relevant when working with a small dataset. In ML, you typically have a large amount of data, and most tests yield significant results, which isn’t very informative. What’s truly needed are cross-validation, sense checking, and possibly sensitivity and uncertainty analysis.

4 Likes

No book, but I can recommend 3 papers on statistical testing for machine learning:

Statistical Comparisons of Classifiers over Multiple Data Sets - JMLR

An Extension on “Statistical Comparisons of Classifiers over Multiple Data Sets” for all Pairwise Comparisons - JMLR

Should We Really Use Post-Hoc Tests Based on Mean-Ranks? - JMLR

3 Likes

I’m already using a lot of GPUs for hyperparameter optimization. Running additional tests would be time-consuming and honestly a hassle for minimal benefit—especially with LLMs.

In most cases, it’s often not that important. Many papers claim their method is superior because they improved the SOTA by just 0.05 percentage points, which probably wouldn’t hold up significantly over multiple trials.

However, even if that were true, the method would still be valid. They may not hold the “SOTA record,” but it wouldn’t change the fact that it’s solid research.

In short, it’s not worth conducting those tests.

2 Likes

The real reason is that they can get away with it. And you’re correct; this wouldn’t be accepted in other fields.

1 Like

I’m already using a lot of GPUs for hyperparameter optimization. Running additional tests would be time-consuming and honestly a hassle for minimal benefit—especially with LLMs.

In most cases, it’s usually not that important. Many papers claim their method is superior because they improved the SOTA by just 0.05 percentage points, which likely wouldn’t hold up significantly across multiple trials.

Even if that were true, the method would still be valid. They might not hold the “SOTA record,” but that wouldn’t diminish the quality of the research.

In summary, it’s not worth conducting those tests.