Why is it so common in ML research to skip statistical tests to confirm that the results are genuinely significant? Often, a single outcome is reported instead of conducting multiple runs and using tests like a t-test or Mann-Whitney U Test. In other fields, like psychology or medicine, drawing conclusions from a single sample would be unacceptable, so why is this not seen as an issue in ML research?
Also, can anyone recommend a book that covers statistical tests in the context of ML?
Such statistics are primarily relevant when working with a small dataset. In ML, you typically have a large amount of data, and most tests yield significant results, which isn’t very informative. What’s truly needed are cross-validation, sense checking, and possibly sensitivity and uncertainty analysis.
I’m already using a lot of GPUs for hyperparameter optimization. Running additional tests would be time-consuming and honestly a hassle for minimal benefit—especially with LLMs.
In most cases, it’s often not that important. Many papers claim their method is superior because they improved the SOTA by just 0.05 percentage points, which probably wouldn’t hold up significantly over multiple trials.
However, even if that were true, the method would still be valid. They may not hold the “SOTA record,” but it wouldn’t change the fact that it’s solid research.
I’m already using a lot of GPUs for hyperparameter optimization. Running additional tests would be time-consuming and honestly a hassle for minimal benefit—especially with LLMs.
In most cases, it’s usually not that important. Many papers claim their method is superior because they improved the SOTA by just 0.05 percentage points, which likely wouldn’t hold up significantly across multiple trials.
Even if that were true, the method would still be valid. They might not hold the “SOTA record,” but that wouldn’t diminish the quality of the research.
In summary, it’s not worth conducting those tests.