Methods for Evaluating Model Calibration

We have an acute class imbalance in the problem I’m working on right now, and I’m analyzing various evaluation measures. One intriguing finding I’ve noticed is that certain models provide lower brier scores, but they also produce extremely low probabilities for the positive class.

Exist any other measures that would be more appropriate for this particular situation?

If you’re not adjusting for class imbalances in your minibatches or using weighted updates/loss functions, you might want to consider these techniques.

Alternatively, gathering more data for under-represented classes could help.

Another approach is to analyze your test data with a normalized histogram to assess calibration. For instance, if you frequently predict Class 2 with a probability between 0.9 and 0.95 and are correct 99% of the time, you might be underestimating confidence in that range.

1 Like

We used a statistic called median average percent error (MedAPE) while I worked on CTR models. Determine the mean y value for each percentile of your anticipated p distribution. Utilizing the formula e = mean(p) - mean(y), get the error for each percentile. Use the median as your measure after calculating the errors. This provides a decent description of the reliability curve.

2 Likes

I appreciate your reply. I have not used MDAPE, although I have used MAPE in the past. A follow-up question: How does this illustrate the enormous class imbalance and calibration? Maybe I’m not grasping it correctly.

2 Likes

I was referring to just reviewing the calibration curve for test data, but you might conduct a more in-depth analysis to determine which forecasts are the least confident and why.

2 Likes