By Gaurav in Statistics — 14 Apr 2025

Breaking the Monotony: Calibrating Without Preserving Monotonicity

Calibration refers to how well the predicted probabilities of an event match the actual frequencies of that event occurring. For instance, when a calibrated model predicts that an event has a 20% probability of occurring, that event occurs 20% of the time. The intuition extends to continuous values. When a calibrated model predicts $20 of revenue, it brings in $20. (The standard error of the average difference between predicted and actual depends on how good the model is.)

Predictions of non-linear probability models like gradient-boosted trees (but also, famously, SVMs) often produce uncalibrated predictions, e.g., here. The conventional rationale for miscalibration is that loss functions like cross-entropy optimize for discrimination rather than calibration, with incentives to bring probabilities of one close to 1 and zero close to 0. (See also, for example, Guo et al.) Given that the issue is with the loss function, a variety of proposals have been made to change the loss function for better calibration, e.g., here, here, etc. Rather than amend the loss function, making the labels more 'flat' via label smoothing provides another antidote to overconfident predictions (see here and here). The third set of techniques tries to do a post-hoc fix. It minimizes calibration loss with a monotonic constraint. Platt scaling, isotonic calibration, and temperature scaling are popular ways of doing the same.

The post-hoc techniques are unmoored from the root cause. If the root cause of miscalibration is not the objective function but concept drift (changing relationship between predictors and labels), which may come from label drift (e.g., changing ads) or different feature mappings (data engineering issues), preserving monotonicity may not be the right thing to do. In such circumstances, adjustments that allow for minor monotonicity violations can be useful.

Calibre, a new Python package, exposes several ways to trade off between goodness-of-fit and monotonicity, including the Nearly Isotonic regression and Relaxed PAVA (plausibly novel; it ignores minor violations of monotonicity).

Using isotonic regression can cause other downstream problems. With noisy data, monotonicity-preserving transforms like isotonic regression yield few cut points. If the decision function is a simple threshold-based rule and the decision thresholds intersect with plateaus in the transformed data, then minor perturbations in the input data can trigger disproportionately large changes in the resulting decisions. Along with Nearly Isotonic regression and Relaxed PAVA, GAM calibration (see here for an i-spline variant) can be a potential solution to the problem.

Subscribe to Gojiberries