I trained a random forest model with all default parameters using sklearn. The dataset has seven features and roughly 550k data points. Classification in binary, 0 or 1. All seven characteristics are just the straightforward probabilities of labeling 1 for a given categorical feature value.

I’m making use of the 10-fold integrated cross-validator. The only difference I’m making is between standard and stratified; the latter performs marginally worse. For what reason might this be the case?

Do you know the distribution of 0-1 labels in your dataset? Using stratified sampling ensures that your test and training sets have the same distribution of 0-1 labels as the entire dataset. This method likely provides a better estimate of performance. If you were to repeat the standard k-fold with different random sets of folds, you would probably see convergence with the stratified sampling approach.

I recommended using calibration to help you determine which classifier is better through an alternative method. However, I don’t expect the difference you mentioned in your post to be statistically significant. Given the large number of instances and the relatively mild imbalance, both random k-fold and stratified k-fold should produce similar distributions in their folds.

Let’s imagine that one fold in your k-fold cross-validation contains only outliers. These outliers will be included in every fold. In your example, the 10% outlier will be present in each fold.