I have an imbalanced dataset and I want the output as probabilities and not labels. Hence using Logistic Regression seemed to be the obvious choice.

However, the classifier started predicting all data points belonging to the majority class, which caused a problem for me. I then decided to use class_weight='balanced' of the sklearn package, which assigns weights to classes in the loss function. Now I do achieve a decent model with ROC AUC of 0.85.

However, I have the following questions:

Do I need to adjust the predicted probabilities since I messed around with the distribution by using the class weight parameter?

In my evaluation set, I used stratified split. Is this a good choice or should I have a balanced dataset in my evaluation set?

Given both classes are equally important, is ROC AUC a good metric?

To understand what ‘class_weight = balanced’ does, check out this paper: Menon et al. (2013). It covers two main methods for dealing with imbalanced classification by focusing on the Arithmetic Mean rather than just Accuracy.

The first method involves using the usual decision rule where you predict label 1 if the probability P(Y=1|X=x) is greater than 1/2. Instead, you set a threshold based on the class prior probability P(Y=1), so you predict label 1 if P(Y=1|X=x) is greater than P(Y=1). This way, the score given by your classifier directly represents P(Y=1|X=x).

The second method uses a weighted loss function but keeps the classification rule as s > 1/2, where s is the score from your classifier. Here, the score s doesn’t directly represent P(Y=1|X=x) because the weights adjust the output score to fit a threshold of 1/2 rather than P(Y=1).

I had similar questions before reading the paper. It might seem complex at first, but there might be simpler sources that explain these concepts more clearly.

A Logistic Regression model usually provides reliable probabilities, but this might not be true with imbalanced datasets. The ROC AUC metric alone doesn’t help much in choosing the best threshold; you need to examine the ROC curve for that. Also, ROC AUC remains the same regardless of how you transform the probabilities. For better calibration of probabilities, you can use sklearn’s CalibratedClassifierCV.

Make sure your evaluation set resembles a real-world dataset. If you believe your entire dataset is representative, a stratified split is a good option.

The ROC AUC metric is important, but if you’re more interested in one specific class, average precision-recall and the PR-curve might be more useful.

Lastly, I recommend checking out this paper, which could provide you with some valuable insights.

@GradientGuru Nice! Both algorithms make sense to me. However, it’s not entirely up to me to determine the optimal balance between recall and precision. So the cutoff could alter and not depend on past experiences?