I get that L1 regularization promotes sparsity, which is useful when needed. But what are the advantages of using L2 over L1 in typical scenarios? If the goal is simply to keep weights smaller, why not use L4 regularization, for instance?

I’ve heard that L2 captures energy, Euclidean distance, and is rotation invariant. Can someone explain these concepts more clearly and how they come into play?

When dealing with two highly correlated features, L2 regression provides more interpretable results because the coefficients tend to be more evenly distributed across the features. In contrast, L1 regularization might produce coefficients with significantly different magnitudes, even though they may point in the same direction.

If X1 and X2 measure the same underlying concept but differ only in noise, they will form a linearly independent pair and span a 2-dimensional subspace. This can lead to a better (though spurious) fit of the data. L2 regularization will treat both features equally, which may smooth the decision boundary without drastically altering it, while L1 regularization will favor a smaller subset of features.

In this scenario, L2 regularization might result in a larger generalization gap (in terms of accuracy) compared to L1, even though the overall generalization error might be similar for both methods. (Of course, this is dependent on the choice of regularization strength, so the generalizability of this observation may vary.)