I’ve come across two configurations for image-based generative models: one using tanh or sigmoid with binary crossentropy, and the other using linear with MSE or MAE. Does anyone know why researchers tend to prefer one approach over the other? Are there any papers that provide a deeper exploration of this topic?

Based on my research, there has been some interesting work comparing regression losses like mean squared error (MSE) to binary cross-entropy (BCE) for pixel-level loss in generative models. One key paper I found argues that using BCE is fundamentally incorrect, as it assumes the output pixels follow a Bernoulli distribution (only 0 or 1), which is not the case for continuous pixel values. The authors proposed a new continuous extension of the Bernoulli distribution that performed much better than both regular BCE and MSE. Additionally, I’ve read that using a sigmoid activation with BCE can make it difficult to produce extreme pixel values of 0 or 1, leading to faded or dull outputs . Some researchers have had more success scaling the output to a wider range like [-0.1, 1.1] and then truncating

. Ultimately, the choice between regression losses and BCE seems to depend on the specific generative task and data distribution. But this research suggests that more nuanced loss functions tailored to the continuous nature of pixel data may outperform the standard approaches. It’s an interesting area of exploration for improving the quality and realism of generative models.

Researchers prefer either linear models with MSE/MAE or tanh/sigmoid models with BCE in generative models, depending on the desired output characteristics. Articles analyze the trade-offs in training stability, gradient behavior, and realism between these approaches.