Back in the LeNet era, ANNs and backpropagation seemed almost forgotten. But when scaled up in AlexNet, they proved their worth and gained widespread support.

Before this deep learning resurgence, some people believed it could be successful with the right training methods.

Let’s discuss other old algorithms that were once hard to understand or scale. Which ones do you think could succeed if given more attention today?

I think second-order optimization will eventually become the standard way we train neural networks once we can make it efficient enough.

Second-order optimization methods, like Newton’s method or quasi-Newton methods, could offer significant improvements in training neural networks if we can make them more efficient.

Yes, momentum-based methods do have some similarities to second-order optimization but are more like diagonal approximations of the Hessian.

They help speed up training by improving convergence, but true second-order methods can be significantly faster per iteration due to their more accurate use of curvature information.

If we can make these true second-order methods efficient, they could greatly enhance training speed and performance.

Hessians are crucial for capturing the curvature of the loss landscape, which can make second-order methods more efficient in navigating complex optimization problems.

On a related note, meta-learning, or “learning to learn,” is gaining a lot of attention.

It focuses on improving the learning process itself, allowing models to adapt quickly to new tasks or data with minimal examples.

This could be a breakthrough in making machine learning more flexible and efficient.

Some sort of selective stability, which is honestly all that gradient descent does. In SGD, error/cost functions provide the selection, and gradients, along with a nonincreasing learning rate schedule, entrain downstream gradients and neighboring weights to stabilize themselves over time.

Entrainment means “is pulled towards some subspace attractor” in weight-space, not unlike niche construction in biology. The difference is that SGD, as a general local optimization procedure, isn’t optimized for deep learning architectures, typical data manifolds, or GPUs. An algorithm that could explicitly use that structure to its advantage could theoretically achieve similar results to SOTA while reducing sample complexity and training time.

As for what form that algorithm will take, evolutionary methods seem like the obvious choice to me, but the theory is lagging right now.

Singular learning theory shows some promise imo, and my intuition says that there’s probably a relationship between the effective redundancy of any singular parameter and the Fisher information for that same parameter, which itself has an interpretation in terms of natural selection.

This is also somewhat related to the idea of bisimulation that one finds in RL, in the sense that local symmetries enable exploratory variation, which improves search.