(These are excerpts from my book "Intelligence is not Artificial")
Footnote: A brief History of the Gradient
Training a neural network means minimizing its error. The neural network needs two algorithms for this: one algorithm to minimize the "gradient" in a series of steps and one algorithm to compute the "gradient" at each step. Backpropagation is the algorithm to compute the gradient. The algorithm used in conjunction with backpropagation is an optimization method because minimizing the error is another way of saying that we want to find the minimum of a complex (non-linear) function. There are many such optimization methods. The need originally arose after Newton published his equations of gravitation. It was mainly astronomers who needed to find the minima of functions. The British mathematician John Wallis published the so-called "Newton's method" in his "Treatise of Algebra both Historical and Practical" (1685), the book that introduced the calculus of limits, but it was the French mathematician Augustin Cauchy in 1847 who invented the first efficient method of this kind. His "gradient-descent" method (or "steepest-descent method") enabled astronomers to calculate the orbit of an astronomical object without having to solve Newton's differential equations of motion, but instead using (simpler) algebraic equations.
The idea is quite intuitive if one thinks of a function as a curve in space similar to the skyline of a mountain range, with peaks and valley. The goal is to descend from the top of a mountain to the parking lot. The fastest way at every step (if you are not scared of heights) is to always take the steepest way down: follow the direction of the slope downhill. The problem is that there are many cases in which this takes you to a valley, e.g. a glacier nested between several mountains, where the strategy fails: you can't go any further down. Or, mathematically speaking, the gradient at any local minima is zero, so it doesn't tell you how to continue optimizing the function. That is the problem of "local minima".
A very popular gradient method for neural networks, employed in conjunction with backpropagation, is the "stochastic gradient descent" method, pioneered by Herbert Robbins in 1951, mainly because it is much more efficient, a fact of great importance when you are training a neural network with a large dataset; but also to reduce the chances of getting stuck in local minima.
Backpropagation is what is known in calculus as a "chain rule" and it computes
the gradient at every step, after which the optimization method decides which step to take next.
The iterative application of gradient computation and optimization results in the progressive modification of the "weights" that connect neurons.
There are also gradient-free methods of optimization, notably simulated annealing and genetic algorithm, which, in fact, have been studied by A.I. since the early days.
Back to the Table of Contents
Purchase "Intelligence is not Artificial")