Intelligence is not Artificial

(These are excerpts from my book "Intelligence is not Artificial")

### The Connectionists (Neural Networks)

Meanwhile, the other branch of Artificial Intelligence was pursuing a rather different approach: simulating what the brain does at the physical level of neurons and synapses. The symbolic school of John McCarthy and Marvin Minsky believed in using mathematical logic (i.e., symbols) to simulate how the human mind works; the school of "neural networks" (or "connectionism") believed in using mathematical calculus (i.e., numbers) to simulate how the brain works.

Since in the 1950s neuroscience was just in its infancy (medical machines to study living brains would not become available until the 1970s), computer scientists only knew that the brain consists of a huge number of interconnected neurons, and neuroscientists were becoming ever more convinced that "intelligence" was due to the connections, not to the individual neurons. A brain was viewed as a network of interconnected nodes, and our mental life as due to the way signals travel through those connections from the neurons of the sensory system up to the neurons that process those sensory data and eventually down to the neurons that generate action.

The neural connections can vary in strength from zero to infinite, and this is known as the "weight" of the connection. Change the weight of some neural connections and you change the outcome of the network's computation. In other words, the weights of the connections can be tweaked to cause different outputs for the same inputs. The problem for those designing "neural networks" consists in fine-tuning the connections so that the network as a whole comes up with the correct interpretation of the input; e.g. with the word "apple" when the image of an apple is presented. This is called "training the network". For example, showing many apples to the system and forcing the answer "APPLE" should result in the network adjusting those connections to recognize apples in general. This is called "supervised learning". The normal operation of the neural network is quite simple. The signals coming from different neurons into a neuron are weighed based on the weights of each input connection and then fed to an "activation function" (also known as the "nonlinearity") that decides what has to be the output produced by this neuron. The simplest activation function is a function that has a threshold, the "step" function: if the total input passes the threshold value, the neuron emits a one, otherwise a zero. This process goes on throughout the network. The network is usually organized in layers of neurons. The weights of the connections determine what the network computes. The weights change during "training" (i.e. in response to experience). Neural networks "learn" those weights during training. A simple approach to learning weights is to compare the output of the neural network to the correct answer and then modify the weights in the network so as to produce the correct answer. Each "correct answer" is a training example. The neural network needs to be trained with numerous such examples. Today computers are powerful enough that it can be literally millions of examples. If the network has been designed well, the weights will eventually converge to a stable configurations: at that point the network should provide the correct answer even for instances that were not in the training data (e.g., recognize an apple that it has never seen before). The designer of the neural network has to decide the structure of the neural network (e.g. the number of layers, the size of each layer, and which neurons connect to which other neurons), the initial values of the weights, the activation function (the "nonlinearity"), and the training strategy. Both the initialization and the training may require the use of random numbers, and there are many different ways to generate random numbers. The term "hyperparameters" refers to all the parameters that the network designer needs to pinpoint. It may take months to come up with a neural network that can be trained.

Since the key is to adjust the strength of the connections, the alternative term for this branch of A.I. is "connectionism".

One of the most influential books in the early years of neuroscience was "Organization of Behavior" (1949), written by the psychologist Donald Hebb at McGill University in Montreal (Canada). Hebb described how the brain learns by changing the strength in the connections between its neurons. In 1951 two Princeton University students, Marvin Minsky and Dean Edmonds, simulated Hebbian learning in a network of 40 neurons realized with 3,000 vacuum tubes, and called this machine SNARC (Stochastic Neural Analog Reinforcement Computer). I wouldn't count it as the first neural network because SNARC was not implemented on a computer. In 1954 Wesley Clark and Belmont Farley at MIT simulated Hebbian learning on a computer, i.e. created the first artificial neural network (a two-layer network). In 1956 Hebb collaborated with IBM's research laboratory in Poughkeepsie to produce another computer model, programmed by Nathaniel Rochester's team (that included a young John Holland).

If there was something similar to the Macy conferences in Britain, it was the Ratio Club, organized in 1949 by the neurologist John Bates, a dining club of young scientists who met periodically at London's National Hospital to discuss cybernetics. McCulloch, who was traveling in Britain, became their very first invited speaker. Among its members was the neurologist William Grey-Walter, who in 1948 built two tortoise-shaped robots (better known as Elmer and Elsie) that some consider the first autonomous mobile robots. Turing was a member and tested the "Turing test" at one of their meetings. John Young was a member: in 1964 he would discover the "selectionist" theory of the brain. And, finally, another member was the psychiatrist Ross Ashby, who in 1948 actually built a machine to simulate the brain, the homeostat ("the closest thing to a synthetic brain so far designed by man", as Time magazine reported). The title of that paper became the title of his influential book "Design for a Brain" (1952). No surprise then that mathematical models of the brain proliferated in Britain, peaking just about in the year of the first conference on Artificial Intelligence: Jack Allanson at Birmingham University reported on "Some Properties of Randomly Connected Neural Nets" (1956), Raymond Beurle at Imperial College London studied "Properties of a Mass of Cells Capable of Regenerating Pulses" (1956), and Albert "Pete" Uttley at the Radar Research Establishment, a mathematician who had designed Britain's first parallel processor, wrote about "Conditional Probability Machines and Conditioned Reflexes" (1956). It is debatable whether, as argued by Christof Teuscher in his book "Turing's Connectionism" (2001), Turing truly predated neural networks (as well as genetic algorithms) in an unpublished 1948 paper, now known as "Intelligent Machinery", that was about "unorganized machines", i.e. random Boolean networks.

The idea that computers were "giant brains" wasn't just a myth invented by the media. Some psychologists enthusiastically signed on to this metaphor. George Miller was a psychologist at Harvard University who in 1950 visited the Institute for Advanced Study in Princeton, one of the pioneering centers in computer science. The following year he was hired by MIT to lead the psychology group at the newly formed Lincoln Laboratories (a hotbet of military technology for the Cold War) and published an influential book titled "Language and Communication" in which he launched the program of studying the human mind using the information theory just developed by Claude Shannon at Bell Labs in his article "A Mathematical Theory of Communication" (1948).

Frank Rosenblatt's Perceptron (1957) at Cornell University and Oliver Selfridge's Pandemonium (1958) at MIT defined the standard for artificial neural networks: not knowledge representation and logical inference, but pattern propagation and automatic learning. The Perceptron, first implemented in software in 1958 on the Weather Bureau's IBM 704 and then custom-built in hardware at Cornell Aeronautical Laboratory, was the first trainable neural network (called "single-layer" even though it had two layers of neurons). The activation function was the same binary function (the "step function") used by the McCulloch-Pitts neuron but it had a learning rule (an algorithm for changing the weights). Its application was to separate data in two groups. The limitations of perceptrons were obvious to everybody and in the following years several studies found solutions. The British National Physical Laboratory (November 1958) organized a symposium titled "The Mechanisation of Thought Processes"; three conferences on "Self-Organization" were held in 1959, 1960 and 1962; and Rosenblatt published his report "Principles of Neurodynamics" (1962). However, nobody could figure out how to build a multilayer perceptron.

In 1960 Bernard Widrow and his student Ted Hoff at Stanford University built a single-layer network based on an extension of the McCulloch-Pitts neuron called Adaline (Adaptive Linear Neuron) and using a generalization of the Perceptron's learning rule, the "delta rule" or "least mean square" (LMS) algorithm (a way to minimize the difference between the desired and the actual signal), the first practical application of a "stochastic gradient descent" method to machine learning. The method of "stochastic gradient descent" had been introduced in 1951 for mathematical optimization by Herbert Robbins of the Univ of North Carolina ("A Stochastic Approximation Method ", 1951). Trivia: Ted Hoff later joined a tiny Silicon Valley startup called Intel and helped design the world's first microprocessor.

The "gradient descent" method, discovered in 1847 by the French mathematician Augustin Cauchy, was first applied to control theory in 1960 by Henry Kelley at Grumman Aircraft Engineering Corporation in New York ("Gradient Theory of Optimal Flight Paths", 1960) and by Arthur Bryson at Harvard University ("A Gradient Method for Optimizing Multi-stage Allocation Processes", 1961). That was "backpropagation".

Another important discovery that went unnoticed at the time was the first learning algorithms for multilayer networks, published in 1965 by the Ukrainian mathematician Alexey Ivakhnenko in his book "Cybernetic Predicting Devices".

Compared with expert systems, neural networks are dynamic systems (their configuration changes as they are used) and predisposed to learning by themselves (they can adjust their configuration). "Unsupervised" networks, in particular, can discover categories by themselves; e.g., they can discover that several images refer to the same kind of object, a cat.

There are two ways to solve a crime. One way is to hire the smartest detective in the world, who will use experience and logic to find out who did it. On the other hand, if we had enough surveillance cameras placed around the area, we would scan their tapes and look for suspicious actions. Both ways may lead to the same conclusion, but one uses a logic-driven approach (symbolic processing) and the other one uses a data-driven approach (ultimately, the visual system, which is a connectionist system).

Expert systems were the descendants of the "logical" school that looked for the exact solution to a problem. Neural nets were initially viewed as equivalent logical systems, but actually represented the other kind of thinking, probabilistic thinking, in which we content ourselves with plausible solutions, not necessarily exact ones. That is the case of speech and vision, and of pattern recognition in general.

In 1969 Stanford held the first International Joint Conference on Artificial Intelligence (IJCAI). Nils Nilsson from SRI presented Shakey. Carl Hewitt from MIT's Project MAC presented Planner, a language for planning action and manipulating models in robots. Cordell Green from SRI and Richard Waldinger from Carnegie-Mellon University presented systems for the automatic synthesis of programs (automatic program writing). Roger Schank from Stanford and Daniel Bobrow from Bolt Beranek and Newman (BBN) presented studies on how to analyze the structure of sentences.

In 1969 Marvin Minsky and Samuel Papert of MIT published a devastating critique of neural networks (titled "Perceptrons") that virtually killed the discipline. This came a decade after a review by Noam Chomsky of a book by Burrhus Skinner had turned the tide in psychology, ending the domination of behaviorism and resurrecting cognitivism, and Noam Chomsky's campaign against behaviorism culminated in an article in the New York Review of Books of December 1971. Most A.I. scientists favored the "cognitive" approach simply for computational reasons, but those computer scientists felt somewhat reassured by the events in psychology that their choice was indeed wise.

Minsky's and Papert's proof came, by sheer coincidence, at the right time to avoid criticism: both Pitts and McCulloch died in 1969 (may and september), and Rosenblatt died in a boating accident in 1971.

To be fair, Minsky and Papert simply argued that the limitations of the Perceptron could be overcome only with multilayer neural nets and, unfortunately, Rosenblatt's learning algorithm did not work for multilayer nets.

The gradient method was perfected as a method to optimize multi-stage dynamic systems by Bryson and his Chinese-born student Yu-Chi Ho in the book "Applied Optimal Control" (1969). At that point the mathematical theory necessary for backpropagation in multi-layer neural networks was basically ready. In 1970 the Finnish mathematician Seppo Linnainmaa invented "reverse mode of automatic differentiation", which has backpropagation as a special case.

In 1974 Paul Werbos at Harvard University applied Bryson's backpropagation algorithm to the realm of neural networks ("Beyond Regression", 1974); and in 1975 the first multi-layered network appeared, designed by Kunihiko Fukushima in Japan, the Cognitron ("Cognitron - A Self-organizing Multilayered Neural Network", 1975).

Practitioners of neural networks also took detours into cognitive science. For example, James Anderson at Rockefeller University ("A Simple Neural Network Generating an Interactive Memory", 1972) and Teuveo Kohonen in Finland ("Correlation Matrix Memories", 1972) used neural networks to model associative memories based on Donald Hebb's law. Therefore, by the mid-1970s significant progress had occurred (if not widely publicized) in neural networks.

A much more stinging criticism could have come from neuroscience, a discipline that was beginning to use computer simulations. In 1947 Kacy Cole at the Marine Biological Lab near Boston pioneered the "voltage clamp" technique to measure the electrical current flowing through the membranes of neurons. Using that technique, in 1952 the British physiologists Alan Hodgkin and Andrew Huxley at Cambridge University built the first mathematical model of a spiking neuron, which also counts as the first simulation of computational neuroscience (for the record, they simulated the axon of the squid's brain). The Hodgkin-Huxley model is a set of nonlinear differential equations that approximates the electrical characteristics of neurons. The next major breakthroughs in the simulation of brain computation came respectively in 1962, when Wilfrid Rall at the National Institutes of Health simulated a dendritic arbor, and in 1966, when Fred Dodge and James Cooley at IBM simulated a propagating impulse in an axon. Meanwhile, Donald Perkel at the RAND Corporation in Los Angeles had written computer programs to simulate the working of the neuron using one of the earliest computers, the Johnniac ("Continuous-time Simulation of Ganglion Nerve Cells in Aplysia", 1963). These simulations (by people who actually knew what a neuron looks like) bore little resemblance to the naive digital neurons of the artificial neural networks.