(These are excerpts from my book "Intelligence is not Artificial")
Brute-force A.I. doesn't Speak
Most disheartening has been the scant progress in Natural Language Processing (NLP) since 1970.
Following are examples of exchanges between a person and Terry Winograd's program SHRDLU at the MIT in 1970:
Person: FIND A BLOCK WHICH IS TALLER THAN THE ONE YOU ARE HOLDING AND PUT IT INTO THE BOX.
Computer: BY "IT", i ASSUME YOU MEAN THE BLOCK WHICH IS TALLER THAN THE ONE i AM HOLDING.
Person: DOES THE SHORTEST THING THE TALLEST PYRAMID'S SUPPORT SUPPORTS SUPPORT ANYTHING GREEN?
Computer: YES, THE GREEN PYRAMID.
Person: WHAT COLOR IS IT?
Computer: BY "IT", i ASSUME YOU MEAN THE SHORTEST THING THE TALLEST PYRAMID'S SUPPORT SUPPORTS. RED
Person: DOES A STEEPLE
Computer interrupting: SORRY, i DON'T KNOW THE WORD "STEEPLE".
Person: A "STEEPLE" IS A STACK WHICH CONTAINS TWO GREEN CUBES AND A PYRAMID.
Computer: i UNDERSTAND.
Person: ARE THERE ANY STEEPLES NOW?
It beats some of today's "personal digital assistants".
We virtually abandoned the idea of having machines understand and speak our language and resorted to the opposite: make humans speak like machines. That is what you do when you talk on the phone with a machine that asks you for numbers; and that is what you do when you talk to your smartphone's "assistant" according to the rules of that assistant. Nothing illustrates the gap between machine intelligence and human intelligence better than comparing how much of a language a toddler
learns in two years and how little of any language all machines ever invented
have learned in more than 60 years.
IBM's Watson, that debuted with much fanfare in 2011 on a quiz show competing against human experts, was actually not capable of understanding the spoken questions: the questions were delivered to Watson as text files, not as spoken questions (a trick which, of course, distorted the whole game).
The most popular search engines are still keyword-based. Progress in search engines has been mainly in indexing and ranking webpages, not in understanding what the user is looking for nor in understanding what the webpage says. Try for example "Hey i had a discussion with a friend about whether Qaddafi wanted to get rid of the US dollar and he was killed because of that" and see what you get (as i write these words, Google returns first of all my own website with the exact words of that sentence and then a series of pages that discuss the assassination of the US ambassador in Libya). Communicating with a search engine is a far (far) cry from
communicating with human beings.
Products that were originally marketed as able to understand natural language, such as SIRI for Apple's iPhone, have bitterly disappointed their users. These products understand only the most elementary of sounds, and only sometimes, just like their ancestors of decades ago. Promising that a device will be able to translate speech on the fly (like Samsung did with its Galaxy S4 in 2013) is a good way to embarrass yourself and to lose credibility among your customers.
The status of natural language processing is well represented by antispam software that is totally incapable of understanding whether an email is spam or not based on its content while we can tell in a split second.
During the 1960s, following
(and mostly reacting against)
Noam Chomsky's "Syntactic Structures" (1957) that heralded a veritable linguistic revolution, a lot work in A.I. was directed towards "understanding" natural-language sentences, notably Charles Fillmore's case grammar at Ohio State University (1967), Roger Schank's conceptual dependency theory at Stanford (1969, later at Yale), William Woods' augmented transition networks at Harvard (1970),
Yorick Wilks' preference semantics at Stanford (1973),
and semantic grammars, an evolution of ATNs by Dick Burton at BBN for one of the first "intelligent tutoring system", Sophie (started in 1973 at UC Irvine by John Seely Brown and Burton). Unfortunately, the results were crude.
Schank and Wilks were emblematic of the revolt against Chomsky's logical approach, that did not work well in computational systems. Schank and Wilks turned to meaning-based approached to natural language processing.
Terry Winograd's SHRDLU and Woods' LUNAR (1973), both based on Woods' theories, were limited to very narrow domains and short sentences.
Roger Schank moved to Yale in 1974 and attacked the Chomsky-ian model that language comprehension is all about grammar and logic thinking. Schank instead viewed language as intertwined with cognition, as Otto Selz and other cognitive psychologists had argued 50 years earlier. Minsky's "frame" and Schank's "script" (all variations on Selz's "schema") assumed a unity of perception, recognition, reasoning, understanding and memory: memory has the passive function of remembering and the active function of predicting; the comprehension of the world and its categorization proceed together; knowledge is stories.
Schank's "conceptual dependency" theory, whose tenet is that two sentences whose meaning is equivalent must have the same representation, aim to replace Noam Chomsky's focus on syntax with a focus on concepts.
We humans use all sorts of complicated sentences, some of them very long, some of them nested into each other.
Little was done in discourse analysis before Eugene Charniak's thesis at the MIT ("Towards a Model of Children's Story Comprehension", 1972), Indian-born Aravind Joshi's "Tree Adjunct Grammars" (1975) at the University of Pennsylvania, and Jerry Hobbs' work at the SRI Intl ("Computational Approach to Discourse Analysis", 1976).
Then a handful of important theses established the field. One originated from the SRI, Barbara Grosz╬Ú╬¸s thesis at UC Berkeley ("The Representation and Use af Focus in a System for Understanding Dialogs", 1977). And two came from Bolt Beranek and Newman, where William Woods had pioneered natural-language processing: Bonnie Webber╬Ú╬¸s thesis at Harvard: ("Inference in an Approach to Discourse Anaphora", 1978) and Candace Sidner╬Ú╬¸s thesis at the MIT ("Towards a Computational Theory of Definite Anaphora Comprehension in English Discourse", 1979).
In 1974 Marvin Minsky at MIT introduced the "frame" for representing a stereotyped situation ("A Framework for Representing Knowledge", 1974) and in 1975 for the same purpose Roger Schank, who had already designed MARGIE (1973, which, believe it or not, stands for "Memory, Analysis, Response Generation, and Inference on English"), in collaboration with Stanford student Chris Riesbeck, and psychologist and social scientist Robert Abelson at Yale introduced the script ("Scripts, Plans, and Knowledge", 1975). Schank's students built a number of systems that used scripts to understand stories: Richard Cullingford's Script Applier Mechanism (SAM) of 1975; Robert Wilensky's PAM (Plan Applier Mechanism) of 1976; Wendy Lehnert's question-answering system QUALM of 1977; Janet Kolodner's CYRUS (Computerized Yale Retrieval and Updating System) of 1978, that learned events in the life of two politicians; Michael Lebowitz's IPP (Integrated Partial Parser) of 1978, that in order to read newspaper stories about international terrorism introduced an extension of the script, the MOP (Memory Organization Packet); Jaime Carbonell's Politics of 1978, that simulated political beliefs; Gerald DeJong's FRUMP (Fast Reading Understanding and Memory Program) of 1979, an evolution of SAM for producing summaries of newspaper stories; BORIS (Better Organized Reading and Inference System) of 1980, developed by Lehnert and her student Michael Dyer, a story-understanding and question-answering system that combined the MOP and a new extension, the Thematic Affect Unit (TAU). Starting in 1978 these systems were grouped under the general heading of "case-based reasoning". Meanwhile, Steven Rosenberg at MIT built a model to understand stories based on Minsky's frames.
In particular, Jaime Carbonell's PhD dissertation at Yale University ("Subjective Understanding", 1979) can be viewed as a precursor of the field that would be called "sentiment analysis".
It is important to realize that, despite the hype and the papers published in reputable (?) A.I. magazines, none of these systems ever worked. They "worked" only in a very narrow domain and they "understood" pretty much only what was hardwired into them by the software engineer. That's why they were never used twice. They were certainly steps forward in theoretical research, but very humble and very short steps. In 2017 Schank published on his blog an angry article titled "The fraudulent claims made by IBM about Watson and A.I." that started out with the sentence "They are not doing cognitive computing no matter how many times they say they are" but perhaps that's precisely what Schank was doing two generations earlier.
These computer scientists, as well as philosophers such as Hans Kamp in the Netherlands (founder of Discourse Representation Theory in 1981), attempted a more holistic approach to understanding "discourse", not just individual sentences; and this resulted in domain-independent systems such as the Core Language Engine, developed in 1988 by Hiyan Alshawi's team at SRI in Britain.
Meanwhile, Melvin Maron's pioneering work on statistical analysis of text
at UC Berkeley ("On Relevance, Probabilistic Indexing, and Information Retrieval", 1960)
was being resurrected by Gerard Salton at Cornell University (the project leader of SMART, System for the Mechanical Analysis and Retrieval of
Text, since 1965). This technique,
true to the motto "You shall know a word by the company it keeps" (1957) by the British linguist John-Rupert Firth,
represented a text as a "bag" of words,
disregarding the order of the words and even the grammatical relationships.
Surprisingly, this method was working better than the complex grammar-based
approaches. It quickly came to be known as the "bag-of-words model" for
language analysis. Technically speaking, it was text classification using naive
Bayes classifiers. In 1998 Thorsten Joachims at Univ of Dortmund replaced the naive Bayes classifier with the method of statistical learning called "Support Vector Machines", invented by Vladimir Vapnik
at Bell Labs in 1995, and other improvements followed. The bag-of-words model became the dominant paradigm for natural language processing but its statistical approach still failed to grasp the
meaning of a sentence.
Yoshua Bengio at the University of Montreal started working on neural networks for natural language processing in 2000 ("A Neural Probabilistic Language Model", 2001). Bengio's neural language models learn to convert a word symbol into a vector within a meaning space. The word vector is the semantic equivalent of an image vector: instead of extracting features of the image, it extracts the semantic features of the word to predict the next word in the sentence. Bengio realized something peculiar about word vectors learned from a text by his neural networks: these word vectors represent precisely the kind of linguistic regularities and patterns that define the use of a language, the kind of things that one finds in the grammar, the lexicon, the thesaurus, etc; except that they are not separate databases but just one organic body of expertise about the language. Firth again: "you shall know a word by the company it keeps".
In 2005 Bengio developed a method to solve the "curse of dimensionality" in natural language processing, the problem of training a network with the particular data that are vocabularies ("Hierarchical Probabilistic Neural Network Language Model", 2005). After Bengio's pioneering work, several others applied deep learning to natural language processing, notably Ronan Collobert and Jason Weston at NEC Labs in Princeton ("A Unified Architecture for Natural Language Processing", 2008), one of the earliest multitask deep networks, and capable of learning recursive structures. Bengio's mixed approach (neural networks and statistical analysis) was further expanded by Andrew Ng's and Christopher Manning's student Richard Socher at Stanford with applications to natural language processing ("Learning Continuous Phrase Representations and Syntactic Parsing with Recursive Neural Networks", 2010), which improved the parser developed by Manning with Dan Klein ("Accurate Unlexicalized Parsing", 2003). The result was a neural network that learns recursive structures, just like Collobert's and Weston's. Socher introduced a language-parsing algorithm based on recursive neural networks that Socher also reused for analyzing and annotating visual scenes ("Parsing Natural Scenes and Natural Language with Recursive Neural Networks", 2010).
However, Bengio's neural network was a feed-forward network, which means that it could only use a fixed number of preceding words when predicting the next one. Czech student Tomas Mikolov of the Brno University of Technology, working at John Hopkins University in Sanjeev Khudanpur's team, showed that, instead, a recurrent neural network is able to process sentences of any length ("Recurrent Neural Network-based Language Model," 2010). An RNN transforms a sentence into a vector representation, or viceversa. This enables translation from one language to another: a RNN (the encoder) can transform the sentence of a language into a vector representation that another RNN (the decoder) can transform into the sentence of another language. (Mikolov was hired by Google in 2012 and by Facebook in 2014, and in between in 2013 he invented the "skip-gram" method for learning vector representations of words from large amounts of unstructured text data). Mikolov's method would be the basis for the R-Net developed by Microsoft's Chinese laboratories (Furu Wei and others) that in January 2018 would win the Stanford reading-comprehension competition beating (for the first time) the human beings.
Using a similar technique, Oriol Vinyals and Quoc Le at Google revolutionized the venerable branch of discourse analysis. They trained a recurrent network with a large set of chats between users and support technicians. This created the equivalent of a translation (or, better, of a sequence-to-sequence model): the question asked by a user has to be "translated" into the response of the support technician ("A Neural Conversational Model", 2015).
Then Oriol Vinyal used the same technique of machine translation to analyze images and create captions. The best architecture to represent images as vectors was the convolution neural network, so Vinyal used a convolution neural network as the image encoder ("Show and Tell - A Neural Image Caption Generator", 2015) and a decoder RNN turned that vector representation into sentences. It achieved the same feat of a neural network trained to describe a scene. The similarities between language parsing in natural language processing and scene analysis in machine vision had been known at least since Gabriela Csurka developed the "bag-of-visual-words" or "bag-of-features" technique. Ironically, the biggest success story of the "bag-of-words" model has been in image classification, not in text classification. In 2003 Gabriela Csurka at Xerox in France applied the same statistical method to images. The "bag-of-visual-words" model was born, that basically treats an image as a document. For the whole decade this was the dominant method for image recognition, especially when coupled with a support vector machine classifier. This approach led, for example, to the system for classification of natural scenes developed in 2005 at Caltech by Pietro Perona and his student Fei-fei Li at Caltech.
Up to this point the technique for natural language processing included: the "bag-of-words" approach, in which sentence representations are independent of word order; the sequence models developed by Michael Jordan (1986) and Jeffrey Elman (1990) at UC San Diego; and models based on tree structures, in which a sentence's symbolic representation is derived from its constituents following a syntactic blueprint (the typical symbolic structure that results from this process resembles an inverted tree). The latter arose in the 1990s after a debate on representations in neural networks that started in 1984 when Geoffrey Hinton (then at Carnegie Mellon University) circulated a report titled "Distributed Representations" about representations in which "each entity is represented by a pattern of activity distributed over many computing elements, and each computed element is involved in representing many different entities." The problem of representing tree structures in neural networks was solved by Jordan Pollack of Ohio State University who came up with the Recursive Auto-Associative Memory or RAAM ("Recursive Distributed Representations", 1990). A few years later Christoph Goller and Andreas Kuechler in Germany extended Pollack's RAAM so that it could be used for arbitrarily complex symbolic structures, e.g. any sort of tree structure ("Learning Task-dependent Distributed Representations by Backpropagation Through Structure", 1995).
For question-answering systems James Weston (now at Facebook's labs in New York) developed "Memory Networks" (2014), neural networks coupled with long-term memories.
"Sequence tagging" (or "labeling") is the process of assigning each item in a sequence to a category, a process that is used in both natural language processing and bioinformatics. This process was traditionally implemented either with generative models such as the hidden Markov models employed in speech recognition or with the "conditional random fields" invented by John Lafferty (a former member of Fred Jelinek's group at IBM, now at Carnegie Mellon University), working with Andrew McCallum and Fernando Pereira ("Conditional Random Fields", 2001). Collobert's technique constituted the first major innovation, and it was countered years later by the bi-directional LSTM with conditional random fields developed by Zhiheng Huang, Wei Xu and Kai Yu of Baidu ("Bidirectional LSTM-CRF Models for Sequence Tagging", 2015).
Collobert's neural-network architecture for NLP formed the basis for Soumith Chintala's "sentiment analysis" at New York University, that learned to categorize movie reviews as positive or negative ("Sentiment Analysis using Neural Architectures", 2015). Socher at Stanford helped Kai Sheng Tai develop Tree-LSTM, a generalization of LSTMs to the tree structures used in natural language processing that further improved sentiment analysis taking advantage of the research started by Pollack 25 years earlier ("Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks", 2015). Sentiment analysis was also the objective of two projects in New England. In 2016 the Computational Story Laboratory of the University of Vermont (led by Peter Dodds and Chris Danforth) used Teuvo Kohonen's Self Organising Map (SOM) to study what Kurt Vonnegut had termed the "emotional arcs" of written stories ("The Emotional Arcs of Stories are Dominated by Six Basic Shapes", 2016). In 2017 Eric Chu of MIT's Laboratory for Social Machines directed by Deb Roy (later hired by Twitter) used deep convolutional neural networks to infer the emotional content of videos and television shows by analyzing both the story, the facial expressions and the soundtrack, i.e. for both audio and visual sentiment analysis ("Audio-Visual Sentiment Analysis for Learning Emotional Arcs in Movies", 2017).
A footnote on sentiment analysis. There are countless precursors, like Carbonell's dissertation of 1979 at Yale and Clark Elliott's PhD dissertation of 1992 at Northwestern University ("The Affective Reasoner", 1992), an implementation of Andrew Ortony's psychological theory (his "appraisal model" of 1988); but the discipline was truly born in 2002 with two studies: one by Peter Turney at the Institute for Information Technology of Canada ("Thumbs up or Thumbs down? Semantic Orientation Applied to Unsupervised Classification of Reviews", 2002) and one (a movie review classifier) and the other one by Bo Pang and Lillian Lee at Cornell University ("Thumbs up? Sentiment Classification using Machine Learning Techniques", 2002). Jeonghee Yi at IBM in San Jose (2003) was perhaps the first one to use "Sentiment Analysis" in the title of his paper.
Meanwhile, recurrent neural networks were maturing. In November 2016 Google switched its translation algorithm to a recurrent neural network and the jump in translation quality was noticeable.
Several startups began offering services of text analysis and summary: Narrative Science, founded in 2010 in Chicago by Northwestern University's professors Kristian Hammond and Larry Birnbaum (a student of Schank's at Yale University in 1986); Maluuba, founded in 2011 in Canada by two students of the University of Waterloo, Sam Pasupalak and Kaheer Suleman, and acquired in 2017 by Microsoft; and MetaMind, founded in 2014 in Palo Alto by Richard Socher and acquired by Salesforce in 2016. But their narrative summaries only worked in very narrow domains under very friendly circumstances.
The results are still far from human performance. The most illiterate person on the planet can understand language better than the most powerful machine.
Ironically, the biggest success story of the "bag-of-words" model has been in image classification, not in text classification. In 2003 Gabriela Csurka at Xerox in France applied the same statistical method to images. The "Bag-of-visual-words" model was born, that basically treats an image as a document. For the whole decade this was the dominant method for image recognition, especially when coupled with a Support Vector Machine classifier. This approach led, for example, to the system for classification of natural scenes developed in 2005 at Caltech by Pietro Perona and his student FeiFei Li at Caltech.
To be fair, progress in natural language understanding was hindered by the simple fact that humans prefer not to speak to another human in our time-consuming natural language. Sometimes we prefer to skip the "Good morning, how are you?" and get straight to the "Reset my Internet connection" in
which case saying "One" to a machine is much more effective than
having to wait for a human operator to pick up the phone and to understand your
issue. Does anyone actually understand the garbled announcements in the New
York subway? Communicating in natural language is not always a solution, as
SIRI users are rapidly finding out on their smartphone. Like it or not, humans
can more effectively go about their business using the language of machines.
For a long time, therefore, Natural Language Processing remained an underfunded
research project with few visible applications. It is only recently that
interest in "virtual personal assistants" has resurrected the field.
In order to realize how far we are from having machines that truly "understand"
our language, think of a useful application that would greatly help civility
in written conversations: the equivalent of a spelling checker for hostile
moods. Imagine an app that, when you try to send an email, would warn you
"The tone of this email is rude: do you really want to send it?" or
"The tone of this email is sarcastic" or
"The tone of this email is insulting".
It is not difficult for a human to read "between the lines", to understand
the hidden motivation of a message and, in particular, to understand when
the writer is deliberately trying to hurt your feelings.
Written hostilities can escalate quickly in the age of email and texting.
The interesting fact is that we understand in a second that the tone of an
email is not friendly, even when the email is a correct reply to our question
or a positive comment to something we have done. When a friend was celebrating
the killing of Osama bin Laden, i quipped "Yes, we are very good at
assassinating people". You understand the sarcasm and indirect critique of US
foreign policy, don't you? You may also understand that i was greatly annoyed
by that operation, and, even more, by the fact that people were celebrating
in the streets.
We routinely get in trouble when we speak quickly because we say something
that we "should not have said": this doesn't mean that what we said was false,
but that we said it on purpose to cause harm and perhaps humiliate. Most of
the time we regret doing it. Conversely, we are easily ticked off by the
wrong tone in an email that was sent to us.
We immediately understand the "tone" of an email, especially when it's meant to
hurt or annoy us, and we are very good
at forging a tone that will hurt or annoy somebody. We word
our sentences accordingly to our mood and to the mood we want to create
in the other person.
When you chat with someone, you pay little attention to the grammatical
structure of what you are saying (in fact, you make a lot of grammatical mistakes,
interrupting your own sentences, restarting them, interjecting a lot of random
noise such as "hmmm") but you pay a lot of attention at the dynamics of the
conversation, which depends heavily on the tone of your and their voices
(assertive, soothing, angry, etc).
The importance of mood in understanding what is going on cannot be overstated.
The impact of mood on comprehension was already studied by Gordon Bower at Stanford in "Mood and Memory" (1981), with a tentative computational theory based on semantic networks, and Daniel Martins in France ("Influence of Affect on Comprehension of a Text", 1982).
The collusion of emotion and cognition has been studied for a long time, from
Richard Lazarus at UC Berkeley
("Thoughts on the Relations between Emotion and Cognition", 1982) to Joseph LeDoux at New York University (his book "The Emotional Brain", 1996) via
multi-level theories of cognition-emotion interaction such as
Philip Barnard's "interacting cognitive subsystems model"
("Interacting Cognitive Subsystems", 1985) at Cambridge University
and Barnard's collaborator John Teasdale's model of nine cognitive subsystems at
Oxford University ("Emotion and two kinds of meaning", 1993).
More studies have emerged in the 1990s about how
emotional states influence what one understands, for example
Joseph-Paul Forgas' "affect infusion model" ("Mood and Judgement", 1995)
and Isabelle Tapiero's book "Situation Models and Levels of Coherence" (2007).
But little progress has been made in computing moods. The one influential paper on the subject was written by a philosopher, Laura Sizer at Hampshire University ("Towards A Computational Theory of Mood", 2000).
The other thing that we humans can do effortlessly (and frequently abuse this skill) is to generate stories. Given something that happened or an article that we read or a television show that we watched, we can easily create a story to describe it.
That's another thing that machines can't do in any reasonable fashion, despite decades of research.
The pioneering systems of automatic story generation were the Automated Novel Writer, developed since 1971 at the University of Wisconsin (a status report was published in 1973) by Sheldon Klein, who had already worked on automatic summaries at Carnegie Mellon University ("Automatic Paraphrasing in Essay Format", 1965);
James Meehan's story generator Tale-Spin ("The Metanovel", 1976), advised by Roger Schank at Yale, a program that generated stories about woodland creatures;
and Michael Lebowitz's Universe at Columbia University ("Creating Characters in a Story-Telling Universe", 1984).
Selmer Bringsjord of Rensselaer Polytechnic Institute in New York state and David Ferrucci of IBM started building Brutus in 1990 ("AI and Literary Creativity", 1999).
Then came Scott Turner's Minstrel at UCLA ("Minstrel, a computer model of creativity and storytelling", 1992) and
Rafael Perez y Perez's Mexica in Britain ("A Computer Model of Creativity in Writing", 1999).
An illiterate four-year child is infinitely better than these systems at
"narrating" a generic event.
Machine Translation too has disappointed. Despite recurring investments in the field by major companies, your favorite online translation system succeeds only with the simplest sentences, just like Systran in the 1970s. Here are some random Italian sentences from my old books translated into English by the most popular translation engine: "Graham Nash the content of which led nasal harmony", "On that album historian who gave the blues revival", "Started with a pompous hype on wave of hippie phenomenon".
Perhaps the only major progress in machine translation since Systran was demonstrated in 1973 by Yorick Wilks at Stanford. His system was based on something similar to conceptual dependency, "preference semantics" ("An Artificial Intelligence Approach to Machine Translation", 1973).
The method that has indeed improved the quality of automatic translation is the statistical one, pioneered in the 1980s by Fred Jelinek's team at IBM and first implemented there by Peter Brown's team
(the Candide system of 1992).
When there are plenty of examples of (human-made) translations, the computer can perform a simple statistical analysis and pick the most likely translation. Note that the computer isn't even trying to understand the sentence: it has no clue whether the sentence is about cheese or parliamentary elections. It has "learned" that those few words in that
combination are usually translated in such and such a way by humans. The
statistical approach works wonders when there are thousands of (human-made)
translations of a sentence, for example between Italian and English. It works
awfully when there are fewer, like in the case of Chinese to English.
Bengio's "Neural Machine Translation by Jointly Learning to Align and and Translate" (2012) showed that neural networks could be applied to translating texts.
In 2013 Nal Kalchbrenner and Phil Blunsom of Oxford University attempted statistical machine translation based purely on neural networks ("Two Recurrent Continuous Translation Models"). In 2014 Ilya Sutskever's solved the "sequence-to-sequence problem" of deep learning using a Long Short-Term Memory ("Sequence to Sequence Learning with Neural Networks"), so the length of the input sequence of characters doesn't have to be the same length of the output.
Sutskever, Oriol Vinyals and Quoc Le trained a recurrent neural network that was then able to read a sentence in one language, produce a semantic representation of its meaning, and generate a translation in another language.
In November 2016 the new Google Translate feature was widely publicized because it dramatically improved the machine-translation score called BLEU (bilingual evaluation understudy), introduced in 2002 by IBM. The new Google Translate was developed by Quoc Le (born in Vietnam), Mike Schuster (born in Germany), and Yonghui Wu (a Chinese-born veteran of Google's search engine). I tried it myself on simple sentences and the improvement was obvious. I tried it on one of my old music reviews written in Italian and the result is difficult to understand (maybe the original was too!) The biggest mistake: the Italian plural "geni" got translated as the plural of "gene" but in that context it is obviously the plural of "genius".
After successfully employing that recurrent neural network to improve Google's machine translation, Ilya Sutskever announced that: "all supervised vector-to-vector problems are now solved thanks to deep feed-forward neural networks" and "all supervised sequence-to-sequence problems are now solved thanks to deep LSTM networks" (at the 2014 Neural Information Processing Systems conference in Montreal). Unbridled optimism has always been A.I.'s main enemy.
Even if we ever get to the point that a machine can translate a complex sentence, here is the real test: "'Thou' is an ancient English word". Translate that
into Italian as "'Tu' e` un'antica parola Inglese" and you get an
obviously false statement ("Tu" is not an English word). The trick is
to understand what the original sentence means, not to just mechanically
replace English words with Italian words. If you understand what it means, then
you'll translate it as "'Thou' e` un'antica parola Inglese", i.e. you
don't translate the "thou"; or, depending on the context, you might
want to replace "thou" with an ancient Italian word like "'Ei'
e` un'antica parola Italiana" (where "ei" actually means
"he" but it plays a similar role to "thou" in the context
of words that changed over the centuries). A machine will be able to get it
right only when it fully understands the meaning and the purpose of the
sentence, not just its structure.
(There is certainly at least one quality-assurance engineer who, informed of this passage in this book, will immediately enter a few lines of code in the machine translation program to correctly translate "'Thou' is an ancient English word". That is precisely the dumb, brute-force, approach that i am talking about).
Or take Ronald Reagan's famous sarcastic statement, that the nine most terrifying words in the English language are "I'm from the government and i'm here to help". Translate this into Italian and you get "Le nove parole piu` terrificanti in Inglese sono `io lavoro per il governo e sono qui per aiutare'". Those are neither nine in the Italian translation (they are ten) and they are not "Inglese" (English) because they are now Italian. An appropriate translation would be "Le dieci parole piu` terrificanti in Italiano sono `io lavoro per il governo e sono qui per aiutare'".
Otherwise the translation, while technically impeccable, makes no practical
Or take Bertrand Russell's paradox: "the smallest positive integer number that cannot be described in fewer than fifteen words". This is a paradox because the sentence in quotes contains fourteen words. Therefore if such an integer number exists, it can be described by that sentence, which is fourteen words long. When you translate this paradox into Italian, you can't just translate fifteen with "quindici". You first
need to count the number of words. The literal translation "il numero
intero positivo piu` piccolo che non si possa descrivere in meno di quindici
parole" does not state the same paradox because this Italian sentence
contains sixteen words, not fourteen like the original English sentence. You
need to understand the meaning of the sentence and then the nature of the
paradox in order to produce an appropriate translation. I could continue with
self-referential sentences (more and more convoluted ones) that can lead to trivial mistakes when translated "mechanically" without understanding what they are meant to do.
Translations of proverbs can be quite inefficient. Take the Italian "Tra il dire e il fare c'e` di mezzo il mare", which is equivalent to the English "Easier said than done". In 2017 the most popular online translator renders it as "Between the saying and the sea there is the middle of the sea". Even the translation into Spanish fails (it is rendered as "Entre el dicho y el mar está el medio del mar") despite the fact that the equivalent Spanish proverb is very similar to the Italian ("Del dicho al hecho hay mucho trecho"). Our software engineer is now frantically entering a few lines of code in the online translator to make sure that this Italian proverb will be translated correctly in English and Spanish: alas, there are hundreds of languages and thousands of proverbs in each one, so the possible combinations are millions.
To paraphrase the physicist Max Tegmark, a good explanation is one that answers more than was asked. If i ask you "Do you know what time it is", a "Yes" is not a good answer. I expect you to at least tell me what time it is, even if it was not specifically asked. Better: if you know that i am in a hurry to catch a train, i expect you to calculate the odds of making it to the station in time and to tell me "It's too late, you won't
make it" or "Run!" If i ask you "Where is the
library?" and you know that the library is closed, i expect you to reply
with not only the location but also the important information that it is
currently closed (it is pointless to go there). If i ask you "How do i get
to 330 Hayes St?" and you know that it used to be the location of a
popular Indian restaurant that just shut down, i expect you to reply with a
question "Are you looking for the Indian restaurant?" and not with a
simple "It's that way". If i am in a foreign country and ask a simple
question about buses or trains, i might get a lengthy lecture about how public
transportation works, because the local people guess that I don't know how it
works. Speaking a language is pointless if one doesn't understand what language
is all about. A machine can easily be programmed to answer the question
"Do you know what time it is" with the time (and not a simple
"Yes"), and it can easily be programmed to answer similar questions
with meaningful information; but we "consistently" do this for all
questions, and not because someone told us to answer the former question with
the time and other questions with meaningful information, but because that is
what our intelligence does: we use our knowledge and common sense to formulate
In the near future it will still be extremely difficult to build machines that can understand the simplest of sentences. At the current rate of progress, it may take centuries before we have a machine that can have a conversation like the ones I have with my friends on the Singularity. And that would still be a far cry from what humans do: consistently provide an explanation that answers more than it was asked.
A lot more is involved than simply understanding a language. If people around me speak Chinese, they are not speaking to me. But if one says "Sir?" in English, and i am the only English speaker around, i am probably supposed to pay attention.
The state of Natural Language Processing is well represented by the results returned by the most advanced search engines: the vast majority of results are precisely the
kind of commercial pages that i don't want to see. Which human would normally
answer "do you want to buy perfume Katmandu" when i inquire about
Katmandu's monuments? It is virtually impossible to find out which cities are
connected by air to a given airport because the search engines all return
hundreds of pages that offer "cheap" tickets to that airport.
Take, for example, zeroapp.email, a young startup being incubated in San Francisco in 2016. They want to use deep learning to automatically catalog the emails that you receive. Because you are a human being, you imagine that their software will read your email, understand the content, and then file it appropriately. If you were an A.I. scientist, you would have guessed instinctively that this cannot be the case. What they do is to study your behavior and learn what to do
the next time that you receive an email that is similar to past ones. If you have done X for 100 emails of this kind, most likely you want to do X also for all the future emails of this kind. This kind of "natural language processing" does not understand the text:
it analyzes statistically the past behavior of the user and then predicts what
the user will want to do in the future. The same principle is used by Gmail's
Priority Inbox, first introduced in 2010 and vastly improved over the years:
these systems learn, first and foremost, by watching you; but what they learn
is not the language that you speak.
I like to discuss with machine-intelligence fans a simple situation. Let's say you are accused of a murder you did not commit. How many years will it take before you are willing to accept a jury of 12 robots instead of 12 humans? Initially, this sounds like a question about "when will you trust robots to decide whether you are guilty or innocent?" but it actually isn't (i would probably trust a robot better than many of the jurors who are easily swayed by good looks,
racial prejudices and many other unpredictable factors). The question is about
understanding the infinite subtleties of legal debates, the language of lawyers
and, of course, the language of the witnesses. The odds that those 12 robots
fully understand what is going on at a trial will remain close to zero for a
Back to the Table of Contents
Purchase "Intelligence is not Artificial"