(These are excerpts from my book "Intelligence is not Artificial")
Understanding this Book (or any Book)
Modern computational methods for text comprehension
Gerard Salton's SMART (1965) used a very simple algorithm to represent how words occur next to other words. Thirty years later, Curt Burgess and Kevin Lund at UC Riverside recast this simple idea in more abstract terms: lexical co-occurrence can be used to create high-dimensional spaces in which each word is a multi-dimensional "vector" whose coordinates are determined by its co-occurrence with each of the other words (Hyperspace Analogue to Language", 1995). The distances between word vectors expresses similarity in meanings. This was the beginning of a new approach to natural-language processing: vector-based models of word meaning.
Tom Landauer, a psychologist at the University of Colorado, in collaboration with Susan Dumais of Bellcore in New Jersey, formulated "latent semantic analysis", according to which knowledge can be created from simply studying co-occurrence of data in a large corpus of texts ("A solution to Plato's Problem", 1997). They showed that latent semantic analysis, with no prior linguistic knowledge, acquired knowledge about the full vocabulary of the English language at about the same speed as schoolchildren. The meaning of a word was expressed as a vector calculated from the word's average effect on the meaning of phrases in which it occurs.
The premise of these vector-based models was "the distributional hypothesis": words frequently occurring in the same contexts must be related (semantically similar).
By using the distributional approach, one can easily discover that two words are synonyms, or that they belong to the same category.
The beauty of these models is, of course, their simplicity: they represent meaning simply by using "distributional" information, and, mathematically speaking, this representation is a vector that lends itself to simple calculations (the whole of linear algebra is about vector operations, or, equivalently, matrices). Usually, the dimensions of a vector space stand for words, but they can also stand for non-linguistic objects such as images.
The obvious problem of the distributional approach is polysemes, words that have several meanings (in the English language the most infamous example is "get").
Yoshua Bengio's neural probabilistic language model of 2003 popularized the distributional approach
in the world of neural networks.
A few years later, Ronan Collobert and Jason Weston at NEC Labs in Princeton demonstrated that, using convolutional neural networks, one could build a general architecture for natural-language processing, a field which until then had mainly focused on task-specific architectures ("A Unified Architecture for Natural Language Processing", 2008); a task-independent architecture for "sequence tagging" refined when Jason Weston was at Google ("Natural Language Processing almost from Scratch", 2011).
Beyond vector-based models of word meaning, one would like vector-based models of phrases. These are usually constructed by combining word vectors, following the pioneer work of Mirella Lapata and her student Jeff Mitchell at the University of Edinburgh. ("Vector-based Models of Semantic Composition", 2008).
There are more sophisticated approaches to "semantic composition", notably the theory developed by Stephen Clark at Oxford University in collaboration with quantum physicist Bob Coecke and logical mathematician Mehrnoosh Sadrzadeh ("A Compositional Distributional Model of Meaning", 2008). They applied techniques from mathematical logic, category theory, and quantum physics to the distributional hypothesis.
Richard Socher, working with Andrew Ng and Christopher Manning at Stanford, used recursive neural networks to learn vector space representations for multi-word phrases and sentences ("Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection", 2011).
Tomas Mikolov, while at Google, invented Word2vec, an influential method to build vector representations of words from very large datasets of texts ("Efficient Estimation of Word Representations in Vector Space", 2013). Word2vec's skip-gram method predicts the neighbors of a given word in a random sentence; or, better, it provides the probability for every word in the vocabulary of being the neighbor of that given word. An extension of Word2vec called "FastText" was developed in 2015 by the team of Tomas Mikolov at Facebook and made available as open source ("Bag of Tricks for Efficient Text Classification", 2016). Meanwhile, a representation method similar to Word2vec called "GloVe" was introduced by Christopher Manning's students Jeffrey Pennington and Richard Socher ("GloVe: Global Vectors for Word Representation", 2014). These "word embeddings" (like Word2vec and GloVe) derive a map of how words relate to each other based on the configurations in which the words appear in large amounts of text.
The next step was to develop models that can not only detect the relative positions of words but also capture information about sentences. The "Skip-thoughts" method proposed by Ruslan Salakhutdinov's student Jamie/Ryan Kiros at the University of Toronto is almost literally the equivalent for sentences of the skip-gram model for words: if the skip-gram model predicts the words surrounding a given word, the skip-thoughts method predicts the sentences surrounding a given sentence ("Skip-Thought Vectors", 2015). In 2018 Matt Gardner's group at the Allen Institute in collaboration with Luke Zettlemoyer's group at the University of Washington introduced ELMo ("Embeddings from Language Models"), an alternative to GloVe and Word2vec ("Deep Contextualized Word Representations", 2018). ELMo, trained on a vast amount of texts, builds a representation of language use that can be tranferred to a variety of natural-language processing tasks (question answering, summarizing, sentiment analysis, etc). The "Quick-Thoughts" method by Honglak Lee's student Lajanugen Logeswaran at the University of Michigan is a variant of "Skip-Thoughts" that lends itself to faster training (An Efficient Framework for Learning Sentence Representations", 2018).
Most natural-language processing tasks require that one first computes a "baseline" by averaging a sentence's word vectors (the so-called "Bag-of-Words approach") or manipulating a linear weighted combination of the word vectors like the method proposed by Sanjeev Arora at Princeton University ("A Simple but Tough-to-Beat Baseline for Sentence Embeddings", 2016) or the one proposed by Andreas Rueckle at the Technical University of Darmstadt ("Concatenated p-mean Word Embeddings as Universal Cross-Lingual Sentence Representations", 2018).
A variation of Luong's "dot-product" attention mechanism, "self-attention", became important for all natural-language tasks. Self-attention is an attention mechanism that captures the relationships between two items of a sequence regardless of their distance within the sequence.
Self-attention was invented by Mirella Lapata's student Jianpeng Cheng ("Long Short Term Memory Networks for Machine Reading", 2016), and then employed by Ankur Parikh at Google ("A Decomposable Attention Model", 2016),
and by Socher's group at Salesforce ("A Deep Reinforced Model for Abstractive Summarization", 2017).
Yoshua Bengio's student Zhouhan Lin in Montreal ("A Structured Self-attentive Sentence Embedding", 2017) devised an elegant method to augment
a bidirectional LSTM with a self-attention mechanism.
Until then attention had been used as an add-on to a recurrent neural network.
Google's "transformer" architecture,
instead, disposed of the recurrent neural network altogether and relied solely on self-attention ("Attention Is All You Need", 2017).
This approach pioneered a completely different way to analyze a sentence.
A stack of encoders reads the input (of any length) and produces a representation of it using "self-attention", while a stack of decoders generates the output. The self-attention mechanism is really what makes the transformer different from an LSTM: self-attention models relationships between all the words of a sentence, regardless of their position, in order to produce the best possible representation.
Google's transformer, a project started in 2017 by Jakob Uszkoreit at Google laboratories in Germany and mostly implemented by Ashish Vaswani and Noam Shazeer, boasted an improvement
in translation quality, and it also seemed to learn grammatical facts on its own.
Google's Transformer offered another advantage. During the previous year, a number of architectures had been proposed to reduce the computational cost of sequence-to-sequence modeling, notably: Nal Kalchbrenner's ByteNet and Facebook's ConvS2S. The computational cost of relating distant elements in a sequence increases exponentially in ByteNet and linearly in ConvS2S, but doesn't increase at all in Transformer: it remains constant.
Abhinav Gupta's student Xiaolong Wang at Carnegie Mellon University, in collaboration with Ross Girshick and Kaiming He of Facebook, develop an algorithm superficially similar to
the self-attention mechanism of Google's transformer
but designed instead to capture long-range dependencies ("Non-local Neural Networks", 2018).
Meanwhile, those working on "visual question answering", i.e. answering questions about a visual scene, had been forced to adopt different strategies. Zichao Yang at CMU in collaboration with Microsoft ("Stacked Attention Networks for Image Question Answering", 2016) used neural attention, and Jiasen Lu at Virginia Tech introduced a new form of attention, "co-attention" ("Hierarchical Question-image Co-attention for Visual Question Answering", 2016), but Peng Wang and Qi Wu in Anton van den Hengel's team at University of Adelaide in Australia merged neural networks with the knowledge-based approach which had become almost anathema in the era of deep learning ("Explicit Knowledge-based Reasoning for Visual Question Answering", 2017). The catch here is to learning a "disentangled" representation that can be interpreted. One influential method to learn a "disentangled" representation was developed by Joshua Tenenbaum's team at MIT using a variational autoencoder ("Deep Convolutional Inverse Graphics Network", 2015).
Deep reinforcement learning models, such as AlphaGo, suffer from the inability to adapt to even minor changes in the task. DeepMind then turned to a 20-year-old idea, relational reinforcement learning, a form of inductive logic programming, and used self-attention to compute its relations ("Relational Deep Reinforcement Learning", 2018).
Word2vec and GloVe are both kinds of unsupervised learning: they skim through vast amounts of data and build a representation of language use based on the distribution of words with no help from humans. ELMo, Skip-Thoughts and Quick-Thoughts belong to the same category. Unsupervised learning of language use ruled until Alexis Conneau of Antoine Bordes' group at Facebook in France introduced InferSent (a bi-directional LSTM architecture) that outperformed Skip-Thought ("Supervised Learning of Universal Sentence Representations from Natural Language Inference Data", 2017).
That opened the gates to supervised architectures for language representation. Sandeep Subramanian in Yoshua Bengio's group at the University of Montreal, based on a bidirectional "gated recurrent unit" (GRU) neural network ("Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning", 2018). Based on Jonathan Baxter's old theory of inductive bias learning, Subramanian assumed that training a network on multiple weakly-related tasks was useful to generate general representations of sentences, representations that encode the "inductive biases" of multiple models and that can therefore be employed for novel tasks. Minh-Thang Luong at Google ("Multi-task Sequence to Sequence Learning", 2015) had pioneered the idea: he had trained a model on a set of weakly-related tasks, fusing sequence-to-sequence learning and multi-task learning. Another Google team introduced two supervised models ("Universal Sentence Encoder", 2018), one based on Ashish Vaswani's "transformer" architecture at Google Brain ("Attention Is All You Need", 2017) and the other based on Mohit Iyyer's "deep averaging network" (DAN) architecture at the University of Maryland ("Deep Unordered Composition Rivals Syntactic Methods for Text Classification", 2015).
Using a technique derived from Mikolov's skip-gram, Oriol Vinyals and Quoc Le at Google revolutionized the venerable branch of discourse analysis. They trained a recurrent network with a large set of chats between users and support technicians. This created the equivalent of a translation (or, better, of a sequence-to-sequence model): the question asked by a user has to be "translated" into the response of the support technician ("A Neural Conversational Model", 2015).
Then Oriol Vinyal used the same technique of machine translation to analyze images and create captions. The best architecture to represent images as vectors was the convolution neural network, so Vinyal used a convolution neural network as the image encoder ("Show and Tell - A Neural Image Caption Generator", 2015) and a decoder RNN turned that vector representation into sentences. It achieved the same feat of a neural network trained to describe a scene. The similarities between language parsing in natural language processing and scene analysis in machine vision had been known at least since Gabriela Csurka developed the "bag-of-visual-words" or "bag-of-features" technique. Ironically, the biggest success story of the "bag-of-words" model has been in image classification, not in text classification. In 2003 Gabriela Csurka at Xerox in France applied the same statistical method to images. The "bag-of-visual-words" model was born, that basically treats an image as a document. For the whole decade this was the dominant method for image recognition, especially when coupled with a support vector machine classifier. This approach led, for example, to the system for classification of natural scenes developed in 2005 at Caltech by Pietro Perona and his student Fei-fei Li at Caltech.
Mikolov's method was also the basis for the R-Net developed by Microsoft's Chinese laboratories (Furu Wei and others) that in January 2018 won the Stanford reading-comprehension competition beating (on some metric on some reading task) the human beings.
Up to this point the technique for natural language processing included: the "bag-of-words" approach, in which sentence representations are independent of word order; the sequence models developed by Michael Jordan (1986) and Jeffrey Elman (1990) at UC San Diego; and models based on tree structures, in which a sentence's symbolic representation is derived from its constituents following a syntactic blueprint (the typical symbolic structure that results from this process resembles an inverted tree). The latter arose in the 1990s after a debate on representations in neural networks that started in 1984 when Geoffrey Hinton (then at Carnegie Mellon University) circulated a report titled "Distributed Representations" about representations in which "each entity is represented by a pattern of activity distributed over many computing elements, and each computed element is involved in representing many different entities." The problem of representing tree structures in neural networks was solved by Jordan Pollack of Ohio State University who came up with the Recursive Auto-Associative Memory or RAAM ("Recursive Distributed Representations", 1990). A few years later Christoph Goller and Andreas Kuechler in Germany extended Pollack's RAAM so that it could be used for arbitrarily complex symbolic structures, e.g. any sort of tree structure ("Learning Task-dependent Distributed Representations by Backpropagation Through Structure", 1995).
For question-answering systems James Weston (now at Facebook's labs in New York) developed "Memory Networks" (2014), neural networks coupled with long-term memories.
The trend towards more complex memory structures, required for analysis of lengthy text, led to a variant of memory networks that is trained end-to-end, namely Sainbayar Sukhbaatar's "end-to-end memory network" at New York University, a collaboration with the Facebook group of Jason Weston and Rob Fergus ("End-to-End Memory Networks", 2015).
"Sequence tagging" (or "labeling") is the process of assigning each item in a sequence to a category, a process that is used in both natural language processing and bioinformatics. This process was traditionally implemented either with generative models such as the hidden Markov models employed in speech recognition or with the "conditional random fields" invented by John Lafferty (a former member of Fred Jelinek's group at IBM, now at Carnegie Mellon University), working with Andrew McCallum and Fernando Pereira ("Conditional Random Fields", 2001). Collobert's technique constituted the first major innovation, and it was countered years later by the bi-directional LSTM with conditional random fields developed by Zhiheng Huang, Wei Xu and Kai Yu of Baidu ("Bidirectional LSTM-CRF Models for Sequence Tagging", 2015).
Collobert's neural-network architecture for NLP formed the basis for Soumith Chintala's "sentiment analysis" at New York University, that learned to categorize movie reviews as positive or negative ("Sentiment Analysis using Neural Architectures", 2015).
Socher, who at Stanford had created "recursive neural tensor network" ("Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank", 2013), helped Christopher Manning's student Kai-sheng Tai to develop Tree-LSTM, a generalization of LSTMs to the tree structures used in natural language processing that further improved sentiment analysis taking advantage of the research started by Pollack 25 years earlier ("Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks", 2015). Sentiment analysis was also the objective of two projects in New England. In 2016 the Computational Story Laboratory of the University of Vermont (led by Peter Dodds and Chris Danforth) used Teuvo Kohonen's Self Organising Map (SOM) to study what Kurt Vonnegut had termed the "emotional arcs" of written stories ("The Emotional Arcs of Stories are Dominated by Six Basic Shapes", 2016). In 2017 Eric Chu of MIT's Laboratory for Social Machines directed by Deb Roy (later hired by Twitter) used deep convolutional neural networks to infer the emotional content of videos and television shows by analyzing both the story, the facial expressions and the soundtrack, i.e. for both audio and visual sentiment analysis ("Audio-Visual Sentiment Analysis for Learning Emotional Arcs in Movies", 2017).
Alec Radford at OpenAI discovered a mysterious property of a particular kind of LSTM networks: trained (with 82 million Amazon reviews) to predict the next character in the text of Amazon reviews, the network develops a "sentiment neuron" that predicts the sentiment value of the review, i.e. it develops the ability to discover the sentiment of the text, and the sentiment neuron adjusts its value on a character-by-character basis ("Learning to Generate Reviews and Discovering Sentiment", 2017). On the other hand, the "deep forest" method developed by Zhi-Hua Zhou and Ji Feng at Nanjing University used the old-fashioned method of decision tree ensembles instead of neural networks and performed as well on sentiment analysis as the best neural networks ("Deep Forest", 2017).
A footnote on sentiment analysis. There are countless precursors, like Carbonell's dissertation of 1979 at Yale and Clark Elliott's PhD dissertation of 1992 at Northwestern University ("The Affective Reasoner", 1992), an implementation of Andrew Ortony's psychological theory (his "appraisal model" of 1988); but the discipline was truly born in 2002 with two studies: one by Peter Turney at the Institute for Information Technology of Canada ("Thumbs up or Thumbs down? Semantic Orientation Applied to Unsupervised Classification of Reviews", 2002) and one (a movie review classifier) and the other one by Bo Pang and Lillian Lee at Cornell University ("Thumbs up? Sentiment Classification using Machine Learning Techniques", 2002). Jeonghee Yi at IBM in San Jose (2003) was perhaps the first one to use "Sentiment Analysis" in the title of his paper.
Training a neural network requires a well-structured dataset. But a lot of real-world information comes in unstructured formats such as books, magazines, radio news, TV programs, etc. Hence the need for text-understanding, or reading-comprehension, technology. Understanding a text requires, first of all, determining what the real focus is. Hence a number of neural attention mechanisms were developed, mainly Jason Weston's memory networks at Facebook ("Memory Networks", 2014) and Richard Socher's dynamic memory networks at MetaMind in Palo Alto ("Ask Me Anything", 2015). Minjoon Seo in Ali Farhadi's group at the Allen Institute for Artificial Intelligence ("Bidirectional Attention Flow for Machine Comprehension", 2016) developed the Bidirectional Attention Flow (BiDAF) model, a new kind of "attention" technique, inspired by the "bi-attention" technique of by Dzmitry Bahdanau's BiRNN. Seo's architecture can model the context at different levels of granularity.
Just about at the same time, extensive datasets such as SQuAD, MARCO,
CoQA and QuAC
made it possible to train neural networks for reading comprehension tasks.
Weizhu Chen's team at Microsoft developed Reasonet in 2016, that combined memory networks with reinforcement learning, and FusionNet in 2017, that introduced a simpler attention mechanism called "History of Word". In 2017 the Reinforced Mnemonic Reader, developed in China jointly by Xipeng Qiu at Fudan University and the National University of Defense Technology, set a new record ("Reinforced Mnemonic Reader for Machine Reading Comprehension", 2017). Alas, recurrent neural networks are very slow both in training and inference a fact which prevents them from being deployed in real-time applications.
In 2017 Quoc Le's team at Google collaborated with Wei Yu of Carnegie Mellon University to develop QANet, a system for end-to-end natural-language processing that eschewed recurrent neural networks, and instead used convolutions to model local interactions and self-attention to model global interactions. QANet showed impressive improvement at question-answering tasks ("Combining Local Convolution with Global Self-Attention for Reading Comprehension", 2018).
Ali Farhadi's former student Mark Yatskar, now at the Allen Institute, and others updated BiDAF with self-attention and obtained BiDAF++ ("A Qualitative Comparison of CoQA, SQuAD 2.0 and QuAC", 2018).
SDNet, developed by Xuedong Huang's group at Microsoft, was basically Google's BERT improved with the "context", i.e. with a history of the questions and answers that led the current question ("Contextualized Attention-based Deep Network for Conversational Question Answering", 2018).
FlowQA, developed by Hsin-Yuan Huang and Wen-tau Yih at the Allen Institute, encoded the conversation history in a deeper manner ("Grasping Flow in History for Conversational Machine Comprehension", 2018).
Transfer learning for text classification was tried by many, and in particular by Andrew Dai and Quoc Le at Google ("Semisupervised Sequence Learning", 2015), who proposed fine-tuning a "language model". Their project was influential because before them LSTM networks had rarely been used for natural-language processing tasks despite their power to model sequential data. They showed that it was possible through a pretraining step. The problem is that their method was impractical: it required millions of documents in a given domain in order to achieve good performance in that domain.
It was the field of computer vision that first demonstrated the importance of transfer learning from large pre-trained models, for example fine-tuning models that were pre-trained on ImageNet (the classic approach pioneered by Fei-fei Li), a fact discussed in a famous paper by Hod Lipson's student Jason Yosinski at Cornell University ("How Transferable are Features in Deep Neural Networks?", 2014).
Another improvement came from the method of "hypercolumns", introduced by Jitendra Malik's student Bharath Hariharan at UC Berkeley ("Hypercolumns for Object Segmentation and Fine-grained Localization", 2015). This method too required some kind of pre-training, whether language modeling, paraphrasing, entailment or machine translation. Luke Zettlemoyer's team at the University of Washington achieved some success using this method to build a language model ("Deep Contextualized Word Representations", 2018).
Jeremy Howard at the University of San Francisco and Sebastian Ruder at the National University of Ireland combined techniques of transfer learning from computer vision, Hariharan's hypercolumns method and Dai's fine-tuning method to build a more general system to classify documents, called ULMFiT, that made transfer learning the new standard for natural-language processing ("Universal Language Model Fine-tuning for Text Classification", 2018). This was a three-layer LSTM architecture: first the language model is trained on a general corpus of texts to capture general features of the language; then such model is fine-tuned on a specific task using two new techniques ("discriminative fine-tuning" and "slanted triangular learning rates"); and finally the system is further fine-tuned using a new technique called "gradual unfreezing". These novel techniques remedied a well-known problem: neural networks are prone to catastrophic forgetting during fine-tuning.
That systems such as OpenAI’s Generative Pre-trained Transformer or GPT ("Improving Language Understanding by Generative Pre-Training", 2018) and Google’s Bidirectional Encoder Representations from Transformers or BERT ("Pre-training of Deep Bidirectional Transformers for Language Understanding", 2018) combined Google's "transformer" architecture with UMLFiT-style transfer learning.
Now researchers could choose between two strategies for applying pre-trained language models to specific tasks: feature-based (such as ELMo) and fine-tuning based (such as OpenAI's GPT and Google's BERT).
The importance of pre-trained representations (when using deep learning to study written text) is due to the fact that deep learning requires data for training, and there are few data available for written text. A text can be just about anything, it can have an infinite number of meanings and purposes. The available datasets that are task-specific have relatively few examples that can be used for training a deep network. Luckily, it turns out that one can pretrain a deep network on generic language data (typically, the texts found on the World-wide Web) and then specialize it for a narrow domain with impressive results, and it doesn't really matter that the fine-tuning for the special domain is done using "small" data. Pretraining can be either be context-free (such as word2vec or GloVe) or contextual (such as ELMo and ULMFit). Contextual representations can be unidirectional or bidirectional. As of 2018, BERT was the most popular of the bidirectional contextual methods. Ironically, BERT used a technique called "cloze procedure", originally published in 1953 by the psychologist Wilson Taylor at the University of Illinois for measuring reading skills and widely used in schools worldwide. It consists in deleting random words from a text and testing if you can still understand it. BERT is trained to predict the deleted words.
Within one year Jacob Devlin's BERT had been improved by multiple teams to achieve quasi-human performance on several tests, notably the GLUE (General Language Understanding Evaluation) test for reading-comprehension tasks introduced in 2018 by Sam Bowman of New York University: RoBERTa by Luke Zettlemoyer's team at Facebook and at the University of Washington ("A Robustly Optimized BERT Pretraining Approach", 2019), that pushed the GLUE benchmark to 88.5; FreeLB by Tom Goldstein's student Chen Zhu at the University of Maryland in collaboration with Jingjing Liu's team at Microsoft, that used so-called "adversarial training" to achieve 88.8 ("Enhanced Adversarial Training for Language Understanding", 2019); StructBERT by an Alibaba team in China, that incorporated language structures into pre-training and pushed the GLUE benchmark to 89.0 ("Incorporating Language Structures into Pre-training for Deep Language Understanding", 2019); and Albert (short for "A lite BERT") by Zhenzhong Lan at Google that achieved 89.4 ("A Lite BERT for Self-supervised Learning of Language", 2019). The media indulged in headlines such as "Machines beat humans at reading", but in reality within a few weeks some major papers that disproved BERT's skills at comprehending language, for example one by Timothy Niven and Hung-Yu Kao of Taiwan's National Cheng Kung University ("Probing Neural Network Comprehension of Natural Language Arguments", 2019) and one by Tal Linzen of Johns Hopkins University ("Right for the Wrong Reasons," 2019).
BERT's "pre-training" was based on autoencoding and was capable of modeling bidirectional contexts. This approach looked more promising than the main alternative: pretraining based on autoregressive language modeling, as in the case of OpenAI's GPT (GUID Partition Table). However, within one year the autoregressive approach was again in vogue thanks to the work of Ruslan Salakhutdinov's team at Carnegie Mellon University that first developed an attention model named Transformer-XL ("Attentive Language Models Beyond a Fixed-length Context", 2019) and then unveiled an autoregressive pre-training method, XLNet, both collaborations with Google's scientist Quoc Le ("Generalized Autoregressive Pretraining for Language Understanding", 2019). XLNet was capable of modeling bidirectional contexts and outperformed BERT in several tasks..
The sequence-to-sequence (Seq2Seq) method introduced in 2013 by Nal Kalchbrenner and Phil Blunsom, and especially the attention-based Seq2Seq model introduced by Dzmitry Bahdanau, triggered a boom in systems for text summarization (abstractive summarization, not just reproducing a few significant sentences) because they can "write" and not only "read". Alexander Rush, Sumit Chopra and Jason Weston at Facebook developed Attention-Based Summarization ("A Neural Attention Model for Abstractive Sentence Summarization", 2015) and opened the floodgates, soon followed by Jiatao Gu's CopyNet at the University of Hong Kong (Incorporating Copying Mechanism in Sequence-to-Sequence Learning", 2016), the Forced Attention Sentence Compression Model developed by Phil Blunsom's student Yishu Miao at Oxford University, that uses variational autoencoders ("Language as a Latent Variable", 2016), Read-Again Summarization developed by Raquel Urtasun's student Wenyuan Zeng at the University of Toronto ("Efficient Summarization with Read-Again and Copy Mechanism", 2016), Ramesh Nallapati's SummaRuNNer in Bowen Zhou's group at IBM ("A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents", 2017), etc. Most of these systems (Gu's, Miao's, Zeng's, Nallapati's) employed the "pointer network" conceived by Oriol Vinyals of Google and Meire Fortunato of UC Berkeley, an extension of the sequence-to-sequence model using Bahdanau's attention mechanism. A major step forward for longer-text summarization was the Pointer-generator Network or PGNet developed by Christopher Manning's student Abigail See at Stanford ("Get To The Point - Summarization with Pointer-Generator Networks", 2017) synthesizing all of these ideas.
Several startups began offering services of text analysis and summary: Narrative Science, founded in 2010 in Chicago by Northwestern University's professors Kristian Hammond and Larry Birnbaum (a student of Schank's at Yale University in 1986); Maluuba, founded in 2011 in Canada by two students of the University of Waterloo, Sam Pasupalak and Kaheer Suleman, and acquired in 2017 by Microsoft;
Semantic Machines, founded in 2014 in Berkeley by Dan Roth, UC Berkeley scientist Dan Klein and former Dragon executive Larry Gillick, and acquired by Microsoft in 2018;
and MetaMind, founded in 2014 in Palo Alto by Richard Socher and acquired by Salesforce in 2016. But their narrative summaries only worked in very narrow domains under very friendly circumstances.
The results are still far from human performance. The most illiterate person on the planet can understand language better than the most powerful machine.
Ironically, the biggest success story of the "bag-of-words" model has been in image classification, not in text classification. In 2003 Gabriela Csurka at Xerox in France applied the same statistical method to images. The "Bag-of-visual-words" model was born, that basically treats an image as a document. For the whole decade this was the dominant method for image recognition, especially when coupled with a Support Vector Machine classifier. This approach led, for example, to the system for classification of natural scenes developed in 2005 at Caltech by Pietro Perona and his student FeiFei Li at Caltech.
To be fair, progress in natural language understanding was hindered by the simple fact that humans prefer not to speak to another human in our time-consuming natural language. Sometimes we prefer to skip the "Good morning, how are you?" and get straight to the "Reset my Internet connection" in
which case saying "One" to a machine is much more effective than
having to wait for a human operator to pick up the phone and to understand your
issue. Does anyone actually understand the garbled announcements in the New
York subway?
The advantage of machine communications is that they are clear. We make a long story about short stories. A human speaking to another human may say: "Can you please open the window as i'm having difficulties breathing. You know, i had asthma as a child". A human speaking to a machine only needs to say: "Alexa, open the window". Humans are weird: they like to understand the meaning of what is going on, not just be told what to do.
Like it or not, humans
can more effectively go about their business using the language of machines.
For a long time, therefore, Natural Language Processing remained an underfunded
research project with few visible applications. It is only recently that
interest in "virtual personal assistants" has resurrected the field.
In order to realize how far we are from having machines that truly "understand"
our language, think of a useful application that would greatly help civility
in written conversations: the equivalent of a spelling checker for hostile
moods. Imagine an app that, when you try to send an email, would warn you
"The tone of this email is rude: do you really want to send it?" or
"The tone of this email is sarcastic" or
"The tone of this email is insulting".
It is not difficult for a human to read "between the lines", to understand
the hidden motivation of a message and, in particular, to understand when
the writer is deliberately trying to hurt your feelings.
Written hostilities can escalate quickly in the age of email and texting.
The interesting fact is that we understand in a second that the tone of an
email is not friendly, even when the email is a correct reply to our question
or a positive comment to something we have done. When a friend was celebrating
the killing of Osama bin Laden, i quipped "Yes, we are very good at
assassinating people". You understand the sarcasm and indirect critique of US
foreign policy, don't you? You may also understand that i was greatly annoyed
by that operation, and, even more, by the fact that people were celebrating
in the streets.
We routinely get in trouble when we speak quickly because we say something
that we "should not have said": this doesn't mean that what we said was false,
but that we said it on purpose to cause harm and perhaps humiliate. Most of
the time we regret doing it. Conversely, we are easily ticked off by the
wrong tone in an email that was sent to us.
We immediately understand the "tone" of an email, especially when it's meant to
hurt or annoy us, and we are very good
at forging a tone that will hurt or annoy somebody. We word
our sentences accordingly to our mood and to the mood we want to create
in the other person.
When you chat with someone, you pay little attention to the grammatical
structure of what you are saying (in fact, you make a lot of grammatical mistakes,
interrupting your own sentences, restarting them, interjecting a lot of random
noise such as "hmmm") but you pay a lot of attention at the dynamics of the
conversation, which depends heavily on the tone of your and their voices
(assertive, soothing, angry, etc).
The importance of mood in understanding what is going on cannot be overstated.
The impact of mood on comprehension was already studied by Gordon Bower at Stanford in "Mood and Memory" (1981), with a tentative computational theory based on semantic networks, and Daniel Martins in France ("Influence of Affect on Comprehension of a Text", 1982).
The collusion of emotion and cognition has been studied for a long time, from
Richard Lazarus at UC Berkeley
("Thoughts on the Relations between Emotion and Cognition", 1982) to Joseph LeDoux at New York University (his book "The Emotional Brain", 1996) via
multi-level theories of cognition-emotion interaction such as
Philip Barnard's "interacting cognitive subsystems model"
("Interacting Cognitive Subsystems", 1985) at Cambridge University
and Barnard's collaborator John Teasdale's model of nine cognitive subsystems at
Oxford University ("Emotion and two kinds of meaning", 1993).
More studies have emerged in the 1990s about how
emotional states influence what one understands, for example
Joseph-Paul Forgas' "affect infusion model" ("Mood and Judgement", 1995)
and Isabelle Tapiero's book "Situation Models and Levels of Coherence" (2007).
But little progress has been made in computing moods. The one influential paper on the subject was written by a philosopher, Laura Sizer at Hampshire University ("Towards A Computational Theory of Mood", 2000).
The other thing that we humans can do effortlessly (and frequently abuse this skill) is to generate stories. Given something that happened or an article that we read or a television show that we watched, we can easily create a story to describe it.
That's another thing that machines can't do in any reasonable fashion, despite decades of research.
The pioneering systems of automatic story generation were the Automated Novel Writer, developed since 1971 at the University of Wisconsin (a status report was published in 1973) by Sheldon Klein, who had already worked on automatic summaries at Carnegie Mellon University ("Automatic Paraphrasing in Essay Format", 1965);
James Meehan's story generator Tale-Spin ("The Metanovel", 1976), advised by Roger Schank at Yale, a program that generated stories about woodland creatures;
and Michael Lebowitz's Universe at Columbia University ("Creating Characters in a Story-Telling Universe", 1984).
Selmer Bringsjord of Rensselaer Polytechnic Institute in New York state and David Ferrucci of IBM started building Brutus in 1990 ("AI and Literary Creativity", 1999).
Then came Scott Turner's Minstrel at UCLA ("Minstrel, a computer model of creativity and storytelling", 1992) and
Rafael Perez y Perez's Mexica in Britain ("A Computer Model of Creativity in Writing", 1999).
An illiterate four-year child is infinitely better than these systems at
"narrating" a generic event.
In 2019 Alec Radford and Jeffrey Wu of OpenAI demonstrated an algorithm capable of creating convincing articles. In February 2019 the Guardian newspaper (the printed version) published an article written by GPT2 itself. (A few days later another Guardian article, penned by real human Hannah Jane Parkinson, screamed "AI can write just like me - Brace for the robot apocalypse", which, again, made us wonder who's more intelligent, human or machine). Called GPT2, OpenAI's system was based on Ashish Vaswani's "transformer" method (with a whopping 1.5 billion parameters) and it was trained on WebText (millions of web pages) to predict the next word in a text. The "distributional" approach had found its AlphaGo. The problem, of course, is that GPT2 created "fake" stories. Since it had no idea of what it was generating, it also had no idea whether these were real facts: it is merely a game in which GPT2 "predicts" what is the most likely word following the current one. It turns out that this game can produce a text that sounds authentic. GPT2 is not even great to create "fake news" because you can't really control the ideological orientation of the generated text. But GPT2 was indeed a major achievement because it achieved multitask learning without any explicit supervision. In other words, it was equally proficient at question answering, machine translation, reading comprehension, and summarization (Language Models are Unsupervised Multitask Learners", 2019).
Machine Translation too has disappointed. Despite recurring investments in the field by major companies, your favorite online translation system succeeds only with the simplest sentences, just like Systran in the 1970s. Here are some random Italian sentences from my old books translated into English by the most popular translation engine: "Graham Nash the content of which led nasal harmony", "On that album historian who gave the blues revival", "Started with a pompous hype on wave of hippie phenomenon".
In November 2016 the new Google Translate feature was widely publicized because it dramatically improved the machine-translation score called BLEU (bilingual evaluation understudy), introduced in 2002 by IBM. The new Google Translate was developed by Quoc Le (born in Vietnam), Mike Schuster (born in Germany), and Yonghui Wu (a Chinese-born veteran of Google's search engine). I tried it myself on simple sentences and the improvement was obvious. I tried it on one of my old music reviews written in Italian and the result is difficult to understand (maybe the original was too!) The biggest mistake: the Italian plural "geni" got translated as the plural of "gene" but in that context it is obviously the plural of "genius".
After successfully employing that recurrent neural network to improve Google's machine translation, Ilya Sutskever announced that: "all supervised vector-to-vector problems are now solved thanks to deep feed-forward neural networks" and "all supervised sequence-to-sequence problems are now solved thanks to deep LSTM networks" (at the 2014 Neural Information Processing Systems conference in Montreal). Unbridled optimism has always been A.I.'s main enemy.
Even if we ever get to the point that a machine can translate a complex sentence, here is the real test: "'Thou' is an ancient English word". Translate that
into Italian as "'Tu' e` un'antica parola Inglese" and you get an
obviously false statement ("Tu" is not an English word). The trick is
to understand what the original sentence means, not to just mechanically
replace English words with Italian words. If you understand what it means, then
you'll translate it as "'Thou' e` un'antica parola Inglese", i.e. you
don't translate the "thou"; or, depending on the context, you might
want to replace "thou" with an ancient Italian word like "'Ei'
e` un'antica parola Italiana" (where "ei" actually means
"he" but it plays a similar role to "thou" in the context
of words that changed over the centuries). A machine will be able to get it
right only when it fully understands the meaning and the purpose of the
sentence, not just its structure.
(There is certainly at least one quality-assurance engineer who, informed of this passage in this book, will immediately enter a few lines of code in the machine translation program to correctly translate "'Thou' is an ancient English word". That is precisely the dumb, brute-force, approach that i am talking about).
Or take Ronald Reagan's famous sarcastic statement, that the nine most terrifying words in the English language are "I'm from the government and i'm here to help". Translate this into Italian and you get "Le nove parole piu` terrificanti in Inglese sono `io lavoro per il governo e sono qui per aiutare'". Those are neither nine in the Italian translation (they are ten) and they are not "Inglese" (English) because they are now Italian. An appropriate translation would be "Le dieci parole piu` terrificanti in Italiano sono `io lavoro per il governo e sono qui per aiutare'".
Otherwise the translation, while technically impeccable, makes no practical
sense.
Or take Bertrand Russell's paradox: "the smallest positive integer number that cannot be described in fewer than fifteen words". This is a paradox because the sentence in quotes contains fourteen words. Therefore if such an integer number exists, it can be described by that sentence, which is fourteen words long. When you translate this paradox into Italian, you can't just translate fifteen with "quindici". You first
need to count the number of words. The literal translation "il numero
intero positivo piu` piccolo che non si possa descrivere in meno di quindici
parole" does not state the same paradox because this Italian sentence
contains sixteen words, not fourteen like the original English sentence. You
need to understand the meaning of the sentence and then the nature of the
paradox in order to produce an appropriate translation. I could continue with
self-referential sentences (more and more convoluted ones) that can lead to trivial mistakes when translated "mechanically" without understanding what they are meant to do.
Translations of proverbs can be quite inefficient. Take the Italian "Tra il dire e il fare c'e` di mezzo il mare", which is equivalent to the English "Easier said than done". In 2017 the most popular online translator renders it as "Between the saying and the sea there is the middle of the sea". Even the translation into Spanish fails (it is rendered as "Entre el dicho y el mar est el medio del mar") despite the fact that the equivalent Spanish proverb is very similar to the Italian ("Del dicho al hecho hay mucho trecho"). Our software engineer is now frantically entering a few lines of code in the online translator to make sure that this Italian proverb will be translated correctly in English and Spanish: alas, there are hundreds of languages and thousands of proverbs in each one, so the possible combinations are millions.
To paraphrase the physicist Max Tegmark, a good explanation is one that answers more than was asked. If i ask you "Do you know what time it is", a "Yes" is not a good answer. I expect you to at least tell me what time it is, even if it was not specifically asked. Better: if you know that i am in a hurry to catch a train, i expect you to calculate the odds of making it to the station in time and to tell me "It's too late, you won't
make it" or "Run!" If i ask you "Where is the
library?" and you know that the library is closed, i expect you to reply
with not only the location but also the important information that it is
currently closed (it is pointless to go there). If i ask you "How do i get
to 330 Hayes St?" and you know that it used to be the location of a
popular Indian restaurant that just shut down, i expect you to reply with a
question "Are you looking for the Indian restaurant?" and not with a
simple "It's that way". If i am in a foreign country and ask a simple
question about buses or trains, i might get a lengthy lecture about how public
transportation works, because the local people guess that I don't know how it
works. Speaking a language is pointless if one doesn't understand what language
is all about. A machine can easily be programmed to answer the question
"Do you know what time it is" with the time (and not a simple
"Yes"), and it can easily be programmed to answer similar questions
with meaningful information; but we "consistently" do this for all
questions, and not because someone told us to answer the former question with
the time and other questions with meaningful information, but because that is
what our intelligence does: we use our knowledge and common sense to formulate
the answer.
Ludwig Wittgenstein in the "Philosophical Investigations" (published posthumously in 1953) wrote that "the meaning of a word is its use in the language". That statement launched a whole new discipline, now called "pragmatics", via John Austin's analysis of speech acts (starting with a lecture at Harvard University in 1955 that in 1962 became the book "How to Do Things with Words"), Paul Grice's "conversational maxims" ( "Logic and conversation", 1975) and Dan Sperber's and Deirdre Wilson's "relevance theory" ("Relevance - Communication and Cognition", 1986). The term "pragmatics" was coined by Charles Morris, the founder of modern semiotics, in his book "Foundations of the Theory of Signs" (1938), which divided the study of language in three branches: syntax, semantics and pragmatics.
In the near future it will still be extremely difficult to build machines that can understand the simplest of sentences. At the current rate of progress, it may take centuries before we have a machine that can have a conversation like the ones I have with my friends on the Singularity. And that would still be a far cry from what humans do: consistently provide an explanation that answers more than it was asked.
A lot more is involved than simply understanding a language. If people around me speak Chinese, they are not speaking to me. But if one says "Sir?" in English, and i am the only English speaker around, i am probably supposed to pay attention.
The state of Natural Language Processing is well represented by the results returned by the most advanced search engines: the vast majority of results are precisely the
kind of commercial pages that i don't want to see. Which human would normally
answer "do you want to buy perfume Katmandu" when i inquire about
Katmandu's monuments? It is virtually impossible to find out which cities are
connected by air to a given airport because the search engines all return
hundreds of pages that offer "cheap" tickets to that airport.
Take, for example, zeroapp.email, a young startup being incubated in San Francisco in 2016. They want to use deep learning to automatically catalog the emails that you receive. Because you are a human being, you imagine that their software will read your email, understand the content, and then file it appropriately. If you were an A.I. scientist, you would have guessed instinctively that this cannot be the case. What they do is to study your behavior and learn what to do
the next time that you receive an email that is similar to past ones. If you have done X for 100 emails of this kind, most likely you want to do X also for all the future emails of this kind. This kind of "natural language processing" does not understand the text:
it analyzes statistically the past behavior of the user and then predicts what
the user will want to do in the future. The same principle is used by Gmail's
Priority Inbox, first introduced in 2010 and vastly improved over the years:
these systems learn, first and foremost, by watching you; but what they learn
is not the language that you speak.
I like to discuss with machine-intelligence fans a simple situation. Let's say you are accused of a murder you did not commit. How many years will it take before you are willing to accept a jury of 12 robots instead of 12 humans? Initially, this sounds like a question about "when will you trust robots to decide whether you are guilty or innocent?" but it actually isn't (i would probably trust a robot better than many of the jurors who are easily swayed by good looks,
racial prejudices and many other unpredictable factors). The question is about
understanding the infinite subtleties of legal debates, the language of lawyers
and, of course, the language of the witnesses. The odds that those 12 robots
fully understand what is going on at a trial will remain close to zero for a
long time.
"I am for richness of meaning rather than clarity of meaning" (Robert Venturi, architect).
Back to the Table of Contents
Purchase "Intelligence is not Artificial"
|