Humankind 2.0

a book in progress...
Meditations on the future of technology and society...
...to be published in China in 2016

These are raw notes taken during and after conversations between piero scaruffi and Jinxia Niu of Shezhang Magazine (Hangzhou, China). Jinxia will publish the full interviews in Chinese in her magazine. I thought of posting on my website the English notes that, while incomplete, contain most of the ideas that we discussed.
(Copyright © 2016 Piero Scaruffi | Terms of use )

Back to the Table of Contents


Big Data: History, Trends and Future

(See also the slide presentation)

Narnia: "Big Data" is a vague term. What does it really mean? What is new in "data"?

piero:

I think that "Silicon Valley" should be renamed "Data Valley". Many analysts, for example IMS, forecast that there will be 20-30 billion web-connected devices by 2020, generating 2.5 quintillion bytes of new data each day. This means that we will be producing more data every year than in the previous 200,000 years. In Silicon Valley people talk about data as the new "oil". One refined in a refinery, oil yields things like the gasoline that powers our cars and the electrical power that powers our appliances. Once refined in a "data refinery", data will yield self-driving cars (that use GPS data, traffic data, etc), drones, wearable devices, etc. The difference between oil and data is that the product of oil does not generate more oil (unfortunately), whereas the product of data (self-driving cars, drones, wearables, etc) will generate more data (where do you normally drive, how fast/well you drive, who is with you, etc). We are not just building data, but also data that generate data. What is disappointing is that we are not doing much with those data. Mostly we do "data analytics". This has been done since at least the 1960s: we analyze data and try to detect regularities, inefficiencies, etc to optimize processes and, ultimately, make more money. It is a pity that most data analytics is simply used to sell more products. We mostly use data to figure out what advertising to display on a device. That's what we can do with a quintillion of data? Too little, and too stupid. The real application for big data has not been invented yet. Let us focus on what happens to "big data"? Who generates most data? Machines. Who reads them? Believe it or not, about 30% of the "readers" on the Internet are robots, not humans. Even the world news are read by robots. In the future the main readers of data will be robots. Machines generate data and machines read data. It is a machine-centric world of data. That's why the only useful application is data analytics: machines are good at math, statistics, not at understanding the human world. We don't have a great application of "big data" yet because it is not humans but machine that "read" those data. Machines can only do relatively stupid things like displaying an ad on your computer. That is not a very intelligent application.
We don't have the iPhone or Facebook of big data yet. Note that the software is available, and it is free. Apache Spark (developed by Matei Zaharia and others at UC Berkeley in 2009) and OpenStack (an evolution of NASA's project Nebula that is used by more than 500 companies in 2015) are open-source. The big users of data, which are essentially Google and Facebook (whose systems have to handle billions of data in real time), have made their big-data infrastructure available to the open-source public: Cassandra (that comes from Facebook) and Apache Hadoop, Hive, Mesos and Giraph (that all come from Google). One simple reason is that we want more and more startups to experiment with big data. Even the big companies want the small companies to experiment with big data. We want to see if someone can invent the "killer app" for big data that is missing today. If i knew that that app is, i would become a billionaire, but let me give you an example. Very soon one object that will be producing a lot of data is the human body. There will be many wearable devides and nanorobots inside our body and maybe chip implants that will be producing and broadcasting data all the time. The obvious application in this case will be some software in the cloud that can capture all these data and make sure that we are in good health. If there is any sign of imbalance, tha software could immediately ask the wearables to provide more data or could instruct the person to contact the doctor and take specific medical tests.

The fact that the big companies are offering their big data platform as open-source platforms to everybody is a sign that sometimes even the most aggressive businesses value collaboration more than competition. Undderstanding big data is one field that will require a shift from competition to cooperation.

I think that "big data" will introduce a new way of thinking about human life. It may sound like a horrible world in which machines produce data and machines read data and then machines tell other machines what to do, and, yes, our body is the ultimate object of this process, and all these machines make it less "human". However, you can also find a Buddhist way to look at data. Our existence is a chaotic flow of data. Those data don't last in time, they are just instants of existence. All combined, they are "me". They are similar to the Buddhist "dharmas". Each dharma is relative to every other dharma, each dharma is caused by another dharma. The "Visuddhimagga" says: "Only suffering exists, but no sufferer is to be found. Acts are but there is no actor". In Buddhism no being "exists" for any period of time: each moment is an entirely new existence. I think that "big data" introduce a concept of human life that is somewhat similar. We think too often of data as just numbers, but those numbers actually represent real people. If i write that, unfortunately, 600,000 people are killed every year of malaria, mostly children, that is not just a number: it is people, and it is also the people around them, mothers and sisters and wives, who are crying. I hope that in the future, when everything is interpreted in terms of data, we will be able to interpret data as people, not numbers.

Big Data requires a new way of thinking. The data come from all sorts of sources. A specialist (whether human or machine) cannot possibly absorb all of them. What is required is an interdisciplinary approach. In the 1930s two men pioneered "big science" in the USA: Vannevar Bush at the MIT and Ernest Lawrence. Unfortunately, the motivation came from the war, but the beneficiary was actually peacetime society. Bush and Lawrence realized that solving big problems requires many minds: big science gathered together scientists from different disciplines. Out of that approach we got, for example, nuclear power and the Internet. Big science was an early application of "big data" except that in those days the data were in the minds of the scientists. The approach, however, will have to be similar. In order to use big data to solve big problems we will need to use a similar interdisciplinary approach.

There is an even earlier example of solving big problems with big data: ancient China. I think China can be a model for the new way of thinking because China actually invented it many centuries ago. During the Tang and the Song dynasty the ideal person was an interdisciplinary scholar: politician, historian, writer, painter, poet, calligrapher... The ideal person was supposed to study all (all) the classics, not just one or two. China invented the multimedia mind (China also invented the multitasking mind!) The ideal person was in charge of solving the big problems of society, thanks to the fact that it had absorbed so much knowledge from so many different fields. What has calligraphy got to do with solving big problems? It shapes your brain. If the brain is not right, you will never find the right solution. Every discipline helps create the right way of thinking. I think it was the right approach, and it is still the right approach today. Maybe China needs to rediscover its own approach to managing a complex society, except that this approach needs to be adapted to the age of big data (i.e. the ideal person needs to use machines not just the "maobi").

Improved statistical and computational methods, and improved visualization methods, are being developed at many universities; but these new methods serve a simple purpose: make fast computation cheaper (big data requires expensive computers). The progress has been impressive: decoding the human genome originally took 10 years, but now there are startups that do it in less than a day. Stanford's most popular textbook for undergraduate computer science students is "Mining of Massive Datasets". The second edition has just been published by Cambridge University Press in 2014: http://www.mmds.org/ There is no secret: anybody can use those methods to analyze big data. But new math will not give us more useful applications, just cheaper data analytics, and the reason is simple: mathematicians are not the ones who know the problems of the world. This is yet another field in which an interdisciplinary approach is required to come up with applications that are not just "data analytics". Yes, we need mathematicians; but we also need scholars from all other disciplines. Solving problems in human society is not just a math test. For example, Gary King, director of Harvard University's Institute for Quantitative Social Science, has assembled a team of sociologists, economists, physicians, lawyers, psychologists, etc. You can see the current line-up at http://www.iq.harvard.edu/team-profiles Berkeley has set up the Institute for Data Science (BIDS) staffing it with ethnographers, neuroscientists, sociologists, economists, physicists, biologists, psychologists and even a seismologist: http://bids.berkeley.edu/ And in 2012 the USA launched the "Big Data Research and Development Initiative" to apply big data to government.

It is a shame that so far (since the invention of computers) the main application of data analysis has been to maximize the profits of big corporations. These days applications of big-data analysis include the "recommendation engines" of Amazon and Alibaba, that use data about other customers to suggest what you should buy. In the USA, people got upset when the media learned that Target, one of the the largest retail chains, used math to guess when women were pregnant. Target's algorithm recognized purchases typically related to expecting mothers for the sole reason of targeting them with special promotions. Is this all we can do with big data about pregnant women?

There is a broader category of big-data applications, applications that guess the future. For example, using big data we could predict when pollution will reach a dangerous level without waiting for the day that it happens; we could predict where and when crime is more likely to happen and therefore allocate police resources there; Banks already use a "predictive" kind of big-data analysis when they want to determine if a customer deserves a loan: credit underwriting. Banks could decide underwrite a loan in seconds by using all the data available on people like you. You are likely to behave like all the other people in your age group, income group, ethnic group, etc. The bank can use big data to determine if you can be trusted. These predictive applications typically look for associations: if you have the same purchasing history of many other people who defaulted on their credit card payments, it is very likely that you will default too. In technical terms, they look for patterns and then try to build hypotheses. But we are back to the problem that most data are "read" and analyzed by machines, not by humans. We have known for centuries that hypothesis-formation methods have a weakness: finding correlations in very large datasets is not difficult, what is difficult is to understand "causation". If all the people who caught the flu yesterday in Turin prefer black and white shirts, it doesn't mean that black and white shirts cause the flu, or that the sellers of white and black shirts are contagious: it may simply mean that they are all fans of the Juventus football club, whose official shirt is black and white. Half of the population of Turin is Juventus fans. Mathematicians who don't follow football would reach the wrong conclusion. Machines who know nothing about football would even be worse at reaching the conclusion. Instead, a human being who knows about the city of Turin would realize that the correlation does not tell us much about causation, except that maybe the outbreak started at a stadium where Juventus played. This problem is as old as the science of statistics, but it becomes particularly vexing with huge datasets because in huge datasets the likelihood of accidental correlations is... huge.

Predictions based on big data can be especially useful in the medical and biotech fields, where the amount of available data is virtually infinite but sometimes we don't even store it in digital format. The human genome contains billions of base pairs. Our current knowledge of what all the genes in the human genome do, and how they interact with each other to cause diseases, is ridiculously minimal. For the record, biologists also study the microbiome, the bacteria that live inside us and are crucial to the proper functioning of our body (for example, digestion): there are 100 times more genes in the microbiome than in the genome. We don't know what those billions of base pairs mean, but we have 8 billion people on this planet whose genomes can be compared to find out which combinations of genes are likely to be a problem and which combinations can give immunity. Some people are immune to malaria. We can find out by studying the distributions of those billions of base pairs. Stanford hosts a yearly conference titled "Big Data in Biomedicine" with the motto "Data science will shape human health for the 21st century". Google itself once analyzed search terms by region to predict outbreaks of influenza; and Google's project DNAStack studies genetic data from around the world to predict diseases https://www.dnastack.com Shamelessly, a lot of the big data that are needed to provide useful applications to the public are owned by corporations that don't make them available to the researchers who could use them. There are also data all over our environment that can provide useful information and we "waste". For example, the Sloan Foundation is funding a project to collect information about which microbes we humans leave on the touchscreen machines of the railway stations. Those microbes can yield a lot of information about the health situation in a city.

Narnia: do you agree with Jaron Lanier that only corporations make money out of everybody's data?

piero:

I agree, except that i am more interested in access to knowledge than in money. There has been a process of "democratizing" knowledge since the French Enlightenment, when the French philosophers compiled the "Encyclopedie" to share all the world's knowledge with ordinary people. Then Prussia introduced compulsory primary education, and all the other countries followed suit: education became mandatory for all children. Big data today could allow us to achieve that goal of fully democratizing knowledge. Instead, the only ones who are benefiting from big data are the big corporations (and some government agencies). My frustration is that ordinary people are not benefiting from big data. We use digital tools like the smartphone to augment our bodies with "prosthetic knowledge" (a term that i borrow from British blogger Rich Oglesby) but we don't do anything with the millions of data that surround us. We don't even have access to the data that we generate. Big corporations control and collect those data for their own purposes (typically, to sell more lucrative advertising schemes). Ordinary people are the object, not the subject, of big data.

Narnia: What about the "quantified self" movement? In 2007 writers Gary Wolf and Kevin Kelly of Wired magazine introduced the term "quantified self". The concept caught up quickly and in 2011 the first international conference was held in Mountain View. The "quantified self" movement started from the premise that our lives continuously produce data. We can physically record those data by wearing sensors connected to computers. These data can then be used to document a person's life: self-tracking and self-monitoring. Isn't that an example of "big data" applied to ordinary lives?

piero:

Yes, this is actually a promising idea. At least it is a case in which we own the data that we produce. But you need to integrate them with other people's data in order to get truly life-changing applications, otherwise it is not clear what those data mean. Everything is relative to something else. If i say 5, it has no meaning. If i say 5 in a set of 1,1,3,2,4,1,2 you see that it is the highest number. One example is "gamification", in which people "compete" to obtain the best data for some activity, for example for running in the morning. I also like to think of self-quantification as a new and more scientific form of psychological therapy. By collecting data on yourself, you can actually discover behavior that you never consciously observed. If you log the things that you do during the day and the discussions that you have with your friends, you might be surprised by what you find. You may discover aspects of yourself that are obvious to all your friends but you never realized. It is like keeping a diary of your life, but not written by you: written by someone who follows you nonstop. The data will tell you who you really are. That helps you improve yourself. The psychologist has to guess who you are: the data tell you who you are. You can even classify your daily activities in categories such as hobby, creative thinking, reading, etc and, at the end of the month see how much of your life is spent in each one and then re-balance them to match your real goals. That is the easiest way to improve yourself.
Post-interview notes. Reverse correlation: whenever the captain turns the "fasten your seatbelt on" sign, the airplane starts shaking, so one could draw the scientific conclusion that the seatbelt sign causes the shaking.
This interview was complemented with another interview: Weidong Yang, founder of Kinevitz.

Back to the Table of Contents


Piero Scaruffi | Contact