A History of Silicon Valley

Table of Contents | Timeline of Silicon Valley | A photographic tour
History pages | Editor | Correspondence
Purchase the book

These are excerpts from Piero Scaruffi's book
"A History of Silicon Valley"

(Copyright © 2016 Piero Scaruffi)

The Selfies (2011-16)

click here for the other sections of this chapter

Big Data

The first companies to deal successfully with "big data" were probably the big two of the 2000s: Google and Facebook. It was becoming more and more apparent that their key contributions to technology were not so much the little features added here and there but the capability to manage in real time an explosive amount of data.

A Facebook team led by Avinash Lakshman and Prashant Malik developed Cassandra, leveraging technology from Amazon and Google, to solve Facebook's data management problems. Facebook gifted it to the open-source Apache community in 2008. DataStax, founded in 2010 in Santa Clara by Jonathan Ellis and Matt Pfeil, took Cassandra and turned it into a mission-critical database management system capable of competing with Oracle, the field's superpower.

A Google team led by Jeff Dean and Sanjay Ghemawat (in about 2004) developed the parallel, distributed algorithm MapReduce to provide massive scalability across a multitude of servers, a real-life problem for a company managing billions of search queries and other user interactions. In 2005 Doug Cutting, a Yahoo! engineer, and Mike Cafarella implemented a MapReduce service and a distributed file system (HDFS), collectively known since 2006 as Hadoop, for storage and processing of large datasets on clusters of servers. Hadoop was used internally by Yahoo! and eventually became another Apache open-source framework. The first startups to graft SQL onto Hadoop were Cloudera, formed in 2008 in Palo Alto by three engineers from Google, Yahoo! and Facebook (Christophe Bisciglia, Amr Awadallah and Jeff Hammerbacher) and later joined by Doug Cutting himself (Cloudera was acquired by Intel in 2014); and Hadapt, founded in 2011 in Boston by Yale students Daniel Abadi, Kamil Bajda-Pawlikowski and Justin Borgman. Other Hadoop-based startups included Qubole, founded in 2011 in Mountain View, by two Facebook engineers, Ashish Thusoo and Joydeep Sen Sarma; and Platfora, founded in 2011 in San Mateo by Ben Werther. Qubole offered a cloud-based version of Apache Hive, the project that the founders ran at Facebook (since 2007) and that was made open-source in 2008. Hive sat on top of Hadoop for providing data analysis and SQL-like query.

Meanwhile, Google developed its own "big data" service, Dremel, announced in 2010 (but used internally since 2006). The difference between Hadoop and Dremel was simple: Hadoop processed data in batch mode, Dremel did it in real time. Dremel was designed to query extremely large datasets on the fly. Following what Amazon had done with its cloud service, Google opened its BigQuery service, a commercial version of Dremel, to the public in 2012, selling storage and analytics at a price per gigabyte. Users of the service could analyze datasets using SQL-like queries. Dremel's project leader Theo Vassilakis went on to found Metanautix with a Facebook engineer, Apostolos Lerios, in 2012 in Palo Alto.

At the same time that it disclosed Dremel, Google published two more papers that shed some light on its internal technologies for handling big data. Caffeine (2009) was about building the index for the search engine. The other one (2010) was about Pregel, a "graph database" capable of fault-tolerant parallel processing of graphs; the idea being that graphs were becoming more and more pervasive and important (the Web itself is a graph and, of course, so are the relationships created by social media). MapReduce not being good enough for graph algorithms, and the existing parallel graph software not being fault tolerance, Google proceeded to create its own. Google's Pregel, largely the creature of Grzegorz Czajkowski, used the Bulk Synchronous Parallel model of distributed computation introduced by Leslie Valiant at Harvard (eventually codified in 1990). The Apache open-source community came up with their own variation on the same model, the Giraph project.

The open-source project Apache Mesos, inspired by the Borg system developed at Google by John Wilkes since 2004 to manage Google's own data centers, was conceived at UC Berkeley to manage large distributed pools of computers and was used and refined at Twitter. In 2014 in San Francisco a veteran of Twitter and Airbnb, Florian Leibert, founded Mesosphere to commercialize Mesos. Meanwhile at Google the old project Borg evolved into Omega, Apache Spark, a project started in 2009 by Matei Zaharia at UC Berkeley, was a platform for large-scale data processing. Zaharia later founded his own company, Databricks, but the project survived and in fact grew. In 2015 IBM pledged 3,500 researchers to Apache Spark while open-sourcing its own SystemML machine-learning technology.

The old field of "business intelligence" kept mutating, or at least changing name. As "data mining" and "data analytics" became obsolete terms, a new one was coined: "data science". For example, Looker Data Sciences, founded in 2012 in Santa Cruz by Lloyd Tabb and Ben Porterfield, provided business-intelligence tool to dig into big data and make sense of them. At that point "big data" were mostly stored on high-performance data warehouses such as Amazon Redshift (2013, powered by technology acquired from ParAccel), Google BigQuery (2012), HP Vertica (built on top of Hadoop), IBM Netezza, and Teradata.

The world actually didn't have enough data, particularly from the developing world, a fact that skewed research and hampered remedies to problems. Premise, founded in 2012 in San Francisco by MetaMarkets' co-founder David Soloff and MetaMarkets' chief scientist Joe Reisinger, harnessed the power of the crowd to collect economic data around the world, provided in real-time from ordinary individuals armed with smartphones.

As disk-based database management systems could no longer provide real-time answers when data were too big, a new paradigm came out of Germany with SAP's High Performance Analytics Appliance (HANA), first demonstrated in 2008, and Exasol's EXASolution: in-memory databases. Another pioneer was Starcounter in Sweden. These were distributed platforms for big data that stored data in RAM distributed across many processing units. Oracle had Times Ten, originally a Hewlett-Packard spinoff that Oracle acquired in 2005, and VMware had the technology developed by Salvatore Sanfilippo in Italy as the open-source project Redis in 2009 (he was hired by VMware in 2010). GridGain Systems, started in 2007 by Nikita Ivanov and Dmitriy Setrakyan in Pleasanton (and later relocated to Foster City), gifted its Apache Ignite in-memory database project to the open-source community in 2015.

Hadoop was born for a restricted number of companies (such as Facebook and Google) who faced an exponential growth in data management, but a decade later the whole world had their same problem. That created an opportunity for startups to develope tools that could help deal with the technicalities of Hadoop. For example, Qubole, founded in 2011 in Santa Clara by former Facebook engineers Joydeep Sen Sarma (who jumpstarted the Apache Hive project) and Ashish Thusoo, offered a higher-level service to analyze data stored in Hadoop or Spark on the cloud.

ThoughtSpot was founded in 2012 in Palo Alto by Ajeet Singh (a former co-founder of Nutanix) to develop a search engine specifically for big-data apps, basically the "business intelligence" software for the age of big data (by 2017 it became a unicorn). Its competitors were San Francisco-based Splunk, that had been in the business of real-time search engine for flow of data since 2002, way before it was called "big data", and Elasticsearch, first released in 2010 when its creator Shay Banon lived in the Netherlands and then relocated to Mountain View. Elasticsearch was based on the open-source Apache Lucene created in 1999 by Xerox PARC and Excite alumnus Doug Cutting (later famous for Hadoop).

Lattice Data, a spin-off of Stanford's OpenSource DeepDive project, was founded by Mike Cafarella of Hadoop fame and Stanford University's professor Chris Re in 2015 in Menlo Park to turn unstructured data, such as text and images, into "structured" data, easier to manipulate by traditional database (it was acquired by Apple in 2017).

Cask, founded in 2011 in Palo Alto, offered a user-friendly tool to build applications on Hadoop and Apache Spark.

Big data processing was a top concern of smart-city infrastructure. Kwh Analytics, founded in 2014 in San Francisco by Richard Matsui, offered an insurance product backed by European giant Swiss Re that reduced the risk of solar-energy installations by guaranteeing energy production. Swiftly, also founded in 2014 in San Francisco, developed software to help transit agencies and cities improve urban mobility. Aclima, founded in 2010 in San Francisco by Davida Herzl and Reuben Herzl, developed an air-quality mobile sensing platform in collaboration with Google: the mobile sensors were placed on Google's StreetView cars as they toured the city.

click here for the other sections of the chapter "The Selfies (2011-16)"
(Copyright © 2016 Piero Scaruffi)

Table of Contents | Timeline of Silicon Valley | A photographic tour | History pages | Editor | Correspondence