Big Data in the Public Cloud
Before we arrive at the present, however, let’s look at how big data on the public cloud got started. The early innovators in big data infrastructure (Google, Microsoft, Yahoo, and Facebook) were of course large public cloud companies, but ran on their own private infrastructures.Although the public cloud companies have been developing big data infrastructure since their inception, only more recently have big data workloads been running in the public cloud.
Data Processing in the Cloud
Data processing was the first big data workload to run on the public cloud. Amazon launched the core parts of its Amazon Web Services (AWS)—Elastic Compute Cloud (EC2) and Simple Storage Service (S3)—in 2006. The Apache Hadoop project added support for running Hadoop on EC2 and S3 that same year. In his article, “Self-Service, Prorated Su-percomputing Fun!” Derek Gottfrid described how the New York Times used Hadoop on Amazon AWS in 2007 to create PDFs from its archives; in 2008, NYT used Hadoop to process archived images as well. In 2008, Amazon launched its Elastic MapReduce (EMR) service for large-scale data processing, arguably the first big data service offered on the public cloud. In 2013, the company reported that 5.5 million clusters had been launched since 2010.
Other large public cloud companies soon followed suit. Google launched BigQuery, a Web service for querying massive datasets, in 2010. In 2012, the company launched Compute Engine, an infrastructure-as-a-service (IaaS) offering that lets users run existing big data infrastructure on Google’s virtual machines. That year, Qubole also launched its
Hadoop-based big data service, Qubole Data Service. The following year, Microsoft launched both Azure IaaS and HDInsight, a cloud-based Hadoop service. Google recently announced Cloud Dataflow, a soft-ware developer’s kit (SDK) and managed service for big and fast parallel data analysis pipelines
Cloud-Based Data Stores
Scale-out, schemaless data stores were the next big data systems to run on the public cloud. Just as the Google File System (GFS) and MapReduce papersled to the development of Apache Hadoop, Google’s BigTable and Amazon’s Dynamo led to the Apache HBase and Apache Cassandra projects, respectively.
As with data processing, users have run these systems on cloud infrastructure (IaaS), and the public cloud companies have launched datastore services. In 2012, Amazon added Apache HBase to its Elastic MapReduce (EMR), and launched its own managed NoSQL database, DynamoDB (which, de-spite the name, isn’t based on Dynamo). The next year, Google released Cloud Datastore, a managed solution for storing nonrelational data based on its High Replication Datastore (HRD), which appeared as part of Google App Engine in 2008. Microsoft recently announced Azure DocumentDB, a managed highly scalable document database.
Native Cloud Services
Although it’s increasingly common to run big data infrastructure on cloud IaaS, cloud native platform infrastructure services (for example, for data processing, query, and search) continue to flourish, delivering services for higher-level activities.
An early example of this was Google’s Prediction API service, launched in 2010.Google Prediction API implements supervised learning, the user submits labeled training data, and the service trains a model and then serves queries against it. Google Prediction API can be used for everything from document classification to building recommendation systems.
Microsoft recently launched a pre-view of Azure Machine Learning (ML), a public cloud service for predictive analytics. Like Google Prediction API, it lets users build, test, and deploy models.
Startups have also built cloud services targeted at analysts and data scientists. Databricks recently launched Databricks Cloud, a service based on Apache Spark that’s designed to facilitate data scientists’ common tasks. Big data infrastructure is also increasinglybeing exposed to users via public cloud versions of popular analytics and business intelligence (BI) tools. Recent examples are Microstrategy Cloud, SAS Cloud Analytics, Tableau Online, and Microsoft Power BI.