Bringing Big Data Systems to the Cloud
Big data represents a new paradigm of data management (collection, processing, querying, data types, and scale) that isn’t well served by traditional data management systems. Two distinct paradigms are emerging in the big data space: working with data at rest, and working with streams of data in flight. We’ll focus on data at rest for now.
The big data ecosystem has seen some fast evolution. Most big data systems today incorporate Hadoop-based architectures (http://hadoop.apache.org) and are quickly becoming the center of the enterprise technology stack for data management. These architectures usually consist of several components: Hadoop Distributed File System (HDFS), MapReduce, YARN, and HBase, to name a few. For the purpose of this article, we’ll collectively refer to these as Hadoop. Terms like data lake and data hub refer to HDFS being the central storage system due to the scale and economics it has to offer, enabling storage of data in full fidelity for long periods of time.
Cloud computing refers to a paradigm for infra- structure, platforms, and software consumption in which users consume from a shared pool of resources that someone else manages. Users pay for what they use. There are public cloud environments, such as Amazon Web Services (AWS), Google, and Microsoft Azure, as well as software offerings, such as Openstack and VMWare, that you can use to build your own private cloud. We’ll limit the discussion to public cloud for now.
We can divide cloud computing technologies into three levels: infrastructure as a service (IaaS), plat- form as a service (PaaS), and software as a service (SaaS). These service levels aren’t new, but technology has evolved to make the consumption patterns look different from how they looked in the past. AWS extended the paradigm of end users interacting with and consuming a service programmatically without any human involvement in the early 2000s.2 Other vendors, such as Microsoft, Google, and IBM, have since forayed into this business as well.
Intersection of the Two Worlds
The worlds of big data and cloud computing (mostly IaaS) share some characteristics that make the intersection intuitive in some ways and counterintuitive in others.
Motivation and Considerations
There are several motivations for using cloud environments for big data deployments as well as some considerations.
- Cost. Total cost of ownership of infrastructure includes hardware, power, racks, hosting space, and the people managing the infrastructure. Public cloud benefits from economies of scale and vendors often pass these benefits to the customers, who can simply consume the infrastructure without worrying about the operational costs.
- Ease of use. Cloud computing is all about accessing resources programmatically and automating systems as much as possible. That’s not possible with bare- metal hardware, and ease of use is a big factor when considering deploying in cloud environments.
- Elasticity. Big data workloads are often times spiky in nature. Users onboard new data sources and need to perform ad hoc processing to explore the datasets. This requires the ability to scale up the environment and perhaps scale down later on. With bare-metal infrastructure, you’ll have to provision for that burst requirement or you’ll have to wait for the IT team to provision new hardware. In cloud environments, you can scale up and down programmatically in a matter of minutes.
- Operations. In public cloud environments, operations are the cloud provider’s responsibility. Users don’t have to worry about operating the infrastructure. If the system fails, they can recover by provisioning more resources.
- Reliability. Some might argue that public cloud infrastructures are less reliable than those in bare- metal because virtual machines have a higher chance of going down than physical servers. The flip side of that is that you can provision a new virtual machine much faster than you can procure and provision a new server. With that, reliability comes down to how you architect your system for fault tolerance.
- Flexibility. Clouds offer different kinds of infra- structure configurations with minimal customization options. With bare-metal infrastructure, you can customize at the time of procurement. Having said that, most enterprises have standard infra- structure configurations that they use and customization is uncommon.
- Performance. Virtualization has a performance hit, especially for I/O intensive workloads. This has rap- idly decreased in recent times. For certain workloads, this hit might not be acceptable. For others where a slight variation and possibly lower performance is acceptable, cloud environments might be sufficient.
- Security and compliance. Security and compliance are important considerations for enterprise deploy- ments. We can probably write several dedicated articles to cover all aspects. The key is that both cloud environments and Hadoop have been rapidly developing and have come a long way to cater to the various requirements.
- Location. Often, users want to keep their data close to where it’s generated. This could be because of the volume of data, where it’s accessed from, or restrictions on where it can be moved. For example, certain kinds of data generated in China can’t be transferred outside the country. Public cloud environments offer the flexibility of having deployments in multiple locations without needing your own datacenters.
- Intersection in Practice
Let’s look at how the intersection of the two paradigms exists today and where future opportunities exist.
- Consumption paradigms. Two kinds of consumption paradigms exist for big data systems in public cloud environments.
In a hosted system, the vendor is hosting the infrastructure onwhich big data software is deployed. Examples of this are enterprisesdeploying their own software in AWS, Azure, and Google.
In a managed and hosted system, the vendor hosts, operates, and manages the big data deployment and infrastructure for you. This could entail anything from provisioning to debugging the environment when things fail. Examples include Amazon Elastic MapReduce, Qubole Data Service, and Altiscale.Architectural considerations. The key architectural consideration in this intersection is the choice of persistent storage.