Tools for Big Data
Tools for Big Data
The hardware and the software are basic and important components for the good handling of data, with respect to the first one, there are technologies such as Massive Parallel Processing (MPP) architectures that help quickly to its processing.
However, for the management of unstructured or semi-structured data it is necessary to resort to other technologies such as MapReduce or Hadoop, which are responsible for managing structured, unstructured or semi-structured information. The tools used for Big Data must be capable of processing large data sets, massive data, within a reasonable computation time and in a range of adequate precision.
It is a framework that allows the distributed processing of large datasets through groups of computers that use simple programming models.The big data tools Hadoop supports different operating systems and is usually used on any platform in the cloud.
It also has two main components: a distributed file system on each cluster node (HDFS) used for file storage and the MapReduce programming infrastructure. The HDFS file system provides a database that is fault tolerant and highly available, while MapReduce allows the creation of algorithms that extract value from the analyzed data through the study of results.
MapReduce was designed by Google in 2003, is considered as the pioneer platform for the processing of massive data, as well as a paradigm for data processing by fractioning data files, it is used in solutions where large amounts of information can be processed in parallel in the same hardware, that is, with petabyte volumes, while providing the user with an easy and transparent management of the resources of the underlying cluster.
MapReduce divides the processing into two functions: Map and Reduce. Each of these phases uses pairs as inputs and outputs. Here are some elements of each:
Map Function. Where the ingestion and transformation of the input data is carried out and in which the input registers can be processed in parallel. The system processes key-value pairs, read directly from the distributed file system, and transforms these pairs into other intermediates using a user-defined function. Each node is responsible for reading and transforming the pairs of one or more partitions.
Reduce Function. The master node groups pairs by key and distributes the combined results to the Reduce processes in each node. The reduction function is applied to the list of values associated with each key and generates an output value.
3. Apache Storm
It is a distributed open source and open source system that has the advantage of handling data processing in real time in contrast to Hadoop, which is designed for batch processing.
Apache Storm, allows to build distributed processing systems in real time, which can process unlimited data flows quickly (register more than one million tuples processed
per second per node.), it is highly scalable, easy to use and guarantees low latency (processing a very high volume of data messages with minimal delay), it also provides a very simple architecture for creating applications called topology.
Storm big data tools is based on a topology composed of a complete network of peaks, bolts and flows. A peak is a source of currents, and bolts are used to process inflows to produce outflows. Storm can be used for many cases, such as real-time analysis, online machine leadrning, continuous computation and distributed RPC, ETL, among others.
4. Apache Spark
It was born as an alternative to solve the MapReduce/Hadoop limitations. It can load and query data quickly in memory, is very useful for iterative processes and also provides a simple programming model that supports a wide range of applications.
Apache Spark is compatible with graphics databases, transmission analysis, general batch processing, ad-hoc queries and machine learning and allows the query of structured and semi-structured data using SQL language.
Spark provides the ability to perform more operations than Hadoop/MapReduce which contributes to carrying out Big Data projects with less budget and more complex solutions. Among its main advantages, ease of use stands out, since it is possible to program in R, Python, Scala and even in Java. Spark has its own calculation cluster management system so it uses the Hadoop HDFS only for storage.
5. Apache Flink
Flink is a project of the Apache Software Foundation that is developed and supported by a community of more than 180 open source collaborators and is used in production in several companies. It is considered an open source flow processing framework that allows performing real-time transmission analysis for large volumes of data with a single technology.
Flink allows great flexibility for programmers to correlate events through the use of different notions of time (event time, ingestion time, processing time); it also offers low latency, high throughput, multilanguage APIs, disordered events, tolerance to failures and consistency.
It is an ingestion or data collection tool that is commonly used for Hadoop. Flume is a distributed, reliable and available system that collects aggregates and transfers data from many different sources to a centralized data warehouse such as the Hadoop Distributed File System (HDFS). It has a flexible and simple architecture, others that handle the transmission of data flows. Fault tolerance, adjustable reliability mechanism and fault recovery service are some of the functions it has. Flume relies on a simple extensible data model to handle massive distributed data sources.
Although Flume complements Hadoop well, it is an independent component that can work on other platforms. He is known for his ability to execute several processes on a single machine. By using Flume, users can transmit data from several high-volume sources (such as the Avro RPC source and syslog) to sinks (such as HDFS and HBase) for real-time analysis. In addition, Flume provides a query processing engine that can transform each new batch of data before channeling it to the specified receiver.