Tools for Data Analytics
Tools for Data Analytics
The following section presents a number of Tools for Data Analytics that currently are trend in storage, management, and analysis of big amounts of data. This, with the purpose of having a clearer perspective when choosing the suitable tool that allows better use of resources and data.
Hadoop (HDFS) is a distributed file system designed to be executed from the hardware of a computer or information system . Among its characteristics:
– Hardware: Hadoop is made up of hundreds or thousands of connected servers that store and execute the user’s tasks; the possibility of failure is high, since if only one of the servers fails, the whole system fails. Therefore, the Hadoop platform always has a certain percentage of inactivity.
-Transmission: The applications executed by means of Hadoop are not for general use, since it processes sets of data without contact with the user.
– Data: usually applications executed in this type of tool are of a great size; Hadoop adapts to support it with a great number of nodes that distribute data. This distribution brings as a benefit the increase on the bandwidth since the information is available in several nodes, it has a great performance in the access to data and its continuous presentation (streaming).
– Computation: An advantage Hadoop has regarding the other systems is that the processing of information takes place in the same place where it is store, which does not overcrowd the network since it does not have to transmit the data elsewhere to be processed.
– Portability: Hadoop, as many Open Source applications, was designed with ease of migration to different platforms.
– Accessibility: The access and browsing that Hadoop allows in its data group may be carried out in several
ways. Some of them are: Java program interface and by means of a web server.
Applications where Hadoop is present: response in real time, whether for decision-making processes or immediate responses (fraud control, performance, etc.), aside from particular applications, such as research and development, behavioral analysis, marketing and sales, failure detection in machinery and the IoT universe.
MongoDB is a distributed NoSQL data manager of a documental sort, which means that it is a non-relational
database. Data exchange in this manager is done by means of BSON, which is a text that uses a binary
representation for data structuring and mapping. This manager is written in C++, may be executed from different
operational systems; it is open-source code. This data manager has the following characteristics:
• Flexible storage, since it is sustained by JSON and does not need to define prior schemes.
• Multiple indexes may be created starting on any attribute, which facilitates its use, since it is not necessary to
define MapReduce or parallel processes.
• Queries are based on documents, and they have high performance for querying as well as updating.
• MongoDB has a high capability for growth, replication, and scalability. More than one of these properties
may be obtained increasing the number of machines.
• Independent file storage support, for any size, based on GFS which is a storage specification implemented by all supported drivers.
MongoDB Application: It is suitable for making Internet applications that record a high amount of data, such as: data collection with sensors, social network maintenance infrastructures, statistics collection (reporting), among others. In general, MongoDB may be used for almost anything without so much rigidity.
It is a set of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a
data set or called from its own Java code. WEKA is more than a program.
It is a collection of interdependent programs in the same user interface, which is mainly made up by: data pre-processing, classification, regression, clustering, association rules and visualization, and it works for the development of machine learning schemes.
In recent years it has been used for agricultural data analysis in some cases, for instance, to respond to whether a series of rules that model decision factors in cattle sacrificing may be found. WEKA’s basic characteristics are:
• Data pre-processing: As well as a native file format (ARFF), WEKA is compatible with several other formats
(for instance, CSV, ASCII Matlab files), and database connectivity through JDBC.
• Classification: Classification is done with about 100 methods. Classifiers are divi-ded in “Bayesian” methods,
lazy methods (closest neighbor), rule-based methods (decision charts, OneR, Ripper), learning threes,
learning based on diverse functions and methods. On the other hand, WEKA includes meta-classifiers, such
as bagging, boosting, stacking, multiple instance classifiers, and interfaces implemented in Groovy and Jython.
• Clustering: unsupervised learning supported by several clustering schemes, such as EMbased Mixture model,
k-means, and several hierarchy clustering algorithms.
• Selection attribute: The set of attributed used is essential for the performance classification. There are several
selection criteria and search methods available.
• Data visualization: Data may be visually inspected representing attribute values against class, or against other
attribute values. There are visualization-specialized tools for specific methods.
Besides Hadoop, MongoDB and WEKA, there are other alternatives for the management of big data deposits, such as the case of the Appliances. An Appliance may be defined as the application whose purpose is managing, collecting, and analyzing big amounts of data; this hardware and software is designed solely to carry out these tasks.
An example of this is the Oracle Big Data Appliance.
Why to choose MongoDB or Hadoop when both may fit without any issues in a typical Big Data problem?
Depending on the characteristics of the project to be carried out tools could be chosen, but in some cases there is
no need to choose between these two tools. How to use MongoDB and Hadoop together? The way they may be
combined is using Hadoop for data processing and analysis, while MongoDB takes care of real-time operative
Other tools of data analytics such as WEKA and the Appliance-type ones are much more specific and limited, which makes their selection to be done based on the solution requirements for a determined issue.