Introduction to Big Data
If we need to define Big Data, we must introduce its evolution through the years while linking it to its characteristics. As the name implies, it was somehow about the large size of data files that cannot be handled by traditional databases. Then extended to cover the difficulty to analyze these data using the traditional software algorithms. Big Data means the whole value chain that includes several stages: data generation, collection, acquisition, transportation, storage, preprocessing, and processing, analytics, and visualization. The insights that we can extract from this chain are from the continuous data growth using new techniques and new architectures.
there is no clear and final definition of Big Data according to many references. It is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision-making. Also “Big Data” is used to describe a massive volume of both structured and unstructured data. therefore, it’s difficult to process it using traditional database and software techniques. It refers also to the technologies and storage facilities that an organization requires to handle and manage the large amounts of data that derives from multiples sources.
the data originates from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos uploaded to media portals, purchase transaction records, and cell phone GPS signals to name a few. The gigantic volume of data did not mean it’s the only characteristics to consider.
There are many definitions of Big Data framed differently in the past by various researchers, but all of them revolve around the five characteristics of Big Data. These 5 V’s of Big Data are:
The first characteristic of Big Data is Variety, which addresses the various sources which are generating this Big Data. They are classified into three categories as:
Structured Data: Structured data concerns all data which can be stored in table with rows and columns. These data are considered to be the most organized data but it accounts for only 5-10% of the total data available.
Semi structured data: Semi-structured data is the information that does not reside in tables but they possess some properties which make them convertible into structured data. These are the data coming from web server logs, XML documents etc. Comparatively less organized than structured data, they also make only 5-10% of data available.
Unstructured data: Unstructured data constitutes the biggest source of Big Data that is 80 – 90%. It includes data in the form of text, images, video, voices, web pages, emails, word documents and all other multimedia content. These data are very difficult to store into database. These types of data are both machine and human generated just like structured and semi structured data.
Multi Structured Data: Data which is a mix of Structured, semi structured and unstructured data. Example operating system logs
Volume is the characteristic which makes Data as Big Data It denotes to the large amount of data which is generating in every second. The range of data has highly increased, crossing the range of terabytes to Peta, Exa and now till Zeta bytes. Big data can be measured in the terms of:
1- Records per Area
Data is coming from multiple sources in huge amounts, as explained earlier. Also, Velocity is one of the characteristics of Big Data which talks about the high data rate at which it is being generated.
Various applications based on data rate are:
a. Batch: Batch means running the query in a scheduled and sequential way without any intervention. Execution is on a batch of input
b. Real Time: Real time data is defined as the information which is delivered immediately after its collection. There is no delay in the timeliness of information provided.
c. Interactive: means executing the tasks which require frequent user interaction.
d. Streaming: The method of processing the data as it comes in is called streaming. The insight into the data is required as it arrives.
It is necessary to fetch meaningful information or patterns from this huge amount of Big Data which can be used for analysis or determining results on application of queries. Thus, Value is the characteristic which denotes fetching meaning from Big Data. The value can be extracted from Big Data as:
The fifth V of Big data ensures the correctness and accuracy of information. When dealing with Big Data, along with maintaining its privacy and security, it is also important to take care of Data quality, data governance and metadata management. Factors which should be considered are :
III. Big Data Lifecycle
As mentioned early and showed in depicted in figure 3, the Big Data ecosystem is organized as a value chain lifecycle from data inception to visualization.
For instance In the following, a brief description of all the main stages of Big Data lifecycle.
1. Data Generation/Inception: is the phase where data is created, many data sources are responsible for these data: electrophysiology signals, sensors used to gather climate information, surveillance devices, posts to social media sites, videos and still images, transaction records, to name a few.
2. Data Acquisition: consists of data collection, data transmission, and data pre-processing.
a. Data Collection: the data is gathered in specific data formats from different sources: real world data measurements using sensors and RFID, or data from any sources using a specific designed script to crawl the web.
b. Data Transport: to transfer the collected data into storage data centers using interconnected networks.
c. Data Pre-Processing: it consists of the typical pre-processing activities like Data Integration, Enrichment, Transformation, Reduction, and Cleansing.
3. Data Storage: the infrastructure data center where the data is stored and distributed among several clusters, data centers spread geographically. The storage systems ensure several fault tolerance levels to achieve reliability and efficiency.
4. Data Processing & Analytics: application of Data Mining algorithms, Machine Learning, Artificial Intelligence and Deep Learning to process the data and extract useful insight for better decision making.
Than Data scientists are the most expected users of this phase since they have the expertise to apply what needed on what must be analyzed.
5. Data Visualization: the best way of assessing the value of processed data is to examine it visually and taking decision accordingly. Application of visualization methods in Big Data is of an importance as it closes the loop value chain.