BIG DATA FUNDAMENTALS
In this article, we introduce some Big Data foundations and all the elements that cooperate to contribute to
I. Big Data overview
We always dealt with Big Data, the moment we started gathering data and storing it in different ways. Big Data is being considered in every domain, in academia, in industry, in businesses, in social media, and in research. It has a lifecycle and characteristics to be defined and followed.
1) Big Data Lifecycle
Figure bellow describes the most important stages that the data goes through till the purpose that it was gathered and used for. From the data inception, collection, transport through inter-networks, saved into distributed storage around the world that offers the best quality price with a reliable network. Then pre-processed to filter only the best quality data and forwarded to processing and analytics for insight extraction.
2) Big Data Characteristics
In the annual McKinney Global Institute report, three data dimensions characterizing Big Data were introduced. The Volume, Velocity and Variety, also called the 3 V’s. Lately the number of dimensions increased from 3, 4,5, 7 and even to 10 V’s . As illustrated in Table below, we compiled the most important V’s that describes Big Data. As the name suggest Big Data is more than simply a matter of size; it is a prospect to unearth insights to make beneficial decisions. Thus, Visualization, Variability, Volatility, Virality, Vulnerability, Viscosity, Validity are extended characteristics.
II. Unstructured Big Data
To make decision we need relevant information that is extracted from data using processing and analysis. In this rich context, data exist in numerous formats, with different types and from several sources and knowledge domains. Unstructured data is growing faster than structured data. It is explained by the number of Facebook posts, tweets, photos and emails created in every second.
1) Unstructured Data
By default, the name unstructured data imply mess, noise and a chaos in data organization. In contrast, it refers to a data that doesn’t have a schema, no Metadata, and any rules or constraints to follows when it has been created, likewise, the structural database model. Even it has some basic low-level internal structure but no pre-defined data models or schema. Unstructured data has two meaning: no structure at all or an unknown structure.
It may be textual or non-textual, and human or machine-generated. It may also be stored within a non-relational database like No-SQL. In Table 2, we illustrate some unstructured data domains and the data types it generates
2) Unstructured Big Data Characteristics
In addition, to the noticeable difference which is the columnar data model, the major difference is the effortlessness of analyzing structured data versus unstructured data. The existing pre-processing and analytics tools are very mature for structured data, but still in embryo state for unstructured one. The Variety characteristic of Big Data defines different formats of data (e.g. document, emails) that are not always stored in structured relational database systems. Its follow two classes of UBD:
Human generated: Text files, Emails, social media, websites, mobile data messages, messaging chats, business applications, Media files (audio, video, image)
Machine generated: scientific data, satellite imagery, digital surveillance, sensor data, network logs, IoT devices. Many different sources of data in several domains that feeds the unstructured contents which prevail the name of data domains. Therefore, it is important to note that unstructured big data is also characterized by velocity and volume.
3) Unstructured Data Management
a) Textual (text, Pdf, scanned docs, email body) Unstructured textual data is transformed and explored using a combination of techniques as in Text Mining such as data mining, machine learning, Natural Language Processing (NLP), information retrieval and knowledge management. Moreover, Search engines tools are used for indexing, cataloguing, categorizing to make information and text search easy characterizing the Unstructured Data. Also, other Techniques are used varying from text analytics, OCR to patterns, terms, topics detection and discovery for the sake of structuring the textual data.
b) Social Media (Twitter, Facebook), CRM For twitter data, sentiment analysis, opinion mining, are well-known techniques applied to extract trends in multitude of areas like elections, events and much more. In CRM systems, a semantic analysis on multisource unstructured data a semantic analysis is conducted to annotate, extract, and rate customer feedbacks.
c) Media (Video, Audio, Image)
Digital photos, Videos, and Audio files are stored in a structured format such as JPG/ PNG, Mov/MP4, and WAV/MP# respectively. However, all these data don’t express any information about what is in the data. It needs to be treated to comprehend its meaning. Automatic Media data tagging, labeling, indexing after analyzing and processing will help to search within the media files efficiently. Processing this kind of unstructured data needs some advanced algorithms for image, audio, speech, and video processing to gather patterns or any information that can be indexed.