Big Data

8 minute read

Published:

This lesson is from Wikipedia and Courseera course by BCG & University of Virginia

Introduction

  • Big data is a field that treats ways to
    • analyze,
    • systematically extract information from, or otherwise
    • deal with data sets that are too large or complex to be dealt with by traditional data-processing application software.
  • The dictionary definition of big data is
    • extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions.
  • Big data analysis challenges include
    • capturing data,
    • data storage,
    • data analysis,
    • search,
    • sharing,
    • transfer,
    • visualization,
    • querying,
    • updating,
    • information privacy, and
    • data source.
  • Big data was originally associated with three key concepts:
    • volume
    • variety
    • velocity.
  • Current usage of the term big data tends to refer to the use of
    • predictive analytics
    • user behavior analytics, or
    • certain other advanced data analytics methods that
      • extract value from big data, and
      • seldom to a particular size of data set.
  • There is little doubt that the quantities of data now available are indeed large, but that’s not the most relevant characteristic of this new data ecosystem.
    • Analysis of data sets can find new correlations to
      • spot business trends
      • prevent diseases
      • combat crime and so on
  • Scientists, business executives, medical practitioners, advertising and governments alike regularly
    • meet difficulties with large data-sets in areas including
      • Internet searches, fintech, healthcare analytics, geographic information systems, urban informatics, and business informatics.
  • The size and number of available data sets have grown rapidly as data is collected by devices such as
    • mobile devices, cheap and numerous information-sensing Internet of things devices, aerial (remote sensing), software logs, cameras, microphones, radio-frequency identification (RFID) readers and wireless sensor networks.
  • Relational database management systems and desktop statistical software packages used to visualize data often have difficulty processing and analyzing big data.
  • The processing and analysis of big data may require “massively parallel software running on tens, hundreds, or even thousands of servers”.
  • What qualifies as “big data” varies depending on the capabilities of those analyzing it and their tools.
  • Furthermore, expanding capabilities make big data a moving target.
  • For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration.

Definition

  • The term big data has been in use since the 1990s, with some giving credit to John Mashey for popularizing the term.
  • Big data usually includes
    • data sets with sizes beyond the ability of commonly used software tools
    • to capture, curate, manage, and process data within a tolerable elapsed time.
  • Big data philosophy encompasses
    • unstructured
    • semi-structured and
    • structured data, however the main focus is on unstructured data.
  • Big data “size” is a constantly moving target;
    • as of 2012 ranging from a few dozen terabytes to many zettabytes of data.
    • Big data requires a set of techniques and technologies with new forms of integration to reveal insights from data-sets that are diverse, complex, and of a massive scale.
  • Vs
    • “Variety”, “veracity”, and various other “Vs” are added by some organizations to describe it, a revision challenged by some industry authorities.
    • The Vs of big data were often referred to as the “three Vs”, “four Vs”, and “five Vs”.
    • They represented the qualities of big data in volume, variety, velocity, veracity, and value.
    • Variability is often included as an additional quality of big data.
  • A 2018 definition states
    • “Big data is where parallel computing tools are needed to handle data”, and notes,

Characteristics

  1. Volume
    • The quantity of generated and stored data.
    • The size of the data determines the value and potential insight, and whether it can be considered big data or not.
    • The size of big data is usually larger than terabytes and petabytes.[30]
  2. Variety
    • The type and nature of the data.
    • The earlier technologies like RDBMSs were capable to handle structured data efficiently and effectively.
    • However, the change in type and nature from structured to semi-structured or unstructured challenged the existing tools and technologies.
    • The big data technologies evolved with the prime intention to
      • capture, store, and process
      • the semi-structured and unstructured (variety) data
      • generated with high speed (velocity), and huge in size (volume).
    • Later, these tools and technologies were explored and used for handling structured data also but preferable for storage.
    • Eventually, the processing of structured data was still kept as optional, either using big data or traditional RDBMSs.
    • This helps in analyzing data towards effective usage of the hidden insights exposed from the data collected via social media, log files, sensors, etc.
    • Big data draws from text, images, audio, video; plus it completes missing pieces through data fusion.
  3. Velocity
    • The speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development.
    • Big data is often available in real-time.
    • Compared to small data, big data is produced more continually.
    • Two kinds of velocity related to big data are
      • the frequency of generation and
      • the frequency of handling, recording, and publishing.[31]
  4. Veracity
    • The truthfulness or reliability of the data, which refers to the data quality and the data value.[32]
    • Big data must not only be large in size, but also must be reliable in order to achieve value in the analysis of it.
    • The data quality of captured data can vary greatly, affecting an accurate analysis.[33]
  5. Value
    • The worth in information that can be achieved by the processing and analysis of large datasets.
    • Value also can be measured by an assessment of the other qualities of big data.[34]
    • Value may also represent the profitability of information that is retrieved from the analysis of big data.
  6. Variability
    • The characteristic of the changing formats, structure, or sources of big data.
    • Big data can include structured, unstructured, or combinations of structured and unstructured data.
    • Big data analysis may integrate raw data from multiple sources.
    • The processing of raw data may also involve transformations of unstructured data to structured data.

Impact

  • Netflix
    • The competitive advantage of Netflix is not or not only making videos available online, it is improving the whole experience of discovering those videos in the first place.
    • Actually, Netflix collects enormous amounts of data and analyzes customer watching habits to generate personalized recommendations and offerings.
    • And they go even further, they analyse what people like to watch and why, and use this data as a basis to produce their own series.
    • Netflix might be an extreme example. This whole business model is built on Big Data as a competitive advantage.
  • But Big Data can also bring incremental improvements to existing business across different areas of application.
  • The most common use of Big Data is probably personalization of offering.
    • Companies like Amazon are obvious examples.
    • But Brick and Mortar companies also use Big Data to get to know their customers and offer customized solutions.
    • You think you got those discount vouchers from your supermarkets by chance? Think again.
  • Fraud Reduction is another example of how Big Data can be used to create value.
    • Credit card companies like Visa analyse billions of transactions to identify unusual patterns, and therefore reduce fraud in real time.
    • According to Visa, that saves them $2 billion each year.
  • Big Data can also be used for Predictive Maintenance.
    • It means that a company can use the data it collects about operations to predict performance issues before they even happen.
    • This is extremely valuable especially in asset intensive industries.
    • An oil and gas client for example, had hundreds of wells across three continents connected to an analytics platform, which integrates data from those facilities and generates insight.
  • Big Data has many more application areas across all industries, all functions.
    • We estimate that leaders in Big Data generate an average of 12% more revenue than those who do not maximize their use of analytics
    • Now, how do those leaders better manage to unlock value from Big Data?