Big data

Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze

At the moment, data loaded in primary memory are of the order of gigabytes (\(10^9\) bytes), data stored on secondary memory are of the order of terabytes (\(10^{12}\) bytes), while data beyond this order (from petabytes, \(10^{15}\) bytes, and above) are considered big data.

A torrent of data

Data have become a torrent flowing into every area of the global economy. We can identify four land-marking sources of this torrent:

  1. the birth of the World Wide Web in 1989
  2. the arrival of an array of social networks such as Facebook, launched in 2004, and Twitter, founded in 2006, and friends
  3. the advent of smartphones, started with the first-generation iPhone released in 2007
  4. the forthcoming materialization of Internet of Things: millions of networked sensors are being embedded in the physical world in devices such as mobile phones, smart energy meters, automobiles, and industrial machines that sense, create, and communicate data

The role of big data

Big data has now reached every sector in the global economy:

  • like other essential factors of production such as hard assets and human capital, much of modern economic activity simply couldn’t take place without data
  • there is strong evidence that big data can play a significant economic role to the benefit not only of private commerce but also of national economies and their citizens (in particular in health care and government administration)

The 3 Vs

Big data are characterized by the three V’s:

  • Volume. The volume of data obviously refers to the size of data managed by the system. Data that is somewhat automatically generated tends to be voluminous
  • Velocity. The velocity is the speed at which data is created, accumulated, ingested, and processed
  • Variety. Big data includes structured, semistructured, and unstructured data.
    • Structured data feature a formally structured data model, such as the relational model, in which data are in the form of tables containing rows and columns
    • Unstructured data have no identifiable formal structure and include text, audio, video, images.
    • Semistructured data. Some forms of unstructured data may fit into a format that allows well-defined tags that separate semantic elements (HTML and more generally XML)

Big data concerns - Privacy and security

  • Personal data such as health and financial records are often those that can offer the most significant human benefits, such as helping to pinpoint the right medical treatment or the most appropriate financial product
  • however, consumers also view these categories of data as being the most sensitive
  • it is clear that there is a trade-off between privacy and utility
  • another closely related concern is data security, e.g., how to protect sensitive data or other data that should be kept private

Big data concerns - Legal issues

Big data solutions

  1. while the complete data might be big, often the data needed to answer a specific question is small; you might be able to find a sample or summary that fits in memory and still allows you to answer the question that you’re interested in
  2. you can scale up or scale out your hardware (more later)
  3. you can store (in secondary memory) your dataset in a database and use packages like dbplyr to work with remote database tables as if they are in-memory data frames
  4. you can take advantage of a cloud storage and computing system, like BigQuery, and access it from R with package bigrquery
  5. finally, you can use a cluster computing platform that allows you to spread your data and your computations across multiple machines and work with packages like sparklyr

Scaling up and out

  • speeding up computation is usually possible by buying faster or more capable hardware (say, increasing your machine memory, upgrading your hard drive, or procuring a machine with many more CPUs); this approach is known as scaling up
  • however, there are usually hard limits as to how much a single computer can scale up
  • we can consider spreading computation and storage across multiple machines. This approach provides the highest degree of scalability because you can potentially use an arbitrary number of machines to perform a computation. This approach is commonly known as scaling out
  • however, spreading computation effectively across many machines is a complex endeavor

A brief history of big data

  • as humans, we have been storing, retrieving, manipulating, and communicating information since the Sumerians in Mesopotamia developed writing around 3000 BC
  • mathematician George Stibitz used the word digital to describe fast electric pulses back in 1942
  • we describe information stored electronically as digital information
  • in contrast, analog information represents everything we have stored by any nonelectronic means such as handwritten notes, books, newspapers, and so on

Google File System

  • with the ambition to provide tools capable of searching all of this new digital information, many companies attempted to provide such functionality with what we know today as web search engines
  • search engines were unable to store all of the web page information required to support web searches in a single computer
  • this meant that they had to split information into several files and store them across many machines
  • this approach became known as the Google File System, and was presented in a research paper published in 2003 by Google

MapReduce

  • one year later (2004), Google published a new paper describing how to perform operations across the Google File System, an approach that came to be known as MapReduce
  • users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs
  • a reduce function that merges all intermediate values associated with the same intermediate key
  • both operations require custom computer code, but the MapReduce framework takes care of automatically executing them across many computers at once
  • counting words is often the most basic MapReduce example, but we can also use MapReduce for much more sophisticated and interesting applications, for instance, we can use it to rank web pages in Google’s PageRank algorithm

Hadoop

  • after these papers were released by Google, a team at Yahoo worked on implementing the Google File System and MapReduce as a single open source project
  • this project was released in 2006 as Hadoop, with the Google File System implemented as the Hadoop Distributed File System (HDFS)
  • the Hadoop project made distributed file-based computing accessible to a wider range of users and organizations, making MapReduce useful beyond web data processing

Hive

  • although Hadoop provided support to perform MapReduce operations over a distributed file system, it still required MapReduce operations to be written with code every time a data analysis was run
  • to improve upon this tedious process, the Hive project, released in 2008 by Facebook, brought Structured Query Language (SQL) support to Hadoop
  • this meant that data analysis could now be performed at large scale without the need to write code for each MapReduce operation; instead, one could write generic data analysis statements in SQL, which are much easier to understand and write

Spark

  • in 2009, Apache Spark began as a research project at UC Berkeley’s AMPLab to improve on MapReduce
  • Spark provided a richer set of verbs beyond MapReduce to facilitate optimizing code running in multiple machines
  • Spark also loaded data in-memory, making operations much faster than Hadoop’s on-disk storage
  • in 2010, Spark was released as an open source project and then donated to the Apache Software Foundation in 2013

Spark

  • a good use case for Spark is tackling problems that can be solved with multiple machines
  • for instance, when data does not fit on a single disk drive or into memory, Spark is a good candidate to consider
  • however, you can also consider it for problems that might not be large scale, but for which using multiple computers could speed up computation
  • for instance, CPU-intensive models and scientific simulations also benefit from running in Spark

Spark

  • Spark is good at tackling large-scale data-processing problems, usually known as big data
  • it is also good at tackling large-scale computation problems, known as big compute (tools and approaches using a large amount of CPU and memory resources in a coordinated way)
  • big data often requires big compute, but big compute does not necessarily require big data

Spark and R

  • when you think of the computation power that Spark provides and the ease of use of the R language, it is natural to want them to work together, seamlessly
  • this is also what the R community expected: an R package that would provide an interface to Spark that was easy to use, compatible with other R packages, and available in CRAN: this is sparklyr

Learn more