Big Data

Big data

Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze

At the moment, data loaded in primary memory are of the order of gigabytes (\(10^9\) bytes), data stored on secondary memory are of the order of terabytes (\(10^{12}\) bytes), while data beyond this order (from petabytes, \(10^{15}\) bytes, and above) are considered big data.

A torrent of data

Data have become a torrent flowing into every area of the global economy. We can identify four land-marking sources of this torrent:

the birth of the World Wide Web in 1989
the arrival of an array of social networks such as Facebook, launched in 2004, and Twitter, founded in 2006, and friends
the advent of smartphones, started with the first-generation iPhone released in 2007
the forthcoming materialization of Internet of Things: millions of networked sensors are being embedded in the physical world in devices such as mobile phones, smart energy meters, automobiles, and industrial machines that sense, create, and communicate data

The role of big data

Big data has now reached every sector in the global economy:

like other essential factors of production such as hard assets and human capital, much of modern economic activity simply couldn’t take place without data
there is strong evidence that big data can play a significant economic role to the benefit not only of private commerce but also of national economies and their citizens (in particular in health care and government administration)

The 3 Vs

Big data are characterized by the three V’s:

Volume. The volume of data obviously refers to the size of data managed by the system. Data that is somewhat automatically generated tends to be voluminous
Velocity. The velocity is the speed at which data is created, accumulated, ingested, and processed
Variety. Big data includes structured, semistructured, and unstructured data.
- Structured data feature a formally structured data model, such as the relational model, in which data are in the form of tables containing rows and columns
- Unstructured data have no identifiable formal structure and include text, audio, video, images.
- Semistructured data. Some forms of unstructured data may fit into a format that allows well-defined tags that separate semantic elements (HTML and more generally XML)

Big data concerns - Privacy and security

Personal data such as health and financial records are often those that can offer the most significant human benefits, such as helping to pinpoint the right medical treatment or the most appropriate financial product
however, consumers also view these categories of data as being the most sensitive
it is clear that there is a trade-off between privacy and utility
another closely related concern is data security, e.g., how to protect sensitive data or other data that should be kept private

Big data concerns - Legal issues

big data’s increasing economic importance also raises a number of legal issues, especially when coupled with the fact that data are fundamentally different from many other assets
data can be copied perfectly and easily combined with other data
the same piece of data can be used simultaneously by more than one person
all of these are unique characteristics of data compared with physical assets
who owns a piece of data and what rights come attached with a dataset?
what defines fair use of data?
who is responsible when an inaccurate piece of data leads to negative consequences?

Big data solutions

while the complete data might be big, often the data needed to answer a specific question is small; you might be able to find a sample or summary that fits in memory and still allows you to answer the question that you’re interested in
you can scale up or scale out your hardware (more later)
you can store (in secondary memory) your dataset in a database and use packages like dbplyr to work with remote database tables as if they are in-memory data frames
you can take advantage of a cloud storage and computing system, like BigQuery, and access it from R with package bigrquery
finally, you can use a cluster computing platform that allows you to spread your data and your computations across multiple machines and work with packages like sparklyr

Scaling up and out

speeding up computation is usually possible by buying faster or more capable hardware (say, increasing your machine memory, upgrading your hard drive, or procuring a machine with many more CPUs); this approach is known as scaling up
however, there are usually hard limits as to how much a single computer can scale up
we can consider spreading computation and storage across multiple machines. This approach provides the highest degree of scalability because you can potentially use an arbitrary number of machines to perform a computation. This approach is commonly known as scaling out
however, spreading computation effectively across many machines is a complex endeavor

A brief history of big data

as humans, we have been storing, retrieving, manipulating, and communicating information since the Sumerians in Mesopotamia developed writing around 3000 BC
mathematician George Stibitz used the word digital to describe fast electric pulses back in 1942
we describe information stored electronically as digital information
in contrast, analog information represents everything we have stored by any nonelectronic means such as handwritten notes, books, newspapers, and so on

Google File System

with the ambition to provide tools capable of searching all of this new digital information, many companies attempted to provide such functionality with what we know today as web search engines
search engines were unable to store all of the web page information required to support web searches in a single computer
this meant that they had to split information into several files and store them across many machines
this approach became known as the Google File System, and was presented in a research paper published in 2003 by Google

MapReduce

one year later (2004), Google published a new paper describing how to perform operations across the Google File System, an approach that came to be known as MapReduce
users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs
a reduce function that merges all intermediate values associated with the same intermediate key
both operations require custom computer code, but the MapReduce framework takes care of automatically executing them across many computers at once
counting words is often the most basic MapReduce example, but we can also use MapReduce for much more sophisticated and interesting applications, for instance, we can use it to rank web pages in Google’s PageRank algorithm

Hadoop

after these papers were released by Google, a team at Yahoo worked on implementing the Google File System and MapReduce as a single open source project
this project was released in 2006 as Hadoop, with the Google File System implemented as the Hadoop Distributed File System (HDFS)
the Hadoop project made distributed file-based computing accessible to a wider range of users and organizations, making MapReduce useful beyond web data processing

Hive

although Hadoop provided support to perform MapReduce operations over a distributed file system, it still required MapReduce operations to be written with code every time a data analysis was run
to improve upon this tedious process, the Hive project, released in 2008 by Facebook, brought Structured Query Language (SQL) support to Hadoop
this meant that data analysis could now be performed at large scale without the need to write code for each MapReduce operation; instead, one could write generic data analysis statements in SQL, which are much easier to understand and write

Spark

in 2009, Apache Spark began as a research project at UC Berkeley’s AMPLab to improve on MapReduce
Spark provided a richer set of verbs beyond MapReduce to facilitate optimizing code running in multiple machines
Spark also loaded data in-memory, making operations much faster than Hadoop’s on-disk storage
in 2010, Spark was released as an open source project and then donated to the Apache Software Foundation in 2013

Spark

a good use case for Spark is tackling problems that can be solved with multiple machines
for instance, when data does not fit on a single disk drive or into memory, Spark is a good candidate to consider
however, you can also consider it for problems that might not be large scale, but for which using multiple computers could speed up computation
for instance, CPU-intensive models and scientific simulations also benefit from running in Spark

Spark

Spark is good at tackling large-scale data-processing problems, usually known as big data
it is also good at tackling large-scale computation problems, known as big compute (tools and approaches using a large amount of CPU and memory resources in a coordinated way)
big data often requires big compute, but big compute does not necessarily require big data

Spark and R

when you think of the computation power that Spark provides and the ease of use of the R language, it is natural to want them to work together, seamlessly
this is also what the R community expected: an R package that would provide an interface to Spark that was easy to use, compatible with other R packages, and available in CRAN: this is sparklyr

Big data

A torrent of data

The role of big data

The 3 Vs

Big data concerns - Privacy and security

Big data concerns - Legal issues

Big data solutions

Scaling up and out

A brief history of big data

Google File System

MapReduce

Hadoop

Hive

Spark

Spark

Spark

Spark and R

Learn more