Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. This definition is intentionally subjective and incorporates a moving definition of how big a dataset needs to be in order to be considered big data. McKinsey report on big data (2011)

At the moment, data loaded in primary memory are of the order of gigabytes (\(10^9\) bytes), data stored on secondary memory are of the order of terabytes (\(10^{12}\) bytes), while data beyond this order (from petabytes, \(10^{15}\) bytes, and above) are considered big data.

Small data first

This course proudly focuses on small, in-memory datasets. This is the right place to start because you can’t tackle big data unless you have experience with small data.

If your data is bigger, carefully consider if your big data problem might actually be a small data problem in disguise. While the complete data might be big, often the data needed to answer a specific question is small. You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you’re interested in. The challenge here is finding the right small data, which often requires a lot of iteration.

Another possibility is that your big data problem is actually a large number of small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like Spark) that allows you to send different datasets to different computers for processing. Once you have figured out how to answer the question for a single subset using the tools on small data, you learn new tools to solve it for the full dataset.

Big data

Data have become a torrent flowing into every area of the global economy. We can identify four land-marking sources of this torrent:

  1. the birth of the World Wide Web in 1989. Tim Berners-Lee proposes the HyperText Markup Language (HTML) with the motivation to keep track of experimental data at the European Organization for Nuclear Research (CERN).
  2. The arrival of an array of social networks such as Facebook, launched in 2004, and Twitter, founded in 2006.
  3. The advent of smartphones, started with the first-generation iPhone released in 2007.
  4. The forthcoming materialization of Internet of Things: millions of networked sensors are being embedded in the physical world in devices such as mobile phones, smart energy meters, automobiles, and industrial machines that sense, create, and communicate data.

Big data has now reached every sector in the global economy:

For instance, according to McKinsey report on big data cited above:

The Gartner Group, a popular enterprise-level organization that industry looks up to for learning about trends, characterized big data in 2011 by the three V’s: volume, velocity, and variety.

Many citizens around the world regard this collection of information with deep suspicion, seeing the data flood as nothing more than an intrusion of their privacy. As an ever larger amount of data is digitized and travels across organizational boundaries, there is a set of policy issues that will become increasingly important, including, but not limited to, privacy, security, intellectual property, and liability.

Finally, a significant constraint on realizing value from big data will be a shortage of talent, particularly of people with deep expertise in statistics and machine learning, and the managers and analysts who know how to operate companies by using insights from big data.