Big Data

At the moment, data loaded in primary memory are of the order of gigabytes (\(10^9\) bytes), data stored on secondary memory are of the order of terabytes (\(10^{12}\) bytes), while data beyond this order (from petabytes, \(10^{15}\) bytes, and above) are considered big data.

Small data first

This course proudly focuses on small, in-memory datasets. This is the right place to start because you can’t tackle big data unless you have experience with small data.

If your data is bigger, carefully consider if your big data problem might actually be a small data problem in disguise. While the complete data might be big, often the data needed to answer a specific question is small. You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you’re interested in. The challenge here is finding the right small data, which often requires a lot of iteration.

Another possibility is that your big data problem is actually a large number of small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like Spark) that allows you to send different datasets to different computers for processing. Once you have figured out how to answer the question for a single subset using the tools on small data, you learn new tools to solve it for the full dataset.

Big data

Data have become a torrent flowing into every area of the global economy. We can identify four land-marking sources of this torrent:

the birth of the World Wide Web in 1989. Tim Berners-Lee proposes the HyperText Markup Language (HTML) with the motivation to keep track of experimental data at the European Organization for Nuclear Research (CERN).
The arrival of an array of social networks such as Facebook, launched in 2004, and Twitter, founded in 2006.
The advent of smartphones, started with the first-generation iPhone released in 2007.
The forthcoming materialization of Internet of Things: millions of networked sensors are being embedded in the physical world in devices such as mobile phones, smart energy meters, automobiles, and industrial machines that sense, create, and communicate data.

Big data has now reached every sector in the global economy:

Like other essential factors of production such as hard assets and human capital, much of modern economic activity simply couldn’t take place without data.
The effective use of big data has the potential to transform economies, delivering a new wave of productivity growth and consumer surplus.
There is strong evidence that big data can play a significant economic role to the benefit not only of private commerce but also of national economies and their citizens.

For instance, according to McKinsey report on big data cited above:

If US health care could use big data creatively and effectively to drive efficiency and quality, we estimate that the potential value from data in the sector could be more than 300 billion dollars in value every year, two-thirds of which would be in the form of reducing national health care expenditures by about 8 percent.
In the developed economies of Europe, government administration could save more than 100 billion euros in operational efficiency improvements alone by using big data. This estimate does not include big data levers that could reduce fraud, errors, and tax gaps.
In the private sector a retailer using big data to the full has the potential to increase its operating margin by more than 60 percent.

The Gartner Group, a popular enterprise-level organization that industry looks up to for learning about trends, characterized big data in 2011 by the three V’s: volume, velocity, and variety.

Volume. The volume of data obviously refers to the size of data managed by the system. Data that is somewhat automatically generated tends to be voluminous.
Velocity The velocity is the speed at which data is created, accumulated, ingested, and processed.
Variety Big data includes structured, semistructured, and unstructured data.
- Structured data feature a formally structured data model, such as the relational model, in which data are in the form of tables containing rows and columns.
- Unstructured data have no identifiable formal structure. and include text, audio, video, images.
- Semistructured data. Some forms of unstructured data may fit into a format that allows well-defined tags that separate semantic elements; this format may include the capability to enforce hierarchies within the data. XML is hierarchical in its descriptive mechanism, and various applications of XML have come about in many domains.

Many citizens around the world regard this collection of information with deep suspicion, seeing the data flood as nothing more than an intrusion of their privacy. As an ever larger amount of data is digitized and travels across organizational boundaries, there is a set of policy issues that will become increasingly important, including, but not limited to, privacy, security, intellectual property, and liability.

privacy is an issue whose importance, particularly to consumers, is growing as the value of big data becomes more apparent. Personal data such as health and financial records are often those that can offer the most significant human benefits, such as helping to pinpoint the right medical treatment or the most appropriate financial product. However, consumers also view these categories of data as being the most sensitive. It is clear that individuals and the societies in which they live will have to grapple with trade-offs between privacy and utility.
Another closely related concern is data security, e.g., how to protect competitively sensitive data or other data that should be kept private. Recent examples have demonstrated that data breaches can expose not only personal consumer information and confidential corporate information but even national security secrets. With serious breaches on the rise, addressing data security through technological and policy tools will become essential.
Big data’s increasing economic importance also raises a number of legal issues, especially when coupled with the fact that data are fundamentally different from many other assets. Data can be copied perfectly and easily combined with other data. The same piece of data can be used simultaneously by more than one person. All of these are unique characteristics of data compared with physical assets. Questions about the intellectual property rights attached to data will have to be answered: Who owns a piece of data and what rights come attached with a dataset? What defines fair use of data? Who is responsible when an inaccurate piece of data leads to negative consequences?

Finally, a significant constraint on realizing value from big data will be a shortage of talent, particularly of people with deep expertise in statistics and machine learning, and the managers and analysts who know how to operate companies by using insights from big data.