Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. This definition is intentionally subjective and incorporates a moving definition of how big a dataset needs to be in order to be considered big data. McKinsey report on big data (2011)
At the moment, data loaded in primary memory are of the order of gigabytes (\(10^9\) bytes), data stored on secondary memory are of the order of terabytes (\(10^{12}\) bytes), while data beyond this order (from petabytes, \(10^{15}\) bytes, and above) are considered big data.
This course proudly focuses on small, in-memory datasets. This is the right place to start because you can’t tackle big data unless you have experience with small data.
If your data is bigger, carefully consider if your big data problem might actually be a small data problem in disguise. While the complete data might be big, often the data needed to answer a specific question is small. You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you’re interested in. The challenge here is finding the right small data, which often requires a lot of iteration.
Another possibility is that your big data problem is actually a large number of small data problems. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like Spark) that allows you to send different datasets to different computers for processing. Once you have figured out how to answer the question for a single subset using the tools on small data, you learn new tools to solve it for the full dataset.
Data have become a torrent flowing into every area of the global economy. We can identify four land-marking sources of this torrent:
Big data has now reached every sector in the global economy:
For instance, according to McKinsey report on big data cited above:
If US health care could use big data creatively and effectively to drive efficiency and quality, we estimate that the potential value from data in the sector could be more than 300 billion dollars in value every year, two-thirds of which would be in the form of reducing national health care expenditures by about 8 percent.
In the developed economies of Europe, government administration could save more than 100 billion euros in operational efficiency improvements alone by using big data. This estimate does not include big data levers that could reduce fraud, errors, and tax gaps.
In the private sector a retailer using big data to the full has the potential to increase its operating margin by more than 60 percent.
The Gartner Group, a popular enterprise-level organization that industry looks up to for learning about trends, characterized big data in 2011 by the three V’s: volume, velocity, and variety.
Volume. The volume of data obviously refers to the size of data managed by the system. Data that is somewhat automatically generated tends to be voluminous.
Velocity The velocity is the speed at which data is created, accumulated, ingested, and processed.
Many citizens around the world regard this collection of information with deep suspicion, seeing the data flood as nothing more than an intrusion of their privacy. As an ever larger amount of data is digitized and travels across organizational boundaries, there is a set of policy issues that will become increasingly important, including, but not limited to, privacy, security, intellectual property, and liability.
privacy is an issue whose importance, particularly to consumers, is growing as the value of big data becomes more apparent. Personal data such as health and financial records are often those that can offer the most significant human benefits, such as helping to pinpoint the right medical treatment or the most appropriate financial product. However, consumers also view these categories of data as being the most sensitive. It is clear that individuals and the societies in which they live will have to grapple with trade-offs between privacy and utility.
Another closely related concern is data security, e.g., how to protect competitively sensitive data or other data that should be kept private. Recent examples have demonstrated that data breaches can expose not only personal consumer information and confidential corporate information but even national security secrets. With serious breaches on the rise, addressing data security through technological and policy tools will become essential.
Big data’s increasing economic importance also raises a number of legal issues, especially when coupled with the fact that data are fundamentally different from many other assets. Data can be copied perfectly and easily combined with other data. The same piece of data can be used simultaneously by more than one person. All of these are unique characteristics of data compared with physical assets. Questions about the intellectual property rights attached to data will have to be answered: Who owns a piece of data and what rights come attached with a dataset? What defines fair use of data? Who is responsible when an inaccurate piece of data leads to negative consequences?
Finally, a significant constraint on realizing value from big data will be a shortage of talent, particularly of people with deep expertise in statistics and machine learning, and the managers and analysts who know how to operate companies by using insights from big data.