What is Data Science? A Complete Guide.
Big data is having a massive impact on society and business. Analysts forecast investments in the technology to reach $65 billion in 2018. Moreover, we live in a connected world collecting huge amounts of data on everything from our viewing habits to our health. Understanding big data is more important than ever, so here is a quick primer on what it is, how it's used and a brief history of how it came to be.
Big data refers to massive complex structured and unstructured data sets that are rapidly generated and transmitted from a wide variety of sources. These attributes make up the three Vs of big data:
- Volume: The huge amounts of data being stored.
- Velocity: The lightning speed at which data streams must be processed and analyzed.
- Variety: The different sources and forms from which data is collected, such as numbers, text, video, images, audio and text.
These days, data is constantly generated anytime we open an app, search Google or simply travel place to place with our mobile devices. The result? Massive collections of valuable information that companies and organizations need to manage, store, visualize and analyze.
Traditional data tools aren't equipped to handle this kind of complexity and volume, which has led to a slew of specialized big data software and architecture solutions designed to manage the load.
Big data is essentially the wrangling of the three Vs to gain insights and make predictions, so it's useful to take a closer look at each attribute.
Big data is enormous. While traditional data is measured in familiar sizes like megabytes, gigabytes and terabytes, big data is stored in petabytes and zettabytes.
To grasp the enormity of difference in scale, consider this comparison from the Berkeley School of Information: one gigabyte is the equivalent of a seven minute video in HD, while a single zettabyte is equal to 250 billion DVDs.
This is just the tip of the iceberg. According to a report by EMC, the digital universe is doubling in size every two years and by 2020 is expected to reach 44 trillion zettabytes.
Big data provides the architecture handling this kind of data. Without the appropriate solutions for storing and processing, it would be impossible to mine for insights.
From the speed at which it's created to the amount of time needed to analyze it, everything about big data is fast. Some have described it as trying to drink from a fire hose.
Companies and organizations must have the capabilities to harness this data and generate insights from it in real-time, otherwise it's not very useful. Real-time processing allows decision makers to act quickly, giving them a leg up on the competition.
While some forms of data can be batched processed and remain relevant over time, much of big data is streaming into organizations at a clip and requires immediate action for the best outcomes. Sensor data from health devices is a great example. The ability to instantly process health data can provide users and physicians with potentially life-saving information.
Roughly 95% of all big data is unstructured, meaning it does not fit easily into a straightforward, traditional model. Everything from emails and videos to scientific and meteorological data can constitute a big data stream, each with their own unique attributes.
The diversity of big data makes it inherently complex, resulting in the need for systems capable of processing its various structural and semantic differences.
Big data requires specialized NoSQL databases that can store the data in a way that doesn't require strict adherence to a particular model. This provides the flexibility needed to cohesively analyze seemingly disparate sources of information to gain a holistic view of what is happening, how to act and when to act.
When aggregating, processing and analyzing big data, it is often classified as either operational or analytical data and stored accordingly.
Operational systems serve large batches of data across multiple servers and includes such input as inventory, customer data and purchases — the day-to-day information within an organization.
Analytical systems are more sophisticated than their operational counterparts, capable of handling complex data analysis and providing businesses with decision-making insights. These systems will often be integrated into existing processes and infrastructure to maximize the collection and use of data.
Regardless of how it is classified, data is everywhere. Our phones, credit cards, software applications, vehicles, records, websites and the majority of “things” in our world are capable of transmitting vast amounts of data, and this information is incredibly valuable.
Big data is used in nearly every industry to identify patterns and trends, answer questions, gain insights into customers, and tackle complex problems. Companies and organizations use the information for a multitude of reasons like growing their businesses, understanding customer decisions, enhancing research, making forecasts and targeting key audiences for advertising.
Here are a few industries in which the big data revolution is already underway:
The finance and insurance industries utilize big data and predictive analytics for fraud detection, risk assessments, credit rankings, brokerage services and blockchain technology, among other uses.
Financial institutions are also using big data to enhance their cybersecurity efforts and personalize financial decisions for customers.
Hospitals, researchers and pharmaceutical companies are adopting big data solutions to improve and advance healthcare.
With access to vast amounts of patient and population data, healthcare is enhancing treatments, performing more effective research on diseases like cancer and Alzheimer’s, developing new drugs, and gaining critical insights on patterns within population health.
Media & Entertainment
If you've ever used Netflix, Hulu or any other streaming services that provides recommendations, you've witnessed big data at work.
Media companies analyze our reading, viewing and listening habits to build individualized experiences. Netflix even uses data on graphics, titles and colors to make decisions about customer preferences.
From engineering seeds to predicting crop yields with amazing accuracy, big data and automation is rapidly enhancing the farming industry.
With the influx of data in the last two decades, information is more abundant than food in many countries, leading researchers and scientists to use big data to tackle hunger and malnutrition. With groups like the Global Open Data for Agriculture & Nutrition (GODAN) promoting open and unrestricted access to global nutrition and agricultural data, some progress is being made in the fight to end world hunger.
Data collection can be traced back to the use of stick tallies by ancient civilization when tracking food, but the history of big data really begins much later. Here is a brief timeline of some of notable moments that have led us to where we are today.
- One of the first instances of data overload is experienced during the 1880 census. The Hollerith Tabulating Machine is invented and the work of processing census data is cut from ten years of labor to under a year.
- German-Austrian engineer Fritz Pfleumer develops magnetic data storage on tape, which led the way for how digital data would be stored in the coming century.
- Shannon’s Information Theory is developed, laying the foundation for the information infrastructure widely used today.
- Edgar F. Codd, a mathematician at IBM, presents a “relational database” displaying how information in large databases can be accessed without knowing its structure or location. This was formerly reserved for specialists or those with extensive computer knowledge.
- Commercial use of Material Requirements Planning (MRP) systems are developed to organize and schedule information, becoming more common for catalyzing business operations.
- The World Wide Web is created by Tim Berners-Lee.
- Doug Laney presents a paper describing the "3 Vs of Data," which becomes the fundamental characteristics of big data. That same year the term “software-as-a-service” is shared for the first time.
- Hadoop, the open-source software framework for large dataset storage is created.
- The term “big data” is introduced to the masses in the Wired article "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete."
- A team of computer science researches publish the paper "Big Data Computing: Creating Revolutionary Breakthroughs in Commerce, Science and Society," describing how big data is fundamentally changing the way companies and organizations do business.
- Google CEO Eric Schmidt reveals that every two days people are creating as much information as people created from the beginning of civilization until 2003.
- More and more companies begin moving their Enterprise Resource Planning Systems (ERP) to the cloud.
- Internet of Things (IoT) becomes widely used with an estimate of 3.7 billion connected devices or things in use, transmitting large amounts of data every day.
- The Obama administration releases the "Federal Big Data Research and Strategic Development Plan," designed to drive research and development of big data applications that will directly benefit society and the economy.
- IBM study says 2.5 quintillion bytes of data are created daily and that 90% of the world's data has been created in the last two years.