Big data: Will you know it when you see it?
Connecting state and local government leaders
IDC offers criteria for identifying the volume, variety and velocity of big data so that growth, changes and technology preferences can be measured and analyzed.
We've all heard of big data. While few of us may agree on exactly what the term really means or how large a data set needs to be in order to qualify as big data, most of us understand that big data is a data set so large and intricate that it can't be managed with traditional IT solutions such as database tools, spreadsheets or storage management structures.
As someone who works to verify market sizes and trends, I've spent some time reviewing how big data currently is defined.
To qualify as big data, it's not just a matter of how data is counted, but how the information flows and how decisions are made for big data use cases. Over the past few years IDC has worked to establish a specific definition for big data, fully realizing that the definition must shift as some solutions become more mainstream and as the upper limits of big data continue to grow.
Currently to make the big data grade, the data collected first needs to meet one of three criteria:
- There needs to be more than 100 terabytes of data collected in the set.
- The data generated needs to exceed 60 percent growth per year.
- The data received is delivered in near real-time, via ultra-high-speed streaming.
Then, no matter which of the three criteria has been met, the data also needs to be deployed on a dynamically adaptable infrastructure. If it also meets this standard, it must meet one of the following criteria:
First, the data must originate from two or more formats and/or data sources.
Second, the data is delivered as a high-speed streaming connection, as in sensor data used for real-time monitoring.
That's certainly a long list of qualifiers. So it's no wonder there is ongoing debate about what big data means. However, I strongly believe that only data collections (and associated IT systems) that meet these criteria qualify as big data under IDC's definition. This is an important point, because with this type of definition, the size of the government big data market can start to be measured and growth, changes and technology preferences noted.
As agencies have learned, there are unique challenges involved in managing extremely large data sets, including the way the data is gathered, managed, stored, searched, analyzed and transferred. A whole new IT market is evolving with new tools and technologies designed specifically to work with these oversized sets of information.
Cloud computing has helped light a fire under big data because government agencies can quickly have access to the large data storage systems and big data analysis tools they need.
By working with information in a single collected set, rather than separately analyzing smaller sets, agencies have found that it's possible to spot trends, to notice correlations between data sets and to analyze real-time changes in the information. For these reasons, technologies such as broad- and narrow-scope data analysis, analytics and data visualization have become closely aligned to big data.
Here are just a few examples of big data in the federal space:
The NASA Earth Observing System Data Information System. EOSDIS manages data from EOS missions from the point of data capture to delivery to end users at near-real-time rates. It includes the collection, storage and dissemination of several terabytes of data each day.
Battlespace networks. Within the Defense Department, battlespace is a term used to describe DOD's unified military strategy, where armed forces within the military's theatre of operations can communicate, share data and make decisions. It includes integrated air, land, sea and space components. The military's battlespace networks are a prominent generator of big data, which is shared via networks, satellites and in some cases huge arrays of hard drives on reconnaissance aircraft that can be offloaded as soon as the plane lands.
Big Data to Knowledge (BD2K). This National Institutes of Health initiative is meant to help biomedical scientists leverage big data from multiple medical and scientific research communities.
It's worth noting that a large amount of data is already located in government data centers. Some people might describe such data stores as big data. But much of the information in legacy data centers is located on a variety of storage types, including both active databases and older tape silos. In its current format, many of these collections would not meet the definition of big data described here.
With the federal government now using large cloud-based resources such as Amazon, RackSpace and CleverSafe as cloud providers, we expect to see more vendors partnering with commercial cloud providers to develop cloud-based real-time data processing as a service. This will make it easier for vendors to pitch cloud-based big data solutions that can be ramped up fairly quickly, as long as business and analytical needs can be clearly defined.