What you need to know about big data

 

Connecting state and local government leaders

It's not one thing, but a mix of technologies that can let you make use of the exploding data generated by sensors, mobile devices and broadband. And it's in everyone's future.

EDITOR'S NOTE: This article was updated Feb. 10, 2012, to correct the name of the Oracle Big Data Appliance.

The king is dead.  Long live the king.

For the past several years, the buzzword was "the cloud."  Now it's "big data."

Thanks to the exploding use of sensors, mobile devices and social networks, coupled with broadband communications, the amount of data collected by government, private enterprise and individuals has far outstripped our ability to analyze it effectively and even to store it.

It has been estimated that Wal-Mart records more than 1 million customer transactions each hour, resulting in more than 2.5 petabytes of data being stored, the equivalent of 167 times the data stored in the Library of Congress. Facebook is reported to store 40 billion photographs. And an estimated 35 hours of video content is uploaded to YouTube every minute.


Related coverage:

Big data spawns new breed of 'data scientist'

Apache Hadoop: Big data's big player

The big deal about big data 


And, of course, the federal government is also a prodigious collector of data. One project alone — NASA's Earth Observing System Data and Information System — has accumulated more than 3 petabytes of data since 2005. "Data volumes are growing exponentially," the President's Council of Advisors on Science and Technology warned in a December 2010 report. "Every federal agency needs to have a 'big data' strategy."

And it's not just the amount of data that presents challenges. The types of data being collected are changing. "Eighty percent of the new information is coming in as unstructured information," said Frank Stein, director of IBM's Analytics Solution Center. "Look at the growth of YouTube and all the documents in PDF files. A huge variety of data types is really adding to the volume of data."

What's more, analyzing video streams for significant content, making it searchable and integrating it with other types of datasets is a feat beyond the reach of traditional relational database systems.

Unfortunately, the price of drowning in data can be significant. Military investigators blamed "information overload" for a Predator drone attack that killed 23 Afghan civilians in February 2010. Drone operators, it was found, could not keep up with monitoring the drone's video feeds while at the same participating in instant messaging and radio communications with intelligence analysts and troops.

While the explosion of digital data collection offers challenges, however, it also offers opportunities. Various agencies at federal, state and local levels have accumulated massive amounts of data on such varied topics as pollen in trees, water quality, disease patterns, transportation infrastructure, weather statistics and satellite photography.

If the tools are developed to integrate these massive datasets across governmental boundaries and make them efficiently searchable, unexpected benefits may emerge. "When you can start integrating datasets you see insights that you never saw before," Stein said.

Hadoop a catalyst

The main reason for all the recent buzz about big data is that those tools are just now emerging.

"In the public sector we have so much data that we do nothing with," said Peter Doolan, chief technologist for Oracle's public-sector division. "We haven't even had the tools to do anything with it. We now do."

What makes it difficult to talk about big data, however, is that it is not a single technology.  Rather, it is a confluence of technological developments that allow analysts to store, manage and analyze large and diverse datasets.

"We call it big data just to give it an easy-to-remember name," said Michael Chui, senior analyst a McKinsey Global Institute, the research arm of the McKinsey & Co. management consulting firm. If there is a key technology enabling the analysis of big data, Chui said, it was the introduction of Apache Hadoop. Hadoop is essentially an open-source framework for rapidly analyzing huge datasets that may reside on multiple commodity computers. It is based on Google's MapReduce analysis engine, which parses data for distributed processing.

"We can take this commodity equipment and use it as a high-performance computer on the cheap," said Dave Ryan, CTO at General Dynamics IT. "The parallel processing idea that was only for the big labs can now be done by Web 2.0 startups." It is that processing power that allows for the analysis of extremely large datasets.

But Chui says that big data isn't just about Hadoop-based data processing. It's also about faster processors, wider bandwidth communications and larger, cheaper storage. "And in addition to all the analytics, how can you make this data consumable?" he asked. "So the idea of visualization or interface technologies to make the results of analysis consumable is a strongly felt need."

Big data in a box

While the main pieces of big data analysis software are available as open-source downloads, the major database vendors are rushing to deliver packaged big data solutions.  "All of these pieces today are available for you if you wish to go ahead and download all of these pieces," said Doolan. "We're trying to put big data in a box. Oracle has announced the thing we are calling the Oracle Big Data Appliance. It has all that software and hardware in one box."

Similarly, IBM offers InfoSphere BigInsights, a Hadoop-based analysis tool.

According to Stein, IBM is also focusing on developing the ability to analyze data streams while they are still in motion. "It takes time to write to disk and put it in your database and draw statements against that," said Stein.  "We're talking about doing all of this on the fly as the data is in motion.  You can actually write a program to tell the box what you want to look for just like you tell a firewall what to look for in terms of viruses or other kinds of things. This is at the early stage."

The potential of big data solutions has agency staff and integrators alike excited.

"As an integrator what I find so interesting is figuring out how we can apply the solution to problems we have had over time in all the different domains," said Ryan, pointing to such uses as traffic management, fraud detection and natural language processing for discovery and other regulatory matters.

Already, a variety of federal agencies are using big data applications. The Office of Personnel Management is using a SAS analytic suite to scan data records from more than 400 health insurance companies participating in the Federal Employees Health Benefits Program for fraudulent claims and other irregularities. The SAS software is also being used to analyze the millions of records in the CMS Chronic Condition Data Warehouse, a repository for Medicare and Medicaid research data.

GCE Federal is currently working on a project that will combine and make searchable procurement data across the entire federal government.  "Imagine if you could combine procurement data from every agency in the government in one big database and have tools on top of it that would allow stakeholders — from public users to government organizations — to be able to go in there and slice and dice through procurement data in a highly intuitive fashion," said GCE Federal CEO Ray Muslimani.

The ability of big data applications to integrate disparate datasets for analysis offers the potential not only for unexpected insights but also for interagency cooperation that can result in major savings in an era of constricting budgets.

"As we are more and more providing Web services that take advantage of the data, there is great opportunity for agencies to collaborate on the resources," said Rob Dollison, program manager with the National Geospatial Program, a unit of the U.S. Geological Survey. "We often have use for the same types of data and in the past we just had to make copies. More and more we are able to find ways to collaborate in the acquisition of the data and how we make it available."

Dollison said USGS already shares a lot of aerial photography with the Agriculture Department. "Anytime a government agency is looking at investing in things we really are required to look at what exists, what other agencies are doing, and how can we collaborate on it," said Dollison. "In most cases there is an awful lot of incentive to collaborate.  And I think the urgency is growing."

Oracle's Doolan agrees. "I think the next two or three years you're going to see a massive leap forward," he said. "It's going to be very interesting."

New ‘data scientists’

The potential of big data also brings with it challenges for agencies and departments.

"These new technologies are coming down the pike," Chui said. "What's important is to understand how you can start to experiment with some of these new or innovative types of technologies. In general you can do it in a relatively small-scale. In some cases that will require some shifting of budget dollars."

It will also, warn other analysts, require a lot more highly trained people, a special challenge for public-sector users.

"You also have to have the right people to be able to interpret the results," said Anne Lapkin, a research vice president with the Gartner Group. "One of the things that we're seeing now is the emergence of something that we are calling the 'data scientist,' who is someone who has an innate understanding of the data, who understands the analytical techniques, who understand statistical analysis and can actually formulate the appropriate queries to get a sensible result. Those are people who are in very short supply in the private sector and the public sector. It's an entirely new skill set."

"Finding the people who can derive actionable insight from large amounts of data is a tremendous challenge," Chui said. His team has estimated a potential gap of approximately 140,000 to 190,000 potential positions. "That suggests some interesting things from a policy standpoint. If you are a leader of a public-sector organization finding that type of talent, motivating them in retaining them will be very important for you if you're to be able to do your job.

“As a policy-maker, if you recognize this is going to be a basis of competition not only for companies but for countries it will be very important to try and make sure that we have the right policies in place so that we are educating and graduating people with these types of skills," he said.

Meta standards

Another challenge of growing importance is the development of standards.

Hadoop and similar data analysis tools are efficient at dealing with massive amounts of data because they focus on manipulating metadata — data about the data — which is far more efficient than moving the data itself around. The problem with that, says Lapkin, is that current big data implementations tend to have the metadata embedded within the analytical code.

"So you can't take that information, for example, and easily integrated with other information another system," Lampkin said. "People who are doing Hadoop/MapReduce implementations are building themselves a new little set of data silos to replace the ones that they had previously."

In getting one of the public sector's highest profile big data projects  — Data.gov — up and running, Marion Royal quickly saw the need for metadata standards. "When we first started we established what we called the 'metadata template,' " Royal said. "Agencies were required to use that template to submit to Data.gov. So that was the beginning of some harmonization across government in defining datasets and how they might be used."

Data.gov, which launched in May 2009 with 47 datasets, now offers more than 400,000 datasets from 172 agencies and subagencies.

"There is still a lot of work to do," Royal said. "We need to define open standards to be able to share the data and make it available to developers who can develop applications that make use of this data regardless of where the data is stored, regardless of whether it is federal or state, because most of the data that citizens might use are typically found on the local and state data sites."

Finally, some analysts warn that, despite the potential power of big data tools, the emerging technology is not a silver bullet.

Lapkin said agency strategists looking at big data should keep their focus clearly on the problems they want to solve or services they want to offer. "Fundamentally, if you haven't defined what the problem is, then throwing people or technology or technique or a buzzword isn't going to do anything except waste money," she said. "Start small and spread out. Always work on very well-defined business outcomes. The stakes are higher and higher because it is very easy to throw huge piles of money at this stuff and not get any return. Everybody gets caught up in the hype."

X
This website uses cookies to enhance user experience and to analyze performance and traffic on our website. We also share information about your use of our site with our social media, advertising and analytics partners. Learn More / Do Not Sell My Personal Information
Accept Cookies
X
Cookie Preferences Cookie List

Do Not Sell My Personal Information

When you visit our website, we store cookies on your browser to collect information. The information collected might relate to you, your preferences or your device, and is mostly used to make the site work as you expect it to and to provide a more personalized web experience. However, you can choose not to allow certain types of cookies, which may impact your experience of the site and the services we are able to offer. Click on the different category headings to find out more and change our default settings according to your preference. You cannot opt-out of our First Party Strictly Necessary Cookies as they are deployed in order to ensure the proper functioning of our website (such as prompting the cookie banner and remembering your settings, to log into your account, to redirect you when you log out, etc.). For more information about the First and Third Party Cookies used please follow this link.

Allow All Cookies

Manage Consent Preferences

Strictly Necessary Cookies - Always Active

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Sale of Personal Data, Targeting & Social Media Cookies

Under the California Consumer Privacy Act, you have the right to opt-out of the sale of your personal information to third parties. These cookies collect information for analytics and to personalize your experience with targeted ads. You may exercise your right to opt out of the sale of personal information by using this toggle switch. If you opt out we will not be able to offer you personalised ads and will not hand over your personal information to any third parties. Additionally, you may contact our legal department for further clarification about your rights as a California consumer by using this Exercise My Rights link

If you have enabled privacy controls on your browser (such as a plugin), we have to take that as a valid request to opt-out. Therefore we would not be able to track your activity through the web. This may affect our ability to personalize ads according to your preferences.

Targeting cookies may be set through our site by our advertising partners. They may be used by those companies to build a profile of your interests and show you relevant adverts on other sites. They do not store directly personal information, but are based on uniquely identifying your browser and internet device. If you do not allow these cookies, you will experience less targeted advertising.

Social media cookies are set by a range of social media services that we have added to the site to enable you to share our content with your friends and networks. They are capable of tracking your browser across other sites and building up a profile of your interests. This may impact the content and messages you see on other websites you visit. If you do not allow these cookies you may not be able to use or see these sharing tools.

If you want to opt out of all of our lead reports and lists, please submit a privacy request at our Do Not Sell page.

Save Settings
Cookie Preferences Cookie List

Cookie List

A cookie is a small piece of data (text file) that a website – when visited by a user – asks your browser to store on your device in order to remember information about you, such as your language preference or login information. Those cookies are set by us and called first-party cookies. We also use third-party cookies – which are cookies from a domain different than the domain of the website you are visiting – for our advertising and marketing efforts. More specifically, we use cookies and other tracking technologies for the following purposes:

Strictly Necessary Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Functional Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Performance Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Sale of Personal Data

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

Social Media Cookies

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

Targeting Cookies

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.