Worried about security? Beware the mosaic effect
Connecting state and local government leaders
Open data increases the possibilities that individual data sets can be combined and analyzed to reveal private or secure information.
Most people have likely never heard of the mosaic effect. But anyone working with large streams of public data should become familiar with it because it can create security holes where none existed before.
It occurs, according to Marion Royal, program director of Data.gov, when very large data sets, even those with completely unclassified information, are combined. People can mix that data, reassembling it in unforeseen ways, like a mosaic puzzle. In a worst case scenario, security can be compromised by those with ill intent.
Royal described the mosaic effect during a presentation at the annual FOSE conference in Washington D.C., presented by 1105 Media, parent company of GCN.
Data.gov is perfect for demonstrating how the mosaic effect works, though Royal said that it was not something that was even considered when the site was created in May 2009, as part of the Obama administration’s open government agenda.
“When we started out, we only had 47 data sets,” Royal said. “There are over 90,000 today.” Data.gov works by agencies putting their public data into a specific file format so that it follows the pattern of agencyname.gov/data.json. Every night, the Data.gov site crawls those locations within agencies and adds data sets to the main site. The .json format is open and non-proprietary, so the public can also view it.
The mosaic effect came into play six months after the new site went online. The security agencies came to Data.gov warning about the dangers of people compiling data.
“We were aggregating all government data together,” Royal said. “The concern was that someone could, for example, figure out things that were never intended and not represented in a single data stream, like how often trucks were moving between agency locations. They didn’t want the bad guys to use that data in a bad way.”
David McClure is with the office of the CIO for the National Oceanic and Atmospheric Administration, though he also spent some time with Data.gov. At the same FOSE presentation, he pointed out that the mosaic effect can be used for good. In NOAA’s case, the agency has thousands of sensors deployed around the world, above and below ground and out in the ocean. They generate over a terabyte of data every day.
NOAA uses that information to help predict the weather and warn people of approaching dangers like hurricanes, but it only uses a fraction of the data collected. In some private partnerships, outside organizations have used the extra data in new ways capitalizing on the mosaic effect, McClure said, though for the most part the data isn’t shared.
Predicting the mosaic effect is still a very young science, McClure said. “I came to the conclusion that most agencies were looking just at their data and making sure that there was no personally identifiable information, and then posting it,” he said. “But people were just whistling past the graveyard because they really didn’t know how the data could be combined and used.”
As data sets get larger, the possibility of the mosaic effect occurring increases, for good or ill, Royal said.
“We have learned to find two data points that correlate with each other where, if one changes, so does the other,” he said. “Then when you expand that to hundreds or thousands of other points, you start to really see how things are.”
On the good side, Royal said that the mosaic effect can be used to predict things like flu outbreaks so more medicine can be added to the shelves. Open data is worth pursuing, he said, even if someone might also be able to figure out how to use the data for malicious intents. Data.gov scans all its data now to try and predict how it can be combined, Royal said. Even so, the times where data has to be modified for security reasons based on the possible mosaic effect are almost non-existent, he said.
“Of all the data sets we posted on Data.gov, we only ever had one incident where we asked an agency to go back and take another look,” he said. So while the possibility exists that data sets can be combined for harmful purposes, he said, functionally it’s probably not a huge threat at the moment.