Getting big data out of the storage black hole
Connecting state and local government leaders
With fast data retrieval critical for real-time analysis, researchers are finding ways to access big data more quickly and efficiently.
Ten years ago, data in long-term storage “was almost like a black hole,” Katrin Heitmann, a computational scientist at Argonne National Laboratory, said. “You would put data on it. You would hope you could retrieve it at some point, but if you actually had to, it was almost impossible because it was so slow.”
Typically, high-performance computing (HPC) archive data has been put in “cold” storage, whose major benefits are low cost, high capacity and durability for archiving data that was unlikely to be accessed. However, fast data retrieval is necessary for real-time analysis, which is leading to several efforts to combat the black hole effect.
A new data storage system being developed for Los Alamos National Laboratory will use technology from Seagate’s ClusterStor A200 system to keep massive amounts of stored data available for rapid access, while also minimizing power consumption. The joint research program, which aims to determine new ways to rapidly access archived data cost effectively, will use high-density, power-managed prototype disks and software for deep data archiving -- a challenge for organizations that must juggle increasingly massive amounts of data using very little additional energy, Seagate officials said.
Using automated, policy-driven hierarchical storage management to migrate data off expensive primary storage tiers while keeping it online for fast retrieval, the program aims to significantly reduce costs in time and operations.
Meanwhile, the Oak Ridge Leadership Computing Facility is addressing slow data retrieval by upgrading its data storage system. According to OCLF officials, the facility is increasing “its data intake rate by a factor of five” as well as more than quadrupling the disk cache of its high-performance storage system.
“We’ve significantly reduced the time it takes to ingest a petabyte of data -- almost by two-thirds,” OLCF staff member Jason Hill said. “These improvements not only help users place data in the archive, they make retrieving a dataset much faster, too.”
The OLCF storage team invested in 40-gigabit Ethernet connectivity and two separate Ethernet switches to provide a high-bandwidth path for transferring data quickly to the archival media. In addition to speeding up communication between the different storage tiers, the team also wanted to help users avoid retrieving data from tapes for as long as possible. Fifteen petabytes of raw disk cache helps expand the amount of data that can be recalled and accessed quickly.
When the upgrades are completed later this year, data storage will total nearly 20 petabytes of disk cache, and transfer rates will approach 200 gigabytes per second, OLCF reported.
Wrangler, a new type of data analysis and management system at the Texas Advanced Computing Center, is also taking on the HPC storage problem.
What’s different about Wrangler is its massive amount of flash storage – 600 terabytes – which allows people to work directly with stored data. It also has a very large distributed spinning disc storage system and high-speed network access.
"Wrangler has enough horsepower that we can run some very large studies and get meaningful results in a single run," Oak Ridge computer scientist Joshua New told Scientific Computing.
"Wrangler fills a specific niche for us in that we're turning our analysis into an end-to-end workflow, where we define what parameters we want to vary," New said. "Doing that from beginning to end as a solid workflow on Wrangler is something that we're very excited about."