Next big things in big data: Visualization, knowledge clouds, fast clusters
Connecting state and local government leaders
The open-source cluster computing tool called Spark speeds programming and can run up to 100 times faster than Hadoop Map Reduce.
This is the third in a series about big data tools. Read part one and part two.
A picture is worth a thousand words, but when it comes to data analytics, basic graphics or charts are not enough. Instead, users need the data to answer more complex questions and solve problems.
The Florida Department of Juvenile Justice is using a tool called Tableau to present a clearer picture of children in the justice system and the effectiveness of the state’s innovative reform efforts. Tableau is a self-service business intelligence tool that helps people of any skill level create data visualizations, reports or dashboards from databases, spreadsheets and big data sources, according to Francois Ajenstat, director of product management for Tableau Software.
Sometimes a mix of data and geospatial analytics can help bring data to life. Using analytics and customer relationship management software, analysts at the Texas Parks and Wildlife Department can pinpoint trends in leisure activities, parks utilization and purchasing patterns, all the way down to different neighborhoods, according to TPWD officials.
Using Business Analytics, a part of Esri’s ArcGIS geospatial data analysis tool, TPWD analysts mined Census and ZIP code information and used probability matching to get a better handle on their customers. Using geographic data and SAS Analytics, TPWD has even been able to stock fish in lakes closer to where anglers live or promote special hunts to hunters.
Future tools
The amount of data will only increase with expansion of the Internet of Things, data-driven scientific discovery and explosive growth of video, and agencies are already struggling to keep up. Current success comes from adding more or faster hardware or using tools that scale to high-performance computing data sets. But the cost of processing and storage may become prohibitive.
One solution may come from the National Institutes of Health. The agency’s Big Data to Knowledge initiative aims to advance the science and utility of big data in biomedical and behavioral research and to create innovative approaches, methods, software and tools for big data.
Meanwhile, the National Cancer Institute this year set up pilot projects to test the feasibility of a "cancer knowledge cloud" that would combine storage repositories and computing power in the cloud.
But the more likely next big thing in big data could be an open-source cluster computing system called Spark. It speeds programming and can run up to 100 times faster than Hadoop MapReduce, according to its developers. Spark offers a general execution model that can optimize arbitrary operator graphs and supports in-memory computing, which lets it query data faster than disk-based engines like Hadoop. To make programming faster, Spark provides clean, concise APIs in Scala, Java and Python.
Spark was originally created at the University of California, Berkeley’s AMPLab, and more than 25 companies have contributed code to Spark, making it the largest open-source, big data development community, according to Silicon Angle. And it’s moving into the mainstream. Recently, Cloudera announced direct support for Apache Spark, giving Cloudera users a way to perform rapid, resilient processing of in-memory datasets stored in Hadoop, as well as general data processing.
Although it is an exciting time for big data, agencies should be cautious about the tools they select, said Dale Wickizer, chief technology officer of NetApp U.S. Public Sector. So many companies are popping up with big data offerings, agency managers have to consider whether all will be around in five to 10 years.
“Agencies should always have a fallback plan, as well as a robust, underlying infrastructure that enables them to quickly checkpoint and restart when problems do arise,” he said. Agencies can also benefit from big data solutions that use an open ecosystem of partners that ensure a complete offering.
NEXT STORY: Programming tests separate wheat from chaff