Hadoop: The good, the bad and the ugly
Connecting state and local government leaders
Government IT managers should be wary of technology overreach and focus on Hadoop's known success areas.
Hadoop is a disruptive force in the traditional data management space. However, there are both good and bad sides to the disruption, as well as some ugly marketing hype fueling it.
The good side of Hadoop’s disruption is in the realm of big data.
Hadoop is an open source, Java-based ecosystem of technologies that exploit many low-cost, commodity machines to process huge amounts of data in a reasonable time. The bottom line is that Hadoop works for big data, functions well at a low cost and is improving every day. A recent Forrester report called Hadoop’s momentum “unstoppable.”
Currently there are hundreds, even thousands, of contributors to the Hadoop community, including dozens of large companies like Microsoft, IBM, Teradata, Intel and many others. Hadoop has proven a robust way to process big data; its ecosystem of complementary technologies is growing every day.
But there’s a bad side to Hadoop’s disruption.
First, its very success is causing many players to jump in, which increases the confusion and pace of change in the technology. The current state of Hadoop is in radical flux. Every part of the ecosystem is undergoing both rapid acceleration experimentation.
Furthermore, parts of the ecosystem are extremely immature. When I tech edited the book, “Professional Hadoop Solutions,” I saw firsthand how some newer technologies like Oozie had schemas for configuration files that were very immature and will undergo significant change as they mature.
Hadoop 2.0 only came out in 2013 with a new foundational layer called YARN. Now there is Hadoop Spark, a more general-purpose parallel computation approach that is faster than and competes with Hadoop MapReduce. It is not unrealistic to say that the technology is experiencing both extreme success and extreme churn simultaneously.
Second, there is immaturity in terms of features that increase the risk of adoption. Technologies like Facebook’s Presto competes with Apache’s Hive, a data warehouse infrastructure built on top of Hadoop. As with any other emerging technology, it is best to keep away from the bleeding edge of the technology to the more stable core components.
The ugly side of Hadoop’s disruption is the technology overreach fueled by the marketing departments of numerous new entrants to the Hadoop/big data space. Hortonworks Inc., which focuses on the support of Hadoop and just received a $100 million investment, recently published a whitepaper titled A Modern Data Architecture with Apache Hadoop: The Journey to a Data Lake.
The paper makes the case for augmenting your current enterprise data warehouse and data management architecture with a Hadoop installation to create a “data lake.” Of course, data lake is a newly minted term that basically promises a single place to store all your data where it can be analyzed by numerous applications at any time.
Basically, it’s a play for “Hadoop Everywhere and Hadoop for ALL DATA.” To say this is a bold statement by Hortonworks is being kind. The vision of a data lake is not a bad vision – a store-everything approach is worthwhile. However, it is wildly unrealistic to say that Hadoop can get you that today. Executing successfully on that vision is a minimum of five years out.
On the positive side, let me add that I do believe that Hadoop can achieve this vision if it continues on its current trajectory – it is just not there today. For example, the Hadoop File System is geared towards extremely large files, which a store-everything approach would not accommodate. Additionally, Hadoop’s analysis features are geared to processing homogeneous data like Web logs, sensor data and clickstream data, which is at odds with the vision of storing everything including a wide variety of heterogeneous formats.
A reality check comparing Hadoop’s current status for handling data management tasks (outside of its big data realm) to mature data management technologies like ETL and data warehouses can only conclude that hyperbole like the Hortonworks whitepaper is a classic case of technology overreach.
So, government IT managers should be wary of technology overreach and focus on known success areas to use the right tools for the right challenge. For right now, Hadoop successfully tackles big data.
For any other use of Hadoop at this time, your mantra is caveat emptor.
Michael C. Daconta (mdaconta@incadencecorp.com or @mdaconta) is the Vice President of Advanced Technology at InCadence Strategic Solutions and the former Metadata Program Manager for the Homeland Security Department. His new book is entitled, The Great Cloud Migration: Your Roadmap to Cloud Computing, Big Data and Linked Data.