HADOOP

Mapping and Reducing the State of Hadoop

Jacky Liang

In this blog post, part one of a two-part series, we look at the state of Hadoop from a macro perspective. In the second part of this series, we will look at how Hadoop and MemSQL can work together to solve many of the problems described here.

2008 was a big year for Apache Hadoop. It appeared organizations had finally found the panacea for working with exploding quantities of data with the rise of mobile and desktop web.

Yahoo launched the world’s largest Apache Hadoop production application. They also won the “terabyte sort” benchmark, sorting a terabyte of data in 209 seconds. Apache Pig – a language that makes it easier to query Hadoop clusters – and Apache Hive – a SQL-ish language for Hadoop – were actively being developed, by Yahoo and Facebook respectively. Cloudera, now the biggest software and services company for Apache Hadoop, was also founded.

Data sizes were exploding with the continued rise of web and mobile traffic, pushing existing data infrastructure to its absolute limits. As a result, the term “big data” was coined around this time too.

Then came Hadoop, promising to all organizations to answer any questions you have with your data.

The promise: You simply need to collect all your data in one location and run it on free Apache Hadoop software, using cheap scalable commodity hardware. Hadoop also introduced the concept of the Hadoop Distributed File System (HDFS), allowing data to be spanned over many disks and servers. Not only is the data stored, but it’s also replicated 2 – 3 times across servers, ensuring no data loss even when a server goes down. Another benefit to using Hadoop is that there is no limit to the sizes of files stored in HDFS, so you can continuously append data to the files, as in the case of server logs.

Facebook claimed to have the largest Hadoop cluster in the world, at 21 petabytes of storage, in 2010. By 2017, more than half of the Fortune 50 companies were using Hadoop. Cloudera and Hortonworks became multi-billion dollar public companies. For an open source project that had only begun in 2006, Hadoop became a household name in the tech industry in the span of under a decade.

The only direction is up for Hadoop, right?

However, many industry veterans and experts are saying Hadoop perhaps isn’t the panacea for big data problems that it’s been hyped up to be.

Just last year in 2018, Cloudera and Hortonworks announced their merger. The CEO of Cloudera announced an optimistic message about the future of Hadoop, but many in the industry disagree.

“I can’t find any innovation benefits to customers in this merger,” said John Schroder, CEO and Chairman of the Board at MapR. “It is entirely about cost cutting and rationalization. This means their customers will suffer.”

Ashish Thusoo, the CEO Of Qubole, also has a grim outlook on Hadoop in general — “the market is evolving away from the Hadoop vendors – who haven’t been able to fulfill their promise to customers”.

While Hadoop promised the world a single data store for all of your data, in cheap and scalable commodity hardware, the reality of operationalizing that data was not so easy. Speaking with a number of data experts at MemSQL, reading articles by industry experts, and reviewing surveys from Gartner, we noticed a number of things that are slowing Hadoop growth and deployment within existing enterprises. The data shows that the rocketship growth of Hadoop had been partly driven by fear of being left behind, especially by technology executives – the ones who overwhelmingly initiate Hadoop adoption, with 68% of adoption initiated within the C-suite, according to Gartner. We will also explore limitations to Hadoop in various use cases especially in this ever-changing enterprise data industry.

Let’s dive in.

Hype

In a Gartner survey released in 2015, the research firm says that an important point to look at with Hadoop adoption is the low number of Hadoop users in an organization, which gives indication that “Hadoop is failing to live up to its promise.”

Gartner say that hype and market pressure were among the main reasons for interest in Hadoop. This is not a surprise to many, as it’s hard to avoid hearing Hadoop and big data in the same sentence. Gartner offers the following piece of advice for the C-suite interested in deploying Hadoop:

“CEOs, CIOs and CTOs (either singularly or due to pressure from their boards) may feel they are being left behind by their competitors, based on press and hype about Hadoop or big data in general. Being fresh to their roles, the new chief of innovation and data may feel pressured into taking some form of action. Adopting Hadoop, arguably ‘the tallest tree in the big data forest’, provides the opportunity.”

The survey warns to not adopt Hadoop because of the fear of being left behind — Hadoop adoption remains still at an early adopter phase, with skills and successes still rare. A concrete piece of advice from Gartner is to start with small projects backed by business stakeholders to see if Hadoop is helpful in addressing core business problems. Using small deployments initially will allow an organization to develop skills and develop a record of success before tackling larger projects.

Skills Shortage

When using Hadoop for analytics, you lose the familiar benefits of SQL.

According to the same survey cited above, it appears that around 70% of organizations have relatively few Hadoop developers and users. The low number of Hadoop users per organization is attributed to Hadoop innately being unsuitable for large simultaneous numbers of users. This also indicates difficulty in hiring Hadoop developers attributed to skill shortage. Which leads to our next point — cost.

Cost

Two facts about Apache Hadoop:

  1. It’s free to use. Forever.
  2. You can use cheap commodity hardware.

But Hadoop is still very expensive. Why?

While Hadoop may have a cheap upfront cost in software use and hosting, everything after that is anything but cheap.

As explained before, to make Hadoop work for more than just engineers, there need to be multiple abstraction layers on top. Having additional copies of the data for Hive, Presto, Spark, Impala, etc, means additional cost in hardware, maintenance, and operations. Adding layers on top also means requiring additional operations and engineering work to maintain the infrastructure.

While Hadoop may seem cheap in terms of upfront cost, the costs for maintenance, hosting, storage, operations, and analysis make it anything but.

Easy to Get Data In, but Very Tough to Get Data Out

Getting data into Hadoop is very easy, but it turns out, getting data out and deriving meaningful insight to your data is very tough.

A person working on data stored in Hadoop – usually an engineer, not an analyst – is expected to have at least some knowledge of HDFS, MapReduce, and Java. One also needs to learn how to set up the Hadoop infrastructure, which is another major project in itself. Speaking with relevant industry people that have formerly deployed Hadoop or work closely with organizations that use Hadoop, this is the biggest pain point of the technology — how hard it is to run analytics on Hadoop data.

Many technologies have been built to tackle the complexities of Hadoop, such as Spark (data processing engine), Pig (data flow language), and Hive (a SQL-like extension on top of Hadoop). These extra layers add more complexity to an already-complex data infrastructure. This usually means more potential points of failure.

Hiring Software Engineers is Expensive

An assortment of software skills are needed to make Hadoop work. If it’s used with no abstraction layer, such as Hive or Impala, on top, querying Hadoop needs to be done in MapReduce, which is written in Java. Working in Java means hiring software engineers rather than being able to hire analysts which are proficient in SQL.

Software engineers with Hadoop skills are expensive, with an average salary in the U.S. at $84,000 a year (not including bonuses, benefits, etc). In a survey by Gartner, it’s stated that “obtaining the necessary skills and capabilities [is] the largest challenge for Hadoop (57%).”

Your engineering team is likely the most expensive, constrained, and tough-to-hire-for resource in your organization. When you adopt Hadoop, you then require engineers for a job that an analyst proficient in SQL could otherwise do. On top of the Hadoop infrastructure and abstraction layers you need to more easily get data out, you now need to account for the engineering resources needed. This is not cheap at all.

Businesses Want Answers NOW

As businesses are going international, and customers are demanding instant responsiveness around the clock, companies are pushed to become real-time enterprises. Whether this is to derive real-time insights into product usage, live identification of financial fraud, providing customer dashboards that show investment returns in milliseconds, not hours, or understanding ad spend results on an up-to-the-second basis, waiting for queries to Map and Reduce simply no longer serves the immediate business need.

It remains true that Hadoop is incredible for crunching through large sets of data, as that is its core strength — in batch processing. There are ways to augment Hadoop’s real-time decision abilities, such as using Kafka streams. But in this case, what’s meant to be real-time processing slows down to micro batching.

Spark streaming is another way to speed up Hadoop, but it has its own limitations. Finally, Apache projects like Storm are also micro-batching, so they are nowhere near real time.

Another point to consider is that, the above technologies mentioned are another piece of complexity added to an already-complex data infrastructure. Adding multiple layers between Hadoop and SQL-based analytic tools also means slow response, multiplied cost, and additional maintenance required.

In short, Hadoop is not optimized for real-time decision making. This means it may not be well-suited to the evolving information demands of businesses in the 21st century.

In this, part one of this two-part series on Hadoop, we talked about the rise of Hadoop, why it looked like the solution to organizations’ big data problems, and where it fell short. In the next part of this series, we will explore why combining Hadoop with MemSQL may help businesses that are already invested in Hadoop.

memsql rainbow wave
Live Webinar
See a Demo of MemSQL & Kubernetes