CASE STUDY

Case Study: Augmenting Hadoop with MemSQL at a Fortune 50 Company

Floyd Smith

A Fortune 50 company tried traditional SQL databases, then Hadoop, to manage financial updates company-wide. They finally found a successful solution by augmenting Hadoop with MemSQL.

Many companies today are straining to meet the needs for faster ad hoc analytics, dashboards, reporting, and support for modern applications that depend on fast access to the latest data – in particular, financial data.

Facing competitive and customer pressures, IT organizations must respond to a rapid increase in the volume of incoming data and the needs of the business to use this data for rapid decision-making, improving operations, and providing better customer experiences. These problems are especially severe in larger traditional companies that need to innovate.

Traditional relational databases, lacking scalability, have run out of gas for solving these problems. Many companies have turned to Hadoop, a NoSQL solution, to rapidly ingest data and move it to a data lake. However, the lack of SQL support and performance issues for queries – which affect ad hoc queries, reporting, dashboards, and apps – have made this approach unproductive for many.

A diversified Fortune 50 company approached MemSQL to help solve exactly this problem. The company first attempted to meet its analytics needs with an existing, legacy relational database. They then introduced a NoSQL solution based on Hadoop. The company found each of these solutions to be inadequate.

Ultimately, this company augmented Hadoop with MemSQL. They now use MemSQL to support an operational data store for company-wide data analytics, including access by executives. Hadoop still powers a data lake, which is used largely to support data scientists.

Detailed below is our customer’s journey and the benefits they saw from moving their core analytics processing to MemSQL. Included is a reference architecture for a combined implementation of MemSQL alongside Hadoop.

Displacing Hadoop for Analytics and Business Intelligence

The MemSQL customer in this case study needed fast analytics at the divisional level, as well as the ability to continually roll up data across the company for quarter-end reporting and communications to investors.

Originally, the company used a legacy relational database to drive analytics. Business analysts used business intelligence (BI) tools such as Business Objects and Tableau to derive insights. Analysts, knowledgeable in the needs of the business, were able to deliver real value to the company.

However, data volumes continued to increase, while more users wanted analytics access. Analytics users, from business analysts to top management, demanded greater responsiveness. And the analytics system needed to be ready to handle demands from machine learning and artificial intelligence software on both operational and legacy data.

The system that drove analytics was unable to keep up with the transaction speeds, query responsiveness, and concurrency levels required by the business. As a first attempt to solve the problem, the company augmented the legacy system with a Hadoop solution, with the final goal being to fully replace the legacy system with a Hadoop/Big Data solution.

The first try at change was a Hadoop/HDFS data lake
The company’s first try at a solution, a data lake supported by Hadoop
and HDFS, has become a common analytics architecture

Unfortunately, the move to Hadoop was unable to deliver for this customer – as it has for many others. Many data lake projects fail to make it into production, and others experience serious problems after implementation.

Hadoop is designed for unstructured data, and this causes many serious problems:

  • Batch processing. With Hadoop, input data is processed in batches, which is time-consuming. The data available to analytics users is not current.
  • Large appetite for hardware. Both batch processing and analytics responses take a lot of processing power, so companies spend a great deal to power Hadoop solutions.
  • Governance and security challenges. Because Hadoop data is less organized than data in a relational database, it’s hard to identify and protect the most sensitive items.
  • Complex system. Big Data solutions involve an entire ecosystem with many specialized components, which make for a difficult system to manage.
  • Complex queries. Queries have to be written against complex data, with careful study of each data source needed.
  • Slow queries. Queries run slowly, and users must either spend time up front optimizing them, or spend extra time waiting for results.
  • No SQL. NoSQL eliminates the ease, speed, and business intelligence (BI) tool support found with SQL, preventing companies from getting maximum value from their data.

Queries, Sorted: Augmenting Hadoop with MemSQL

The company had two competing sets of needs. Data scientists wanted access to raw data, and were willing to do the work to transform, catalog, and analyze data as needed. Business analytics users (including the executive team), BI tools, and existing applications and dashboards needed structured data and SQL support.

By augmenting Hadoop with MemSQL, the company can meet all these needs at once. MemSQL is a massively-scalable, high-performance relational database. MemSQL is a leader in a new category of databases, commonly described as NewSQL.

As a leading NewSQL database, MemSQL offers a cloud-native, distributed architecture (details here). It combines the ability to handle streaming ingest, transactions, and analytics, including SQL support.

MemSQL is a fast, scalable SQL database
MemSQL adds scalability to transactional databases, with
SQL support, offering a wide range of benefits

With MemSQL holding the operational data store, the company is once again able to unleash its business analysts on its most important data. The same operational data store is available for executive queries and financial reporting.

MemSQL drives analytics while HDFS is reserved for data science use
The client now runs an operational data store on MemSQL
for most needs, plus a data lake on HDFS for data science

Hadoop and HDFS are still used, but as a data archive, to support audits, and as a data lake for use by data scientists. (Who, of course, use the operational data store as well.) Work in the data lake yields valuable results, while the operational data store supports the ease of access, efficiency, and rapid response needed by the rest of the business.

Choosing an Ingest Strategy

With MemSQL used as an operational data store for real-time analytics, and Hadoop as a data lake, the customer had three choices as to how to ingest data:

  • MemSQL-first ingest. The customer could replicate the data, using change data capture (CDC) from some sources and MemSQL Pipelines from others. Some or all of the data could then be extracted and loaded to Hadoop/Hive.
  • Simultaneous ingest. The customer could replicate the data to both the targets in parallel. In either case, some data could be discarded on ingest, but the default choice for any incoming data would be to go to both databases at once.
  • Hadoop-first ingest. The customer could bring all data into Hadoop in a batch operation, then transfer all of the data, or a subset, to MemSQL as needed.

Each approach has either advantages or disadvantages.

Advantages Disadvantages Use Case
MemSQL-first ETL, CDC and
MemSQL Pipelines
stream data;
analytics run against
live data
Fast analytics
critical; MemSQL
needs most or all of
the data
Simultaneous Added complexity, as
both streams need to
be managed as
primary
Fast analytics
critical; both
MemSQL and
Hadoop need most
or all of the data
Hadoop-first Major delay for
MemSQL ingest &
analytics, happening
after Hadoop
batch ingest
Many users try this,
only to realize they
need an operational
data store w/SQL

The MemSQL-first approach gets you analytics against live data,
while other approaches add complexity and delays

The customer chose direct ingest to MemSQL. Data is also sent to an operational data store, then batched to Hadoop Their whole reason for using MemSQL was to speed up analytics, so putting Hadoop first – leading to delays in data getting into MemSQL – was a poor option. Because both MemSQL and Hadoop would be getting most of the data, and to simplify troubleshooting if a problem occurred, the customer decided to keep MemSQL and Hadoop ingest separate.

With MemSQL, the customer has the same advantages that customers have long sought from a traditional data warehouse: direct access to data for faster analytics performance. A traditional data warehouse requires pre-aggregation of data, where this is needed less, or not at all for MemSQL. And MemSQL adds advantages of its own: better performance than traditional data warehousing products and the ability to query live data.

MemSQL compares even more positively to the customer’s former Hadoop-based data warehouse-type implementation. Hadoop is not designed to support ad hoc queries and dashboards. It lacks SQL support, meaning that it can’t be used easily with most business intelligence (BI) tools, and it can’t be queried conveniently by business analysts and others who want to use SQL for ad hoc queries, not write a computer program.

Hadoop MemSQL
Performance Poor Excellent
Concurrency Poor Excellent
Queries Custom SQL ad hoc queries & BI tools
Accessible by Data scientists Business analysts, executives, data scientists
Governance support Poor Excellent

Hadoop lacks key analytics features supported
by MemSQL, especially SQL support

Analytics with No Limits

The company’s use of data is now on much firmer footing. Thousands of BI tool users and more than five hundred direct users, spread across roughly a dozen divisions, use MemSQL, BI tools, and SQL queries for all their analytics needs. Many terabytes of data are accessible to users across the globe. Analysts and other users work with up-to-date data, at high concurrency, with fast query performance.

Analytics work has dramatically improved:

  • From batch mode to near-real-time: Analytics now run against near-real-time data. Information that used to take days to arrive now takes minutes.
  • From spreadsheets to BI: This Fortune 50 company (like many) was doing much of its core analytics work by exporting data from Hadoop into Excel spreadsheets. With MemSQL, the full range of BI tools is now available as needed.
  • Instant access for live queries: Using SQL, any analyst can make any query, anytime, and get results quickly. In combination with BI tools, this supports the kind of interactive probing that leads to real, timely, actionable insights.
  • Executive accessibility: There’s now a single source of truth for investor reporting. The CEO and senior executives have direct access to insight into operations.
Hadoop + Hive MemSQL
Processing mode Batch Real-time
Analytics tools used Excel spreadsheets Wide range of BI tools
Access mode Custom queries SQL queries
Executive accessibility Inaccessible Direct

MemSQL has improved analytics
performance and accessibility

In addition to the outcomes, the deployment process and operations for MemSQL are also considered a success for the IT team:

  • Time to go live was less than three quarters.
  • Deployment across the company was managed by a single team of just two people.
    MemSQL runs faster on a six-node cluster than Hadoop + Hive did on fifty nodes.
    MemSQL offered superior total cost of ownership, with operational savings more than offsetting licensing costs for MemSQL.
  • As impressive as it is, this use case only shows part of what MemSQL can do. The same company is now implementing MemSQL in a different use case that takes advantage of a wider range of MemSQL’s capabilities.

Spotlight on Pipelines

MemSQL’s Pipelines capability makes it uniquely easy to ingest data. With Pipelines, MemSQL streams data in from any source and transforms it “in flight”, then loads it rapidly into MemSQL. The traditional extract, transform, and load (ETL) process is performed on streaming data in minutes, rather than on batched data in a period of hours.

MemSQL pipelines replace ETL with data streaming
MemSQL Pipelines turn a slow ETL process into a rapid streaming operation.

To extend the Pipelines capability, MemSQL has introduced Pipelines to stored procedures. Stored procedures add flexibility and transactional guarantees to the streaming capabilities of Pipelines. Hadoop and HDFS, by contrast, ingest data in a slow batch process, with little transformation capability, and no access to stored procedures.

Conclusion

The Fortune 50 company described here is facing the same challenges with its use of data as most other enterprises worldwide. MemSQL has allowed them to provide better access to data, to more people across their company, at a lower cost than other solutions. The results that they have achieved are encouraging them to increase their use of MemSQL for analytical, and also for transactional use cases. The same may be true for your organization.

For more discussion about the use of MemSQL with Hadoop, view our recent webinar and access the slides. And consider trying MemSQL for free today.

MemSQL Helios eclipse
Introducing
MemSQL Helios
The World’s Fastest Cloud Database