Traditional data warehouses and databases were built for workloads that manifested 20 years ago. They are sufficient for what they were built to do, but these systems are struggling to meet the demands of modern business with the volume, velocity, and user demand of data. IT departments are being challenged from both ends. On one side, companies want to analyze the deluge of data in real time, or near real time. On the other side, on the consumption end, the need to analyze and get value out of data is increasing exponentially. A decade ago, companies had a handful of analysts who ran reports and a few dashboards. Today, enterprises have armies of data scientists, analysts, and savvy business users wanting to slice and dice the latest data.
The “Flying Car” Dilemma
Companies have invested millions of dollars in procuring and customizing legacy systems. In addition to the capital expense, these companies have spent years hiring and training resources to support and maintain these systems. Thus, IT is under tremendous pressure to leverage these investments to meet the evolving needs of the business.
However, these systems were not designed for modern data workloads that emphasize real-time insights to changing customer or machine conditions. Expecting legacy systems to do this is analogous to expecting a car to suddenly adapt and begin flying like a jet. Putting wings on the car and making modifications might work temporarily, but it’s unlikely to work long term. In essence, it is a bad idea to modify a car to fly. This is the “flying car dilemma”.
Can You Replace a Car with a Jet?
This is the architecture of a traditional data warehouse. The source data starts with a transaction system, which is typically a database, such as Oracle or SQL Server. Periodically, you have to use ETL tools such as Informatica to batch load data from your transactional system into a data warehouse product, such as Teradata, Oracle, or Sybase IQ. Then, you run reports in the data warehousing environment. This worked fine for nightly and weekly analytics, but in today’s world this approach has the following challenges:
First, by the time data gets to the warehouse, it is 24 to 48 hours old. Companies today need real-time or near real-time data to compete against each other. To accelerate the ingest rate and speed up access to the data, you need to buy refrigerator-size appliances that are expensive and difficult to upgrade (remember putting wings on a car).
Additionally, the legacy systems are not able to support the concurrency that is required to support the demand for instant data access from executives, analysts, and data scientists.
The solution to this problem is not “rip and replace.” Replacing a legacy system is similar to trading your car in for a jet aircraft. First off, jets are not affordable. Secondly, even if you buy a jet aircraft, you need new infrastructure such as runways, mechanics, pilots, and to setup new processes for the jet to work.
Ripping and replacing the legacy system has similar challenges:
- Legacy systems are heavily customized. Migration is very costly, time consuming, and complex.
- Training the entire staff to work with a brand new system is expensive and retraining is not easy.
- Changing processes, that are embedded in an organization for decades, takes time.
There is a reason why we still have mainframes in modern enterprises.
Augment Instead of Replace: The FedEx Drone Approach
The sensible solution to the “flying car dilemma” is to take the approach a company would take with shipping. For example, let’s say a company is using FedEx Overnight for packages that need to be delivered the next day, and FedEx Ground for packages that need to be delivered in a week.
Taking this approach with a data warehouse would mean isolating workloads that contain real-time analytics, high ingest, and high concurrency [FedEx Overnight] from the workloads that do not have these requirements [FedEx Ground]. Now we put the “Fedex Overnight” workloads on a modern real-time data warehouse such as MemSQL and allow the “FedEx Ground” workloads to continue on a legacy system. Here we are not using an expensive jet to replace a car, rather we are using a drone that is inexpensive, easy to use, and most of all easy to adopt.
Using a product such as MemSQL has the following advantages:
- It works with your existing databases (integration capability)
- It works with your existing ETL tools
- It has the ability to work with open source computing frameworks and modern messaging platforms, such as Spark, Kafka, and others
- MemSQL is fully compliant with SQL, which means all the investments that you have made in manpower for your legacy systems can be leveraged
Some of the technical advantages include:
- Real-Time Ingestion: Modern, lock-free, and skip list technologies remove ingest bottlenecks
- Scalable: Scale-out on commodity hardware for high concurrency requirements
- Fast Queries: Low latency queries that leverage vectorization and compilation
- Machine Learning: Integration with machine learning tooling
Augmenting the current architecture is a pragmatic approach for solving the flying car problem. Over time, as the demand for real-time requirements increase, more and more workloads can be added to MemSQL (the drone slowly transitions into a jet). Since MemSQL converges transactional and analytical workloads, the architecture is simplified over time. As shown in the diagram below, this reduces, and sometimes eliminates, the need for costly ETL tools and complex processes.
So if you have a “flying car dilemma,” take the “FedEx approach,” and start with a drone and over time transition into a jet.