In this blog post, the first in a two-part series, I’m going to describe pre-modern databases: traditional relational databases, which support SQL but don’t scale out, and NoSQL databases, which scale out but don’t support SQL. In the next part, I’m going to talk about modern databases – which scale out, and which do support SQL – and how they are well suited for an important new workload: operational analytics.
In the Beginning: OLTP
Online transaction processing (OLTP) emerged several decades ago as a way to enable database customers to create an interactive experience for users, powered by back-end systems. Prior to the existence of OLTP, a customer would perform an activity. Only at some point, relatively far off in the future, would the back-end system be updated to reflect the activity. If the activity caused a problem, that problem would not be known for several hours, or even a day or more, after the problematic activity or activities.
The classic example (and one of the main drivers for the emergence of the OLTP pattern) was the ATM. Prior to the arrival of ATMs, a customer would go to the bank counter to withdraw or deposit money. Back-end systems, either paper or computer, would be updated at the end of the day. This was slow, inefficient, error prone, and did not allow for a real-time understanding of the current state. For instance, a customer might withdraw more money than they had in their account.
With the arrival of ATMs, around 1970, a customer could self-serve the cash withdrawal, deposits, or other transactions. The customer moved from nine to five access to 24/7 access. ATMs also allowed a customer to understand in real time what the state of their account was. With these new features, the requirements for the backend systems became a lot more complex. Specifically data lookups, transactionality, availability, reliability, and scalability – the latter being more and more important as customers demanded access to their information and money from any point on the globe.
The data access pattern for OLTP is to retrieve a small set of data, usually by doing a lookup on an ID. For example, the account information for a given customer ID. The system also must be able to write back a small amount of information based on the given ID. So the system needs the ability to do fast lookups, fast point inserts, and fast updates or deletes.
Transaction support is arguably the most important characteristic that OLTP offers, as reflected in the name itself. A database transaction means a set of actions that are either all completed, or none of them are completed; there is no middle ground. For example, an ATM has to guarantee that it either gave the customer the money and debited their account, or did not give the customer money and did not debit their account. Only giving the customer money, but not debiting their account, harms the bank; only debiting the account, but not giving the customer money, harms the customer.
Note that doing neither of the actions – not giving the money, and not debiting the account – is an unhappy customer experience, but still leaves the system in a consistent state. This is why the notion of a database transaction is so powerful. It guarantees the atomicity of a set of actions, where atomicity means that related actions happen, or don’t happen, as a unit.
Reliability is another key requirement. ATMs need to be always available, so customers can use one at any time. Uptime for the ATM is critical, even overcoming hardware or software failures, without human intervention. The system needs to be reliable because the interface is with the end customer and banks win on how well they deliver a positive customer experience. If the ATM fails every few times a customer tries it, the customer will get annoyed and switch to another bank.
Scalability is also a key requirement. Banks have millions of customers, and they will have tens of thousands of people hitting the back-end system at any given time. But the usage is not uniform. There are peak times when a lot more people hit the system.
For example, Friday is a common payday for companies. That means many customers will all be using the system around the same time to check on the balance and withdraw money. They will be seriously inconvenienced – and very unimpressed – if one, or some, or all of the ATMs go down at that point.
So banks need to scale to hundreds of thousands of users hitting the system concurrently on Friday afternoons. Hard to predict, one-off events, such as a hurricane or an earthquake, are among other examples that can also cause peaks. The worst case is often the one you didn’t see coming, so you need a very high level of resiliency even without having planned for the specific event that ends up occurring.
These requirements for the OLTP workload show up in many other use cases, such as retail transactions, billing, enterprise resource planning (widely known as ERP), customer relationship management (CRM), and just about any application where an end user is reviewing and manipulating data they have access to and where they expect to see the results of those changes immediately.
The existing legacy database systems were founded to solve these use cases over the last few decades, and they do a very good job of it, for the most part. The market for OLTP-oriented database software is in the tens of billions of dollars a year. However, with the rise of the Internet, and more and more transactional systems being built for orders of magnitude more people, legacy database systems have fallen behind in scaling to the level needed by modern applications.
The lack of scale out also makes it difficult for OLTP databases to handle analytical queries while successfully, reliably, and quickly running transactions. In addition, they lack the key technologies to perform the analytical queries efficiently. This has contributed to the need for separate, analytics-oriented databases, as described in the next section.
A key limitation is that OLTP databases have typically run on a single computing node. This means that the transactions that are the core of an OLTP database can only happen at the speed and volume dictated by the single system at the center of operations. In an IT world that is increasingly about scaling out – spreading operations across arbitrary numbers of servers – this has proven to be a very serious flaw indeed.
OLAP Emerges to Complement OLTP
After the evolution of OLTP, the other major pattern that has emerged is OLAP. OLAP emerged a bit after OLTP, as enterprises realized they needed fast and flexible access to the data stored in their OLTP systems.
OLTP system owners could, of course, directly query the OLTP system itself. However, OLTP systems were busy with transactions – any analytics use beyond the occasional query threatened to bog the OLTP systems down, limited to a single node as they were. And the OLAP queries quickly became important enough to have their own performance demands.
Analytics use would tax the resources of the OLTP system. Since the availability and reliability of the OLTP system were so important, it wasn’t safe to have just anyone running queries that might use up resources to any extent which would jeopardize the availability and reliability of the OLTP system.
In addition, people found that the kinds of analytics they wanted to do worked better with a different schema for the data than was optimal for the OLTP system. So they started copying the data over into another system, often called a data warehouse or a data mart. As part of the copying process, they would change the database schema to be optimal for the analytics queries they needed to do.
At first, OLTP databases worked reasonably well for analytics needs (as long as they ran analytics on a different server than the main OLTP workload). The legacy OLTP vendors included features such as grouping and aggregation in the SQL language to enable more complex analytics. However, the requirements of the analytics systems were different enough that a new breed of technology emerged that could satisfy analytics needs better, with features such as column-storage and read-only scale-out. Thus, the modern data warehouse was born.
The requirements for a data warehouse were the ability to run complex queries very fast; the ability to scale to handle large data sets (orders of magnitude larger than the original data from the OLTP system); and the ability to ingest large amounts of data in batches, from OLTP systems and other sources.
Unlike the OLTP data access patterns that were relatively simple, the query patterns for analytics are a lot more complicated. Trying to answer a question such as, “Show me the sales of product X, grouped by region and sales team, over the last two quarters,” requires a query that uses more complex functions and joins between multiple data sets.
These kinds of operations tend to work on aggregates of data records, grouping them across a large amount of data. Even though the result might be a small amount of data, the query has to scan a large amount of data to get to it.
Picking the right query plan to optimally fetch the data from disk requires a query optimizer. Query optimization has evolved into a specialty niche within the realm of computer science; there are only a small number of people in the world with deep expertise in it. This specialization is key to the performance of database queries, especially in the face of large data sets.
Building a really good query optimizer and query execution system in a distributed database system is hard. It requires a number of sophisticated components including statistics, cardinality estimation, plan space search, the right storage structures, fast query execution operators, intelligent shuffle, both broadcast and point-to-point data transmission, and more. Each of these components can take months or years of skilled developer effort to create, and more months and years to fine-tune.
Datasets for data warehouses can get quite big. This is because you are not just storing a copy of the current transactional data, but taking a snapshot of the state periodically and storing each snapshot going back in time.
Businesses often have a requirement to go back months, or even years, to understand how the business was doing previously and to look for trends. So while operational data sets range from a few gigabytes (GBs) to a few terabytes (TBs), a data warehouse ranges from hundreds of GBs to hundreds of TBs. For the raw data in the biggest systems, data sets can reach petabytes (PBs).
For example, imagine a bank that is storing the transactions for every customer account. The operational system just has to store the current balance for the account. But the analytics system needs to record every transaction in that account, going back for years.
As the systems grew into the multiple TBs, and into the PB range, it was a struggle to get enough computing and storage power into a single box to handle the load required. As a result, a modern data warehouse needs to be able to scale out to store and manage the data.
Scaling out a data warehouse is easier than scaling an OLTP system. This is because scaling queries is easier than scaling changes – inserts, updates, and deletes. You don’t need as much sophistication in your distributed transaction manager to maintain consistency. But the query processing needs to be aware of the fact that data is distributed over many machines, and it needs to have access to specific information about how the data is stored. Because building a distributed query processor is not easy, there have been only a few companies who have succeeded at doing this well.
Getting the Data In
Another big difference is how data is put into a data warehouse. In an OLTP system, data is entered by a user through interaction with the application. With a data warehouse, by contrast, data comes from other systems programmatically. Often, it arrives in batches and at off-peak times. The timing is chosen so that the work of sending data does not interfere with the availability of the OLTP system where the data is coming from.
Because the data is moved programmatically by data engineers, you don’t need the database platform to enforce constraints on the data to keep it consistent. Because it comes in batches, you want an API that can load large amounts of data quickly. (Many data warehouses have specialized APIs for this purpose.)
Lastly, the data warehouse is not typically available for queries during data loading. Historically, this process worked well for most businesses. For example, in a bank, customers would carry out transactions against the OLTP system, and the results could be batched and periodically pushed into the analytics system. Since statements were only sent out once a month, it didn’t matter if it took a couple of days before the data made it over to the analytics system.
So the result is a data warehouse that is queryable by a small number of data analysts. The analysts run a small number of complex queries during the day, and the system is offline for queries while loading data during the night. The availability and reliability requirements are lower than an OLTP system because it is not as big a deal if your analysts are offline. You don’t need transactions of the type supported by the OLTP system, because the data loading is controlled by your internal process.
The NoSQL Work Around
For more information on this topic, read our previous blog post: Thank You for Your Help, NoSQL, But We Got It from Here.
As the world “goes digital,” the amount of information available increases exponentially. In addition, the number of OLTP systems has increased dramatically, as has the number of users consuming them. The growth in data size and in the number of people who want to take advantage of the data has outstripped the capabilities of legacy databases to manage. As scale-out patterns have permeated more and more areas within the application tier, developers have started looking for scale-out alternatives for their data infrastructure.
In addition, the separation of OLTP and OLAP has meant that a lot of time, energy, and money go into extracting, transforming, and loading data – widely known as the ETL process – between the OLTP and OLAP sides of the house.
ETL is a huge problem. Companies spend billions of dollars on people and technology to keep the data moving. In addition to the cost, the consequence of ETL is that users are guaranteed to be working on stale data, with the newest data up to a day old.
With the crazy growth in the amount of data – and in demand for different ways of looking at the data – the OLAP systems fall further and further behind. One of my favorite quotes, from a data engineer at a large tech company facing this problem, is: “We deliver yesterday’s insights, tomorrow!”.
NoSQL came along promising an end to all this. NoSQL offered:
- Scalability. NoSQL systems offered a scale-out model that broke through the limits of the legacy database systems.
- No schema. NoSQL abandoned schema for unstructured and semi-structured formats, abandoning the rigid data typing and input checking that make database management challenging.
- Big data support. Massive processing power for large data sets.
All of this, though, came at several costs:
- No schema, no SQL. The lack of schema meant that SQL support was not only lacking from the get-go, but hard to achieve. Moreover, NoSQL application code is so intertwined with the organization of the data that application evolution becomes difficult. In other words, NoSQL systems lack the data independence found in SQL systems.
- No transactions. It’s very hard to run traditional transactions on unstructured or semi-structured data. So data was left unreconciled, but discoverable by applications, that would then have to sort things out.
- Slow analytics. Many of the NoSQL systems made it very easy to scale and to get data into the system (i.e., the data lake). While these systems did allow the ability to process larger amounts of data than ever before, they are pretty slow. Queries could take hours or even tens of hours. It was still better than not being able to ask the question at all, but it meant you had to wait a long while for the answer.
NoSQL was needed as a complement to OLTP and OLAP systems, to work around the lack of scaling. While it had great promise and solved some key problems, it did not live up to all its expectations.
The Emergence of Modern Databases
With the emergence of NewSQL systems such as MemSQL, much of the rationale for using NoSQL in production has dissipated. We have seen many of the NoSQL systems try to add back important, missing features – such as transaction support and SQL language support – but the underlying NoSQL databases are simply not architected to handle them well. NoSQL is most useful for niche use cases, such as a data lake for storing large amounts of data, or as a kind of data storage scratchpad for application data in a large web application.
The core problems still remain. How do you keep up with all the data flowing in and still make it available instantly to the people who need it? How can you reduce the cost of moving and transforming the data? How can you scale to meet the demands of all the users who want access to the data, while maintaining an interactive query response time?
These are the challenges giving rise to a new workload, operational analytics. Read our upcoming blog post to learn about the operational analytics workload, and how NewSQL systems like MemSQL allow you to handle the challenges of these modern workloads.