Scaling distributed systems is hard. Scaling a distributed database is really hard. Databases are particularly hard to scale because there are so many different aspects you have to consider. For example, which dimension is growing, and how fast, will dictate what resources you need to increase in order to handle the workload and maintain your service level agreements (SLAs).
Some problems require more CPU, some more RAM, and some more storage. Many times, you need a combination. Knowing how much you need up front is tough to determine. In addition, requirements change over time. When demand increases, you need to be able to dynamically add hardware resources, without compromising your availability. Most legacy databases only support a single-box model making it difficult to scale and maintain availability. MemSQL is a scale-out relational database with ANSI SQL support and, as such, is a much better solution for workloads where scaling up on a single machine is not viable.
Let’s look at the different dimensions.
Growth in the Amount of Data Ingested per Unit of Time
Customers often see rapid growth in the amount of data being ingested. An example is Pinterest, where the ingest increases with growth of the user base and with rising user engagement. This pattern is typical of a consumer service experiencing significant user growth. Another example is an IoT workload where the company is growing the number of devices sending back data.
Ingestion is often bottlenecked on computing cycles , so increased ingest requires adding more CPUs.
Growth in Concurrent Queries
There is a culture change happening in business that is driving companies to be more data-driven. This means empowering all of your employees to use data to make decisions across the business. Uber is a good example of a MemSQL customer that has internalized this ethos and maximizes employee access to data so they can do their jobs more effectively.
Customers often start out by having only a few people, such as dedicated analysts or data scientists, accessing data to answer questions for the business. As data sources become more mature, and other people see the results of using the data, they demand direct access. An organization’s data warehouse and other data resources are often not sized appropriately to deal with additional capacity demands that arise after the system is configured.
To handle the additional queries without long delays, or even making the data resource unstable, requires more CPU and memory.
Growth in Working Set
Growth in the working set can occur for a number of different reasons. For example, a customer could decide to change the window of time they want to look at the data from three days to seven days. Another example is that the ingest rate increased (because of one of the scenarios above) and the amount of data for the same time window is now bigger. The customer could be running a Customer 360 Dashboard, or similar dashboard, and the size of the working set grows as they acquire more customers. Another way the working set can grow is changes in the queries to accommodate more data. For example, new dimension data is added and the number of joins over large tables increases.
Growing the working set generally means needing more CPU (to do the processing) and additional memory (to hold interim and final results).
Growth in the Number of SQL Objects
Enterprise customers with large data warehouses often have a larger number of SQL objects. One big bank we worked with had a large data warehouse with hundreds of thousands of tables.
Another pattern needing a high number of SQL objects is a multi-tenant service. In this pattern, the customer offers a service (often to an enterprise) where the data from each of its customers needs to be kept separate, both for security and for namespace reasons. Using namespace boundaries allows the system to enforce security effectively and keeps users of one company from seeing another’s data.
Having this separation encoded into the system makes it easier on the application programmer. The namespace boundary allows the tables in each customer group to have the same names, making it easier to write, debug, and maintain the application code that accesses the tables. The result is that you get a set of tables per customer. So if there are a standard set of 30 tables for a customer and 1000 customers, you have 30,000 tables.
One of the most common problems in this model is that the size of customers follows a long tail distribution, where there are a few customers who each have a large amount of data, plus many small customers who each have a small amount of data. So you need a database system that can scale a single database for the large customers and also efficiently support many small databases for the other customers. Most of the PaaS database services in the market do not handle this case very well.
Growth in SQL objects that represent customers requires more storage, more CPU, and more memory to handle the increase in ingest and query from the additional customers. Growth in SQL objects for other reasons requires more storage, CPU, and memory to handle additional ingest for the additional objects and additional queries against them.
Growth in Data Storage
Growth in data storage is another example of a scaling issue. A customer wants to regularly query the last week of data. The query is run regularly, but only by a set number of users (i.e. no change in concurrency). The amount of data produced in the week is constant. However, the customer wants to retain data for two years to occasionally run longer historical queries.
This requires more storage, but does not require an increase in CPU or memory, since the amount of data required for most queries stays constant.
Handling the Growth
Most existing systems assume the customer can correctly forecast the amount of future growth and build a cluster that will accommodate it. If they forecast too little demand, the system will get bogged down, or become unstable as the demand outstrips the resources available. If they forecast too much demand, they spend too much money in over-provisioning the system.
The table below shows the different areas in which you need to scale for growth in the different kinds of demands described above.
|Ingest per unit time||✓||✓|
|# of SQL objects||✓||✓||✓|
Most legacy databases only support a single-box model. A single box limits how much you can scale. In addition, a single-box model makes it difficult to scale as an online operation, increasing downtime whenever there is unexpected growth.
MemSQL is a scale-out relational database. It runs on commodity hardware using any modern version of Linux. MemSQL scales easily by allowing an administrator to add new nodes to a cluster. Nodes can be added and data rebalanced as an online operation, with no downtime.
If you have a data intensive application and need flexibility and availability in your data operations, MemSQL should be on the list of alternatives which you consider for new requirements and improving existing operations.
For more information, view our webinar, Three Steps to Modernizing Your Data Platform.