Just like IoT, around in various forms for years, AI has been prevalent in nearly every enterprise for decades. The catch? It was hidden inside the databases and data warehouses in use under the banner of Query Optimization.
Query optimization is the process where a database or data warehouse takes the input from a user’s question, and reworks that question to deliver a quick response with as few compute resources as possible. The number of query plan choices reach far beyond what a human can calculate. An algorithm sifting through choices for the best plan makes applications faster since it receives data more quickly.
Let’s take a closer look at the basics of query optimization, how it aligns with the principles of artificial intelligence, and the benefits delivered to enterprises.
Understanding Query Optimization
When an application or analyst requests information from a database or data warehouse, the question is phrased as a query, most frequently in SQL, the Structured Query Language.
An extremely simple query might be the following:
SELECT * FROM Customers WHERE Country='Canada';
This simple question requires searching everything in the Customers table to find results where the Country equals Canada.
But SQL is a rich language that offers many permutations of how users or applications can write or assemble queries. Turning the query into a function the database can execute is the core of query optimization. And in general, there is a massive space of possible plans, which can answer a specific question asked by a user.
A more complex query might be the following Query (#17) from the well known TPCH benchmark. According to the benchmark report:
This query determines how much average yearly revenue would be lost if orders were no longer filled for small quantities of certain parts. This may reduce overhead expenses by concentrating sales on larger shipments.
Answering the following Business Question
The Small-Quantity-Order Revenue Query considers parts of a given brand and with a given container type and determines the average lineitem quantity of such parts ordered for all orders (past and pending) in the 7-year database. What would be the average yearly gross (undiscounted) loss in revenue if orders for these parts with a quantity of less than 20% of this average were no longer taken?
The query is expressed as:
SELECT Sum(l_extendedprice) / 7.0 AS avg_yearly FROM lineitem, part WHERE p_partkey = l_partkey AND p_brand = 'Brand#43' AND p_container = 'LG PACK' AND l_quantity < (SELECT 0.2 * Avg(l_quantity) FROM lineitem WHERE l_partkey = p_partkey);
Given the quantity section of the query
SELECT 0.2 * Avg(l_quantity) FROM lineitem WHERE l_partkey = p_partkey
The system would have to run this SELECT to compute the average quantity order for each part individually.
As an example, a query optimizer can rewrite this correlated subselect to the following:
SELECT Sum(l_extendedprice) / 7.0 AS avg_yearly FROM lineitem, (SELECT 0.2 * Avg(l_quantity) AS s_avg, l_partkey AS s_partkey FROM lineitem, part WHERE p_brand = 'Brand#43' AND p_container = 'LG PACK' AND p_partkey = l_partkey GROUP BY l_partkey) sub WHERE s_partkey = l_partkey AND l_quantity < s_avg
This speeds up the query by requiring fewer scans of the
lineitem table. In the original query, the computation is needed for every distinct part. But with the query optimization rewrite, we compute all of the averages at once for the different parts then use those results to complete the query. More information on the technical details of query optimization are available in this paper from the Very Large Data Bases (VLDB) Conference.
So the query optimizer will make decisions based on the query and how the data is located in the database across tables and columns. There are standard rules that apply for certain query shapes, but very few cases where the query plan chosen is always the same.
World class query optimization generally relies on three things.
Cardinality estimates include predictions such as how many rows are going to get matched from every table with this query, and once you join the tables, what is the intermediate size of the dataset. Query optimizers will use this information to estimate the costs of potential query plans.
Costing simulates the expense of the algorithm choices and picks the one it thinks is best. You need to consider and calculate alternative query plans with costing to pick the most appropriate choice. Heuristics and rules can lead towards certain directions, but as with many things, past performance is not a predictor of future results. The statistics are never perfect and the query optimization process involves mistakes and missed estimates that result from some guesses.
In general, query optimization is trying to make the best choice, but the goal is not to pick the single best plan, as much as it is to pick among the best plans so that the downside is limited. Limitations are a fact of life as the statistics will never be perfectly predictive of query results. This process closely resembles many AI workflows in use elsewhere.
In thinking about Estimates and Costing, remember that the estimates do not take anything into account about the data itself. An estimate on one database would be just as valid as an estimate on another database, whereas the Cost represents how expensive a query would be to execute on a specific dataset and potentially a specific cluster configuration.
Query optimizers need to understand the data distribution in order to figure out how the query plan will work. SQL provides a functional declaration of the desired results but it is up to the database to figure out how to orchestrate the query. Understanding data placement in the system makes this possible. The process is similar to the classic salesperson dilemma of wanting to visit several customers in one day and figuring out how to do it with the most efficiency in traveling between sites. They need a map and a way to measure distance to make an accurate assessment.
How Query Optimization Aligns with AI
Many AI workflows are based on analyzing large quantities of information with algorithms to search for patterns. Query optimization is similar with a very focused approach of calculating predictions for how the system can and should run a particular query.
With queries, it does not have to be complex for there to be a million ways to execute it. These combinations require large amounts of compute power and sophisticated algorithms to calculate the best outcomes.
Query optimization studies the data in your table and learns from the data to make predictions about your query. It then uses those predictions to identify the most efficient ways to run operations. Even when query optimization gets a prediction wrong, it will work over time by collecting more statistics to provide more context. This approach also follows the basic patterns of AI.
Benefits of Query Optimization for Enterprises
Query optimization directly benefits enterprises running data-rich applications in the following ways.
Faster results for better performance
Applications that can access query results quickly allow businesses to operate in real time. Queries that may have taken hours can complete in minutes, and queries that may have taken minutes can complete in seconds. With a powerful query processing engine, many queries can be returned in well under one second. All of this allows businesses to react faster with better decision making.
Fewer compute resource for reduced costs
Well constructed query plans, generated through optimization, consume fewer compute resources allowing companies to process more data at a lower cost.
Support more users with greater efficiency
With powerful query optimization, systems can support more users at a single time. This may include internal analysts where a large number can access a single well-curated dataset. Or it may involve keeping up with thousands or millions of end users interacting with a mobile application simultaneously.
Understanding Query Optimization and Distributed Systems
In a new world of distributed systems, query optimization becomes even more important.
Traditional databases had cost models that involved both CPU and disk I/O use. But in newer distributed systems, an additional important metric is network usage, and how much processing will take place across system nodes, instead of within a single node’s processor and disk.
There are different join algorithm choices that will affect network utilization. For example, the choice of algorithm to evaluate a join between two tables that are distributed among multiple nodes has to choose between broadcasting one or more tables to other nodes compared to joining the tables directly on the nodes. All of this has a direct impact on the networks.
The Future of Query Optimization and AI
We are likely to see continued use of AI techniques in query optimization for years to come. Today, some databases will explore or run parts of the query to see the results then make a dynamic choice on the plan. This puts more “smarts” in your query in that you don’t know the estimates so you can try running two different queries then pick one, or run pieces and decide on A or B.
Research papers explore how to use machine learning in the database to optimize data layout. TensorFlow can be used to run experiments and adapt.
If we adapt the cost model based on experiments, it is easy to envision real-time adjustments in the cost model over time. Statistics help with that over time by learning from the data.
Although a lot of work goes into cost models and statistics, every database workload is different. With the right feedback loop, and with every query acting as another experiment, the database can improve itself over time. However, one company’s database will improve itself over time for THAT company’s queries.
Long term there will continue to be millions, perhaps billions or trillions, of choices for query plans. This mandates an intelligent scheme to explore the entire space, meaning that machine learning and AI techniques will be a large part of query optimization well into the future.
A special thanks to Robert Walzer, Nick Kline, Adam Prout, and Jack Chen from MemSQL engineering for input on this blog post. If you’re interested in joining our team, check out our open positions.