The development of the relational model heralded a big step forward for the world of databases. A few years later, SQL introduced a rich vocabulary for data manipulation: filters, projections, and—most importantly—the mighty join. Joins meant that analysts could construct new reports without having to interact with those eggheads in engineering, but more importantly, the existence of complex join queries meant that theoreticians had an interesting new NP-hard problem to fawn over for the next five decades.
Ever since, the join has been the fundamental operation by which complex queries are constructed out of simpler "relations". The declarative nature of SQL means that users do not generally specify how their query is to be executed—it’s the job of a separate component of the database called the optimizer to figure that out. Since joins are so prevalent in such queries, the optimizer must take special care to handle them intelligently. As we'll see, this isn't a trivial task.
In this post, we'll look at why join ordering is so important and develop a sense of how to think of the problem space. And then, in upcoming posts, we'll begin discussing ways to implement a fast, reliable algorithm to produce good join orderings.
A Refresher on SQL and Joins
Let’s do a quick refresher in case you don’t work with SQL databases on a regular basis.
A relation or table is basically a spreadsheet. Say we have the following relations describing a simple retailer:
customers
products
orders
The customers, or C relation looks something like this:
the products
, or P relation looks like this:
The orders, or O relation looks like this:
The cross product of the two relations, written ×P×O, is a new relation which contains every pair of rows from the two input relations. Here’s what ×P×O looks like:
However, for most applications, this doesn’t have much meaning, which is where joins come into play.
A join is when we have a filter (or predicate) applied to the cross product of two relations. If we filter the above table to the rows where product_id = order_product_id
, we say we’re "joining P and O on product_id = order_product_id
". The result looks like this:
Here we can see all of the orders that contained a given product.
We can then remove some of the columns from the output (this is called projection):
This ends up with a relation describing the products various users ordered. Through pretty basic operations, we built up some non-trivial meaning. This is why joins are such a major part of most query languages (primarily SQL): they’re very conceptually simple (a predicate applied to the cross product) but can express fairly complex operations.
You might have observed that even though the size of the cross product was quite large (∣∣×∣∣∣P∣×∣O∣), the final output was pretty small. Databases will exploit this fact to perform joins much more efficiently than by producing the entire cross product and then filtering it. This is part of why it’s often useful to think of a join as a single unit, rather than two composed operations.
Some Vocabulary
To make things easier to write, we’re going to introduce a little bit of notation.
We already saw that the cross product of A and B is written ×A×B. Filtering a relation R on a predicate p is written ()σp(R). That is, ()σp(R) is the relation with every row of R for which p is true, for example, the rows where product_id = order_product_id
. Thus a join of A and B on p could be written (×)σp(A×B). Since we often like to think of joins as single cohesive units, we can also write this as ⋈A⋈pB.
The columns in a relation don’t need to have any particular order (we only care about their names), so we can take the cross product in any order. ×=×A×B=B×A, and further, ⋈=⋈A⋈pB=B⋈pA. You might know this as the commutative property. Joins are commutative.
We can "pull up" a filter through a cross product: ()×=(×)σp(A)×B=σp(A×B). It doesn’t matter if we do the filtering before or after the product is taken. Because of this, it sometimes makes sense to think of a sequence of joins as a sequence of cross products which we filter at the very end:
(A⋈pB)⋈qC=σq(σp(A×B)×C)=σp∧q(A×B×C)
Something that becomes clear when written in this form is that we can join A with B and then join the result of that with C, or we can join B with C and then join the result of that with A. The order in which we apply those joins doesn’t matter, as long as all the necessary filtering happens at some point. You might recognize this as the associative property. Joins are associative (with the asterisk that we need to pull up predicates where appropriate).
Optimizing Joins
So we can perform our joins in any order we please. This raises a question: is there some order that’s more preferable than another? Yes. It turns out that the order in which we perform our joins can result in dramatically different amounts of work required.
Consider a fairly natural query on the above relations, where we want to get a list of all customers’ names along with the location of each product they’ve ordered. In SQL we could write such a query like this:
SELECT customer_name, product_location FROM
orders
JOIN customers ON customer_id = order_customer_id
JOIN products ON product_id = order_product_id
We have two predicates:
customer_id = order_customer_id
product_id = order_product_id
Say we first join products
and customers
. Since neither of the two predicates above relate products with customers, we have no choice but to form the entire cross products between them. This cross product might be very large (the number of customers times the number of products) and we have to compute the entire thing.
What if we instead first compute the join between orders
and customers
? The sub-join of orders joined with customers only has an entry for every order placed by a customer - probably much smaller than every pair of customer and product. Since we have a predicate between these two, we can compute the much smaller result of joining them and filtering directly (there are many algorithms to do this efficiently, the three most common being the hash join, merge join, and nested-loop/lookup join).
Visualizing the Problem
To better understand the structure of a join query, we can look at its query graph. The query graph of a query has a vertex for each relation being joined and an edge between any two relations for which there is a predicate.
Since a predicate filters the result of the cross product, predicates can be given a numeric value that describes how much they filter said result. This value is called their selectivity. The selectivity of a predicate p on A and B is defined as:
sel(p)=∣A×B∣∣A⋈pB∣=∣A∣∣B∣∣A⋈pB∣
In practice, we tend to think about this the other way around; we assume that we can estimate the selectivity of a predicate and use that to estimate the size of a join:
∣A⋈pB∣=sel(p)∣A∣∣B∣
So a predicate which filters out half of the rows has selectivity 0.50.5 and a predicate which only allows one row out of every hundred has selectivity 0.010.01. Since predicates which are more selective reduce the cardinality of their output more aggressively, a decent general principle is that we want to perform joins over predicates which are very selective first. It’s often assumed for convenience that all the selectivities are independent, that is,
∣A⋈pB⋈qC∣=sel(p)sel(q)∣A×B×C∣
Which, while indeed convenient, is rarely an accurate assumption in practice. Check out "How Good Are Query Optimizers, Really?" by Leis et al. for a detailed discussion of the problems with this assumption.
It turns out that the shape of a query graph plays a large part in how difficult it is to optimize a query. There are a handful of canonical archetypes of query graph "shapes", all with different optimization characteristics.
A “chain” query graph
A “star” query graph
A “cycle” query graph
Note that these shapes aren’t necessarily representative of many real queries, but they represent extremes which exhibit interesting behaviour and which permit interesting analysis.
Query Plans
To visualize a particular join ordering, we can look at its query plan diagram. Since most join execution algorithms only perform joins on one pair of relations at a time, these are generally binary trees. The query plan we ended up with for the above query has a diagram that looks something like this:
There are also two main canonical query plan shapes, the less general "left-deep plan":
Where every relation is joined in sequence.
The more general form is the "bushy plan":
In a left-deep plan, one of the two relations being joined must always be a concrete table, rather than the output of a join. In a bushy plan, such composite inners are permitted.
Is this actually "hard"?
In the examples we’ve seen, there were only a handful of options, but as the number of tables being joined grows, the number of potential query plans grows extremely fast—and in fact, finding the optimal order in which to join a set of tables is NP-hard. This means that when faced with large join ordering problems, databases are generally forced to resort to a collection of heuristics to attempt to find a good execution plan (unless they want to spend more time optimizing than executing!).
I think it’s important to first answer the question of why we need to do this at all. Even if some join orderings are orders of magnitude better than others, why can’t we just find a good order once and then use that in the future? Why does a piece of software like a database that’s concerned with going fast need to solve an NP-hard problem every time it receives a query? It’s a fair question, and there’s probably interesting research to be done in sharing optimization work across queries.
The main answer, though, is that you’re going to want different join strategies for a query involving Justin Bieber’s twitter followers versus mine. The scale of various relations being joined will vary dramatically depending on the query parameters and the fact is that we just don’t know the problem we’re solving until we receive the query from the user, at which point the query optimizer will need to consult its statistics to make informed guesses about what join strategies will be good. Since these statistics will be very different for a query over Bieber’s followers, the decisions the optimizer ends up making will be different and we probably won’t be able to reuse a result from before.
Once you accept that you have to solve the problem, how do you do it? A common characteristic of NP-hard problems is that they’re strikingly non-local. Any type of local reasoning or optimization you attempt to apply to them will generally break down and doom you to look at the entire problem holistically. In the case of join ordering, what this means is that in most cases it’s difficult or impossible to make conclusive statements about how any given pair of relations should be joined - the answer can differ drastically depending on all the tables you don’t happen to be thinking about at this moment.
In our example with customers, orders, and products, it might look like our first plan was bad only because we first performed a join for which we had no predicate (such intermediate joins are just referred to as cross products), but in fact, there are joins for which the ordering that gives the smallest overall cost involves a cross product (exercise for the reader: find one).
Despite the fact that optimal plans can contain cross products, it’s very common for query optimizers to assume their inclusion won’t improve the quality of query plans that much, since disallowing them makes the space of query plans much smaller and can make finding decent plans much quicker. This assumption is sometimes called the connectivity heuristic (because it only considers joining relations which are connected in the query graph).
This post has mostly been about the vocabulary with which to speak and think about the problem of ordering joins, and hasn’t really touched on any concrete algorithms with which to find good query plans.
Join ordering is, generally, quite resistant to simplification. In the general case—and in fact, almost every case in practice—the problem of finding the optimal order in which to perform a join query is NP-hard. However, if we sufficiently restrict the set of queries we look at, and restrict ourselves to certain resulting query plans, there are some useful situations in which we can find an optimal solution. Those details, though, will come in a follow-up post.
Thanks to Andy Kimball for his technical review of this post.
If you like this post you can go even deeper on Join Ordering with Join Ordering Part II: The SQL
Does building distributed SQL engines put a spring in your step? If so, we're hiring! Check out our open positions here.