Ruby PostgreSQL
Using Recursive SQL with ActiveRecord trees
tl;dr When you have an ActiveRecord tree structure, using the WITH syntax for recursive SQL can provide large performance boons, especially when a tree get several levels deep.
In a previous post, I outlined a Cat Picture store application. As our store grows, more and more categories have to be created, and we end up with a tree of categories. How can we create a homepage that includes all cat pictures for a given category and all of its subcategories?
Pictorally, the category tree might look like this:
Cat Pictures
|-- Funny
| |-- LOLCats
| `-- Animated
`-- Classic
`-- Renaissance
On the landing page for the Cat Pictures category, we want to display all cat pictures for any category below Cat Pictures. Navigating to the Funny category would display all of its pictures, as well as the pictures for LOLCats and Animated. This is the kind of interaction seen on Amazon, for example. The store's categories become like an ad-hoc filtering system.
Here's what the Category class looks like:
class Category < ActiveRecord::Base
attr_accessible :name, :parent
has_many :cat_pictures
belongs_to :parent, :class_name => "Category"
has_many :children, :class_name => "Category", :foreign_key => 'parent_id'
scope :top_level, where(:parent_id => nil)
def descendents
# implement me!
end
end
Each category has a parent_id
column that points at its parent
category. In database speak, modeling a tree like this is known as
an Adjacency List; each node of the tree can only see a children
immediately adjacent to it. For this reason, crawling an Adjacency List
requires recursion. This is actually the database setup common for use
with the acts_as_tree
plugin. Let's see how we can implement the
descendents
method to get all descendent categories.
A Simple Ruby Approach
As you've probably already guessed, we need to recursively collect children for each of our category's children.
class Category < ActiveRecord::Base
# ...
def descendents
children.map do |child|
[child] + child.descendents
end.flatten
end
end
This does the job quite nicely. However, our requirements above state that we want all cat pictures for each descendent category, and our categories. Right now, we've omitted the root category, self. Let's add a new method to include it into the equation:
class Category < ActiveRecord::Base
# ...
def descendents
children.map do |child|
[child] + child.descendents
end.flatten
end
def self_and_descendents
[self] + descendents
end
end
Good deal. Now gathering all cat pictures is just a matter of collecting them for each category:
class Category < ActiveRecord::Base
# ...
def descendent_pictures
self_and_descendents.map(&:cat_pictures).flatten
end
end
For a tree like we have above, this is probably good enough. Our tree is only 3 levels deep. We've introduced plenty of N+1 queries, but given our small dataset, that shouldn't be a huge concern.
That said, as our store grows, and the tree gets deeper and more
detailed, this kind of implementation could become a bottleneck. Also,
because we're doing Array operations on the children
collection,
we lose the ability to take advantage of ActiveRelation outside of the
descendents
method itself. Among other things, this means that we
can't eager-load cat pictures unless we always eager-load them within
the descendents
method.
Surely we can do better.
SQL WITH queries
Since we're using PostgreSQL, we can take advantage of its special features. In this case, we can use a WITH query. From the [PostgreSQL documentation][with queries]:
WITH provides a way to write auxiliary statements for use in a larger query. These statements, which are often referred to as Common Table Expressions or CTEs, can be thought of as defining temporary tables that exist just for one query.
On its own, this might not seem like a big deal, but when combined with the optional RECURSIVE modifier, WITH queries can become quite powerful:
The optional RECURSIVE modifier changes WITH from a mere syntactic convenience into a feature that accomplishes things not otherwise possible in standard SQL. Using RECURSIVE, a WITH query can refer to its own output. A very simple example is this query to sum the integers from 1 through 100:
WITH RECURSIVE t(n) AS ( VALUES (1) UNION ALL SELECT n+1 FROM t WHERE n < 100 ) SELECT sum(n) FROM t;
The general form of a recursive WITH query is always a non-recursive term, then UNION (or UNION ALL), then a recursive term, where only the recursive term can contain a reference to the query's own output.
In other words, the expression contained in the AS statement has two parts. The first part is executed just once. The second part, after the UNION ALL, is executed until it returns an empty result set.
Taking advantage of WITH RECURSIVE, we can reduce our tree crawling technique from n queries to just 1! Let's how we can use this to crawl our category tree.
As a reminder, here's what our categories table looks like:
# SELECT id, name, parent_id FROM categories;
id | name | parent_id
----+--------------+-----------
1 | Cat Pictures |
2 | Funny | 1
3 | LOLCats | 2
4 | Animated | 2
5 | Classic | 1
6 | Renaissance | 5
And this is the query:
WITH RECURSIVE category_tree(id, name, path) AS (
SELECT id, name, ARRAY[id]
FROM categories
WHERE parent_id IS NULL
UNION ALL
SELECT categories.id, categories.name, path || categories.id
FROM category_tree
JOIN categories ON categories.parent_id=category_tree.id
WHERE NOT categories.id = ANY(path)
)
SELECT * FROM category_tree ORDER BY path;
Running the query above returns the following:
id | name | path
----+--------------+---------
1 | Cat Pictures | {1}
2 | Funny | {1,2}
3 | LOLCats | {1,2,3}
4 | Animated | {1,2,4}
5 | Classic | {1,5}
6 | Renaissance | {1,5,6}
Whoa! That's a lot of SQL. Let's break it down a bit.
Declare the Table Expression
First, we declare our "temporary table" using the WITH syntax. We're
going to call it category_tree
. This "table" has 3 "columns": id
,
name
, and path
. The id
and name
columns are fairly obvious; they
refer to corresponding columns on the categories table. The path
is an
array of ids that each row will have. More on this in a bit.
Define the Non-recursive Term
The non-recursive term is next:
SELECT id, name, ARRAY[id]
FROM categories
WHERE parent_id IS NULL
It grabs the id
and name
for each top-level category, that is, each
category that has no parent. It also initializes an array containing just
its id
. On its own, this isn't very interesting, but this array will
become helpful during the recursive step of the query.
Define the Recursive Term
The recursive term is the juiciest bit of the query:
SELECT categories.id, categories.name, path || categories.id
FROM category_tree
JOIN categories ON categories.parent_id=category_tree.id
WHERE NOT categories.id = ANY(path)
Notice that we're selecting from category_tree
. By doing this, we're
able to use each result set in the subsequent iteration. The first time
we recurse, the result set will be what we selected in the non-recursive
term above.
Given that we have a root result set, we join with categories
to find
its children. From our new result set, we select id
and name
, as
before. But this time, we concatenate the child id onto the path
array
using SQL's ||
operator. Having this materialized path allows us to
guard against infinite loops; the WHERE clause makes sure that the row
we're selecting has not appeared in the path before.
This infinite loop check is important. If two categories pointed at each other as parents, the query would never return. Including this check prevents such a mistake from killing our server.
Query the Common Table Expression
Finally, a WITH query is only useful if you select from it outside of its declaration, so we'll do just that:
SELECT * FROM category_tree ORDER BY path;
In addition to the infinite loop guard, the path column answers the question "How did I get here?" Like a directory structure on a file system, the path demonstrates the ids necessary to get from grandparent to parent to child, etc.
You may have noticed that we're also ordering by the path column. We do this because the default sort from a recursive query is nondeterministic. Normal array sorting works well for us here, and groups the categories just like we'd expect, with parents listed before their children.
Using WITH queries in ActiveRecord
class Category < ActiveRecord::Base
# ...
def descendents
self_and_descendents - [self]
end
def self_and_descendents
self.class.tree_for(self)
end
def descendent_pictures
subtree = self.class.tree_sql_for(self)
CatPicture.where("category_id IN (#{subtree})")
end
def self.tree_for(instance)
where("#{table_name}.id IN (#{tree_sql_for(instance)})").order("#{table_name}.id")
end
def self.tree_sql_for(instance)
tree_sql = <<-SQL
WITH RECURSIVE search_tree(id, path) AS (
SELECT id, ARRAY[id]
FROM #{table_name}
WHERE id = #{instance.id}
UNION ALL
SELECT #{table_name}.id, path || #{table_name}.id
FROM search_tree
JOIN #{table_name} ON #{table_name}.parent_id = search_tree.id
WHERE NOT #{table_name}.id = ANY(path)
)
SELECT id FROM search_tree ORDER BY path
SQL
end
end
You should notice right away where our recursive query is. The
tree_sql_for
class method returns a SQL string that can be used with
other queries. Compared to the WITH query we looked at before, there a
few differences worth mentioning.
First, and probably most importantly for our original problem, we've changed our starting place. The non-recursive term is our "start here" result set. Rather than starting with all top-level categories, we're using the id of whichever instance is passed in to scope our tree.
Another change we've made is to remove the name
column from the query.
It isn't necessary for what we're doing, but made the example easier to
demonstrate. We're also interpolating the table name. This makes the
method much more reusable. In fact, we could extract the method to a
RecursiveTree
module to tidy up our class.
One big advantage of the SQL approach here is that we can create scopes
to further filter our results within just one database round-trip.
For example, the tree_for
class method is really just a named scope
that takes a category instance as a parameter.
Likewise, the the descendent_pictures
method returns a CatPicture
relation that includes all pictures from this category and all
subcategories. In other words, what used to take 2 database round trips
for each category in the tree (one to grab children, one to get its
pictures) will now only take 1 for the entire set.
Conclusion
Taking advantage of PostgreSQL's advanced features can provide large performance boons, especially when a tree get several levels deep.
Although using database recursion is an efficient way of improving performance with our existing schema, other methods of handling tree structures in SQL exist. The [SQL Antipatterns][] book has a great breakdown of other tree solutions that would require schema changes.
Example app
As before, while writing this post, I created a [sample Rails app][] to iterate quickly. I used TDD to write the pure-ruby approach, and reused the specs while I "refactored" the implementation to the subsequent approaches. Of particular note is the history of the Category model, which mirrors the code above.
previous post: http://blog.hashrocket.com/posts/sql-window-functions "SQL Window Functions and You" [sample Rails app]: https://github.com/jgdavey/tree-sql-example "Example application with recursive tree SQL" [with queries]: http://www.postgresql.org/docs/9.1/static/queries-with.html "WITH queries" [sql antipatterns]: http://pragprog.com/book/bksqla/sql-antipatterns "SQL Antipatterns book"