Ruby PostgreSQL
Exploring the Postgres Gin index
Postgres has 4 different types of indexes, each better suited for a particular task. In this post, I will explore the Postgres Gin index and how to leverage it to quickly search text columns.
Leveraging PostgreSQL Gin index
What problem will we be solving?
Suppose we wanted to implement simple search functionality for a web app. Say, for example, we wanted to search through all users in the system. Also imagine that we have ~ 1 million users currently stored in the system.
The requirements for this search implementation state that we should be able to search via partial matches, and search via multiple columns (e.g. first_name, last_name). More concretely, if we have the following users: "Hank Lillard" and "Lilly Adams", an input query of "Lil" should return both users. We can solve this problem using a Gin index with a special option in order to achieve extremely performant searches. Let's first dive into a little bit of explanation about Gin indexes.
What is the Gin index?
From the docs:
"GIN stands for Generalized Inverted Index. GIN is designed for handling cases where the items to be indexed are composite values, and the queries to be handled by the index need to search for element values that appear within the composite items. For example, the items could be documents, and the queries could be searches for documents containing specific words."
We will be using a Gin index accross multiple columns in our table. Along with our index, we will be passing a special option called gin_trgm_ops
. We will explain more about this option and how it benefits us shortly.
Creating test data
Let's create an example table and fill it with random strings so that you can follow along at home.
CREATE TABLE users (
first_name text,
last_name text
)
Next, let's fill that table up with random data:
SELECT md5(random()::text), md5(random()::text) FROM
(SELECT * FROM generate_series(1,1000000) AS id) AS x;
This will give us a million rows of random data to search through. Now, notice we have not created our index yet. Let's try searching through this data without, and see what kind of results we get back. Make sure to type in \timing
in order to get time data back from these queries.
SELECT count(*) FROM users where first_name ilike '%aeb%';
Running this query takes about 942.97 ms on my system. Let's see what happens when we search using both first_name and last_name.
SELECT count(*) FROM users where first_name ilike '%aeb%' or last_name ilike'%aeb%';
This query takes 1641.049 ms to run on my system.
Obviously, this is completely unacceptable. We can do so much better by leveraging the amazing power of the indexes Postgres provides.
Introducing our Gin index
Let's create our Gin index, using the gin_trgm_ops option I mentioned earlier. NOTE: If you're on Ubuntu you must ensure you have the contrib packages installed. On 14.04 simply run sudo apt-get install postgresql-contrib-9.3
before running the following queries.
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX users_search_idx ON users USING gin (first_name gin_trgm_ops, last_name gin_trgm_ops);
Creating this index may take a decent amount of time. Once it finishes, let's try running those same queries again and see what we get.
First query: 33.355 ms
Second query: 72.728 ms
Obviously this is a major improvement. We could actually see performing this query in request now without any issues.
What is gin_trgm_ops?
This option tells Postgres to index using trigrams over our selected columns. A trigram is a data structure that hold 3 letters of a word. Essentially, Postgres will break down each text column down into trigrams and use that in the index when we search against it.
Caveats
The only downside of this approach is that the input query must be at least 3 letters, as Postgres will need to be able to extract at least one trigram from the input query in order to use our trigram index.