PostgreSQL

Working with Email Addresses in PostgreSQL

by Josh Branchaud on May 2, 2016

Most of the data we work with day to day is not case-insensitive. For the data that is though, we need to check our assumptions. Is our database enforcing the constraints we think? Do we understand where our indexes are working for us and, more importantly, where they are not?

The more verbose title for this post would read, "Working With Email Addresses And Other Case-Insensitive Data In PostgreSQL". This is to say that while the rest of this post will focus on the email address example, the concepts will generally apply to anything you consider to be case-insensitive.

When working with email addresses from a database perspective, there are a few things I'd like to keep in mind.

First, an email address is generally used to identify a user. Because of this, I'd like the database to ensure the uniqueness of stored email addresses.

Second, email addresses should be treated as case-insensitive. JACK@nbc.com and jack@nbc.com should be handled as the same email address. This is important for scenarios like a user logging in where I cannot assume how they will capitalize their email address.

Third, the email column in my database will be accessed frequently. For instance, every time someone signs in, I find the user record by email address and then verify the given password. The database should be able to do lookups on the email column efficiently.

In brief, I'd like the database to provide efficient lookups of a unique email column.

Setup

With that in mind, let's do some initial setup.

create table users (
  id serial primary key,
  email text not null unique,
  password_digest text not null
);

insert into users (email, password_digest)
select
  'person' || num || '@example.com',
  md5('password' || num)
from generate_series(1,10000) as num;

This abbreviated table is representative of what I usually see for users tables storing email addresses[^1].

Here is a look at the table's description:

\d users
                              Table "public.users"
     Column      |  Type   |                     Modifiers
-----------------+---------+----------------------------------------------------
 id              | integer | not null default nextval('users_id_seq'::regclass)
 email           | text    | not null
 password_digest | text    | not null
Indexes:
    "users_pkey" PRIMARY KEY, btree (id)
    "users_email_key" UNIQUE CONSTRAINT, btree (email)

The email column has a unique constraint, which should cover our unique email addresses requirement, with a btree index, that will help with efficient lookups. That should cover it.

The Catch

This post isn't quite over though because, unfortunately, it isn't that easy.

First, we aren't really preventing duplicates. The first record in the table is person1@example.com. I can easily insert another record with a duplicate email address:

insert into users (email, password_digest)
values ('Person1@example.com', md5('password'));

And inserted. All I had to do was change the casing.

The unique constraint is only constraining the email column to a case-sensitive uniqueness. For email addresses, person1@example.com and Person1@example.com should be treated as identical.

These kinds of issues are especially likely if we are working with an ORM, such as the one provided by ActiveRecord. Though ORMs provide lots of convenience, they can hide the important details of the database and ultimately lead us to believe our database schema is tighter than it actually is[^2].

We'll have to update our constraint, but first let's explore the lookup efficiency.

We don't care if the user enters person1@example.com or PERSON1@EXAMPLE.COM when signing in, we want our application to handle it the same. This is key to providing consistent lookup of user records. We need a way to look at email addresses in a case-insensitive way.

Let's use PostgreSQL's lower() function for that.

select * from users where lower(email) = lower('PERSON5000@EXAMPLE.COM');
--   id  |         email          |         password_digest
-- ------+------------------------+----------------------------------
--  5000 | person5000@example.com | 81aa45be581a3b21e6ff4da69b8c5a15

Despite the casing not matching, we are able to find the record. If we look at the explain analyze output, though, we'll see a problem.

explain analyze select * from users where lower(email) = lower('PERSON5000@EXAMPLE.COM');
--                                              QUERY PLAN
-- ----------------------------------------------------------------------------------------------------
--  Seq Scan on users  (cost=0.00..264.00 rows=50 width=59) (actual time=5.784..11.176 rows=1 loops=1)
--    Filter: (lower(email) = 'person5000@example.com'::text)
--    Rows Removed by Filter: 10000
--  Planning time: 0.108 ms
--  Execution time: 11.243 ms

Postgres ends up doing a sequential scan of the users table. In other words, it is not using the users_email_key index. In order to provide a consistent lookup via the lower() function, we have lost the speed benefits of our index.

We've gone from feeling pretty good about our index to realizing that it both allows duplicates and cannot provide both consistent and efficient lookups on the email column.

We can remedy this, though.

A Better Index

We plan to almost exclusively do user email address lookups in conjunction with the lower() function. So, what we need is a functional index; one that indexes the email column with lower(). If we also make this a unique index, then we will have constrained the email column to email addresses that are unique after the lower() function has been applied to them[^3].

This will solve both of our problems. Let's add it.

We still have that duplicate record (Person1@example.com), so the first thing we want to do is clean that up (and anything else that will violate the upcoming index).

delete from users where email = 'Person1@example.com';

We can then use the create index command to add this better index:

create unique index users_unique_lower_email_idx on users (lower(email));

Because the index uses the lower() function, we call it a functional index.

We can see the new index by taking another look at the users table's description:

\d users
                              Table "public.users"
     Column      |  Type   |                     Modifiers
-----------------+---------+----------------------------------------------------
 id              | integer | not null default nextval('users_id_seq'::regclass)
 email           | text    | not null
 password_digest | text    | not null
Indexes:
    "users_pkey" PRIMARY KEY, btree (id)
    "users_email_key" UNIQUE CONSTRAINT, btree (email)
    "users_unique_lower_email_idx" UNIQUE, btree (lower(email))

If we try to insert a duplicate record like we did earlier, Postgres will throw up a red flag.

insert into users (email, password_digest) values ('Person1@example.com', md5('password'));
-- ERROR:  duplicate key value violates unique constraint "users_unique_lower_email_idx"
-- DETAIL:  Key (lower(email))=(person1@example.com) already exists.

Fantastic.

We can also use explain analyze again to get some insight into the performance with our new index:

explain analyze select * from users where lower(email) = lower('PERSON5000@example.com');
--                                                              QUERY PLAN
-- -------------------------------------------------------------------------------------------------------------------------------------
--  Index Scan using users_unique_lower_email_idx on users  (cost=0.29..8.30 rows=1 width=59) (actual time=0.051..0.052 rows=1 loops=1)
--    Index Cond: (lower(email) = 'person5000@example.com'::text)
--  Planning time: 0.134 ms
--  Execution time: 0.082 ms

The part to focus on is the first line under QUERY PLAN. Earlier this same query necessitated a full sequential scan. Now, Postgres is able to do an Index Scan using users_unique_lower_email_idx. This gives us performant, consistent lookups[^4].

Another Approach

The citext module gives us another approach to this issue of handling case-insensitive data.

The citext module provides a case-insensitive character string type, citext. Essentially, it internally calls lower when comparing values. Otherwise, it behaves almost exactly like text.

By declaring our email column with the citext type instead of text or varchar, we get the same benefits as the previous section without the additional index.

Taking the citext approach, we would create a table like the following:

create extension citext;
create table users (
  id serial primary key,
  email citext not null unique,
  password_digest varchar not null
);

\d users
                                   Table "public.users"
     Column      |       Type        |                     Modifiers
-----------------+-------------------+----------------------------------------------------
 id              | integer           | not null default nextval('users_id_seq'::regclass)
 email           | citext            | not null
 password_digest | character varying | not null
Indexes:
    "users_pkey" PRIMARY KEY, btree (id)
    "users_email_key" UNIQUE CONSTRAINT, btree (email)

There are a few benefits to using citext. We have less hoops to jump through because we no longer need to setup an additional index and we don't have to include lower() on both sides of every comparison. We also get some performance gain on writes by reducing the number of indexes we have to update by one.

The main benefit for me is being able to write a case-insensitive lookup like we would any other statement:

select * from users where email = 'PERSON5000@example.com';
  id  |         email          |         password_digest
------+------------------------+----------------------------------
 5000 | person5000@example.com | 81aa45be581a3b21e6ff4da69b8c5a15

And if you are to check the explain analyze, you'll see that it is able to perform an index scan[^5].

In Summary

When it comes to case-insensitive data like email addresses, you now understand how to enforce uniqueness and get efficient lookups. Perhaps you are even considering moving from text to citext. I hope you are also a bit more wary of what your application framework's ORM is doing for you as well as what database-level details it is hiding from you.

[^1]: I assume some level of ubiquitousness because this is the column definition and index provided by the Devise gem.

[^2]: You may have application-level uniqueness validations (a la Devise), but I'd argue that's not good enough. Your database should have the final say when it comes to preventing invalid and inconsistent data. This is an especially important point in a tech landscape that is often embracing microservices and multiple application clients (i.e. web, iOS, android).

[^3]: Why don't we just lowercase all email addresses before inserting them or querying for them? Well, that is yet another thing we have to remember to do at every access point to our database. It's easier to let an index do it. Less cognitive overhead as well.

[^4]: I am going to leave our old index (users_email_key) on the table. Though I expect the lower(email)-style query to be most common, I still assume there will be the occasional query on email by itself. I'd still like the benefits of the query. I can tolerate the very minor overhead of maintaining two indexes on a read-heavy table.

[^5]: There are some limitations to the citext module that you may never run up against, but are worth knowing about. They are detailed in the Limitations section of the docs for citext.