Hashrocket.com / blog

Large elepahnt upsert

Upsert Records with PostgreSQL 9.5

posted on and written by in

Image 100x100 josh branchaud

With the release of PostgreSQL 9.5, we now have a better way to upsert data.

No more making multiple trips to the database.

No more shoehorning writeable common table expressions.

No more defining custom merge functions.

We finally have the upsert feature we've been waiting for.

For those not familiar with the term upsert, it is also sometimes referred to as a merge. The idea is this. Some data needs to be put in the database. If the data has an existing associated record, that record should be updated, otherwise the data should be freshly inserted as a new record. Depending on the state of the database, a record is either updated or inserted, hence the name upsert. This type of functionality has been available in other relational databases (e.g. MySQL) for a while, so it's exciting to see it makes its way into PostgreSQL.

Traditionally, to achieve this kind of update or insert as necessary functionality, our application code would have to make two trips to the database. The first is to check if there is already an existing record that can be updated. The second is either an update or an insert depending on the result of the first query.

There are also some more clever approaches to upserting that only take a single trip to the database. Since the release of PostgreSQL 9.1, we can take advantage of Writeable Common Table Expressions to upsert records. It's also possible to use PL/pgSQL to create a custom upsert function. And now, we can do an explicit upsert using the on conflict clause of the insert statement.

Before we go any further, let's give ourselves an example database for illustrative purposes.

Pinned tweets

Let's use Twitter as an example and model the three tables we need to represent pinned tweets. Each Twitter user can have many tweets. Of their tweets, a user may choose to have one of these tweets designated as a pinned tweet. They can change their pinned tweet at any time and can even go back to having no pinned tweet at all.

For this, we, of course, need users and tweets. We'll also have a pinned_tweets table that tracks the pinned tweet for each user along with the time at which they pinned that tweet. This may be a bit contrived, but bear with me. Also, feel free to follow along in your own psql session. All the code you need is posted as we go.

So, the tables...

create table users (
  handle varchar primary key,
  name varchar not null,
  bio varchar not null default '',
  created_at timestamptz not null default now()
);

create table tweets (
  id serial primary key,
  content varchar not null,
  user_handle varchar references users (handle),
  created_at timestamptz not null default now(),
  unique (id, user_handle)
);

create table pinned_tweets (
  tweet_id integer not null,
  user_handle varchar not null,
  pinned_at timestamptz not null default now(),
  primary key (user_handle),
  foreign key (tweet_id, user_handle) references tweets (id, user_handle)
);

We'll now need some users.

insert into users (handle, name) values ('rey' 'Rey');
insert into users (handle, name) values ('kyloren', 'Kylo Ren');
insert into users (handle, name) values ('hansolo', 'Han Solo');
select handle, name from users;
--    handle  |     name
--  ----------+--------------
--   rey      | Rey
--   kyloren  | Kylo Ren
--   hansolo  | Han Solo

We will start our users out with some tweets as well (don't worry, no spoilers).

insert into tweets (user_handle, content)
  values (
    'rey',
    '@kyloren you’re afraid that you will never be as strong as Darth Vadar.'
  );
insert into tweets (user_handle, content)
  values (
    'kyloren',
    'I will fulfill our destiny.'
  );
insert into tweets (user_handle, content)
  values (
    'hansolo',
    'Well, you tell them that Han Solo just stole back the Millennium Falcon, for good!'
  );
insert into tweets (user_handle, content)
  values (
    'hansolo',
    'I’ve got a bad feeling about this.'
  );
insert into tweets (user_handle, content)
  values (
    'rey',
    'I didn’t think there was this much green in the whole galaxy...'
  );
select user_handle, count(*) from tweets group by user_handle;
--   user_handle | count
--  -------------+-------
--   rey         |     2
--   kyloren     |     1
--   hansolo     |     2

Upserting Pinned Tweets

Before we look at the new upsert functionality, let's upsert with a writeable CTE (Common Table Expression) so that we have something for comparison. We can pin Han Solo's first tweet for him.

select * from pinned_tweets where user_handle = 'hansolo';
--  tweet_id | user_handle | pinned_at
-- ----------+-------------+-----------
-- (0 rows)

with upsert as (
  update pinned_tweets
  set (tweet_id, pinned_at) = (3, clock_timestamp())
  where user_handle = 'hansolo'
  returning *
)
insert into pinned_tweets (user_handle, tweet_id, pinned_at)
select 'hansolo', 3, clock_timestamp()
where not exists (
  select 1
  from upsert
  where upsert.user_handle = 'hansolo'
);
-- INSERT 0 1

select * from pinned_tweets where user_handle = 'hansolo';
--  tweet_id | user_handle |           pinned_at
-- ----------+-------------+-------------------------------
--         3 | hansolo     | 2016-01-11 16:59:13.410657-06
-- (1 row)

Let's now try the same sort of thing, but with the on conflict clause that we get with the 9.5 release. We'll pin a tweet for Rey.

select * from pinned_tweets where user_handle = 'rey';
--  tweet_id | user_handle | pinned_at
-- ----------+-------------+-----------
-- (0 rows)

insert into pinned_tweets (user_handle, tweet_id, pinned_at)
  values (
    'rey',
    5,
    clock_timestamp()
  )
on conflict (user_handle)
do update set (tweet_id, pinned_at) = (5, clock_timestamp())
where pinned_tweets.user_handle = 'rey';
-- INSERT 0 1

select * from pinned_tweets where user_handle = 'rey';
--  tweet_id | user_handle |          pinned_at
-- ----------+-------------+------------------------------
--         5 | rey         | 2016-01-11 17:08:52.41554-06
-- (1 row)

The addition of the upsert feature means we don't have to tangle with writeable CTEs here. Instead, we get to write a standard insert statement. All we have to add is the on conflict clause. The on conflict clause is a way to tell the insert statement what to do if it is colliding with an existing record. In our case, if the user_handle conflicts with an existing record in the pinned_tweets table, then we react by doing an update instead. That's an upsert.

Now that Rey has a pinned tweet, we can run the same statement as above to update to a different tweet.

insert into pinned_tweets (user_handle, tweet_id, pinned_at)
  values (
    'rey',
    1,
    clock_timestamp()
  )
on conflict (user_handle)
do update set (tweet_id, pinned_at) = (1, clock_timestamp())
where pinned_tweets.user_handle = 'rey';
-- INSERT 0 1

select * from pinned_tweets where user_handle = 'rey';
--  tweet_id | user_handle |          pinned_at
-- ----------+-------------+------------------------------
--         1 | rey         | 2016-01-11 17:09:55.41554-06
-- (1 row)

This time, the on conflict clause is triggered which executes the update part of the clause.

It inserts if it can. It updates if there is a conflict.

I suggest choosing the new upsert feature over the writeable CTE approach for a couple reasons.

First, it is conceptually simpler. Writing an insert statement with an on conflict clause is more straightforward than a writeable CTE. The writeable CTE approach feels clumsy, whereas the upsert syntax was made for this.

Second, readability. The difference may seem small in the contrived example above, but when faced with a wall of SQL, the new upsert syntax is going to win in readability.

Lastly, it is more performant. I performed a basic benchmark to compare the relative performance of the two approaches. Though the difference wasn't significant, the new upsert feature has an edge on the writeable CTE approach.

Update (1/15/2016): a redditor (pilif) pointed out that the previous approaches can all suffer from race conditions. The new upsert feature is free of race conditions. We can now upsert in a way that is race-proof and performant. They linked to Why Is Upsert So Complicated? which goes into more detail.

Counting

You now have a basic idea of how to use insert ... on conflict ... to perform an upsert. Let's look at another example of how we might use upsert. This time we'll use it to update a count.

Without fully specifying another example database, let's consider another scenario. Imagine we are trying to model an Inventory for characters in an RPG-style video game.

Whenever a character picks up an item, the item is added to their inventory. Internally we will represent that with a counting join table between Characters and Items. If they are picking up a new item, we add a record with a count of 1. If they already have identical items in their Inventory, then we just increment the count. An upsert is just what we need.

insert into characters_items as ci (character_id, item_id, quantity)
  values (
    1,
    23,
    1
  )
on conflict (character_id, item_id)
do update set quantity = ci.quantity + 1
  where ci.character_id = 1 and ci.item_id = 23;
-- INSERT 0 1

There are a couple of details worth discussing in this example.

First, the on conflict (character_id, item_id) has two arguments. A composite key. The on conflict clause needs a unique key with which to work. A composite primary key works well in this example. This, of course, means that the characters_items table was defined with just such a composite primary key.

Second, just as we saw in the previous example, we can (and should) use a where clause to specify exactly what record is to be updated.

Lastly, there is the potential for some ambiguity. As such, we need to tell the update part of the statement from what table the referenced columns are coming. Declaring an alias for the table in the insert part of the statement keeps things concise.

That's It?

That's really all there is to the basics of upserting in PostgreSQL 9.5. There is a lot more that we can do with the on conflict clause though.

We can target constraints. Instead of specifying indexed columns, we can have the on conflict specify a particular constraint as the target of a conflict.

We can do nothing. If our use case calls for it, we can also opt to do nothing when there is a conflict. In this case, we don't have to specify any indexed rows or constraints.

We can get fancy. In the update part of the query, we can use things like DEFAULT, expressions, and even sub-selects.

Check out the latest INSERT documentation to see the full specification and to read up on all the nitty-gritty details. If you've already started using the on conflict clause in your applications, whether for upserting or otherwise, let me know how you're using it.

May The Upsert Be With You!


Thanks to Andrew Dennis and Craig Kerstiens for providing feedback on earlier drafts of this post.

Cover image credit: Internet Archive Book Image's on Flickr

Posted in PostgreSQL and tagged with PostgreSQL