PostgreSQL

Create Quick JSON Data Dumps From PostgreSQL

by Josh Branchaud on October 10, 2017

There are lots of ways that we consume the data stored in our PostgreSQL databases. We build apps that query against our database. We explore the data manually from a PSQL session. Perhaps we even occasionally create a dump of the databases' schema and data using pg_dump.

What if we want to export some subset of our data in a general purpose format that we can easily pass along to others? A JSON dump would fit that bill. In this post we'll explore how to quickly create a JSON dump of some of our data.

Why a JSON dump?

A customized JSON dump of our data is appealing for two reasons.

First, we can craft the shape of the data being dumped. We can be sure to include specific, relevant fields. This includes computed values. Likewise, we can exclude sensitive and irrelevant data. For instance, we may not want to expose serial ids that make up the primary key of our data.

Second, JSON is a clean and friendly export format. By dumping our data as JSON we are not only ensuring that it is relatively human-readable, we also know that any developer regardless of their development stack will have access to a JSON parser with which they can read that data into a program meant to consume or otherwise use the data dump.

What We Need

To dump a subset of our data from Postgres as JSON, we'll need to do a couple things. We will need to come up with the shape and content of the data as discussed above, turn the result table into JSON using json_agg, adjust some PSQL settings related to the output formatting, and, lastly, save the resulting JSON dump to a file.

Shape and Content

We are creating a JSON dump of our data because someone needs access to our data. They have specific needs though. You'll have to identify what they need and consider how cleanly that maps to your data model. Some things will map directly where as others will require computing a value based on a combination of rows or some known constant.

To make this a bit more concrete, let's consider how we would create a data dump of the Today I Learned posts from our tilex app.

The posts table can be described as follows:

\d posts
                                     Table "public.posts"
    Column    |           Type           |                     Modifiers
--------------+--------------------------+----------------------------------------------------
 id           | integer                  | not null default nextval('posts_id_seq'::regclass)
 title        | character varying        | not null
 body         | text                     | not null
 inserted_at  | timestamp with time zone | not null default now()
 updated_at   | timestamp with time zone | not null default now()
 channel_id   | integer                  | not null
 slug         | character varying(255)   | not null
 likes        | integer                  | not null default 1
 max_likes    | integer                  | not null default 1
 published_at | timestamp with time zone | not null default now()
 developer_id | integer                  |
 tweeted_at   | timestamp with time zone |

There are a lot of columns in this table. To start we only want to dump the title, body, published_at, and a permalink for each post. The first three -- title, body, and published_at -- map directly to columns on the table. There is no permalink field, but we can easily construct one from the slug.

select
  title,
  body,
  published_at,
  'https://til.hashrocket.com/posts/' || slug as permalink
from posts;

-[ RECORD 1 ]+---------------------------------------------------
title        | Start rails server in production mode
body         | `rails s -e production` ...
published_at | 2015-04-16 23:04:53.142287-05
permalink    | https://til.hashrocket.com/posts/97e26f5f68
-[ RECORD 2 ]+---------------------------------------------------
title        | Variable Hoisting in JavaScript
body         | I while back I wrote about [variable ho...
published_at | 2016-04-03 13:47:25.763974-05
permalink    | https://til.hashrocket.com/posts/eeedb8dda0
-[ RECORD 3 ]+---------------------------------------------------
title        | Three ways to compile Elixir
...

Great! We took a big step in the right direction by pulling out values we want including a computed value (the permalink). Columns like id and inserted_at have been ignored. Let's push this example a bit further by joining in the channel names for each post as well as attributing each post to the developer who wrote it.

select
  title,
  body,
  published_at,
  'https://til.hashrocket.com/posts/' || slug as permalink,
  channels.name,
  developers.username
from posts
join channels on channels.id = posts.channel_id
join developers on developers.id = posts.developer_id;

With the shape and content of our dump settled, let's move on to the JSON step.

Transitioning To JSON

Data in a Postgres database is always in a tabular form. It is stored that way, the results come back that way, even something as simple as select 1; is presented as a table. We want to transition from a tabular result to a JSON one.

Postgres has a json_agg function which will do most of the heavy lifting here. To prepare our previous statement for json_agg we need to wrap it in either a subquery or a CTE (common table expression). I'll show both and then proceed to use the CTE version throughout the rest of the post.

`json_agg` with subquery

select json_agg(t)
from (
  select
    title,
    body,
    published_at,
    'https://til.hashrocket.com/posts/' || slug as permalink,
    channels.name,
    developers.username
  from posts
  join channels on channels.id = posts.channel_id
  join developers on developers.id = posts.developer_id
) t;

`json_agg` with CTE

with t as (
  select
    title,
    body,
    published_at,
    'https://til.hashrocket.com/posts/' || slug as permalink,
    channels.name,
    developers.username
  from posts
  join channels on channels.id = posts.channel_id
  join developers on developers.id = posts.developer_id
)
select json_agg(t) from t;

Either way, we get a JSON result that looks something like the following:

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
----------------------------------
 [{"title":"Start rails server in production mode","body":"`rails s -e production`",
   "published_at":"2015-04-16T23:04:53.142287-05:00","?column?":"https://til.hashrocket.com/posts/97e26f5f68",
   "name":"rails","username":"jakeworth"},                                 +
  {"title":"Variable Hoisting in JavaScript","body":"I while back I wrote about[variable ho",
   "published_at":"2016-04-03T13:47:25.763974-05:00","?column?":"https://til.hashrocket.com/posts/eeedb8dda0",
   "name":"javascript","username":"jakeworth"},                            +

There are two things to note regarding this output.

First, we have that odd dashed line header at the top and weird spacing between results marked by the + signs. We don't want any of that included in our final output and will deal with it in the next section.

Second, the names of some of the fields (keys in JSON speak) are not exactly what we'd like. We can and should adjust these. For the subquery approach, we can follow the pattern of the permalink field by tacking on as <preferred name> after each field as needed. Here is what our CTE approach will look like:

with t(
  title,
  body,
  published_at,
  permalink,
  channel,
  author
) as (
  select
    title,
    body,
    published_at,
    'https://til.hashrocket.com/posts/' || slug as permalink,
    channels.name,
    developers.username
  from posts
  join channels on channels.id = posts.channel_id
  join developers on developers.id = posts.developer_id
)
select json_agg(t) from t;

We name each field in the description of the CTE. This makes things rather explicit. We know exactly what json_agg will do as a result. Now, to deal with the formatting.

Output Formatting

Though json_agg is able to produce valid JSON, the PSQL output formatting quickly invalidates it. Both the table header and the result spacing delimited by + signs are not valid JSON.

With the help of a post on TIL itself, Export text exactly from psql, we can clear up these issues.

The table headers can be removed by turning Tuples Only on.

> \t on

The spacing between the results can be removed by switch PSQL's formatting to unaligned.

> \pset format unaligned

These settings will only persist for the duration of the PSQL session. The next time you connect they will be back to their defaults.

Export The Results

The last step is to take the fruits of our labor and persist them to some file. We can do this with ease right from Postgres using the \g meta-command in place of the semicolon. On its own, it will execute the query just as ; would. However, if followed by a filename, the result of the query will be written to that file.

with t(
  title,
  body,
  published_at,
  permalink,
  channel,
  author
) as (
  select
    title,
    body,
    published_at,
    'https://til.hashrocket.com/posts/' || slug as permalink,
    channels.name,
    developers.username
  from posts
  join channels on channels.id = posts.channel_id
  join developers on developers.id = posts.developer_id
)
select json_agg(t) from t \g my_data_dump.json

The output will now be in the specified file. In what directory is that file saved? Presumably the directory from which the PSQL session was initiated. To be sure, you can shell out to the pwd command.

> \! pwd

Conclusion

You now have some data to share as JSON in the exact shape you want. Hopefully you've learned a little about CTEs, json_agg, PSQL's formatting, and the \g meta-command along the way. Happy Data Dumping!

Cover image by Taylor Bryant on Unsplash.com