PostgreSQL
Create Quick JSON Data Dumps From PostgreSQL
There are lots of ways that we consume the data stored in our PostgreSQL
databases. We build apps that query against our database. We explore the data
manually from a PSQL session. Perhaps we even occasionally create a dump of the
databases' schema and data using pg_dump
.
What if we want to export some subset of our data in a general purpose format that we can easily pass along to others? A JSON dump would fit that bill. In this post we'll explore how to quickly create a JSON dump of some of our data.
Why a JSON dump?
A customized JSON dump of our data is appealing for two reasons.
First, we can craft the shape of the data being dumped. We can be sure to include specific, relevant fields. This includes computed values. Likewise, we can exclude sensitive and irrelevant data. For instance, we may not want to expose serial ids that make up the primary key of our data.
Second, JSON is a clean and friendly export format. By dumping our data as JSON we are not only ensuring that it is relatively human-readable, we also know that any developer regardless of their development stack will have access to a JSON parser with which they can read that data into a program meant to consume or otherwise use the data dump.
What We Need
To dump a subset of our data from Postgres as JSON, we'll need to do a
couple things. We will need to come up with the shape and content of the
data as discussed above, turn the result table into JSON using json_agg
,
adjust some PSQL settings related to the output formatting, and, lastly,
save the resulting JSON dump to a file.
Shape and Content
We are creating a JSON dump of our data because someone needs access to our data. They have specific needs though. You'll have to identify what they need and consider how cleanly that maps to your data model. Some things will map directly where as others will require computing a value based on a combination of rows or some known constant.
To make this a bit more concrete, let's consider how we would create a data
dump of the Today I Learned posts from our tilex
app.
The posts
table can be described as follows:
\d posts
Table "public.posts"
Column | Type | Modifiers
--------------+--------------------------+----------------------------------------------------
id | integer | not null default nextval('posts_id_seq'::regclass)
title | character varying | not null
body | text | not null
inserted_at | timestamp with time zone | not null default now()
updated_at | timestamp with time zone | not null default now()
channel_id | integer | not null
slug | character varying(255) | not null
likes | integer | not null default 1
max_likes | integer | not null default 1
published_at | timestamp with time zone | not null default now()
developer_id | integer |
tweeted_at | timestamp with time zone |
There are a lot of columns in this table. To start we only want to dump the
title
, body
, published_at
, and a permalink for each post. The first
three -- title
, body
, and published_at
-- map directly to columns on
the table. There is no permalink field, but we can easily construct one from
the slug
.
select
title,
body,
published_at,
'https://til.hashrocket.com/posts/' || slug as permalink
from posts;
-[ RECORD 1 ]+---------------------------------------------------
title | Start rails server in production mode
body | `rails s -e production` ...
published_at | 2015-04-16 23:04:53.142287-05
permalink | https://til.hashrocket.com/posts/97e26f5f68
-[ RECORD 2 ]+---------------------------------------------------
title | Variable Hoisting in JavaScript
body | I while back I wrote about [variable ho...
published_at | 2016-04-03 13:47:25.763974-05
permalink | https://til.hashrocket.com/posts/eeedb8dda0
-[ RECORD 3 ]+---------------------------------------------------
title | Three ways to compile Elixir
...
Great! We took a big step in the right direction by pulling out values we want
including a computed value (the permalink). Columns like id
and inserted_at
have been ignored. Let's push this example a bit further by joining in the
channel names for each post as well as attributing each post to the developer
who wrote it.
select
title,
body,
published_at,
'https://til.hashrocket.com/posts/' || slug as permalink,
channels.name,
developers.username
from posts
join channels on channels.id = posts.channel_id
join developers on developers.id = posts.developer_id;
With the shape and content of our dump settled, let's move on to the JSON step.
Transitioning To JSON
Data in a Postgres database is always in a tabular form. It is stored that way,
the results come back that way, even something as simple as select 1;
is
presented as a table. We want to transition from a tabular result to a JSON one.
Postgres has a json_agg
function which will do most of the heavy lifting here.
To prepare our previous statement for json_agg
we need to wrap it in either a
subquery or a CTE (common table expression). I'll show both and then proceed to
use the CTE version throughout the rest of the post.
json_agg
with subquery
select json_agg(t)
from (
select
title,
body,
published_at,
'https://til.hashrocket.com/posts/' || slug as permalink,
channels.name,
developers.username
from posts
join channels on channels.id = posts.channel_id
join developers on developers.id = posts.developer_id
) t;
json_agg
with CTE
with t as (
select
title,
body,
published_at,
'https://til.hashrocket.com/posts/' || slug as permalink,
channels.name,
developers.username
from posts
join channels on channels.id = posts.channel_id
join developers on developers.id = posts.developer_id
)
select json_agg(t) from t;
Either way, we get a JSON result that looks something like the following:
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
----------------------------------
[{"title":"Start rails server in production mode","body":"`rails s -e production`",
"published_at":"2015-04-16T23:04:53.142287-05:00","?column?":"https://til.hashrocket.com/posts/97e26f5f68",
"name":"rails","username":"jakeworth"}, +
{"title":"Variable Hoisting in JavaScript","body":"I while back I wrote about[variable ho",
"published_at":"2016-04-03T13:47:25.763974-05:00","?column?":"https://til.hashrocket.com/posts/eeedb8dda0",
"name":"javascript","username":"jakeworth"}, +
There are two things to note regarding this output.
First, we have that odd
dashed line header at the top and weird spacing between results marked by the
+
signs. We don't want any of that included in our final output and will deal
with it in the next section.
Second, the names of some of the fields (keys
in JSON speak) are not exactly
what we'd like. We can and should adjust these. For the subquery approach, we
can follow the pattern of the permalink
field by tacking on as <preferred
name>
after each field as needed. Here is what our CTE approach will look like:
with t(
title,
body,
published_at,
permalink,
channel,
author
) as (
select
title,
body,
published_at,
'https://til.hashrocket.com/posts/' || slug as permalink,
channels.name,
developers.username
from posts
join channels on channels.id = posts.channel_id
join developers on developers.id = posts.developer_id
)
select json_agg(t) from t;
We name each field in the description of the CTE. This makes things rather
explicit. We know exactly what json_agg
will do as a result. Now, to deal with
the formatting.
Output Formatting
Though json_agg
is able to produce valid JSON, the PSQL output formatting
quickly invalidates it. Both the table header and the result spacing delimited
by +
signs are not valid JSON.
With the help of a post on TIL itself, Export text exactly from psql, we can clear up these issues.
The table headers can be removed by turning Tuples Only on.
> \t on
The spacing between the results can be removed by switch PSQL's formatting to unaligned.
> \pset format unaligned
These settings will only persist for the duration of the PSQL session. The next time you connect they will be back to their defaults.
Export The Results
The last step is to take the fruits of our labor and persist them to some file.
We can do this with ease right from Postgres using the \g
meta-command in
place of the semicolon. On its own, it will execute the query just as ;
would.
However, if followed by a filename, the result of the query will be written to
that file.
with t(
title,
body,
published_at,
permalink,
channel,
author
) as (
select
title,
body,
published_at,
'https://til.hashrocket.com/posts/' || slug as permalink,
channels.name,
developers.username
from posts
join channels on channels.id = posts.channel_id
join developers on developers.id = posts.developer_id
)
select json_agg(t) from t \g my_data_dump.json
The output will now be in the specified file. In what directory is that file
saved? Presumably the directory from which the PSQL session was initiated. To be
sure, you can shell out to the pwd
command.
> \! pwd
Conclusion
You now have some data to share as JSON in the exact shape you want. Hopefully
you've learned a little about CTEs, json_agg
, PSQL's formatting, and the \g
meta-command along the way. Happy Data Dumping!
Cover image by Taylor Bryant on Unsplash.com