PostgreSQL Elixir
Ecto Migrations: Simple to Complex
Over the lifetime of an application, the application's database undergoes many transformations. New tables are added, new relationships are formed, columns are added and subtracted, data is massaged, and so on.
We know this well. This constant change is what underlies our need for database migrations.
These migrations span a wide range. On the one end we have the textbook modifications to our schema. We can pull up the Getting Started guide for Ecto to see how we might add a table or column. On the other end are more sophisticated migrations. These may involve multiple steps or even migrating and massaging data.
In this post, we'll explore this range.
Adding A Table
Let's add a posts
table for our blog app.
def change do
create table(:posts) do
add :title, :varchar
add :content, :text
timestamps()
end
end
This gives us a primary key column id
by default. This is a unique
integer column starting at 1
and counting upwards using a sequence managed
by the database. Every post ought to have a title and some content, so we've
added columns for those. Ecto gives us the flexibility to use data types
native to our database, so we've done just that with some of PostgreSQL's
native data types (varchar
and
text
).
Lastly, the
timestamps()
function will produce inserted_at
and updated_at
timestamp columns.
Running mix ecto.migrate
will add the posts
table and posts_id_seq
sequence to our database schema.
Altering A Table
We said that each post ought to have a title
and some content
, but we
aren't actually enforcing that. Our database can enforce the presence of
those columns when inserting and updating records. We just have to tell it
to do so with the not null
modifier.
Let's generate a new migration.
def up do
alter table(:posts) do
modify :title, :varchar, null: false
modify :content, :text, null: false
end
end
We use the alter
function to target our posts
table. The modify
function declares how the named column is to be modified. We have to specify
the data type of the column. If it is the same, nothing changes. This is,
however, an opportunity to change a column's type. For our purposes though,
we are just declaring that these columns cannot be null, hence the null:
false
.
Every up
migration requires a down
migration. Because we are modifying
existing columns, the down
migration cannot be inferred, so we have to
spell it out explicitly.
def down do
alter table(:posts) do
modify :title, :varchar, null: true
modify :content, :text, null: true
end
end
The down
migration explicitly says that these columns can be null. This
takes us back to where we were if need to rollback.
Adding A Column With A Default
Posts take time to write and sometimes go through a number of iterations.
We'd like to know whether a particular post is in a draft or published
state. We can achieve this by adding a published
flag in a new migration.
def change do
alter table(:posts) do
add :published, :boolean, null: false, default: false
end
end
A post is either published or it isn't, so a boolean published
column will
do the trick. This is a binary status, so null
doesn't mean anything to
us, so we make the column not nullable. Lastly, we want to be explicit about
publishing a post, so we default the published
status of a post to
false
.
Referencing Another Table
Our posts don't appear out of nowhere. People write them and those people
would probably like attribution. We should add a table of people
and then
reference those people in the posts[1].
def change do
create table(:people) do
add :name, :varchar
end
alter table(:posts) do
add :people_id, references(:people), null: false
end
end
This first creates the people
table, again with an implicit primary key
id
column. The second portion of the migration alters the posts
table to
add a people_id
column.
The references
function adds a foreign key constraint to ensure the
integrity of the relationship between people and their posts. In other
words, the database makes sure that post records never get orphaned. With a
foreign key constraint in place, we can be sure that a post pointing to a
person with an id of 6
will definitely be there. If we try to delete the
person with an id of 6
, our database will stop us reminding us that we
need to deal with the posts that depend on person 6
first. Perhaps the
right thing to do is delete those posts as well.
It should be noted that by default the
references
function assumes the key it is referencing has a name of id
. So, in our
case references(:people)
means there will be a foreign key from
people_id
of posts
to id
of people
. If you want to reference a
differently-named column, you can include the :column
option as part of
the second argument to references
specifying the intended name of the
referenced column.
Transitioning A Column
Let's imagine some time has passed and our app, now in production, has some people who are creating and publishing posts.
We'd like to be able to display in the UI the time at which a post was
published. We consider utilizing the updated_at
column when a post in the
published
state, but a post can be updated after it is published, so this
is not an accurate indicator.
Our next thought is to add a published_at
timestamp column. If the post is
transitioned to the published
state, then we set published_at
to the
current time and then display that time as long as the post remains in the
published
state. If published
is changed from true
to false
for a
post, then we can null
out the published_at
value. That'll work, but
we've run into a data design smell.
With this second approach, we've introduced a denormalization of our data.
The published
and published_at
fields have partially overlapping
concerns and it is possible for them to get out of sync. We can achieve the
same ends with a third, simplified option that only involves one column.
We can replace the published
flag with the published_at
timestamp. We
maintain the binary indicator because published_at
being null tells us it
is in a draft state whereas published_at
being some timestamp tells us
both that it is published and the time at which it was published.
Let's add this migration.
def change do
alter table(:posts) do
add :published_at, :timestamp
remove :published
end
end
That will do the trick. But hold up a second. If we run this migration against our production database, we will be doing something very sad. We'll be throwing away a bunch of data and messing up the publish status of all our existing posts.
We need to transition some data before we can remove the published
column.
We need to set all existing unpublished posts to have null
for
published_at
. All the existing published posts, however, need to be
assigned a timestamp. We already know that our existing data model was not
sufficient for representing when a post was published. We are going to have
to do some approximation, and that's okay.
Let's say that, in terms of this data migration, setting the published_at
value for all published posts to the same time that they were all last
updated is sufficient. So, for each post with published
set to true
, we
need to duplicate that post's updated_at
value into the newly-created
published_at
column. That will require an update_all
statement[2].
Let's modify our migration from above.
import Ecto.Query
def up do
alter table(:posts) do
add :published_at, :timestamp
end
from(p in "posts",
update: [set: [published_at: p.updated_at]],
where: p.published)
|> MyApp.Repo.update_all([])
alter table(:posts) do
remove :published
end
end
This migration looks like exactly what we want, but when we run mix
ecto.migrate
, we are going to see an odd error.
** (Postgrex.Error) ERROR (undefined_column): column p0.published_at does not
exist
How can it not exist for our update_all
function when we clearly added
that column just a few lines above?
The issue is that the Ecto migrator is queueing up everything in our
migration to be executed against the database. When it reaches the
update_all
statement it tries to create a prepared statement that involves
a column that has not yet been created.
To ensure that the first portion of our migration is executed before the
update_all
is reached, we can utilize the
flush()
function.
Placing flush()
after the first portion of the migration will force
everything queued up so far to be executed against the database. This clears
the way for the rest of our migration to work.
def up do
alter table(:posts) do
add :published_at, :timestamp
end
flush()
from(p in "posts",
update: [set: [published_at: p.updated_at]],
where: p.published)
|> MyApp.Repo.update_all([])
alter table(:posts) do
remove :published
end
end
And like before, with any up
migration, we need a down
.
def down do
alter table(:posts) do
add :published, :boolean, null: false, default: false
end
flush()
from(p in "posts",
update: [set: [published: true]],
where: not is_nil(p.published_at))
|> MyApp.Repo.update_all([])
alter table(:posts) do
remove :published_at
end
end
This requires some of the same tricks as the up
migration because we want
to be able to put the data back in place as best as we can in the event of
a rollback.
Conclusion
In this post, we've explored a range of simple to complex Ecto migrations involving both schema changes and data transformation. We looked at how you can create new tables as well as alter existing ones. We even explored in detail how you would go about massaging and migrating data in the event that you need to replace one column with another. Ecto is an abstraction on top of SQL and we may need to drop down to raw SQL migrations in some cases, but from what we've seen Ecto can get us pretty far.
[1]: The following migration assumes an empty database, as if we are in the
midst of creating an initial MVP of our application. If the database
already contained posts, then adding a non-nullable column would cause an
error due to a violation of the not null
constraint.
[2]: When using the
update_all
function, there are two ways of specifying the terms of the update. As we
did above, the :update
keyword can be used within our Query
struct to
specify how to update the relevant records. Because we went this route, we
leave the required updates
argument to update_all
as an empty list. If
we don't specify the update as part of the query, then we need to include
those details as part of the updates
argument to update_all
.
Photo credit for cover image: Paul Morris, unsplash.com