Elixir PostgreSQL

Ecto Migrations: Simple to Complex

by Josh Branchaud on February 23, 2017

Over the lifetime of an application, the application's database undergoes many transformations. New tables are added, new relationships are formed, columns are added and subtracted, data is massaged, and so on.

We know this well. This constant change is what underlies our need for database migrations.

These migrations span a wide range. On the one end we have the textbook modifications to our schema. We can pull up the Getting Started guide for Ecto to see how we might add a table or column. On the other end are more sophisticated migrations. These may involve multiple steps or even migrating and massaging data.

In this post, we'll explore this range.

Adding A Table

Let's add a posts table for our blog app.

def change do
  create table(:posts) do
    add :title, :varchar
    add :content, :text

    timestamps()
  end
end

This gives us a primary key column id by default. This is a unique integer column starting at 1 and counting upwards using a sequence managed by the database. Every post ought to have a title and some content, so we've added columns for those. Ecto gives us the flexibility to use data types native to our database, so we've done just that with some of PostgreSQL's native data types (varchar and text). Lastly, the timestamps() function will produce inserted_at and updated_at timestamp columns.

Running mix ecto.migrate will add the posts table and posts_id_seq sequence to our database schema.

Altering A Table

We said that each post ought to have a title and some content, but we aren't actually enforcing that. Our database can enforce the presence of those columns when inserting and updating records. We just have to tell it to do so with the not null modifier.

Let's generate a new migration.

def up do
  alter table(:posts) do
    modify :title, :varchar, null: false
    modify :content, :text, null: false
  end
end

We use the alter function to target our posts table. The modify function declares how the named column is to be modified. We have to specify the data type of the column. If it is the same, nothing changes. This is, however, an opportunity to change a column's type. For our purposes though, we are just declaring that these columns cannot be null, hence the null: false.

Every up migration requires a down migration. Because we are modifying existing columns, the down migration cannot be inferred, so we have to spell it out explicitly.

def down do
  alter table(:posts) do
    modify :title, :varchar, null: true
    modify :content, :text, null: true
  end
end

The down migration explicitly says that these columns can be null. This takes us back to where we were if need to rollback.

Adding A Column With A Default

Posts take time to write and sometimes go through a number of iterations. We'd like to know whether a particular post is in a draft or published state. We can achieve this by adding a published flag in a new migration.

def change do
  alter table(:posts) do
    add :published, :boolean, null: false, default: false
  end
end

A post is either published or it isn't, so a boolean published column will do the trick. This is a binary status, so null doesn't mean anything to us, so we make the column not nullable. Lastly, we want to be explicit about publishing a post, so we default the published status of a post to false.

Referencing Another Table

Our posts don't appear out of nowhere. People write them and those people would probably like attribution. We should add a table of people and then reference those people in the posts[^1].

def change do
  create table(:people) do
    add :name, :varchar
  end

  alter table(:posts) do
    add :people_id, references(:people), null: false
  end
end

This first creates the people table, again with an implicit primary key id column. The second portion of the migration alters the posts table to add a people_id column.

The references function adds a foreign key constraint to ensure the integrity of the relationship between people and their posts. In other words, the database makes sure that post records never get orphaned. With a foreign key constraint in place, we can be sure that a post pointing to a person with an id of 6 will definitely be there. If we try to delete the person with an id of 6, our database will stop us reminding us that we need to deal with the posts that depend on person 6 first. Perhaps the right thing to do is delete those posts as well.

It should be noted that by default the references function assumes the key it is referencing has a name of id. So, in our case references(:people) means there will be a foreign key from people_id of posts to id of people. If you want to reference a differently-named column, you can include the :column option as part of the second argument to references specifying the intended name of the referenced column.

Transitioning A Column

Let's imagine some time has passed and our app, now in production, has some people who are creating and publishing posts.

We'd like to be able to display in the UI the time at which a post was published. We consider utilizing the updated_at column when a post in the published state, but a post can be updated after it is published, so this is not an accurate indicator.

Our next thought is to add a published_at timestamp column. If the post is transitioned to the published state, then we set published_at to the current time and then display that time as long as the post remains in the published state. If published is changed from true to false for a post, then we can null out the published_at value. That'll work, but we've run into a data design smell.

With this second approach, we've introduced a denormalization of our data. The published and published_at fields have partially overlapping concerns and it is possible for them to get out of sync. We can achieve the same ends with a third, simplified option that only involves one column.

We can replace the published flag with the published_at timestamp. We maintain the binary indicator because published_at being null tells us it is in a draft state whereas published_at being some timestamp tells us both that it is published and the time at which it was published.

Let's add this migration.

def change do
  alter table(:posts) do
    add :published_at, :timestamp
    remove :published
  end
end

That will do the trick. But hold up a second. If we run this migration against our production database, we will be doing something very sad. We'll be throwing away a bunch of data and messing up the publish status of all our existing posts.

We need to transition some data before we can remove the published column. We need to set all existing unpublished posts to have null for published_at. All the existing published posts, however, need to be assigned a timestamp. We already know that our existing data model was not sufficient for representing when a post was published. We are going to have to do some approximation, and that's okay.

Let's say that, in terms of this data migration, setting the published_at value for all published posts to the same time that they were all last updated is sufficient. So, for each post with published set to true, we need to duplicate that post's updated_at value into the newly-created published_at column. That will require an update_all statement[^2].

Let's modify our migration from above.

import Ecto.Query

def up do
  alter table(:posts) do
    add :published_at, :timestamp
  end

  from(p in "posts",
    update: [set: [published_at: p.updated_at]],
    where: p.published)
  |> MyApp.Repo.update_all([])

  alter table(:posts) do
    remove :published
  end
end

This migration looks like exactly what we want, but when we run mix ecto.migrate, we are going to see an odd error.

** (Postgrex.Error) ERROR (undefined_column): column p0.published_at does not
exist

How can it not exist for our update_all function when we clearly added that column just a few lines above?

The issue is that the Ecto migrator is queueing up everything in our migration to be executed against the database. When it reaches the update_all statement it tries to create a prepared statement that involves a column that has not yet been created.

To ensure that the first portion of our migration is executed before the update_all is reached, we can utilize the flush() function. Placing flush() after the first portion of the migration will force everything queued up so far to be executed against the database. This clears the way for the rest of our migration to work.

def up do
  alter table(:posts) do
    add :published_at, :timestamp
  end

  flush()

  from(p in "posts",
    update: [set: [published_at: p.updated_at]],
    where: p.published)
  |> MyApp.Repo.update_all([])

  alter table(:posts) do
    remove :published
  end
end

And like before, with any up migration, we need a down.

def down do
  alter table(:posts) do
    add :published, :boolean, null: false, default: false
  end

  flush()

  from(p in "posts",
    update: [set: [published: true]],
    where: not is_nil(p.published_at))
  |> MyApp.Repo.update_all([])

  alter table(:posts) do
    remove :published_at
  end
end

This requires some of the same tricks as the up migration because we want to be able to put the data back in place as best as we can in the event of a rollback.

Conclusion

In this post, we've explored a range of simple to complex Ecto migrations involving both schema changes and data transformation. We looked at how you can create new tables as well as alter existing ones. We even explored in detail how you would go about massaging and migrating data in the event that you need to replace one column with another. Ecto is an abstraction on top of SQL and we may need to drop down to raw SQL migrations in some cases, but from what we've seen Ecto can get us pretty far.

[^1]: The following migration assumes an empty database, as if we are in the midst of creating an initial MVP of our application. If the database already contained posts, then adding a non-nullable column would cause an error due to a violation of the not null constraint.

[^2]: When using the update_all function, there are two ways of specifying the terms of the update. As we did above, the :update keyword can be used within our Query struct to specify how to update the relevant records. Because we went this route, we leave the required updates argument to update_all as an empty list. If we don't specify the update as part of the query, then we need to include those details as part of the updates argument to update_all.

Photo credit for cover image: Paul Morris, unsplash.com