Manipulating Data in Production
Manipulating data in a production environment could be tricky to perform. To start, we need to be think about performance as quite often we are dealing with huge database tables with millions of rows. And on the top of this concern, we also need to care about the data integrity itself.
Most modern apps run schema migrations right before spawning the server with the new code, so schema migrations are part of automated deploys. For data migrations however, we tend to run those manually, so we can choose a good time and day to run that, and usually when the traffic on the app is very low. So eventually they either connect to their app server and run a console where they could manipulate the data, or they could also go directly to our database server via
psql for example. This simple guide helps us to keep our deploy times on a constant base.
In addition, we need to consider the performance (run-time) of the data migration itself. Ideally, we want to ensure fewer table locks, best usage of indexes, batching changes, etc.
With data integrity, the scenario is a bit different. At this point, many developers rely on database backups, so if our data migration went terribly wrong we could "recover" from a database backup restoration. At first glance, this is an ok solution, but we will inevitably have a window of time between completing the backup and finishing the restore where data could be flowing in.
In order to make this a bit more assertive, I also have seen so many devs performing a small version of a backup in case of an eventual rollback. Sometimes they even store those backups on their local machine, which can raise a bunch of security questions. So at this point, backing up and restoring is a bit too customizable, which is good, but it takes a bit more time to perform. And a great thing to keep in mind is that in case of a problem, we want to restore with confidence and as fast as we can.
So on one hand, we have full database backups, usually stored by our databases providers on their servers, simple to backup and restore, with the potential of losing data on data changes inside the window of the backup to the restoration.
And on the other hand we have these customizable small subset of database backups that could be stored anywhere since a local dev environment, or even on a ticket card. The format of this data is also a bit open-air, it could be csv, tsv, some spreadsheet, or even just unformatted text.
A better place
There's no better place to store data than the database itself. So let's use PostgreSQL to store data changes that we care that happen in the same PostgreSQL instance. The idea is very simple and I can summarize that by creating some "temporary" table that will hold the affected data for how long we think that's necessary to validate that change. Remember that sometimes the business or the client could take a while to check these data changes, so you may have to hang in there for a bit.
Let's imagine that we have a single table called
CREATE TABLE planets ( id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY, name VARCHAR(100) NOT NULL UNIQUE, orbital_period INT NOT NULL );
In this first scenario imagine that our business team has sent up a spreadsheet with some values that they want to insert on that table. We can easily turn that to be a simple sql such as:
INSERT INTO planets (name, orbital_period) VALUES ('Venus', 10), ('Earth', 10), ('Mars', 687), ('Pluto', 90560);
Nice so we have some data provided by the business and we want to insert them. In this scenario, in case something goes wrong, our rollback strategy would be to delete those new planets. This way our strategy to create that "temporary" table as I mentioned before to insert data is something like:
CREATE TABLE _2022_04_11_planets_inserted AS ( WITH ids AS ( INSERT INTO planets (name, orbital_period) VALUES ('Venus', 10), ('Earth', 10), ('Mars', 687), ('Pluto', 90560) RETURNING id ) SELECT ids.id AS planet_id FROM ids );
Note that the main insert sql is preserved. In PostgreSQL we have to surround that
INSERT in a CTE so we can return the new planet ids and with that values create the new table
I've been quoting "temporary" here because this is just a regular PostgreSQL table, but I want to treat that as temporary in the sense of, I want to drop this table as soon as I feel confident to do so. My naming convention here is to prepend the table name with
_ so we can easily sort all those "temporary" tables, and I also want to bring the date of execution of that data migration in the table's name. If another developer wants to prune some of this data, they can use that date to make a judgement on staleness and drop those tables.
In this case
ids are enough info that we need to keep as the rollback of that is to delete, so we can rollback by simply:
DELETE FROM planets USING _2022_04_11_planets_inserted WHERE planets.id = _2022_04_11_planets_inserted.planet_id;
With deleting, I usually start my query not by the delete query itself, but by a select, as I usually wants to make sure that I am deleting the right thing. In this scenario, let's say that the business or the client noticed that
Pluto is not a planet and they asked us to delete it from the table. So we'd start with:
SELECT * FROM planets WHERE name = 'Pluto'
Again, let's wrap this query and insert the results on a new "temporary" table:
CREATE TABLE _2022_04_11_planets_deleted AS ( SELECT * FROM planets WHERE name = 'Pluto' ); SELECT * FROM _2022_04_11_planets_deleted; DELETE FROM planets USING _2022_04_11_planets_deleted WHERE planets.id = _2022_04_11_planets_deleted.id;
Note here that we can create the table
_2022_04_11_planets_deleted with the data we want to delete. We can then check if the data is alright, and if so, we can continue and perform the
For a rollback, we want to have all the deleted data so we can create them again. This way:
INSERT INTO planets ( SELECT * FROM _2022_04_11_planets_deleted );
The sql here is so simple, but it's worthwhile to remember that from the time we created the
_2022_04_11_planets_deleted and the rollback, there could be changes in the planets table that could break the rollback. In this case we could potentially alter the new
_2022_04_11_planets_deleted table to adjust to new names, column types, new columns and etc.
For the updating scenario, in order to perform a rollback, we would need to know which was the state of the data before the change. The strategy here is to create a "temporary" table with the id as a reference, and the values that are going to change, the
*_from and the
*_to versions of them.
In this example, the business team noticed that the
orbital_period are incorrect for some of the planets, so they provide us a new spreadsheet with the correct values to be. With that we can create our migration as:
CREATE TABLE _2022_04_11_planets_updated AS ( SELECT p.id AS planet_id, p.orbital_period AS orbital_period_from, v.orbital_period AS orbital_period_to FROM planets p INNER JOIN ( SELECT * FROM ( VALUES ('Earth', 365), ('Venus', 264) ) AS planets_to_update (name, orbital_period) ) v ON v.name = p.name ); SELECT * FROM _2022_04_11_planets_updated; UPDATE planets SET orbital_period = u.orbital_period_to FROM _2022_04_11_planets_updated u WHERE planets.id = u.planet_id;
And the rollback in this case is basically the same
UPDATE query with a slight change in the field from to use
UPDATE planets SET orbital_period = u.orbital_period_from FROM _2022_04_11_planets_updated u WHERE planets.id = u.planet_id;
When we need to perform data changes in production, we want to make sure that we can mitigate all risks involved in the task. To be honest, data migrations are not ideal; we usually should build admin pages so the business can fix the data themselves. But we understand that can't happen in every scenario. We need to embrace those tasks in a way that we can still feel comfortable to operate them.
With this post I wanted to share some simple strategy that could be applied to data migrations. I know that this is not a silver bullet solution, but it works fine in a lot of them. Also keep in mind that this solution does not increase the time on the task, it's a receipt to be followed. It's very simple to put in place, and it help us to get ready for rollback situations.