Materialized View Strategies Using PostgreSQL
Queries returning aggregate, summary, and computed data are frequently used in application development. Sometimes these queries are not fast enough. Caching query results using Memcached or Redis is a common approach for resolving these performance issues. However, these bring their own challenges. Before reaching for an external tool it is worth examining what techniques PostgreSQL offers for caching query results.
We will examine different approaches using the sample domain of a simplified account system. Accounts can have many transactions. Transactions can be recorded ahead of time and only take effect at post time. e.g. A debit that is effective on March 9 can be entered on March 1. The summary data we need is account balance.
create table accounts( name varchar primary key ); create table transactions( id serial primary key, name varchar not null references accounts on update cascade on delete cascade, amount numeric(9,2) not null, post_time timestamptz not null ); create index on transactions (name); create index on transactions (post_time);
Sample Data and Queries
For this example, we will create 30,000 accounts with an average of 50 transactions each.
All the sample code and data is available on Github.
Our query that we will optimize for is finding the balance of accounts. To start we will create a view that finds balances for all accounts. A PostgreSQL view is a saved query. Once created, selecting from a view is exactly the same as selecting from the original query, i.e. it reruns the query each time.
create view account_balances as select name, coalesce( sum(amount) filter (where post_time <= current_timestamp), 0 ) as balance from accounts left join transactions using(name) group by name;
Note that this uses an aggregate filter clause, an awesome feature introduced in PostgreSQL 9.4.
Now we simply select all rows with negative balances.
select * from account_balances where balance < 0;
After several runs to warm OS and PostgreSQL caches, this query takes approximately 3850ms.
We are going to examine multiple solutions. To keep them namespaced we will create separate schemas for each approach.
create schema matview; create schema eager; create schema lazy;
PostgreSQL Materialized Views
The simplest way to improve performance is to use a materialized view. A materialized view is a snapshot of a query saved into a table.
create materialized view matview.account_balances as select name, coalesce( sum(amount) filter (where post_time <= current_timestamp), 0 ) as balance from accounts left join transactions using(name) group by name;
Because a materialized view actually is a table, we can create indexes.
create index on matview.account_balances (name); create index on matview.account_balances (balance);
To retrieve the balance from each row we simple select from the materialized view.
select * from matview.account_balances where balance < 0;
The performance impact is impressive. It now only takes 13ms to retrieve all the accounts with negative balances -- 453x faster! Unfortunately, these materialized views have two substantial limitations. First, they are only updated on demand. Second, the whole materialized view must be updated; there is no way to only update a single stale row.
-- refresh all rows refresh materialized view matview.account_balances;
In the case where possibly stale data is acceptable, they are an excellent solution. But if data must always be fresh they are not a solution.
Eager Materialized View
Our next approach is to materialize the query into a table that is eagerly updated whenever a change occurs that would invalidate a row. We can do that with triggers. A trigger is a bit of code that runs when some event such as an insert or update happens.
First, we create the table to store the materialized rows.
create table eager.account_balances( name varchar primary key references accounts on update cascade on delete cascade, balance numeric(9,2) not null default 0 ); create index on eager.account_balances (balance);
Now we need to think of every way that
account_balances could become stale.
An account is inserted
On account insertion we need to create a
account_balances record with a zero balance for the new account.
create function eager.account_insert() returns trigger security definer language plpgsql as $$ begin insert into eager.account_balances(name) values(new.name); return new; end; $$; create trigger account_insert after insert on accounts for each row execute procedure eager.account_insert();
The syntax for create function and create trigger is quite extensive. Refer to the documentation for details. But the summary explanation is this: We create the function
eager.account_insert as a trigger function that will run with the permissions of the user who created it (
security definer). Inside a insert trigger function,
new is a variable that holds the new record.
An account is updated or deleted
Account update and deletion will be handled automatically because the foreign key to account is declared as
on update cascade on delete cascade.
A transaction is inserted, updated, or deleted
Transaction insert, update, and delete all have one thing in common: they invalidate the account balance. So the first step is to define a refresh account balance function.
create function eager.refresh_account_balance(_name varchar) returns void security definer language sql as $$ update eager.account_balances set balance= ( select sum(amount) from transactions where account_balances.name=transactions.name and post_time <= current_timestamp ) where name=_name; $$;
Next we can create trigger function that calls
refresh_account_balance whenever a transaction is inserted.
create function eager.transaction_insert() returns trigger security definer language plpgsql as $$ begin perform eager.refresh_account_balance(new.name); return new; end; $$; create trigger eager_transaction_insert after insert on transactions for each row execute procedure eager.transaction_insert();
Perform is how you execute a query where you do not care about the result in PL/pgSQL.
For the delete of a transaction we only get the variable
old instead of
old stores the previous value of the row.
create function eager.transaction_delete() returns trigger security definer language plpgsql as $$ begin perform eager.refresh_account_balance(old.name); return old; end; $$; create trigger eager_transaction_delete after delete on transactions for each row execute procedure eager.transaction_delete();
For the update of a transaction, we have to account for the possibility that the account the transaction belongs to was changed. We use the
new values of the row to determine which account balances are invalidated and need to be refreshed.
create function eager.transaction_update() returns trigger security definer language plpgsql as $$ begin if old.name!=new.name then perform eager.refresh_account_balance(old.name); end if; perform eager.refresh_account_balance(new.name); return new; end; $$; create trigger eager_transaction_update after update on transactions for each row execute procedure eager.transaction_update();
Finally, with all this set up we need to initialize the
-- Create the balance rows insert into eager.account_balances(name) select name from accounts; -- Refresh the balance rows select eager.refresh_account_balance(name) from accounts;
To query the negative account balances we simply select from the
select * from eager.account_balances where balance < 0;
This is really fast (13ms / 453x faster) just like the materialized view. But it has the advantage of it stays fresh even when transactions change. Unfortunately, this strategy doesn't account for one key requirement -- row invalidation by the passage of time.
Lazy Materialized View
The previous solution was not bad. It was just incomplete. The full solution lazily refreshes the materialized rows when they are stale.
As with the eager materialization strategy, our first step is to create a table to store the materialized rows. The difference is we add an expiration time column.
create table lazy.account_balances_mat( name varchar primary key references accounts on update cascade on delete cascade, balance numeric(9,2) not null default 0, expiration_time timestamptz not null ); create index on lazy.account_balances_mat (balance); create index on lazy.account_balances_mat (expiration_time);
We will create the initial rows for
-Infinity to mark them as dirty.
insert into lazy.account_balances_mat(name, expiration_time) select name, '-Infinity' from accounts;
The same data changes that could invalidate materialized rows in the eager strategy must be handled with the lazy strategy. The difference is that the triggers will only update
expiration_time -- they will not actually recalculate the data.
An account is inserted
As with the eager strategy, on account insertion we need to create a
account_balances_mat record with a zero balance for the new account. But we also need to provide an
expiration_time. The balance for an account with no transactions will be valid forever, so we provide the special PostgreSQL value
Infinity as the
Infinity is defined as greater than any other value.
create function lazy.account_insert() returns trigger security definer language plpgsql as $$ begin insert into lazy.account_balances_mat(name, expiration_time) values(new.name, 'Infinity'); return new; end; $$; create trigger lazy_account_insert after insert on accounts for each row execute procedure lazy.account_insert();
An account is updated or deleted
As before, account update and deletion will be handled by the the foreign key cascades.
A transaction is inserted
For the insert of a transaction, we update the
expiration_time if the
post_time of the transaction is less than the current
expiration_time. This means the update only happens when absolutely necessary. If the account will already be considered stale at the
post_time of the new record we avoid the IO cost of the write.
create function lazy.transaction_insert() returns trigger security definer language plpgsql as $$ begin update lazy.account_balances_mat set expiration_time=new.post_time where name=new.name and new.post_time < expiration_time; return new; end; $$; create trigger lazy_transaction_insert after insert on transactions for each row execute procedure lazy.transaction_insert();
A transaction is updated
Unlike when a transaction is inserted, when a transaction is updated, it is not possible to compute the new account
expiration_time without reading the account's transactions. This makes it cheaper to simply invalidate the account balance. We will simply set
-Infinity, a special value defined as being less than all other values. This ensures that the row will be considered stale.
create function lazy.transaction_update() returns trigger security definer language plpgsql as $$ begin update accounts set expiration_time='-Infinity' where name in(old.name, new.name) and expiration_time<>'-Infinity'; return new; end; $$; create trigger lazy_transaction_update after update on transactions for each row execute procedure lazy.transaction_update();
A transaction is deleted
For transaction deletion, we invalidate the row if the
post_time is less than or equal to the current
expiration_time. But if at is after the current
expiration_time we do not have to do anything.
create function lazy.transaction_delete() returns trigger security definer language plpgsql as $$ begin update lazy.account_balances_mat set expiration_time='-Infinity' where name=old.name and old.post_time <= expiration_time; return old; end; $$; create trigger lazy_transaction_delete after delete on transactions for each row execute procedure lazy.transaction_delete();
The penultimate step is to define a function to refresh a materialized row.
create function lazy.refresh_account_balance(_name varchar) returns lazy.account_balances_mat security definer language sql as $$ with t as ( select coalesce( sum(amount) filter (where post_time <= current_timestamp), 0 ) as balance, coalesce( min(post_time) filter (where current_timestamp < post_time), 'Infinity' ) as expiration_time from transactions where name=_name ) update lazy.account_balances_mat set balance = t.balance, expiration_time = t.expiration_time from t where name=_name returning account_balances_mat.*; $$;
This function uses a common table expression and aggregate filters to find
expiration_time in a single select. Then results are then used to update
Finally, we define the
account_balances view. The top part of the query reads fresh rows from
account_balances_mat. The bottom part reads and refreshes rows that are stale.
create view lazy.account_balances as select name, balance from lazy.account_balances_mat where current_timestamp < expiration_time union all select r.name, r.balance from lazy.account_balances_mat abm cross join lazy.refresh_account_balance(abm.name) r where abm.expiration_time <= current_timestamp;
To retrieve the all accounts with negative balances balances we simply select from the
select * from lazy.account_balances where balance < 0;
The first time the query is run it takes about 5900ms because it is caching the balance for all accounts. Subsequent runs only take about 16ms (368x faster). In general, the query run time should not be nearly so variable because only a small fraction of the rows will be refreshed in any one query.
Comparison of Techniques
PostgreSQL's built-in materialized views offer the best performance improvement for the least work, but only if stale data is acceptable. Eager materialized views offer the absolute best read performance, but can only guarantee freshness if rows do not go stale due to the passage of time. Lazy materialized views offer almost as good read performance as eager materialized views, but they can guarantee freshness under all circumstances.
One additional consideration is read-heavy vs. write-heavy workloads. Most systems are read-heavy. But for a write-heavy load you should give consider leaning toward lazy and away from eager materialized views. The reason is that eager materialized views do the refresh calculation on every write whereas lazy materialized views only pay that cost on read.
PostgreSQL materialization strategies can improve performance by a factor of hundreds or more. In contrast to caching in Memcachd or Redis, PostgreSQL materialization provides ACID guarantees. This eliminates an entire category of consistency issues that must be handled at the application layer. In addition, the infrastructure for a system as a whole is simpler with one less part.
The increased performance and system simplicity is well worth the cost of more advanced SQL.
- Example code for this post
- Chapter 12 of Enterprise Rails describes materialized views
- Materialized view talk from 2008 PGCon
- Jonathan Gardner materialized view notes
(header image via Flickr user t_buchtele)