Heading image for post: Lessons Learned Using dblink in Production

PostgreSQL

Lessons Learned Using dblink in Production

Profile picture of Mary Lee

On a recent project at Hashrocket, we had to use dblink in production as part of our set up for communicating with a legacy database. While our overall experience was positive, we had a couple of hiccups along the way. Let's talk about the lessons we learned!

dblink is a Postgres command that allows you to connect to and execute queries on a remote database. There are a number of dblink-related commands in Postgres depending on your use case (from dblink_connect to dblink_exec), but our particular use case was for read-only connections querying table contents, which only required us to use the dblink command.

The dblink command requires a connection string with details for how to make the connection to the remote server, and a sql statement to execute once the connection is made.

The connection string for a dblink command must be in keyword/value format, with whitespaces between the keyword/value pairs. An example connection string could look like: host=localhost port=5432 dbname=hr_local user=hashrocket password=1234.

The sql statement should be any row-returning statement (inserts, updates, and deletes are done through the dblink_exec command).

The command returns the rows read from the remote database, but you must also specify an alias as part of the from clause in order to declare the expected columns and column types. This lets you select * against the returned data (the sql statement provided to the dblink command must always specify the columns being queried, so that Postgres knows what to expect from the returned data).

So a basic query with dblink would look something like this:

select *
from dblink(
  'host=localhost port=5432 dbname=hr_local user=hashrocket password=1234 connect_timeout=5',
  'select first_name, last_name, email from users'
)
as t(first_name text, last_name text, email text);

Aliasing remote column names

One of the earliest issues we encountered with our integration with the remote database was related to column names. The legacy database we were dealing with had different column naming restrictions from Postgres, and one of the tables we were querying had a column titled desc, which is a reserved keyword in Postgres. This was leading to errors in the dblink command. So we had to figure out how to alias the column as it was coming into Postgres.

This turned out to have a fairly easy solution; we aliased the column in the dblink sql statement, and used the aliased column name in the from clause's alias:

select *
from dblink(
  'host=localhost port=5432 dbname=hr_local user=hashrocket password=1234 connect_timeout=5',
  'select title, desc as description from posts'
)
as t(title text, description text);

Naming the connection

Performance quickly became an issue for us once our dblink code hit production. Upon launch, we were hit with a large number of statement timeouts.

Our first pass attempt at hotfixing the issue was to increase the connection timeout. We initially had a timeout of 5 seconds, but we upped it to 10 seconds to see if that would help with the timeouts we were seeing. Unsurprisingly, it did not. There was a deeper problem at play. We looked further into the documentation, and found some interesting information.

We knew that by passing the connection string as the first argument to the dblink command, we were using an "unnamed" connection. This didn't appear to be an issue during testing and on staging; it was only under load that we saw performance issues. Turns out, only one unnamed connection is permitted at a time when using dblink. When you try to open a second unnamed connection, it has to wait for the previous one to close before replacing it. Suddenly our timeouts were starting to make sense. Our queries weren't timing out on establishing the connection to the remote database; they were queuing up and timing out waiting for the earlier unnamed connections to finish.

With named connections, you can have as many connections open as you want (provided the names are unique). In order to get things running smoothly, we refactored our code a bit to use the dblink_connect and dblink_disconnect commands, generating unique connection names each time we reached out to the remote database.

select dblink_connect('unique-connection-name', 'host=localhost port=5432 dbname=hr_local user=hashrocket password=1234 connect_timeout=5');

select *
from dblink(
  'unique-connection-name',
  'select first_name, last_name, email from users'
)
as t(first_name text, last_name text, email text);

select dblink_disconnect('unique-connection-name');

This solved the issue with our unnamed connections timing out, but led to enough named connections to overwhelm the remote database. We were able to increase the connection limit on that remote database and suddenly things were smooth sailing!

Takeaways

At Hashrocket, we always stress the importance of reading the manual whenever we're looking into a new library. In this case, one line in the Postgres documentation for dblink_connect made all the difference in our integration. Once we had things set up properly, using dblink in production was a painless experience for us. Perhaps an easier solution would have been to use foreign data wrappers, but for our use case, dblink did what we needed and brought our client's data all the way back from the 1980s to the modern web.

More posts about postgres PostgreSQL dblink

  • Adobe logo
  • Barnes and noble logo
  • Aetna logo
  • Vanderbilt university logo
  • Ericsson logo

We're proud to have launched hundreds of products for clients such as LensRentals.com, Engine Yard, Verisign, ParkWhiz, and Regions Bank, to name a few.

Let's talk about your project