Bulk Imports with Datomic
I've been really happy with Datomic, but doing an initial bulk import wasn't as familiar as SQL dump/restore. Here are some things that I've learned from doing several imports.
Use core.async
The Datomic transactor handles concurrency by transacting datoms
serially, but that doesn't mean it isn't fast! In my experience, the bottleneck is actually in the
reshaping of data and formatting transactions. I
use core.async
to parallelize just about everything in the import
pipeline.
One example of how I've leveraged core.async
for
import jobs can be found in my Kevin Bacon project
repository.
Run the import locally
I use DynamoDB as my storage backend in production. I used to try to
run my import tasks directly to the production transactor/storage.
Lately, though, I've found it really helpful to run my import tasks to
a locally-running transactor and the dev
storage backend.
Running an import locally means I don't have to worry about networking, which speeds the whole process up quite a bit; also, it give me a much more freedom to iterate on the database design itself. (I rarely get an import correct the first time.) And in the case of DynamoDB, I save some money, as I don't have to have my "write throughput" cranked way up for as long.
Clean up the local database
Bulk imports create some garbage, so manually reindexing before backing up is advantageous. Here's what a REPL session looks like:
(def conn (d/connect "datomic:dev://localhost:4334/database-name))
(d/request-index conn)
(->> conn d/db d/basis-t (d/sync-index conn) deref)
;; blocks until done indexing
(d/gc-storage conn (java.util.Date.))
For more information on why this cleanup is important, see the relevant Datomic documentation.
Use backup/restore
Once everything looks good on the local production database, I use Datomic's builtin backup/restore facilities to send the database up to production. Assuming you've already deployed a production transactor and provisioned DynamoDB storage, here's the process I follow:
- Run the
datomic backup-db
command against the local import. - Crank my "write throughput" on DynamoDB way up (on the order of 1000).
- Run the
datomic restore-db
command from the backup folder to the remote database. - Turn the "write throughput" back down to whatever value I plan to use for ongoing use (see the Datomic documentation for more information).
The heart of almost every business is its data. Datomic is a great choice for business data, in part because it treats all data as important: nothing is overwritten. New things are learned, but the old facts are not replaced. And knowing how to get your data into Datomic is half the battle.
Go forth and import!