Top Tweets of 2016 with Elixir and Ecto

by Josh Branchaud on January 10, 2017

As you look ahead to 2017, you may find yourself looking back on 2016. What are the goals that you achieved? What are the books that you read?

If you're an avid Twitter user you may also be wonder what your most popular tweets of the year were. With the help of Elixir and Ecto, we can figure that out.

Roadmap

Determining the most favorited and retweeted tweets of the year is going to require some investigative work. Here is a roadmap of what we will need to do.

Create a database schema for representing and storing tweets in PostgreSQL[^1].
Import Twitter statuses for a user via the Twitter API using something like ExTwitter.
Insert all the imported tweets into the database.
Ask our questions of the Twitter status data using Ecto's schemaless queries.

The focus of this blog post is on Ecto, so we will only walk through the details of steps 1, 3, and 4. For the sake of completeness, the code for importing Twitter statuses with ExTwitter has been posted to github.

Step 1: A Schema For Tweets

A database schema is often times a middle ground. We may have one or more sources of data on the one end. On the other end we have some questions we'd like to be able to ask of the data. We need to create a schema that puts us reasonably between those two ends.

If we look at the documentation for GET statuses/user_timeline we can see an example response. There is way more data there than we are probably going to need. Let's narrow it down.

We will need favorite_count and retweet_count of each tweet to answer our original questions. By including retweet_status we can also differentiate between retweets and tweets authored by the given user. To ensure that a given tweet is from 2016, we'll want created_at. Lastly, to identify each tweet, we will want the id, screen_name of the user, and of course the text of the tweet.

Here is a first look at an Ecto migration that meets our needs:

defmodule TopTweets.Repo.Migrations.CreateTweetsTable do
  use Ecto.Migration

  def change do
    create table(:tweets) do
      add :tweet_id, :bigint, null: false
      add :author, :varchar, null: false
      add :text_content, :varchar, null: false
      add :tweeted_at, :timestamptz, null: false
      add :retweeted_count, :integer, null: false
      add :favorited_count, :integer, null: false
      add :retweeted, :boolean, null: false, default: false
    end
  end
end

There are a couple details of note.

Tweets have very, very large ids. We cannot just use the integer data type for tweet_id. We need something that can handle a much larger range of values, so we use bigint. Anything that is a valid data type in our backing data store can be specified in an Ecto migration as the column's data type.

I likely could have used tweet_id as the primary key for this table. I instead chose to let Ecto give me an integer primary key for free. This is sufficient for the purposes of this blog post.

We expect to always have values for all of our columns when inserting a record. Using null: false on all columns ensures that we never leave gaps in our data when inserting records. If something is missing, our database will refuse to insert the record and complain.

I don't like throwing away timezone information, so I always choose timestamptz over timestamp. I did just that with the tweeted_at column above.

That sums up the interesting parts of the migration so far. We have a big issue though. With this table description, there is nothing to stop us from inserting duplicate tweet records. Duplicate data is detrimental to data integrity. The addition of an index will go a long way in preventing duplicate tweets from getting into our database[^2].

Let's add this index to our migration.

create unique_index(:tweets, [:tweet_id])

Lastly, we are working with a rate-limited API. In the event that we need additional data about a tweet later on, it would be handy to have the entire body of the API response in the database. Let's add a jsonb column.

Here is the final iteration of our migration:

defmodule TopTweets.Repo.Migrations.CreateTweetsTable do
  use Ecto.Migration

  def change do
    create table(:tweets) do
      add :tweet_id, :bigint, null: false
      add :author, :varchar, null: false
      add :text_content, :varchar, null: false
      add :tweeted_at, :timestamptz, null: false
      add :retweeted_count, :integer, null: false
      add :favorited_count, :integer, null: false
      add :retweeted, :boolean, null: false, default: false
      add :data, :jsonb, null: false, default: "{}"
    end

    create unique_index(:tweets, [:tweet_id])
  end
end

To make sure everything is on the up and up, we will want to run the migration. I prefer to use mix ecto.twiki for this.

Step 2: Import Twitter Statuses

The details of importing Twitter statuses are out of the scope of this blog post. You can check out the relevant code on github.

To follow along with the rest of this post, you may want to import some sample data. We are going to use @elixirlang's tweets[^3]. You can get a snapshot of the data either as an elixir data structure or as a SQL dump.

Step 3: Insert All Those Tweets

At this point, we have loaded a bunch of tweet data into memory. We figured out the structure of our tweets table in step 1. Now we just need to insert the data.

A naive approach would be to iterate over the thousands of tweets inserting each one with a separate insert statement. That's not ideal and with Ecto 2.0 we can do better. Ecto 2.0 introduced much stronger support for schemaless queries including the addition of the insert_all function. That's what we are going to use.

The beginning of the documentation for insert_all/2 reads as follows.

Inserts all entries into the repository.

It expects a schema (MyApp.User) or a source ("users") or both ({"users", MyApp.User}) as the first argument. The second argument is a list of entries to be inserted, either as keyword lists or as maps.

We aren't working with a schema, so we use "tweets", the name of our table, as the first argument. The second argument is a list of keyword lists representing the records we want to insert. We need to do some massaging of the tweet data before we can pass it into insert_all.

We'll map over the tweet data to produce the keyword lists. Most of the pieces of data will map directly, but the tweeted_at timestamp needs help and the retweeted_status needs to be converted to a boolean for the retweeted column. The harder part is getting the timestamp right, so let's start there.

The Twitter API gives timestamps in a non-standard format. We need to convert it into an Ecto.DateTime struct. Some pattern matching and Timex will help.

def cast_timestamp(timestamp) do
  [dow, mon, day, time, zone, year] = String.split(timestamp, " ")
  rfc1123 = Enum.join(["#{dow},", day, mon, year, time, zone], " ")

  rfc1123
  |> Timex.parse!("{RFC1123}")
  |> Ecto.DateTime.cast!()
end

I cover the details of this conversion in this blog post.

Translating the retweeted_status data into a boolean is a bit more straightforward.

tweet_data.retweeted_status != nil

The process of translating one instance of tweet data into a keyword list aligning with our tweets table description can be encapsulated in the following function.

def bundle_tweet(tweet_data) do
  [
    tweet_id: tweet_data.id,
    author: tweet_data.user.screen_name,
    text_content: tweet_data.text,
    tweeted_at: cast_timestamp(tweet_data.created_at),
    retweeted_count: tweet_data.retweet_count,
    favorited_count: tweet_data.favorite_count,
    retweeted: tweet_data.retweeted_status != nil,
    data: tweet_data
  ]
end

The list of keys in this keyword list are the names of the columns we defined in step 1. For each key we pull out the associated piece of data from the tweet_data map. As we discussed earlier, the entire tweet_data map can be inserted directly into our jsonb field so that we have a backup of the data for each tweet.

With all of this in place, we can finally batch insert all our tweets.

def insert_tweets(tweets) do
  bundled_tweets = Enum.map(tweets, &bundle_tweet/1)
  TopTweets.Repo.insert_all("tweets", bundled_tweets)
end

We have put in a lot of work so far; it's time for some answers.

Step 4: Ask Questions With Queries

We started off this blog post with two questions in mind.

What are our most favorited tweets of 2016?
What are our most retweeted tweets of 2016?

It will help to be a bit more specific. Let's rework these questions a little.

What are the 10 most favorited tweets authored by @elixirlang in 2016?
What are the 10 most retweeted tweets authored by @elixirlang in 2016?

To answer these questions, we'll need to construct queries with Ecto's query API. We will also need a TopTweets.Repo module which is part of the standard setup when getting started with Ecto. Let's open up IEx and iterate on a solution.

First,

$ iex -S mix

and now let's start simple. Can we just grab all of our tweets?

If we were working with a table that has a corresponding Ecto schema, we could do something like:

TopTweets.Repo.all(TopTweets.Tweet)

We aren't working with Ecto schemas though. We are sticking to schemaless queries here. Since we don't have a module to reference, we could try referencing the name of our table.

TopTweets.Repo.all("tweets")

That doesn't work, but it is promising. If we take a close look at the error, we'll see what the issue is.

** (Ecto.QueryError) PostgreSQL requires a schema module when using selector
"t0" but none was given. Please specify a schema or specify exactly which
fields from "t0" you desire in query:

from t in "tweets",
  select: t

When provided with a schema, Ecto knows exactly which fields to request from the database. Without one, we have to specify the fields ourself. We can do this using the query API. The Ecto.Query.from\2 function is a good place to start.

> import Ecto.Query
Ecto.Query
> from(t in "tweets")
#Ecto.Query<from t in "tweets">

This constructs the from portion of our SQL query. The second argument to from\2 is a keyword list that defaults to []. We can further fill out our query by specifying what the select clause should look like:

> from(t in "tweets", select: t.tweet_id)
#Ecto.Query<from t in "tweets", select: t.tweet_id>

Piping what we have so far to TopTweets.Repo.all will give us a result with all of our tweets -- or at least all the tweet_ids for our tweets[^4].

from(t in "tweets",
select: t.tweet_id
) |> TopTweets.Repo.all

15:34:52.816 [debug] QUERY OK source="tweets" db=16.4ms decode=2.7ms
SELECT t0."tweet_id" FROM "tweets" AS t0 []
[815582938918223872, 815549333236027392, 815549228818792448, 814143389759340544,
 813698248888217600, 813484243498270720, 813172504730750976, 812589506394279936,
 811867334860218368, 811318252148232193, 811234616178831360, 811218704944001025,
 ...]

This is great output. Not only does it give us a list of tweet_ids, but it provides timing information and the SQL query that was run against our database. This also gives us a query to iterate on.

A natural next step is to limit the results to 10 items. The limit clause is what we need.

from(t in "tweets",
select: t.tweet_id,
limit: 10
) |> TopTweets.Repo.all

17:29:15.103 [debug] QUERY OK source="tweets" db=9.4ms
SELECT t0."tweet_id" FROM "tweets" AS t0 LIMIT 10 []
[815582938918223872, 815549333236027392, 815549228818792448, 814143389759340544,
 813698248888217600, 813484243498270720, 813172504730750976, 812589506394279936,
 811867334860218368, 811318252148232193]

That has trimmed back the results a bit.

Adding a where clause is another way to trim down and shape our result set. We'll want to start with a where clause that ensures only tweets from @elixirlang are in our result set. This is especially important if we have imported tweets from other users already.

from(t in "tweets",
select: t.tweet_id,
limit: 10,
where: [author: "elixirlang"]
) |> TopTweets.Repo.all

17:46:10.144 [debug] QUERY OK source="tweets" db=12.7ms
SELECT t0."tweet_id" FROM "tweets" AS t0 WHERE (t0."author" = 'elixirlang') LIMIT 10 []
[815582938918223872, 815549333236027392, 815549228818792448, 814143389759340544,
 813698248888217600, 813484243498270720, 813172504730750976, 812589506394279936,
 811867334860218368, 811318252148232193]

The resulting query now includes the first part of our where clause. We need to do more than limit the result set by author. We also want to ensure the tweets are only from the year of 2016. Let's add constraints on the tweeted_at column to the where clause.

from(t in "tweets",
select: t.tweet_id,
limit: 10,
where: [author: "elixirlang"],
where: t.tweeted_at > type(^"2016-01-01T00:00:00Z", Ecto.DateTime)
   and t.tweeted_at < type(^"2017-01-01T00:00:00Z", Ecto.DateTime)
) |> TopTweets.Repo.all

09:53:36.972 [debug] QUERY OK source="tweets" db=12.7ms
SELECT t0."tweet_id" FROM "tweets" AS t0 WHERE (t0."author" = 'elixirlang') AND ((t0."tweeted_at" > $1::timestamp) AND (t0."tweeted_at" < $2::timestamp)) LIMIT 10 [{{2016, 1, 1}, {0, 0, 0, 0}}, {{2017, 1, 1}, {0, 0, 0, 0}}]
[814143389759340544, 813698248888217600, 813484243498270720, 813172504730750976,
 812589506394279936, 811867334860218368, 811318252148232193, 811234616178831360,
 811218704944001025, 810909398143029248]

Constructing SQL queries can sometimes require you to bend your mind a bit. It can be particularly challenging to construct an entire query in one take. By taking this kind of iterative approach, we have a working query every step of the way that gets us closer and closer to the solution we want. But as we go, we only have to think about any one part of it at a time.

Our query looks to be almost done, but we don't have much context about our result set yet. Let's update the select clause of our query to give us more information.

from(t in "tweets",
select: {t.tweet_id, t.favorited_count, t.text_content},
limit: 10,
where: [author: "elixirlang"],
where: t.tweeted_at > type(^"2016-01-01T00:00:00Z", Ecto.DateTime)
   and t.tweeted_at < type(^"2017-01-01T00:00:00Z", Ecto.DateTime)
) |> TopTweets.Repo.all

10:17:18.106 [debug] QUERY OK source="tweets" db=5.5ms
SELECT t0."tweet_id", t0."favorited_count", t0."text_content" FROM "tweets" AS t0 WHERE (t0."author" = 'elixirlang') AND ((t0."tweeted_at" > $1::timestamp) AND (t0."tweeted_at" < $2::timestamp)) LIMIT 10 [{{2016, 1, 1}, {0, 0, 0, 0}}, {{2017, 1, 1}, {0, 0, 0, 0}}]
[{814143389759340544, 0,
  "RT @ElixirConfEU: Looking forward to ElixirConf.EU 2017 - Very Early Bird starts 9 Jan #myelixirstatus https://t.co/3NjJhTpsjN"},
 {813698248888217600, 0,
  "RT @whatyouhide: Finally had time to send in a CFP for @ElixirConfEU in Barcelona, fingers crossed! #myelixirstatus"},
 {813484243498270720, 0,
  "RT @sasajuric: Great opportunity to grab yourself some Elixir in Action :-) https://t.co/GSKRnM76Qr"},
 ...]

Awesome. We can now see the favorite_count and content of each tweet in our result set. These don't appear to be our most popular tweets though. None of them were favorited. We need to add an order_by clause:

from(t in "tweets",
select: {t.tweet_id, t.favorited_count, t.text_content},
limit: 10,
where: [author: "elixirlang"],
where: t.tweeted_at > type(^"2016-01-01T00:00:00Z", Ecto.DateTime)
   and t.tweeted_at < type(^"2017-01-01T00:00:00Z", Ecto.DateTime),
order_by: [desc: :favorited_count]
) |> TopTweets.Repo.all

10:37:49.487 [debug] QUERY OK source="tweets" db=47.0ms queue=0.1ms
SELECT t0."tweet_id", t0."favorited_count", t0."text_content" FROM "tweets" AS t0 WHERE (t0."author" = 'elixirlang') AND ((t0."tweeted_at" > $1::timestamp) AND (t0."tweeted_at" < $2::timestamp)) ORDER BY t0."favorited_count" DESC LIMIT 10 [{{2016, 1, 1}, {0, 0, 0, 0}}, {{2017, 1, 1}, {0, 0, 0, 0}}]
[{682890953212956672, 226,
  "Elixir v1.2.0 is out - https://t.co/BuULFDtrsm! New guides and official blog post coming out soon. Happy New Year!"},
 {745240093275873281, 167,
  "Elixir v1.3.0 has been released and welcome @whatyouhide into Elixir's team: https://t.co/aw0aopnXxn"},
 {753171413821976577, 163,
  "Plug v1.2.0-rc.0 is out! Featuring safer encryption algorithms, better MIME support and a brand new debugging page! https://t.co/XuepyvTSQA"},
 ...]

That's more like it. By ordering the tweets by favorited_count in descending order, we pull the most favorited tweets to the top.

Lastly, our questions were specific when it came to pointing out that the tweets we are looking for ought to have been authored by @elixirlang. In other words, we don't want to include any popular retweeted tweets. We can make an addition to our original where clause to accomplish this.

from(t in "tweets",
select: {t.tweet_id, t.favorited_count, t.text_content},
limit: 10,
where: [author: "elixirlang", retweeted: false],
where: t.tweeted_at > type(^"2016-01-01T00:00:00Z", Ecto.DateTime)
   and t.tweeted_at < type(^"2017-01-01T00:00:00Z", Ecto.DateTime),
order_by: [desc: :favorited_count]
) |> TopTweets.Repo.all

21:37:35.068 [debug] QUERY OK source="tweets" db=24.7ms
SELECT t0."tweet_id", t0."favorited_count", t0."text_content" FROM "tweets" AS t0 WHERE ((t0."author" = 'elixirlang') AND (t0."retweeted" = FALSE)) AND ((t0."tweeted_at" > $1::timestamp) AND (t0."tweeted_at" < $2::timestamp)) ORDER BY t0."favorited_count" DESC LIMIT 10 [{{2016, 1, 1}, {0, 0, 0, 0}}, {{2017, 1, 1}, {0, 0, 0, 0}}]
[{682890953212956672, 226,
  "Elixir v1.2.0 is out - https://t.co/BuULFDtrsm! New guides and official blog post coming out soon. Happy New Year!"},
 {745240093275873281, 167,
  "Elixir v1.3.0 has been released and welcome @whatyouhide into Elixir's team: https://t.co/aw0aopnXxn"},
 {753171413821976577, 163,
  "Plug v1.2.0-rc.0 is out! Featuring safer encryption algorithms, better MIME support and a brand new debugging page! https://t.co/XuepyvTSQA"},
 ...]

That's it. This query gives us the 10 most favorited tweets authored by @elixirlang in 2016. Two small changes -- to the select clause and the order_by clause -- are all we need to get the most retweeted tweets.

from(t in "tweets",
select: {t.tweet_id, t.retweeted_count, t.text_content},
limit: 10,
where: [author: "elixirlang", retweeted: false],
where: t.tweeted_at > type(^"2016-01-01T00:00:00Z", Ecto.DateTime)
   and t.tweeted_at < type(^"2017-01-01T00:00:00Z", Ecto.DateTime),
order_by: [desc: :retweeted_count]
) |> TopTweets.Repo.all

12:53:29.310 [debug] QUERY OK source="tweets" db=20.6ms
SELECT t0."tweet_id", t0."retweeted_count", t0."text_content" FROM "tweets" AS t0 WHERE ((t0."author" = 'elixirlang') AND (t0."retweeted" = FALSE)) AND ((t0."tweeted_at" > $1::timestamp) AND (t0."tweeted_at" < $2::timestamp)) ORDER BY t0."retweeted_count" DESC LIMIT 10 [{{2016, 1, 1}, {0, 0, 0, 0}}, {{2017, 1, 1}, {0, 0, 0, 0}}]
[{682890953212956672, 278,
  "Elixir v1.2.0 is out - https://t.co/BuULFDtrsm! New guides and official blog post coming out soon. Happy New Year!"},
 {745240093275873281, 199,
  "Elixir v1.3.0 has been released and welcome @whatyouhide into Elixir's team: https://t.co/aw0aopnXxn"},
 {735157122430816256, 124,
  "Elixir v1.3 introduces calendar types (Time, Date, NaiveDateTime and DateTime) as well as sigils for building them, e.g: ~D[2016-05-24]"},
 ...]

With our queries completed, let's take a look at the two result sets.

@ElixirLang's Most Favorited Tweets of 2016

- 226 - "Elixir v1.2.0 is out - https://t.co/BuULFDtrsm! New guides and official blog post coming out soon. Happy New Year!"
- 167 - "Elixir v1.3.0 has been released and welcome @whatyouhide into Elixir's team: https://t.co/aw0aopnXxn"
- 163 - "Plug v1.2.0-rc.0 is out! Featuring safer encryption algorithms, better MIME support and a brand new debugging page! https://t.co/XuepyvTSQA"
- 155 - "Elixir v1.3 introduces calendar types (Time, Date, NaiveDateTime and DateTime) as well as sigils for building them, e.g: ~D[2016-05-24]"
- 123 - "Announcing GenStage: https://t.co/bNOLdQaB2v"
- 118 - "Ecto 2.1.0-rc.4 is out with https://t.co/PLUFjpi6Rb that streams query results without loading the whole result set into memory."
- 106 - "Elixir v1.3 will perform cross module checks to find undefined modules and functions! (tks to @antipax) https://t.co/DaMYMNu8cL"
- 104 - "Elixir v1.3 will have new accessors for nested data structures. Here is how to upcase all languages names in a map: https://t.co/FNTTqjpDeP"
- 104 - "José Valim's keynote on GenStage and Flow at @elixirconf is up: https://t.co/OOqeYKuHrt!"
- 103 - "mix test --stale is now on Elixir master thanks to @antipax! It uses static analysis to run tests only for the source files changed on disk!"

@ElixirLang's Most Retweeted Tweets of 2016

- 278 - "Elixir v1.2.0 is out - https://t.co/BuULFDtrsm! New guides and official blog post coming out soon. Happy New Year!"
- 199 - "Elixir v1.3.0 has been released and welcome @whatyouhide into Elixir's team: https://t.co/aw0aopnXxn"
- 124 - "Elixir v1.3 introduces calendar types (Time, Date, NaiveDateTime and DateTime) as well as sigils for building them, e.g: ~D[2016-05-24]"
- 110 - "Announcing GenStage: https://t.co/bNOLdQaB2v"
- 106 - "Plug v1.2.0-rc.0 is out! Featuring safer encryption algorithms, better MIME support and a brand new debugging page! https://t.co/XuepyvTSQA"
- 87 - "Elixir v1.3.4 has been released with bug fixes for those using Dialyzer and Cover on Erlang 19. Release notes: https://t.co/cyntmGTbxG"
- 74 - "mix test --stale is now on Elixir master thanks to @antipax! It uses static analysis to run tests only for the source files changed on disk!"
- 73 - "Elixir v1.2.6 has been released with three bug fixes and support for Erlang 19: https://t.co/2NcBVJzRG3"
- 72 - "Can't wait for Elixir v1.3? We have just tidied up the CHANGELOG with the upcoming release highlights: https://t.co/aFZbLwedWn"
- 71 - "Official announcement for Elixir v1.2 is up as well as updated guides and docs: https://t.co/ST6lSXIVJz

There is some overlap between the two sets of tweets, but if you look closely, you'll see that there are some tweets that show up in one list but not the other and vice versa. Nevertheless, it is clear that the big releases of Elixir this year -- 1.2.0 and 1.3.0 -- stand out as the most popular tweets. And from the looks of the 1.4.0 release tweet, that is easily going to be one of the most popular of 2017.

Conclusion

If you've made it this far, then you've learned a lot about what Ecto can do. We've seen how to create a migration that exercises a number of key features of Ecto.Migration, albeit we didn't even touch on associations. Then we got to use insert_all, an exciting new feature hot off the Ecto 2.0 press. Lastly, we took an iterative approach to building a query with Ecto's query API.

Hopefully the results of all this work highlighted some interesting @elixirlang tweets and blog posts that you haven't seen yet. If you haven't done so yet, give this a try on your own Twitter account. What were your most popular tweets of 2016?

[^1]: PostgreSQL is my preferred relational database. Though most of the concepts are applicable to any relation database, there are some PostgreSQL-specific features that I use and the example assumes you are using Postgres 9.5+.

[^2]: Sure, using tweet_id as the primary key would have warded this issue off from the outset, but then we wouldn't have had a chance to play with the unique_index function.

[^3]: Twitter's API not only enforces standard rate-limiting, but it also limits access to tweets from the statuses/user_timeline endpoint to the latest 3200 tweets. Because we are using @elixirlang's tweets, this won't be an issue. If you are following along with an account that tweets much more frequently, then this may affect your results.

[^4]: If you are following along with the examples, depending on how your data was inserted, you may be seeing different sets of tweet_ids in your results. This will become deterministic once we add an order by clause.

Elixir