Elixir
Top Tweets of 2016 with Elixir and Ecto
As you look ahead to 2017, you may find yourself looking back on 2016. What are the goals that you achieved? What are the books that you read?
If you're an avid Twitter user you may also be wonder what your most popular tweets of the year were. With the help of Elixir and Ecto, we can figure that out.
Roadmap
Determining the most favorited and retweeted tweets of the year is going to require some investigative work. Here is a roadmap of what we will need to do.
Create a database schema for representing and storing tweets in PostgreSQL[1].
Import Twitter statuses for a user via the Twitter API using something like ExTwitter.
Insert all the imported tweets into the database.
Ask our questions of the Twitter status data using Ecto's schemaless queries.
The focus of this blog post is on Ecto, so we will only walk through the details of steps 1, 3, and 4. For the sake of completeness, the code for importing Twitter statuses with ExTwitter has been posted to github.
Step 1: A Schema For Tweets
A database schema is often times a middle ground. We may have one or more sources of data on the one end. On the other end we have some questions we'd like to be able to ask of the data. We need to create a schema that puts us reasonably between those two ends.
If we look at the documentation for GET
statuses/user_timeline
we can see an example response. There is way more data there than we are
probably going to need. Let's narrow it down.
We will need favorite_count
and retweet_count
of each tweet to answer
our original questions. By including retweet_status
we can also
differentiate between retweets and tweets authored by the given user. To
ensure that a given tweet is from 2016, we'll want created_at
. Lastly, to
identify each tweet, we will want the id
, screen_name
of the user
, and
of course the text
of the tweet.
Here is a first look at an Ecto migration that meets our needs:
defmodule TopTweets.Repo.Migrations.CreateTweetsTable do
use Ecto.Migration
def change do
create table(:tweets) do
add :tweet_id, :bigint, null: false
add :author, :varchar, null: false
add :text_content, :varchar, null: false
add :tweeted_at, :timestamptz, null: false
add :retweeted_count, :integer, null: false
add :favorited_count, :integer, null: false
add :retweeted, :boolean, null: false, default: false
end
end
end
There are a couple details of note.
Tweets have very, very large id
s. We cannot just use the integer
data
type for tweet_id
. We need something that can handle a much larger range
of values, so we use
bigint
.
Anything that is a valid data type in our backing data store can be
specified in an Ecto migration as the column's data type.
I likely could have used tweet_id
as the primary key for this table. I
instead chose to let Ecto give me an integer
primary key for free. This is
sufficient for the purposes of this blog post.
We expect to always have values for all of our columns when inserting a
record. Using null: false
on all columns ensures that we never leave gaps
in our data when inserting records. If something is missing, our database
will refuse to insert the record and complain.
I don't like throwing away timezone information, so I always choose
timestamptz
over timestamp
. I did just that with the tweeted_at
column above.
That sums up the interesting parts of the migration so far. We have a big issue though. With this table description, there is nothing to stop us from inserting duplicate tweet records. Duplicate data is detrimental to data integrity. The addition of an index will go a long way in preventing duplicate tweets from getting into our database[2].
Let's add this index to our migration.
create unique_index(:tweets, [:tweet_id])
Lastly, we are working with a rate-limited API. In the event that we need
additional data about a tweet later on, it would be handy to have the entire
body of the API response in the database. Let's add a
jsonb
column.
Here is the final iteration of our migration:
defmodule TopTweets.Repo.Migrations.CreateTweetsTable do
use Ecto.Migration
def change do
create table(:tweets) do
add :tweet_id, :bigint, null: false
add :author, :varchar, null: false
add :text_content, :varchar, null: false
add :tweeted_at, :timestamptz, null: false
add :retweeted_count, :integer, null: false
add :favorited_count, :integer, null: false
add :retweeted, :boolean, null: false, default: false
add :data, :jsonb, null: false, default: "{}"
end
create unique_index(:tweets, [:tweet_id])
end
end
To make sure everything is on the up and up, we will want to run the
migration. I prefer to use mix
ecto.twiki
for this.
Step 2: Import Twitter Statuses
The details of importing Twitter statuses are out of the scope of this blog post. You can check out the relevant code on github.
To follow along with the rest of this post, you may want to import some sample data. We are going to use @elixirlang's tweets[3]. You can get a snapshot of the data either as an elixir data structure or as a SQL dump.
Step 3: Insert All Those Tweets
At this point, we have loaded a bunch of tweet data into memory. We figured
out the structure of our tweets
table in step 1. Now we just need to
insert the data.
A naive approach would be to iterate over the thousands of tweets inserting
each one with a separate insert
statement. That's not ideal and with Ecto
2.0
we can do better. Ecto 2.0 introduced much stronger support for schemaless
queries including the addition of the insert_all
function.
That's what we are going to use.
The beginning of the documentation for insert_all/2
reads as follows.
Inserts all entries into the repository.
It expects a schema (MyApp.User) or a source ("users") or both ({"users", MyApp.User}) as the first argument. The second argument is a list of entries to be inserted, either as keyword lists or as maps.
We aren't working with a schema, so we use "tweets"
, the name of our
table, as the first argument. The second argument is a list of keyword lists
representing the records we want to insert. We need to do some massaging of
the tweet data before we can pass it into insert_all
.
We'll map over the tweet data to produce the keyword lists. Most of the
pieces of data will map directly, but the tweeted_at
timestamp needs help
and the retweeted_status
needs to be converted to a boolean for the
retweeted
column. The harder part is getting the timestamp right, so let's
start there.
The Twitter API gives timestamps in a non-standard format. We need to
convert it into an Ecto.DateTime
struct. Some pattern matching and
Timex will help.
def cast_timestamp(timestamp) do
[dow, mon, day, time, zone, year] = String.split(timestamp, " ")
rfc1123 = Enum.join(["#{dow},", day, mon, year, time, zone], " ")
rfc1123
|> Timex.parse!("{RFC1123}")
|> Ecto.DateTime.cast!()
end
I cover the details of this conversion in this blog post.
Translating the retweeted_status
data into a boolean is a bit more
straightforward.
tweet_data.retweeted_status != nil
The process of translating one instance of tweet data into a keyword list
aligning with our tweets
table description can be encapsulated in the
following function.
def bundle_tweet(tweet_data) do
[
tweet_id: tweet_data.id,
author: tweet_data.user.screen_name,
text_content: tweet_data.text,
tweeted_at: cast_timestamp(tweet_data.created_at),
retweeted_count: tweet_data.retweet_count,
favorited_count: tweet_data.favorite_count,
retweeted: tweet_data.retweeted_status != nil,
data: tweet_data
]
end
The list of keys in this keyword list are the names of the columns we
defined in step 1. For each key we pull out the associated piece of data
from the tweet_data
map. As we discussed earlier, the entire tweet_data
map can be inserted directly into our jsonb
field so that we have a
backup of the data for each tweet.
With all of this in place, we can finally batch insert all our tweets.
def insert_tweets(tweets) do
bundled_tweets = Enum.map(tweets, &bundle_tweet/1)
TopTweets.Repo.insert_all("tweets", bundled_tweets)
end
We have put in a lot of work so far; it's time for some answers.
Step 4: Ask Questions With Queries
We started off this blog post with two questions in mind.
What are our most favorited tweets of 2016?
What are our most retweeted tweets of 2016?
It will help to be a bit more specific. Let's rework these questions a little.
What are the 10 most favorited tweets authored by @elixirlang in 2016?
What are the 10 most retweeted tweets authored by @elixirlang in 2016?
To answer these questions, we'll need to construct queries with Ecto's query
API. We will also need a
TopTweets.Repo
module which is part of the standard setup when getting started with Ecto.
Let's open up IEx
and iterate on a solution.
First,
$ iex -S mix
and now let's start simple. Can we just grab all of our tweets?
If we were working with a table that has a corresponding Ecto schema, we could do something like:
TopTweets.Repo.all(TopTweets.Tweet)
We aren't working with Ecto schemas though. We are sticking to schemaless queries here. Since we don't have a module to reference, we could try referencing the name of our table.
TopTweets.Repo.all("tweets")
That doesn't work, but it is promising. If we take a close look at the error, we'll see what the issue is.
** (Ecto.QueryError) PostgreSQL requires a schema module when using selector
"t0" but none was given. Please specify a schema or specify exactly which
fields from "t0" you desire in query:
from t in "tweets",
select: t
When provided with a schema, Ecto knows exactly which fields to request from
the database. Without one, we have to specify the fields ourself. We can do
this using the query API. The
Ecto.Query.from\2
function is a good place to start.
> import Ecto.Query
Ecto.Query
> from(t in "tweets")
#Ecto.Query<from t in "tweets">
This constructs the from
portion of our SQL query. The second argument to
from\2
is a keyword list that defaults to []
. We can further fill out
our query by specifying what the select
clause should look like:
> from(t in "tweets", select: t.tweet_id)
#Ecto.Query<from t in "tweets", select: t.tweet_id>
Piping what we have so far to TopTweets.Repo.all
will give us a result
with all of our tweets -- or at least all the tweet_id
s for our tweets[4].
from(t in "tweets",
select: t.tweet_id
) |> TopTweets.Repo.all
15:34:52.816 [debug] QUERY OK source="tweets" db=16.4ms decode=2.7ms
SELECT t0."tweet_id" FROM "tweets" AS t0 []
[815582938918223872, 815549333236027392, 815549228818792448, 814143389759340544,
813698248888217600, 813484243498270720, 813172504730750976, 812589506394279936,
811867334860218368, 811318252148232193, 811234616178831360, 811218704944001025,
...]
This is great output. Not only does it give us a list of tweet_id
s, but it
provides timing information and the SQL query that was run against our
database. This also gives us a query to iterate on.
A natural next step is to limit the results to 10 items. The limit
clause
is what we need.
from(t in "tweets",
select: t.tweet_id,
limit: 10
) |> TopTweets.Repo.all
17:29:15.103 [debug] QUERY OK source="tweets" db=9.4ms
SELECT t0."tweet_id" FROM "tweets" AS t0 LIMIT 10 []
[815582938918223872, 815549333236027392, 815549228818792448, 814143389759340544,
813698248888217600, 813484243498270720, 813172504730750976, 812589506394279936,
811867334860218368, 811318252148232193]
That has trimmed back the results a bit.
Adding a where
clause is another way to trim down and shape our result
set. We'll want to start with a where
clause that ensures only tweets from
@elixirlang are in our result set. This is especially important if we have
imported tweets from other users already.
from(t in "tweets",
select: t.tweet_id,
limit: 10,
where: [author: "elixirlang"]
) |> TopTweets.Repo.all
17:46:10.144 [debug] QUERY OK source="tweets" db=12.7ms
SELECT t0."tweet_id" FROM "tweets" AS t0 WHERE (t0."author" = 'elixirlang') LIMIT 10 []
[815582938918223872, 815549333236027392, 815549228818792448, 814143389759340544,
813698248888217600, 813484243498270720, 813172504730750976, 812589506394279936,
811867334860218368, 811318252148232193]
The resulting query now includes the first part of our where
clause. We
need to do more than limit the result set by author. We also want to ensure
the tweets are only from the year of 2016. Let's add constraints on the
tweeted_at
column to the where
clause.
from(t in "tweets",
select: t.tweet_id,
limit: 10,
where: [author: "elixirlang"],
where: t.tweeted_at > type(^"2016-01-01T00:00:00Z", Ecto.DateTime)
and t.tweeted_at < type(^"2017-01-01T00:00:00Z", Ecto.DateTime)
) |> TopTweets.Repo.all
09:53:36.972 [debug] QUERY OK source="tweets" db=12.7ms
SELECT t0."tweet_id" FROM "tweets" AS t0 WHERE (t0."author" = 'elixirlang') AND ((t0."tweeted_at" > $1::timestamp) AND (t0."tweeted_at" < $2::timestamp)) LIMIT 10 [{{2016, 1, 1}, {0, 0, 0, 0}}, {{2017, 1, 1}, {0, 0, 0, 0}}]
[814143389759340544, 813698248888217600, 813484243498270720, 813172504730750976,
812589506394279936, 811867334860218368, 811318252148232193, 811234616178831360,
811218704944001025, 810909398143029248]
Constructing SQL queries can sometimes require you to bend your mind a bit. It can be particularly challenging to construct an entire query in one take. By taking this kind of iterative approach, we have a working query every step of the way that gets us closer and closer to the solution we want. But as we go, we only have to think about any one part of it at a time.
Our query looks to be almost done, but we don't have much context about our
result set yet. Let's update the select
clause of our query to give us
more information.
from(t in "tweets",
select: {t.tweet_id, t.favorited_count, t.text_content},
limit: 10,
where: [author: "elixirlang"],
where: t.tweeted_at > type(^"2016-01-01T00:00:00Z", Ecto.DateTime)
and t.tweeted_at < type(^"2017-01-01T00:00:00Z", Ecto.DateTime)
) |> TopTweets.Repo.all
10:17:18.106 [debug] QUERY OK source="tweets" db=5.5ms
SELECT t0."tweet_id", t0."favorited_count", t0."text_content" FROM "tweets" AS t0 WHERE (t0."author" = 'elixirlang') AND ((t0."tweeted_at" > $1::timestamp) AND (t0."tweeted_at" < $2::timestamp)) LIMIT 10 [{{2016, 1, 1}, {0, 0, 0, 0}}, {{2017, 1, 1}, {0, 0, 0, 0}}]
[{814143389759340544, 0,
"RT @ElixirConfEU: Looking forward to ElixirConf.EU 2017 - Very Early Bird starts 9 Jan #myelixirstatus https://t.co/3NjJhTpsjN"},
{813698248888217600, 0,
"RT @whatyouhide: Finally had time to send in a CFP for @ElixirConfEU in Barcelona, fingers crossed! #myelixirstatus"},
{813484243498270720, 0,
"RT @sasajuric: Great opportunity to grab yourself some Elixir in Action :-) https://t.co/GSKRnM76Qr"},
...]
Awesome. We can now see the favorite_count
and content of each tweet in
our result set. These don't appear to be our most popular tweets though.
None of them were favorited. We need to add an order_by
clause:
from(t in "tweets",
select: {t.tweet_id, t.favorited_count, t.text_content},
limit: 10,
where: [author: "elixirlang"],
where: t.tweeted_at > type(^"2016-01-01T00:00:00Z", Ecto.DateTime)
and t.tweeted_at < type(^"2017-01-01T00:00:00Z", Ecto.DateTime),
order_by: [desc: :favorited_count]
) |> TopTweets.Repo.all
10:37:49.487 [debug] QUERY OK source="tweets" db=47.0ms queue=0.1ms
SELECT t0."tweet_id", t0."favorited_count", t0."text_content" FROM "tweets" AS t0 WHERE (t0."author" = 'elixirlang') AND ((t0."tweeted_at" > $1::timestamp) AND (t0."tweeted_at" < $2::timestamp)) ORDER BY t0."favorited_count" DESC LIMIT 10 [{{2016, 1, 1}, {0, 0, 0, 0}}, {{2017, 1, 1}, {0, 0, 0, 0}}]
[{682890953212956672, 226,
"Elixir v1.2.0 is out - https://t.co/BuULFDtrsm! New guides and official blog post coming out soon. Happy New Year!"},
{745240093275873281, 167,
"Elixir v1.3.0 has been released and welcome @whatyouhide into Elixir's team: https://t.co/aw0aopnXxn"},
{753171413821976577, 163,
"Plug v1.2.0-rc.0 is out! Featuring safer encryption algorithms, better MIME support and a brand new debugging page! https://t.co/XuepyvTSQA"},
...]
That's more like it. By ordering the tweets by favorited_count
in
descending order, we pull the most favorited tweets to the top.
Lastly, our questions were specific when it came to pointing out that the
tweets we are looking for ought to have been authored by @elixirlang. In
other words, we don't want to include any popular retweeted tweets. We can
make an addition to our original where
clause to accomplish this.
from(t in "tweets",
select: {t.tweet_id, t.favorited_count, t.text_content},
limit: 10,
where: [author: "elixirlang", retweeted: false],
where: t.tweeted_at > type(^"2016-01-01T00:00:00Z", Ecto.DateTime)
and t.tweeted_at < type(^"2017-01-01T00:00:00Z", Ecto.DateTime),
order_by: [desc: :favorited_count]
) |> TopTweets.Repo.all
21:37:35.068 [debug] QUERY OK source="tweets" db=24.7ms
SELECT t0."tweet_id", t0."favorited_count", t0."text_content" FROM "tweets" AS t0 WHERE ((t0."author" = 'elixirlang') AND (t0."retweeted" = FALSE)) AND ((t0."tweeted_at" > $1::timestamp) AND (t0."tweeted_at" < $2::timestamp)) ORDER BY t0."favorited_count" DESC LIMIT 10 [{{2016, 1, 1}, {0, 0, 0, 0}}, {{2017, 1, 1}, {0, 0, 0, 0}}]
[{682890953212956672, 226,
"Elixir v1.2.0 is out - https://t.co/BuULFDtrsm! New guides and official blog post coming out soon. Happy New Year!"},
{745240093275873281, 167,
"Elixir v1.3.0 has been released and welcome @whatyouhide into Elixir's team: https://t.co/aw0aopnXxn"},
{753171413821976577, 163,
"Plug v1.2.0-rc.0 is out! Featuring safer encryption algorithms, better MIME support and a brand new debugging page! https://t.co/XuepyvTSQA"},
...]
That's it. This query gives us the 10 most favorited tweets authored by
@elixirlang in 2016. Two small changes -- to the select
clause and the
order_by
clause -- are all we need to get the most retweeted tweets.
from(t in "tweets",
select: {t.tweet_id, t.retweeted_count, t.text_content},
limit: 10,
where: [author: "elixirlang", retweeted: false],
where: t.tweeted_at > type(^"2016-01-01T00:00:00Z", Ecto.DateTime)
and t.tweeted_at < type(^"2017-01-01T00:00:00Z", Ecto.DateTime),
order_by: [desc: :retweeted_count]
) |> TopTweets.Repo.all
12:53:29.310 [debug] QUERY OK source="tweets" db=20.6ms
SELECT t0."tweet_id", t0."retweeted_count", t0."text_content" FROM "tweets" AS t0 WHERE ((t0."author" = 'elixirlang') AND (t0."retweeted" = FALSE)) AND ((t0."tweeted_at" > $1::timestamp) AND (t0."tweeted_at" < $2::timestamp)) ORDER BY t0."retweeted_count" DESC LIMIT 10 [{{2016, 1, 1}, {0, 0, 0, 0}}, {{2017, 1, 1}, {0, 0, 0, 0}}]
[{682890953212956672, 278,
"Elixir v1.2.0 is out - https://t.co/BuULFDtrsm! New guides and official blog post coming out soon. Happy New Year!"},
{745240093275873281, 199,
"Elixir v1.3.0 has been released and welcome @whatyouhide into Elixir's team: https://t.co/aw0aopnXxn"},
{735157122430816256, 124,
"Elixir v1.3 introduces calendar types (Time, Date, NaiveDateTime and DateTime) as well as sigils for building them, e.g: ~D[2016-05-24]"},
...]
With our queries completed, let's take a look at the two result sets.
@ElixirLang's Most Favorited Tweets of 2016
- 226 - "Elixir v1.2.0 is out - https://t.co/BuULFDtrsm! New guides and official blog post coming out soon. Happy New Year!"
- 167 - "Elixir v1.3.0 has been released and welcome @whatyouhide into Elixir's team: https://t.co/aw0aopnXxn"
- 163 - "Plug v1.2.0-rc.0 is out! Featuring safer encryption algorithms, better MIME support and a brand new debugging page! https://t.co/XuepyvTSQA"
- 155 - "Elixir v1.3 introduces calendar types (Time, Date, NaiveDateTime and DateTime) as well as sigils for building them, e.g: ~D[2016-05-24]"
- 123 - "Announcing GenStage: https://t.co/bNOLdQaB2v"
- 118 - "Ecto 2.1.0-rc.4 is out with https://t.co/PLUFjpi6Rb that streams query results without loading the whole result set into memory."
- 106 - "Elixir v1.3 will perform cross module checks to find undefined modules and functions! (tks to @antipax) https://t.co/DaMYMNu8cL"
- 104 - "Elixir v1.3 will have new accessors for nested data structures. Here is how to upcase all languages names in a map: https://t.co/FNTTqjpDeP"
- 104 - "José Valim's keynote on GenStage and Flow at @elixirconf is up: https://t.co/OOqeYKuHrt!"
- 103 - "mix test --stale is now on Elixir master thanks to @antipax! It uses static analysis to run tests only for the source files changed on disk!"
@ElixirLang's Most Retweeted Tweets of 2016
- 278 - "Elixir v1.2.0 is out - https://t.co/BuULFDtrsm! New guides and official blog post coming out soon. Happy New Year!"
- 199 - "Elixir v1.3.0 has been released and welcome @whatyouhide into Elixir's team: https://t.co/aw0aopnXxn"
- 124 - "Elixir v1.3 introduces calendar types (Time, Date, NaiveDateTime and DateTime) as well as sigils for building them, e.g: ~D[2016-05-24]"
- 110 - "Announcing GenStage: https://t.co/bNOLdQaB2v"
- 106 - "Plug v1.2.0-rc.0 is out! Featuring safer encryption algorithms, better MIME support and a brand new debugging page! https://t.co/XuepyvTSQA"
- 87 - "Elixir v1.3.4 has been released with bug fixes for those using Dialyzer and Cover on Erlang 19. Release notes: https://t.co/cyntmGTbxG"
- 74 - "mix test --stale is now on Elixir master thanks to @antipax! It uses static analysis to run tests only for the source files changed on disk!"
- 73 - "Elixir v1.2.6 has been released with three bug fixes and support for Erlang 19: https://t.co/2NcBVJzRG3"
- 72 - "Can't wait for Elixir v1.3? We have just tidied up the CHANGELOG with the upcoming release highlights: https://t.co/aFZbLwedWn"
- 71 - "Official announcement for Elixir v1.2 is up as well as updated guides and docs: https://t.co/ST6lSXIVJz
There is some overlap between the two sets of tweets, but if you look closely, you'll see that there are some tweets that show up in one list but not the other and vice versa. Nevertheless, it is clear that the big releases of Elixir this year -- 1.2.0 and 1.3.0 -- stand out as the most popular tweets. And from the looks of the 1.4.0 release tweet, that is easily going to be one of the most popular of 2017.
Conclusion
If you've made it this far, then you've learned a lot about what Ecto can
do. We've seen how to create a migration that exercises a number of key
features of Ecto.Migration
,
albeit we didn't even touch on associations. Then we got to use insert_all
, an
exciting new feature hot off the Ecto 2.0 press. Lastly, we took an
iterative approach to building a query with Ecto's query API.
Hopefully the results of all this work highlighted some interesting @elixirlang tweets and blog posts that you haven't seen yet. If you haven't done so yet, give this a try on your own Twitter account. What were your most popular tweets of 2016?
[1]: PostgreSQL is my preferred relational database. Though most of the concepts are applicable to any relation database, there are some PostgreSQL-specific features that I use and the example assumes you are using Postgres 9.5+.
[2]: Sure, using tweet_id
as the primary key would have warded this issue
off from the outset, but then we wouldn't have had a chance to play with
the unique_index
function.
[3]: Twitter's API not only enforces standard rate-limiting, but it also
limits access to tweets from the
statuses/user_timeline
endpoint to the latest 3200 tweets. Because we are using @elixirlang's
tweets, this won't be an issue. If you are following along with an account
that tweets much more frequently, then this may affect your results.
[4]: If you are following along with the examples, depending on how your
data was inserted, you may be seeing different sets of tweet_id
s in your
results. This will become deterministic once we add an order by
clause.