Created: 03 April 2017
I’ve had a bit of a play with Ed Summer’s Twarc tool before, but here I’m documenting my first attempt to do something a bit more systematic.
Over the last few years ‘#australiaday’ tweets have been ‘curated’ by the National Museum of Australia and Twitter and stored in a ‘time capsule’ (what does that even mean?). I thought it was important for someone to be documenting alternative views of Australia Day, so I decided to have a go at harvesting ‘#invasionday’ tweets using Twarc.
Twarc is a Python tool and can be easily installed with
pip install twarc. To configure it you need to obtain a set of four access tokens for the Twitter API. Once you have your keys the
twarc configure command will ask you for your keys and save them to a
.twarc file in your home directory.
The Twarc utilities aren’t installed when you run
pip, so if you want to use things like the deduplicate script just clone a copy of the Twarc GitHub repository into your local directory.
Twarc provides two ways of harvesting tweets –
search looks for matching tweets from the last week or so, while
filter listens to the Twitter stream for anything new. I decided to combine the two approaches to try and get as much as possible. I’m not sure if there’s an optimal mix of search versus filter – I went a bit overboard and fired off almost daily searches. I knew this would result in lots of duplicates, but I thought it might increase my coverage. Perhaps this was just wishful thinking – I don’t really know enough about the coverage of the Twitter API.
To run a search you just do:
$ twarc search invasionday > invasionday-search.json
And to filter:
$ twarc filter invasionday > invasionday-filter.json
The filter command will sit around waiting, but I found that it sometimes stopped working. I suspect my internet connection might have dropped out.
Anyway after about a week I ended up with the following files:
-rw-r--r-- 1 tim staff 542490 23 Jan 17:35 invasionday-filter-20170122.json -rw-r--r-- 1 tim staff 164501 24 Jan 17:46 invasionday-filter-20170123.json -rw-r--r-- 1 tim staff 20143208 26 Jan 09:54 invasionday-filter-20170124.json -rw-r--r-- 1 tim staff 188208128 28 Jan 10:20 invasionday-filter-20170126.json -rw-r--r-- 1 tim staff 1768917 1 Apr 12:36 invasionday-filter-20170128.json -rw-r--r-- 1 tim staff 9666495 22 Jan 12:50 invasionday-search-20170122.json -rw-r--r-- 1 tim staff 8191817 23 Jan 17:38 invasionday-search-20170123.json -rw-r--r-- 1 tim staff 11634605 24 Jan 17:46 invasionday-search-20170124.json -rw-r--r-- 1 tim staff 15271533 25 Jan 18:39 invasionday-search-20170125.json -rw-r--r-- 1 tim staff 29014598 26 Jan 09:55 invasionday-search-20170126-1.json -rw-r--r-- 1 tim staff 107195725 26 Jan 18:33 invasionday-search-20170126-2.json -rw-r--r-- 1 tim staff 160754085 27 Jan 09:14 invasionday-search-20170127.json -rw-r--r-- 1 tim staff 176985132 28 Jan 10:36 invasionday-search-20170128.json -rw-r--r-- 1 tim staff 175328710 29 Jan 14:48 invasionday-search-20170129.json
First step in processing was to combine all the
.json files into one. Just do:
$ cat invasionday* > combined.json
This resulted in one 909mb JSON file containing 134,062 tweets.
I found that it’s worth checking that none of the files have been truncated before you combine them. Otherwise you’ll get errors later.
Twarc provides a useful utility function for removing duplicate tweets. Just do:
$ twarc/utils/deduplicate.py combined.json > deduped.json
This resulted in a 228mb JSON file containing 32,626 tweets. So yes, given that I had about 100,000 duplicates I think you could say I overdid the searching…
This wasn’t really necessary, but I thought I’d try it – sort tweets in
id order (basically equivalent to sorting by the time tweeted):
$ twarc/utils/sort_by_id.py deduped.json > sorted.json
$ twarc/utils/wordcloud.py deduped.json > wordcloud.html
Twitter’s terms of service suggest that you shouldn’t share large amounts of harvested Tweets. To get around this, Twarc offers a
dehydrate function that extracts just the Twitter ids. These can later be rehydrated using Twarc to retrieve the full details from Twitter.
$ twarc dehydrate sorted.json > invasionday-ids.txt
The JSON files created by Twarc have one tweet per line. So a quick way to find out how many tweets you have is to count the number of lines in the file.
$ wc -l < deduplicated.json 32626
How many original tweets were there? Remove retweets:
$ twarc/utils/noretweets.py unshortened.json > tweets_noretweets.json
$ wc -l < tweets_noretweets.json 7058
Twitter returns some, but not all, unshortened urls. Twarc provides an
unshorten script to help you get as many full urls as possible. These are then saved in the JSON file using the
To use the script you have to have unshrtn running – this is a node based web service that unshortens urls. I installed Docker, cloned the repository, and then followed the instructions on the GitHib site.
build phase I got an error message saying that the user with the
uid 1000 already existed. I fixed this by just editing the
Dockerfile to remove ` –uid 1000` from line 6.
run command also resulted in an error – it seems the app was trying to bind to port 80, where I was already running a webserver. I just changed the command to
docker run -p 3000:3000 -d -t unshrtn:dev and all seemed fine.
I could then do:
% sorted.json | twarc/utils/unshorten.py > unshortened.json
To save all unique urls to a file:
$ cat unshortened.json | twarc/utils/urls.py | sort | uniq > urls.txt
How many are there?
$ wc -l < urls.txt 2754
To save all unique urls to a file, count them, and rank in order of popularity :
$ cat unshortened.json | twarc/utils/urls.py | sort | uniq -c | sort -nr > urls-ranked.txt
Neat trick huh? The
-c flag on
uniq adds a count which can then be used to resort.
To get list of unique image urls:
$ twarc/utils/image_urls.py unshortened.json | sort | uniq > images.txt
How many are there?
$ wc -l < images.txt 1208
Once you have a file containing all the image urls, you can save them all with wget:
$ mkdir images $ cd images $ wget -i ../images.txt
To show the first tweet in file nicely formatted way:
$head -1 deduped.json | jq '.'
If you don’t have JQ installed, you’ll need to
brew install jq first.
Note that the Twarc JSON files save one tweet (JSON object) per line. I started off trying to treat them as arrays and got into trouble.