From Trove to tweets: digital tools and explorations

Under construction, 15 April 2017
This page is likely to be messy and incomplete. Check back later.

A workshop held at the University of Wollongong on 19 April 2017.

Setting up

To complete the activities described here you’ll need to create (free) accounts on a number of services.

  • – for web-scale document annotation
  • Trove user account – to create lists, and save your other activities
  • Once you have a Trove user account, you’ll need to get a Trove API key – to build stuff
  • GitHub – to share and manage code
  • DHBox – to play with some command line tools
  • Carto – to make maps
  • Twitter – for harvesting tweets with Twarc

Annotate this!

Screenshot of

This page can be annotated using As we go along you can add notes, reminders, links or clarifications. It’s easy!

Just click on the tab that appears on the right of the screen and log in. You can either comment on the whole page or highlight a section of text to annotate.

Things to try

  • Go to the site and enter any old url in the ‘Annotate!’ box. You can annotate any web page! There’s also a Chrome extension and a bookmarklet to make things even easier.
  • Ever wanted to share a link to single sentence or paragraph on a web page? makes this sort of granular addressability easy – just annotate some text then share the link to your annotation.
  • There’s lots if useful information in the their Education section. Think about how might you make use of in your own resources.

See it in action

Headline roulette

Headline Roulette

Headline Roulette presents you with a randomly-selected newspaper article from Trove. You have to guess the year in which it was published. Sounds easy, huh? But beware – you only get ten guesses.

I like to start off workshops with a quick round of Headline Roulette both because it’s fun (I hope) and because it gets us thinking about different ways we might use and explore cultural heritage data. In this case, Headline Roulette is drawing information about digitised newspaper articles from the Trove API.

But we can do more than play headline roulette, we can create our own customised version! Perhaps you’d like a version focused on a particular topic to use in class, or share with friends.

To complete this activity you’ll need to have an account on GitHub. Basic accounts are free. Create one now if you haven’t already.

Before we get building, you need to think about the topic of your game, and what keywords you might want to use to filter your results. For a feline version of Headline Roulette, for example, you might use the keywords ‘cat’ and ‘kitten’. Try your keywords out in the Trove web interface to make sure you’re getting useful results. Also make sure that there’s enough results (a few thousand at least) to avoid repeats.

Once you’ve got your keywords:

  • Make sure you’re logged in to GitHub.

  • Have your Trove API key at the ready.

  • Go to the DIY Headline Roulette repository on GitHub.

  • Click on the Fork button (in the top right hand corner of the page) to save a copy of this repository under your own account. There’s more about forking here.

  • Go to your account and view the repository you’ve just created. It will look just the same as the original!

  • Click on the Settings tab and change the repository name to suit your game. Save your change.

Ok, now we’re ready to customise the content of your game. There are only two things you have to supply:

  • your query
  • your Trove API key

But you can also provide:

  • a tagline for your game (appears in the header)
  • a byline for your game (appears in the footer)

To customise your game:

  • Click on the Code tab in your repository. Open the js folder and then click on the script.js file to view it.

  • Click on the pencil icon to edit the file.

  • Look for the ‘YOU MUST EDIT THIS SECTION’ message in the script.js file. It’ll look like this.

    // You must supply a Trove API key
    var troveAPIKey = '';
    // Either provide full API query here or include options below
    var apiQuery = '';
    // Words you want to search for -- separate multiple values with spaces, eg:
    // var keywords = 'weather wragge';
    var keywords = '';
    // How you want to combine keywords -- all, any, or phrase
    var keywordType = 'all'
    // Newspaper id numbers -- separate multiple values with spaces, eg:
    // var titles = '840 35';
    var titles = '';
    // Add a byline, eg:
    var byline = 'Created by <a href="">Tim Sherratt</a>.'
    // var byline = '';
    // Add a tagline
    var tagline = 'How well do you know your Australian history?';
    // Leave this alone unless you're publishing on a non-https server
    var useHttps = 'true';
  • On the var troveAPIKey = ''; line, paste your Trove API key between the quotes.

  • On the var keywords = ''; line, type your keyword(s) between the quotes.

  • Edit the byline value to replace my name and url with yours.

  • You can also edit the tagline value as you feel necessary.

  • Once you’ve finished editing, click on the Commit changes button to save your details.

That’s it! Your hew game will be available at the address:

http://[your Github user name][your repository name]

For example, my user account is ‘wragge’ and I created a version of this repository called ‘canberra-headline-roulette’, so you can find it online at If you can’t find it, go into your repository’s ‘Settings’ and scroll down to the section headed ‘GitHub Pages’ – you should see the link there.

The DIY Headline Roulette repository includes instructions for creating more complex Trove queries – for example, you might want to only use articles from specific newspapers. It also explains how to create and manage multiple games.

Trove Harvester

The Trove Harvester lets you download lots and lots of digitised newspaper articles in a batch for further analysis or visualisation. Instead of 20 results on a web page, you can have 2,000, or 20,000, all neatly saved in a spreadsheet. The Harvester is a command line tool (so no fancy user interface), but it’s pretty easy to use.

The easiest and best way to run TroveHarvester is inside a Python virtual environment. If you want to do this on your own computer, see these instructions.

Preparing DHBox

In this workshop we’re going to make use of DHBox to create our own temporary labs. To set Trove Harvester up in DH Box requires few preliminary steps:

  • Go to DHBox, click on the Sign up button under Use it for your workshop and fill in your details. Choose how long you’d like your lab to be available for. Click Launch to set up your lab.

  • If you’ve already created a laboratory, you can get back to it by signing in, clicking on your profile name, and selecting Apps from the drop down list.

  • Once your lab is ready, click on the Command Line tab and log in to the console using your DHBox username and password.

DHBox Command Line screen

First we need to download the Virtualenv software we’ll use to set up our environment. Copy and paste the following commands into DHBox. Click enter after each one.

curl -O
tar xvfz virtualenv-15.1.0.tar.gz
cd virtualenv-15.1.0

Now we can create a virtual environment called troveharvests by entering:

python troveharvests

Let’s activate our new environment:

cd troveharvests
source bin/activate

Installing TroveHarvester

TroveHarvester is a Python program hosted on the PyPi repository. This makes it very easy to install. Assuming you’re inside your newly created and activated virtual environment, just enter:

pip install troveharvester

That’s it! You can check that it’s installed and working by typing:

troveharvester -h

You should see TroveHarvester’s help message.

Screenshot of troveharvester installation

We’re ready to harvest!

Starting your first harvest

Before you start your harvest, you need to make sure you have your Trove API key at the ready.

Ok, so what do you want to harvest? Head over to the Trove newspaper zone and start exploring. Once you’ve constructed an interesting query just copy the url so you can feed it to the harvester.

For example, this url searches for newspaper articles containing the name ‘Poon Gooey’:

If you’d like to find out more about Poon Gooey and his family have a browse of Kate Bagnall’s blog.

The Trove Harvester will automagically convert the url into a query that can be understood by the Trove API. This works well most of the time, but there currently are some differences between the way queries work in the web interface and the API. I’ve tried to work around these as much as possible, but be warned that some queries might not deliver exactly what you expect.

Armed with a query url and your Trove API key you can head back to the DHBox command line. Starting a harvest is easy. If my API key was thisistimsapikey I’d just type:

troveharvester start "" thisistimsapikey

Note the start command and the double quotes around the url.

Ok, off you go – start your own harvest!

You can stop your harvest at any time by pressing Ctrl-C on your keyboard. Try it now – it will look a bit ugly, but it’s perfectly safe.

You can then check the status of your stopped harvest by entering:

troveharvester report

And you can restart it by entering:

troveharvester restart

There are a few options you can add to modify the results of your harvest:

  • Add --pdf to save a PDF copy of every article. This will slow the harvest down a lot, and will consume significant amounts of disk space, so use wisely.
  • Add --text to save the OCR’d text content of every article to a separate file. You could then feed the text files to the text analysis platform of your choice.
  • Add --max followed by a number to limit the number of articles harvested. So --max 100 would only save the first 100 articles.

Putting these together, if I wanted to harvest details and texts for the first 1000 articles in my query, I’d enter.

troveharvester start "" thisistimsapikey --text --max 1000

Exploring the results

Harvests are stored in a data directory which will be created if it doesn’t already exist. Each new harvest creates it’s own directory inside here. The name of the directory will be a timestamp from the moment when the harvest was started. This means that you can keep firing off harvests and each one will have it’s own uniquely-named directory.

In DHBox:

  • Click on the File Manager tab.

  • You’ll see a folder with your DHBox username, click on it to open it.

  • Continue down through the folder hierarchy by clicking on virtualenv-15.1.0, then on troveharvests, and finally on data.

  • You should now see a directory (or directories if you’ve tried multiple harvests) with a name that looks like a large number. Click on it to open.

Screenshot of troveharvester data directory

There are two main files created by the Trove Harvester:

  • metadata.json – contains information about your harvest, such as the query used and the date it was started.
  • results.csv – contains all the article metadata harvested from Trove.

If you’ve used the --text or --pdf options, you’ll also have a folder containing your text or pdf files.

Here’s what the contents of results.csv look like:

Screenshot of troveharvester results

  • Click on the results.csv file to download it. We’ll use it later.
  • Here’s an example

The results.csv is just an ordinary Comma Separated Values (CSV) file – it’s a format commonly used for sharing data. You can open it in any spreadsheet program, but beware that Excel might do odd things to your dates!

Mapping your Trove harvest

There are lots of things you might do with the data generated by the Trove Harvester. You could plot the number of articles over time, or analyse the language used. As an example, we’re going to make a map with a few little extra data widgets using Carto.

To complete this activity you’ll need an account with Carto. They have a basic free account that includes everything you need to get started. You can sign up using your GitHub account.

  • Once you’re logged into Carto, click on the Datasets link in the top navigation bar.

  • Click on the blue New Dataset button.

  • From the Connect dataset page click on the blue Browse button and select the results.csv file you downloaded from DHBox.

  • Click on the blue Connect dataset button at the bottom of the page.

  • Carto will import the file and open it for editing.

At this point you may be wondering what we’re going to map, after all, there isn’t any geospatial data in the results from Trove. Ah ha! It just so happens that I did a little tinkering late last year to extract place names from newspaper titles. I created a CSV file that links newspapers on Trove to geolocated places. I’ve already uploaded it to Carto, so you can connect to it.

  • Click on the circle thingy in the top left corner of the screen to go back to your datasets.

  • Click on the New dataset button again.

  • This time, copy and paste the following link into the url input box and click on Submit.
  • Now click on the blue Connect dataset button to load the new dataset.

All our data is now ready and we can start to visualise it!

  • Click on the blue Create map button at the bottom of the dataset editing page.

Your map will open showing all the locations where newspapers were published. To limit this to only those titles included in our harvester results we need to join the two datasets.

  • Click on the current dataset in the left hand menu.

  • Click on the Analysis tab, and then the blue Add analysis button.

  • Select the Join columns from 2nd layer option and click on yet another Add analysis button.

  • In the box that says Select a right source choose the results file from the harvester.

  • Now you have to identify the field that links the two files. Under Foreign keys choose title_id from the locations dataset, and newspaper_id from the results dataset.

  • Finally we select the fields from each dataset that we want to appear in the joined dataset. From the locations dataset select state and place. From the results dataset select title, newspaper_title, date, page, and url.

  • Click on the Apply button.

You should notice that some of the markers on the map disappear as the files are joined. This is what we’d expect – now we’re just seeing those places where articles in our harvest were published.

Now it’s time to style our map. I’m going to focus on creating an animated map that shows when articles were published, but feel free to play around with the other styles.

  • Click on the Style tab. Under Aggregation choose the Animated style – it looks like a moving dot.

Screenshot of Carto style tab

  • Set the Blending value to ‘multiply’.

  • Set the Column to right_date – this tells Carto which field to use as the timeline for the animation.

That’s it. A timeline will appear below the map – just click on the play button to start the animation if it isn’t running already.

That’s pretty cool, but Carto also makes it easy to add some mini visualisations – called widgets – that we can use to filter the map results.

  • Click on the Data tab.

  • Check the boxes next to state, place, and right_newspaper_title. If the widgets don’t appear, try resizing your window.

  • Click on the Edit links next to the widget names to edit settings, including their titles.

If you’re pleased with your map you can share it with the world.

  • Click on the settings icon in the blue left hand bar.

  • Click on the blue Share button, and then the Publish button.

  • Grab a copy of the link to your map from the Get the link box. Share!

You can also embed you map in another website. Here’s one of the Poon Gooey articles that I created earlier.

Political speech

Election speeches

The Museum of Australian Democracy has a great site where you can play around with election policy speeches from 1901 through to 1916. Have a look at the visualisations on their explore page and see what you can discover.

Even better, they give you access to the underlying data – you can easily download a complete set of the speeches in plain text for your own analysis.

One thing you can’t easily do on the MoAD site is to compare the language of speeches. How does Bill Shorten compare to Malcolm Turnbull? How does Bob Hawke compare to Ben Chifley? A simple way of trying out these sorts of comparisons is with the tool SameDiff.

You’ll notice that the default example is a comparison of Hillary Clinton and Donald Trump – it seems appropriate that we should have a look at that for a while. Anything interesting?

Now let’s try Australian politics.

  • Click on the ‘Upload files’ tab.

  • Click on ‘browse file 1’ and choose one of the unzipped MoAD speeches.

  • Repeat for ‘browse file 2’.

Which speeches are similar or dissimilar? How does language change over time, or by party?

Going further with Voyant Tools

Voyant Tools is a powerful, web-based text analysis platform. It has so many tools and settings it can be a bit overwhelming, but it’s pretty easy to get started.

  • Go to Voyant Tools

  • Click on Upload and select the .zip file of election speeches. Yep – no need to upload individual files, Voyant will accept zips!

Voyant opens in ‘Corpus view’ which includes a series of commonly-used tools. If you have any trouble uploading, you can find a ready-built version here.

  • Look under summary, you’ll see that includes a list of ‘distinctive’ terms for each speech.

In the top corner is the familiar word cloud, but Voyant Tools gives you a lot of control over what it displays.

  • Let’s hide the words ‘government’ and ‘australia’ by adding them to the list of ‘stop words’ (stop words are just common words like ‘a’ or ‘the’ that we don’t want to include in our analysis).

  • Hover over the title bar above the word cloud and click on the icon that says ‘Define options for this tool’ when you hover over it.

  • Click on the Edit list button next to the ‘Stopwords’ selector.

  • Hit return to start a new line, then type ‘australia’. Repeat with ‘government’.

  • Click Save and then Confirm.

  • The word cloud will automatically regenerate. How does it look now? What other stopwords might you add?

  • To create a word cloud for an individual speech, click on ‘Scale’, then ‘Documents’ and select a speech.

The corpus view only gives a taste of the many tools and visualisations that Voyant provides. To try out some others just:

  • Hover over the blue bar at the top of the screen and click on the icon that looks like a window.

  • A menu of available tools will appear. Have a play! See what you can find.

Some of the tools are easier to understand than others. For example, ‘Contexts’ (under Document Tools) simply displays a selected word in the various contexts in which it appears throughout the corpus.

  • Select ‘Contexts’ from the Document Tools menu.

  • Type ‘immigration’ in the input box, wait for a minute while Voyant checks to see the term exists, and then click on the term ‘immigration*’ in the results list.

  • Browse the list (you can resize the columns if necessary). Double click on an entry to see even more context!

The creators of Voyant Tools have made it really easy to embed their tools in your own website – so you can include the visualisations alongside your analysis. Any of the available tools or visualisations can be viewed in its own window, embedded, and shared.

Let’s make a Streamgraph and create a shareable link.

  • Select the tools menu as before and click on ‘StreamGraph’ under ‘Visualisation Tools’.

  • Remove the current terms by clicking on them at the top of the screen.

  • Add new terms by typing them into the input box at the bottom of the screen and selecting the results that pop up.

  • Because the speeches are arranged in chronological order, the streamgraph will show you changing frequency across time. What interesting comparisons can you create?

Once you’ve created something that looks interesting, let’s get a shareable link.

  • Hover over the grey StreamGraph title bar and click on the icon that looks like an external link.

  • The ‘Export’ dialogue appears. By default the ‘URL’ box is checked, but you can also choose ‘Export view’ to get the code to embed your visualisation in a web page, or ‘Export Visualisation’ to download an image.

  • Click Export and a new window will open with your StreamGraph and a shareable url. Tweet it, email it to your family!

Here’s an example of using the embed codes. You’ll see that the StreamGraph below is not just an image, it’s live! You can add or subtract terms here, without ever going to Voyant.

More information

Try this

  • My Historic Hansard has Voyant Tools built in! Go to a day or a year and click on the Voyant button to explore some more political speech.

Using Twarc

The Documenting the Now project is working to develop technologies and guidelines for the ethical collection and use of social media. One of their tools is a Twitter archiving tool called Twarc. We’re going to see how it works.

Prepare DHBox

Twarc is a python command line tool like the Trove Harvester, so we’re going to head back to DHBox to use it.

  • Click on the DHBox ‘Command Line’ tab as before.

  • You’re probably still in your troveharvests directory, so enter the following lines to move out of the current directory and create a new virtual environment for Twarc.

Deactivate the current environment.

cd ~/virtualenv-15.1.0

Create and activate a new environment.

python twarcharvests
cd twarcharvests
source bin/activate

Now install Twarc!

pip install twarc

Get your Twitter keys

Perhaps the trickiest bit of all of this is just getting all the access keys and tokens you need to authenticate yourself with Twitter. Every time I try this it seems a bit different, but the basic principles should be the same.

  • Make sure you have a Twitter account!

  • Go to the Twitter developers site. Login if necessary with your normal Twitter credentials.

  • Click on the ‘My apps’ link at the top of the page.

  • Click on the Create New App button.

  • Fill in the form. The ‘name’ and ‘description’ can be anything you want. For ‘website’ you can use the url of this workshop.

  • Tick the terms and conditions and click the Create button.

  • On your new application’s page click on the ‘Keys and Access Tokens’ tab.

  • Ok, you should now see the first two of the four keys and tokens we need to collect:
    • Consumer Key (API Key)
    • Consumer Secret (API Secret)
  • To get the other two, click on the Create my Access Token button at the bottom of the page. You should now see:
    • Access Token
    • Access Token Secret
  • Keep this page open so you can copy and paste the values into Twarc.

Configure Twarc

Go back to the DHBox Command Line and enter:

twarc configure

Twarc will ask you for your four keys in turn. Just copy and paste them at the command line.

Your first Twarc harvest

There are two main ways of collecting tweets using Twarc: search and filter. Search naturally searches current tweets, while filter sits and listens to the Twitter stream and tries to catch new tweets matching your query.

Given the nature of the Twitter API, it’s virtually impossible to get everything matching your query. You may have noticed that Twitter’s search results only remain current for a few days. This means that you might want to repeat your search every few days, or supplement search with filter. You can always remove duplicates later on.

Let’s run a search to collect tweets using the #ozhist hashtag.

twarc search "#ozhist" > ozhist.json

Results from the Twitter API are saved in JSON format. This is a very common format for saving and sharing structured data – but it’s not very human friendly.

Let’s find out how many tweets we collected. This command counts the number of lines (each tweet is on a new line) in the JSON file.

wc -l < ozhist.json

Note that Twitter discourages you from sharing large amounts of tweets harvested from their API. Twarc offers a way around this – you can dehydrate your collection to create a file that contains only tweet ids. Other users can then rehydrate using Twarc or the web service on the Documenting the Now site.

To dehydrate:

twarc dehydrate ozhist.json > ozhist-ids.txt

To inspect your ids:

cat ozhist-ids.txt

Use the arrow keys to scroll down, and type q to quit.

Let’s finish by creating some simple visualisations of our collection. First we need to grap some extra utility scripts from the Twarc repository.

At the command line enter the following to grab a copy of the Twarc repository on GitHub:

git clone

Now let’s make a wall:

twarc/utils/ ozhist.json > wall.html

And a wordcloud:

twarc/utils/ ozhist.json > cloud.html

To view the web pages we’ve just created, let’s fire up a temporary web server via the command line:

python -m SimpleHTTPServer 4000

Now click on the ‘Your Site’ tab in DHBox. If you don’t see anything, try right clicking and choosing ‘Reload Frame’.

You should see a list of the contents of your current directory. Click on wall.html or cloud.html to see the results!

Screenshot of Twarc word cloud

Things to try