Under construction, 15 April 2017
This page is likely to be messy and incomplete. Check back later.
A workshop held at the University of Wollongong on 19 April 2017.
To complete the activities described here you’ll need to create (free) accounts on a number of services.
This page can be annotated using Hypothes.is. As we go along you can add notes, reminders, links or clarifications. It’s easy!
Just click on the tab that appears on the right of the screen and log in. You can either comment on the whole page or highlight a section of text to annotate.
Headline Roulette presents you with a randomly-selected newspaper article from Trove. You have to guess the year in which it was published. Sounds easy, huh? But beware – you only get ten guesses.
I like to start off workshops with a quick round of Headline Roulette both because it’s fun (I hope) and because it gets us thinking about different ways we might use and explore cultural heritage data. In this case, Headline Roulette is drawing information about digitised newspaper articles from the Trove API.
But we can do more than play headline roulette, we can create our own customised version! Perhaps you’d like a version focused on a particular topic to use in class, or share with friends.
To complete this activity you’ll need to have an account on GitHub. Basic accounts are free. Create one now if you haven’t already.
Before we get building, you need to think about the topic of your game, and what keywords you might want to use to filter your results. For a feline version of Headline Roulette, for example, you might use the keywords ‘cat’ and ‘kitten’. Try your keywords out in the Trove web interface to make sure you’re getting useful results. Also make sure that there’s enough results (a few thousand at least) to avoid repeats.
Once you’ve got your keywords:
Make sure you’re logged in to GitHub.
Have your Trove API key at the ready.
Go to the DIY Headline Roulette repository on GitHub.
Click on the Fork button (in the top right hand corner of the page) to save a copy of this repository under your own account. There’s more about forking here.
Go to your account and view the repository you’ve just created. It will look just the same as the original!
Click on the Settings tab and change the repository name to suit your game. Save your change.
Ok, now we’re ready to customise the content of your game. There are only two things you have to supply:
But you can also provide:
To customise your game:
Click on the Code tab in your repository. Open the js
folder and then click on the script.js
file to view it.
Click on the pencil icon to edit the file.
Look for the ‘YOU MUST EDIT THIS SECTION’ message in the script.js
file. It’ll look like this.
// YOU MUST EDIT THIS SECTION
// You must supply a Trove API key
var troveAPIKey = '';
// Either provide full API query here or include options below
var apiQuery = '';
// Words you want to search for -- separate multiple values with spaces, eg:
// var keywords = 'weather wragge';
var keywords = '';
// How you want to combine keywords -- all, any, or phrase
var keywordType = 'all'
// Newspaper id numbers -- separate multiple values with spaces, eg:
// var titles = '840 35';
var titles = '';
// Add a byline, eg:
var byline = 'Created by <a href="https://timsherratt.org">Tim Sherratt</a>.'
// var byline = '';
// Add a tagline
var tagline = 'How well do you know your Australian history?';
// Leave this alone unless you're publishing on a non-https server
var useHttps = 'true';
On the var troveAPIKey = '';
line, paste your Trove API key between the quotes.
On the var keywords = '';
line, type your keyword(s) between the quotes.
Edit the byline
value to replace my name and url with yours.
You can also edit the tagline
value as you feel necessary.
Once you’ve finished editing, click on the Commit changes button to save your details.
That’s it! Your hew game will be available at the address:
http://[your Github user name].github.io/[your repository name]
For example, my user account is ‘wragge’ and I created a version of this repository called ‘canberra-headline-roulette’, so you can find it online at http://wragge.github.io/canberra-headline-roulette/. If you can’t find it, go into your repository’s ‘Settings’ and scroll down to the section headed ‘GitHub Pages’ – you should see the link there.
The DIY Headline Roulette repository includes instructions for creating more complex Trove queries – for example, you might want to only use articles from specific newspapers. It also explains how to create and manage multiple games.
The Trove Harvester lets you download lots and lots of digitised newspaper articles in a batch for further analysis or visualisation. Instead of 20 results on a web page, you can have 2,000, or 20,000, all neatly saved in a spreadsheet. The Harvester is a command line tool (so no fancy user interface), but it’s pretty easy to use.
The easiest and best way to run TroveHarvester is inside a Python virtual environment. If you want to do this on your own computer, see these instructions.
In this workshop we’re going to make use of DHBox to create our own temporary labs. To set Trove Harvester up in DH Box requires few preliminary steps:
Go to DHBox, click on the Sign up button under Use it for your workshop and fill in your details. Choose how long you’d like your lab to be available for. Click Launch to set up your lab.
If you’ve already created a laboratory, you can get back to it by signing in, clicking on your profile name, and selecting Apps from the drop down list.
Once your lab is ready, click on the Command Line tab and log in to the console using your DHBox username and password.
First we need to download the Virtualenv software we’ll use to set up our environment. Copy and paste the following commands into DHBox. Click enter after each one.
curl -O https://pypi.python.org/packages/d4/0c/9840c08189e030873387a73b90ada981885010dd9aea134d6de30cd24cb8/virtualenv-15.1.0.tar.gz
tar xvfz virtualenv-15.1.0.tar.gz
cd virtualenv-15.1.0
Now we can create a virtual environment called troveharvests
by entering:
python virtualenv.py troveharvests
Let’s activate our new environment:
cd troveharvests
source bin/activate
TroveHarvester is a Python program hosted on the PyPi repository. This makes it very easy to install. Assuming you’re inside your newly created and activated virtual environment, just enter:
pip install troveharvester
That’s it! You can check that it’s installed and working by typing:
troveharvester -h
You should see TroveHarvester’s help message.
We’re ready to harvest!
Before you start your harvest, you need to make sure you have your Trove API key at the ready.
Ok, so what do you want to harvest? Head over to the Trove newspaper zone and start exploring. Once you’ve constructed an interesting query just copy the url so you can feed it to the harvester.
For example, this url searches for newspaper articles containing the name ‘Poon Gooey’:
http://trove.nla.gov.au/newspaper/result?q=%22poon+gooey%22
If you’d like to find out more about Poon Gooey and his family have a browse of Kate Bagnall’s blog.
The Trove Harvester will automagically convert the url into a query that can be understood by the Trove API. This works well most of the time, but there currently are some differences between the way queries work in the web interface and the API. I’ve tried to work around these as much as possible, but be warned that some queries might not deliver exactly what you expect.
Armed with a query url and your Trove API key you can head back to the DHBox command line. Starting a harvest is easy. If my API key was thisistimsapikey
I’d just type:
troveharvester start "http://trove.nla.gov.au/newspaper/result?q=%22poon+gooey%22" thisistimsapikey
Note the start
command and the double quotes around the url.
Ok, off you go – start your own harvest!
You can stop your harvest at any time by pressing Ctrl-C on your keyboard. Try it now – it will look a bit ugly, but it’s perfectly safe.
You can then check the status of your stopped harvest by entering:
troveharvester report
And you can restart it by entering:
troveharvester restart
There are a few options you can add to modify the results of your harvest:
--pdf
to save a PDF copy of every article. This will slow the harvest down a lot, and will consume significant amounts of disk space, so use wisely.--text
to save the OCR’d text content of every article to a separate file. You could then feed the text files to the text analysis platform of your choice.--max
followed by a number to limit the number of articles harvested. So --max 100
would only save the first 100 articles.Putting these together, if I wanted to harvest details and texts for the first 1000 articles in my query, I’d enter.
troveharvester start "http://trove.nla.gov.au/newspaper/result?q=%22poon+gooey%22" thisistimsapikey --text --max 1000
Harvests are stored in a data
directory which will be created if it doesn’t already exist. Each new harvest creates it’s own directory inside here. The name of the directory will be a timestamp from the moment when the harvest was started. This means that you can keep firing off harvests and each one will have it’s own uniquely-named directory.
In DHBox:
Click on the File Manager tab.
You’ll see a folder with your DHBox username, click on it to open it.
Continue down through the folder hierarchy by clicking on virtualenv-15.1.0
, then on troveharvests
, and finally on data
.
You should now see a directory (or directories if you’ve tried multiple harvests) with a name that looks like a large number. Click on it to open.
There are two main files created by the Trove Harvester:
metadata.json
– contains information about your harvest, such as the query used and the date it was started.results.csv
– contains all the article metadata harvested from Trove.If you’ve used the --text
or --pdf
options, you’ll also have a folder containing your text or pdf files.
Here’s what the contents of results.csv
look like:
results.csv
file to download it. We’ll use it later.The results.csv
is just an ordinary Comma Separated Values (CSV) file – it’s a format commonly used for sharing data. You can open it in any spreadsheet program, but beware that Excel might do odd things to your dates!
There are lots of things you might do with the data generated by the Trove Harvester. You could plot the number of articles over time, or analyse the language used. As an example, we’re going to make a map with a few little extra data widgets using Carto.
To complete this activity you’ll need an account with Carto. They have a basic free account that includes everything you need to get started. You can sign up using your GitHub account.
Once you’re logged into Carto, click on the Datasets link in the top navigation bar.
Click on the blue New Dataset button.
From the Connect dataset page click on the blue Browse button and select the results.csv
file you downloaded from DHBox.
Click on the blue Connect dataset button at the bottom of the page.
Carto will import the file and open it for editing.
At this point you may be wondering what we’re going to map, after all, there isn’t any geospatial data in the results from Trove. Ah ha! It just so happens that I did a little tinkering late last year to extract place names from newspaper titles. I created a CSV file that links newspapers on Trove to geolocated places. I’ve already uploaded it to Carto, so you can connect to it.
Click on the circle thingy in the top left corner of the screen to go back to your datasets.
Click on the New dataset button again.
This time, copy and paste the following link into the url input box and click on Submit.
https://dl.dropbox.com/s/a0jj1lepbcrtz9h/trove-newspaper-titles-locations.csv
All our data is now ready and we can start to visualise it!
Your map will open showing all the locations where newspapers were published. To limit this to only those titles included in our harvester results we need to join the two datasets.
Click on the current dataset in the left hand menu.
Click on the Analysis tab, and then the blue Add analysis button.
Select the Join columns from 2nd layer option and click on yet another Add analysis button.
In the box that says Select a right source choose the results
file from the harvester.
Now you have to identify the field that links the two files. Under Foreign keys choose title_id
from the locations dataset, and newspaper_id
from the results dataset.
Finally we select the fields from each dataset that we want to appear in the joined dataset. From the locations dataset select state
and place
. From the results dataset select title
, newspaper_title
, date
, page
, and url
.
Click on the Apply button.
You should notice that some of the markers on the map disappear as the files are joined. This is what we’d expect – now we’re just seeing those places where articles in our harvest were published.
Now it’s time to style our map. I’m going to focus on creating an animated map that shows when articles were published, but feel free to play around with the other styles.
Set the Blending value to ‘multiply’.
Set the Column to right_date
– this tells Carto which field to use as the timeline for the animation.
That’s it. A timeline will appear below the map – just click on the play button to start the animation if it isn’t running already.
That’s pretty cool, but Carto also makes it easy to add some mini visualisations – called widgets – that we can use to filter the map results.
Click on the Data tab.
Check the boxes next to state
, place
, and right_newspaper_title
. If the widgets don’t appear, try resizing your window.
Click on the Edit links next to the widget names to edit settings, including their titles.
If you’re pleased with your map you can share it with the world.
Click on the settings icon in the blue left hand bar.
Click on the blue Share button, and then the Publish button.
Grab a copy of the link to your map from the Get the link box. Share!
You can also embed you map in another website. Here’s one of the Poon Gooey articles that I created earlier.
The Museum of Australian Democracy has a great site where you can play around with election policy speeches from 1901 through to 1916. Have a look at the visualisations on their explore page and see what you can discover.
Even better, they give you access to the underlying data – you can easily download a complete set of the speeches in plain text for your own analysis.
One thing you can’t easily do on the MoAD site is to compare the language of speeches. How does Bill Shorten compare to Malcolm Turnbull? How does Bob Hawke compare to Ben Chifley? A simple way of trying out these sorts of comparisons is with the tool SameDiff.
You’ll notice that the default example is a comparison of Hillary Clinton and Donald Trump – it seems appropriate that we should have a look at that for a while. Anything interesting?
Now let’s try Australian politics.
Click on the ‘Upload files’ tab.
Click on ‘browse file 1’ and choose one of the unzipped MoAD speeches.
Repeat for ‘browse file 2’.
Which speeches are similar or dissimilar? How does language change over time, or by party?
Voyant Tools is a powerful, web-based text analysis platform. It has so many tools and settings it can be a bit overwhelming, but it’s pretty easy to get started.
Go to Voyant Tools
Click on Upload and select the .zip file of election speeches. Yep – no need to upload individual files, Voyant will accept zips!
Voyant opens in ‘Corpus view’ which includes a series of commonly-used tools. If you have any trouble uploading, you can find a ready-built version here.
In the top corner is the familiar word cloud, but Voyant Tools gives you a lot of control over what it displays.
Let’s hide the words ‘government’ and ‘australia’ by adding them to the list of ‘stop words’ (stop words are just common words like ‘a’ or ‘the’ that we don’t want to include in our analysis).
Hover over the title bar above the word cloud and click on the icon that says ‘Define options for this tool’ when you hover over it.
Click on the Edit list button next to the ‘Stopwords’ selector.
Hit return to start a new line, then type ‘australia’. Repeat with ‘government’.
Click Save and then Confirm.
The word cloud will automatically regenerate. How does it look now? What other stopwords might you add?
To create a word cloud for an individual speech, click on ‘Scale’, then ‘Documents’ and select a speech.
The corpus view only gives a taste of the many tools and visualisations that Voyant provides. To try out some others just:
Hover over the blue bar at the top of the screen and click on the icon that looks like a window.
A menu of available tools will appear. Have a play! See what you can find.
Some of the tools are easier to understand than others. For example, ‘Contexts’ (under Document Tools) simply displays a selected word in the various contexts in which it appears throughout the corpus.
Select ‘Contexts’ from the Document Tools menu.
Type ‘immigration’ in the input box, wait for a minute while Voyant checks to see the term exists, and then click on the term ‘immigration*’ in the results list.
Browse the list (you can resize the columns if necessary). Double click on an entry to see even more context!
The creators of Voyant Tools have made it really easy to embed their tools in your own website – so you can include the visualisations alongside your analysis. Any of the available tools or visualisations can be viewed in its own window, embedded, and shared.
Let’s make a Streamgraph and create a shareable link.
Select the tools menu as before and click on ‘StreamGraph’ under ‘Visualisation Tools’.
Remove the current terms by clicking on them at the top of the screen.
Add new terms by typing them into the input box at the bottom of the screen and selecting the results that pop up.
Because the speeches are arranged in chronological order, the streamgraph will show you changing frequency across time. What interesting comparisons can you create?
Once you’ve created something that looks interesting, let’s get a shareable link.
Hover over the grey StreamGraph title bar and click on the icon that looks like an external link.
The ‘Export’ dialogue appears. By default the ‘URL’ box is checked, but you can also choose ‘Export view’ to get the code to embed your visualisation in a web page, or ‘Export Visualisation’ to download an image.
Click Export and a new window will open with your StreamGraph and a shareable url. Tweet it, email it to your family!
Here’s an example of using the embed codes. You’ll see that the StreamGraph below is not just an image, it’s live! You can add or subtract terms here, without ever going to Voyant.
The Documenting the Now project is working to develop technologies and guidelines for the ethical collection and use of social media. One of their tools is a Twitter archiving tool called Twarc. We’re going to see how it works.
Twarc is a python command line tool like the Trove Harvester, so we’re going to head back to DHBox to use it.
Click on the DHBox ‘Command Line’ tab as before.
You’re probably still in your troveharvests
directory, so enter the following lines to move out of the current directory and create a new virtual environment for Twarc.
Deactivate the current environment.
deactivate
cd ~/virtualenv-15.1.0
Create and activate a new environment.
python virtualenv.py twarcharvests
cd twarcharvests
source bin/activate
Now install Twarc!
pip install twarc
Perhaps the trickiest bit of all of this is just getting all the access keys and tokens you need to authenticate yourself with Twitter. Every time I try this it seems a bit different, but the basic principles should be the same.
Make sure you have a Twitter account!
Go to the Twitter developers site. Login if necessary with your normal Twitter credentials.
Click on the ‘My apps’ link at the top of the page.
Click on the Create New App button.
Fill in the form. The ‘name’ and ‘description’ can be anything you want. For ‘website’ you can use the url of this workshop.
Tick the terms and conditions and click the Create button.
On your new application’s page click on the ‘Keys and Access Tokens’ tab.
Go back to the DHBox Command Line and enter:
twarc configure
Twarc will ask you for your four keys in turn. Just copy and paste them at the command line.
There are two main ways of collecting tweets using Twarc: search
and filter
. Search naturally searches current tweets, while filter sits and listens to the Twitter stream and tries to catch new tweets matching your query.
Given the nature of the Twitter API, it’s virtually impossible to get everything matching your query. You may have noticed that Twitter’s search results only remain current for a few days. This means that you might want to repeat your search every few days, or supplement search with filter. You can always remove duplicates later on.
Let’s run a search to collect tweets using the #ozhist hashtag.
twarc search "#ozhist" > ozhist.json
Results from the Twitter API are saved in JSON format. This is a very common format for saving and sharing structured data – but it’s not very human friendly.
Let’s find out how many tweets we collected. This command counts the number of lines (each tweet is on a new line) in the JSON file.
wc -l < ozhist.json
Note that Twitter discourages you from sharing large amounts of tweets harvested from their API. Twarc offers a way around this – you can dehydrate
your collection to create a file that contains only tweet ids. Other users can then rehydrate
using Twarc or the web service on the Documenting the Now site.
To dehydrate:
twarc dehydrate ozhist.json > ozhist-ids.txt
To inspect your ids:
cat ozhist-ids.txt
Use the arrow keys to scroll down, and type q
to quit.
Let’s finish by creating some simple visualisations of our collection. First we need to grap some extra utility scripts from the Twarc repository.
At the command line enter the following to grab a copy of the Twarc repository on GitHub:
git clone https://github.com/DocNow/twarc.git
Now let’s make a wall:
twarc/utils/wall.py ozhist.json > wall.html
And a wordcloud:
twarc/utils/wordcloud.py ozhist.json > cloud.html
To view the web pages we’ve just created, let’s fire up a temporary web server via the command line:
python -m SimpleHTTPServer 4000
Now click on the ‘Your Site’ tab in DHBox. If you don’t see anything, try right clicking and choosing ‘Reload Frame’.
You should see a list of the contents of your current directory. Click on wall.html
or cloud.html
to see the results!