Under construction, 06 November 2016
This page is likely to be messy and incomplete. Check back later.
A workshop held at Griffith University on 8 November 2016.
In the seminar this morning we looked at some of the ways in which web interfaces can deceive us. As researchers we need to ask questions of digital cultural collections – what do they exclude, what do they hide, how is our access to them controlled?
One way we can do this is to step outside existing collection interfaces and work with the data that sits underneath. This workshop will look at how we can do this. Along the way, we’ll explore different techniques for analysing, exhibiting, using, and seeing digtal collections.
I like to start off workshops with a quick round of Headline Roulette both because it’s fun (I hope) and because it gets us thinking about different ways we might use and explore cultural heritage data. In this case, Headline Roulette is drawing information about digitised newspaper articles from the Trove API.
This page can be annotated using Hypothes.is. As we go along you can add notes, reminders, links or clarifications. It’s easy!
Just click on the tab that appears on the right of the screen and log in. You can either comment on the whole page or highlight a section of text to annotate.
We’re also going to create a simple scrap book to store some things as we go along. It uses GitHub, which is a site for sharing, managing and collaborating on code.
Let’s set some things up:
Log in to your GitHub account.
This is a repository that contains all the bits and pieces of code that I’m going to be using today. It’s easy to make your own copy:
That’s it. GitHub should open up your brand new repository.
One of the nice things about GitHub is that as well as sharing code, it provides a simple web page hosting service. We’re going to use that today.
I’ve already set up the skeleton of a web site inside the
docs directory. We just have to tell GitHub where our webpages are:
Make sure you’re in your copy of the repository.
Click on the ‘Settings’ tab.
Scroll down until you see ‘GitHub Pages’.
In the ‘Source’ dropdown list choose ‘master branch /docs folder’.
Your website will be published at
http://[your GitHub username].github.io/asalinks-workshop/. You’ll find a link on your ‘Settings’ page. (It can take a few minutes before the site is ready.)
Let’s customise your site a little. One cool thing about GitHub is that you can edit files.
docs and then
Click on the pencil icon to start editing.
Look for the title surrounded by
<h1></h1> tags and start typing underneath.
<p>This scrapbook was created by [insert your name].</p>
The Museum of Australian Democracy has a great site where you can play around with election policy speeches from 1901 through to 1916. Have a look at the visualisations on their explore page and see what you can discover.
Even better, they give you access to the underlying data – you can easily download a complete set of the speeches in plain text for your own analysis.
One thing you can’t easily do on the MoAD site is to compare the language of speeches. How does Bill Shorten compare to Malcolm Turnbull? How does Bob Hawke compare to Ben Chifley? A simple way of trying out these sorts of comparisons is with the tool SameDiff.
You’ll notice that the default example is a comparison of Hillary Clinton and Donald Trump – it seems appropriate that we should have a look at that for a while. Anything interesting?
Now let’s try Australian politics.
Click on the ‘Upload files’ tab.
Click on ‘browse file 1’ and choose one of the unzipped MoAD speeches.
Repeat for ‘browse file 2’.
Which speeches are similar or dissimilar? How does language change over time, or by party?
Let’s take this analysis further by finding the most ‘significant’ phrases in each speech by calculating their TF-IDF (Term Frequency - Inverse Document Frequency) values. As I described when talking about In a word, TF-IDF can show us words or phrases that appear frequently in a particular document compared to a whole collection of documents.
To do this we’re going to log on to an external server that I set up, using SSH (Secure Shell).
We’re using an external server just because of the limitations of working in a computer lab. If you’re following these instructions at home you’ll probably want to learn about the command line, and set up Python, pip, and virtualenv.
There are different ssh tools available depending on your system. Today we’ll make use of an SSH Chrome plugin.
Install and launch the SSH plugin.
In the ‘hostname’ field enter ‘22.214.171.124’
In the ‘username’ field enter your user name!
In the ‘port’ field enter ‘4444’.
Click on ‘Enter’.
When prompted, enter your password.
Hooray! You now have your very own server to play with. Let’s grab some code. Before we ‘forked’ the ‘griffith2016’ repository, this time we’re going to ‘clone’ it, so we have a copy on the server. Type (or cut and paste):
git clone https://github.com/wragge/griffith2016.git
This command created a new folder called
griffith2016 containing the files from GitHub. Let’s look inside:
cd griffith2016 ls
ls command lists the contents of the current directory.
Now we’re going to grap a copy of the MoAD election speeches:
Then unzip the file:
ls again. You’ll see the newly downloaded and unzipped files.
I’ve written a simple little script in the programming language Python to extract the TF-IDF values. If you look at the file you’ll see it’s not very long. Most of the work is being done by a Python library called Scikit-Learn.
All you need to do is point the script at the directory of speeches:
python tf-idf.py moad-election-speeches
After a little bit of processing, the script will list the top 20 ‘trigrams’ (3 word phrases) for each speech. Anything interesting?
Let’s add some of the results to our scrapbook.
Select and copy some of the results.
Go back to the
index.html page and click the pencil icon to edit.
Type in the following tags:
These tags will keep the formatting of the results. Just paste the results in between the sets of tags.
When you’re finished click on the green Commit changes button.
If you’re feeling adventurous you can try fiddling with some of the settings in the script. Don’t worry, you can’t do any damage!
tf = TfidfVectorizer(input='filename', analyzer='word', ngram_range=(3, 3), min_df=0, smooth_idf=False, sublinear_tf=True)
Try changing the
ngram_range value to
Use Ctrl-X to save your changes. Hit enter at the prompts.
Now run the script again:
python tf-idf.py moad-election-speeches
There’s all sorts of interesting and useful data sitting on websites, but it’s often not easy to use or understand.
For example, here’s a page showing the number of sitting days in Parliament since 1901. It’s useful information, but it’s not very easy to read as a table – let’s turn it into a chart!
You’ll need a spreadsheet of some sort open – either Google Sheets or Excel should work fine.
Just select and copy the values in the first table, from 1901 to 1958.
Click on the first cell in your spreadsheet and paste the data. It should be properly formatted across 3 columns.
Repeat with the second table (leaving out 2016), pasting the results underneath the first set.
Add a header row if necessary – ‘Year’, ‘Senate’, ‘Reps’.
If you’re using Google Sheets, download your data as a CSV file.
We’re going to use Plot.ly to make our chart.
Go to Plot.ly and log in.
Click the + Create button and choose ‘Chart’.
Click Import and select your CSV file.
In the ‘Chart type’ drop down select ‘Bar chart’.
For the ‘Y’ axis select the ‘Reps’ column, for ‘X’ selecy ‘Year’.
You’ll see a chart of the Reps sitting ays magically appear. Now let’s add the Senate.
Click on + Trace.
In the new trace dialogue change the ‘Y’ axis to ‘Senate’.
Now you can customise the chart:
Click on the ‘Y’ axis label and change it to ‘Number of sitting days’. Hit enter to save.
In a similar way give your chart a suitable title.
Now click Save. If it gives you an option for the chart to be ‘public’ or ‘private’, make sure it’s ‘public’.
We’re now going to embed a copy of our chart in our scrapbook!
Click on Share and then on the ‘Embed’ tab.
Copy the contents of the ‘iframe’ box.
Now back to GitHub and the
Click on the pencil to edit, then paste in the embed code from Plot.ly.
Save your changes.
Sometimes you can’t easily copy and paste the data you want. It might be strangely formatted, or spread over multiple pages. This is where screen scraping can be useful.
An easy way to get started with screen scraping is through a web service such as import.io – it’s basically just all point and click. I’ve already prepared a tutorial on using Import.io, so let’s work through it.
When data is available through APIs it makes things a bit easier, but usually it means fiddling with code.
Trove has an API which enables people to build things using Trove data. Have a look at the application gallery for some examples.
To get an idea of how the API works, have a look at the Trove API Console. Just click on some of the sample queries to see what they return. Try changing some of the query parameters to see what happens.
To save you the trouble of working directly with the API, I’ve created a Trove Harvester that will harvest digitised newspaper articles in bulk. It’s a command-line tool, but it’s easy, I promise.
Full documentation is available here. At home you can follow these instructions to get it set up on your own computer. But for now, we’re going to get it up and running on our test server.
View the texts in Voyant Tools.
I’ve got various other bits of code around for harvesting other things from Trove or RecordSearch.
The Wayback Machine is the Internet Archive’s archive of the web itself – billions of preserved pages! One of the most useful things about the Wayback Machine is that little box on the home page that says ‘Save page now’. Just feed it a url and it will instantly attempt to archive the page. Once it’s done you have a permanent link to the contents of that page – so even if the page disappears from the original site, you’ll always be able to go back to the archive and view it.
This is particularly useful if you want to cite webpages and are worried that they’re going to disappear. Perma.cc is another archive more directly aimed at preserving citations.
Here’s an example:
As it loads, the exhibition uses the Trove API to pull in its content. But these API requests aren’t captured by the Internet Archive, so it just shows the fallback version.
The problems of archiving dynamic content have lead to the development of new tools such as Webrecorder.io. This video gives a good introduction to the way it works:
Create a free account and record some web pages! Try some of the sites that didn’t work properly using the Wayback Machine and see if Webrecorder does the job. Try complicated sites with videos, or digital artworks. What works and what doesn’t?
Here’s Webrecorder’s version of the Chinese in NSW exhibition. Try navigating around the site.
Now it’s time to turn your Trove lists into a beautiful online exhibition. It’s easy with my DIY Trove Exhibition repository – all you need’s a Trove API key and a GitHub account. Just follow the instruction in the repository.
There are plenty of other ways you can start to present your data and collections online without any coding. Here are some of my favourites: