Course
ASALinks Archives Hacking 101

Under construction, 20 October 2016
This page is likely to be messy and incomplete. Check back later.


Hacking your archives

This is not a workshop about making websites, or designing cool interfaces. Sorry. Nor it is a survey of best practice in online archives. I’ve got no idea what that might be. It’s a workshop aimed at giving you a toolkit than you can use to start looking at archives in different ways.

I hope that you’ll go away from this workshop with a better understanding of what’s possible, a few tools to play around with, and some new ideas to work on.

We’re going to make use of a couple of web services, so you might as well start by setting up some (free) accounts:


Play time

Headline Roulette

I like to start off workshops with a quick round of Headline Roulette both because it’s fun (I hope) and because it gets us thinking about different ways we might use and explore cultural heritage data. In this case, Headline Roulette is drawing information about digitised newspaper articles from the Trove API.

Things to try


Hacking the web

After

This thing we call ‘the web’ isn’t a thing at all. We talk about ‘web publishing’ and ‘web pages’ as if they’re products of the print world. But if you look underneath what’s presented in your browser you’ll see each ‘page’ is an assemblage of files, standards, protocols, and technologies, all pulled together and rendered in a human-readable version by your browser.

This is important because we can play around with these layers. We don’t have to take the web we’re given. We can change it.

Let’s have a peek beneath the hood and see if we can break something…

  • Go to asio.gov.au (of course)

  • Right click somewhere on the page and select ‘Inspect’ or ‘Inspect Element’ from the menu. (You might need to use Chrome.)

  • A complicated looking control panel opens up that tells you all about the innards of the web.

  • Click on the ‘Network’ tab and then reload the page.

  • Whoosh! Did you see lots of little things fly past. That was your browser assembling all the bits and pieces to build the page. This one web page is made up of about 25 separate components.

  • Now click on the ‘Console’ tab.

  • Cut and paste the code in the box below into the console, then hit enter.

document.getElementsByTagName('body')[0].innerHTML = '<h1>All your secrets are belong to us!</h1>';

Warning you’ve just hacked ASIO’s website! (No not really… just reload to get it back.)

The ASIO web page exists in your browser, and you can change it. Try clicking on the ‘Elements’ tab – you can edit text, change colours, whatever you want. Go crazy!

Things to try

  • X Ray Goggles – a great educational tool from The Mozilla Foundation. Deconstruct and remix websites. Share the results!

Hacking RecordSearch

Yes, hacking RecordSearch is one of my favourite things to do. In my ASALinks keynote I talked about some ‘userscripts’ I’d created to change the way RecordSearch looks and behaves. Just like our little ASIO hack, these userscripts rewrite the version of RecordSearch that lives in your browser.

By playing with X-Ray Goggles or creating your own userscripts you could redesign your institution’s website. Experiment with no risk!

Things to try

Here are my userscripts for you to try. Once you’ve installed a userscript manager just click on the Raw button and it should recognise the scripts and ask if you want to install them.


Your own GitHub repository

Enough fiddling around with other people’s websites – let’s make our own! Today we’re going to use an assortment of tools to build a simple dashboard that gives us a few new ways of looking at the John Ellis collection of protest photographs held by the University of Melbourne Archives.

First we need to set some things up:

This is a repository that contains all the bits and pieces of code that I’m going to be using today. It’s easy to make your own copy:

  • Click on the Fork button.

That’s it. GitHub should open up your brand new repository.

One of the nice things about GitHub is that as well as sharing code, it provides a simple web page hosting service. We’re going to use that today.

I’ve already set up the skeleton of a web site inside the docs directory. We just have to tell GitHub where our webpages are:

  • Make sure you’re in your copy of the repository.

  • Click on the ‘Settings’ tab.

  • Scroll down until you see ‘GitHub Pages’.

  • In the ‘Source’ dropdown list choose ‘master branch /docs folder’.

  • Click Save.

Your website will be published at http://[your GitHub username].github.io/asalinks-workshop/. You’ll find a link on your ‘Settings’ page. (It can take a few minutes before the site is ready.)

Things to try


Running the code

The repository also includes a couple of Python scripts that I’m going to use to prepare data for our dashboard. To run these yourself you need to have a few things set up on your own computer. I thought it would probably be a bit too much to fit into the workshop today, but you might like to try it out at a later stage.

First you need to make sure you have Python, Virtualenv, Pip, and Git.

Once that’s all installed, you’re ready to go:

  • Open up a terminal

  • Type virtualenv asalinks to create a new directory called asalinks.

  • Type cd asalinks to move to the new directory.

  • Activate your virtual environment – on a Mac type source bin/activate.

  • You need to install my Trove-Python library using Pip. Just copy and paste this:

pip install git+https://github.com/Trove-Toolshed/trove-python.git#egg=trove_python-master
  • Now you can grab a copy of the asalinks-workshop repository. Just copy and paste:
git clone https://github.com/wragge/asalinks-workshop.git

 * Type cd asalinks-workshop.

  • We need to set up a few empty directories for later. Enter mkdir images, mkdir data and mkdir faces one at a time.

  • Now would be a good time to get yourself a Trove API key.

  • Once you have your key, open the file credentials-edit-me.py with your favourite text editor and paste in your key. Save the file as credentials.py.

If you want to try facial detection, you’re also going to have to install OpenCV. This can be a bit tricky – on a Mac it’s easiest if you use Homebrew. I don’t know of any really good docs, so I’d suggest you Google it.

You’ll also need to make sure OpenCV will work within your virtual environment. I create links like this:

ln -s /usr/local/lib/python2.7/site-packages/cv* ~/[your code directory]/lib/python2.7/site-packages

Getting data

We all have data. Lots of it. But it’s often surprisingly hard to get it in a form that you can play around with.

Fortunately the John Ellis photographs have been harvested by Trove, and we can use the Trove API to download all the metadata.

Are your organisation’s collections in Trove? Did you know that you can construct a Trove search limited to your collections?

If you want to explore Trove’s collection of collections you can make use of the advanced search to filter results by organisation:

  • Go to Trove and click on the ‘Advanced Search’ link.

  • Scroll down to ‘Library’.

  • Where it says ‘Enter the name of a library or institution’ type in a name or type of organisation – try ‘University of Melbourne Archives’ and click on ‘Find locations’.

  • You’ll see a list of matching collections. Check ‘University of Melbourne Archives’ and click Search.

You’ll see all the items from the University of Melbourne Archives in Trove.

You’ll notice that ‘(nuc:”VUMA”)’ appears in the Trove search box. NUC stands for National Union Catalogue, and the code after ‘nuc:’ is a unique identifier for the Archives. NUC codes are handy because you can use them to construct links to specific collections in Trove. For example, a link to the University of Melbourne Archives collections on Trove is just: http://trove.nla.gov.au/result?q=(nuc:"VUMA"). See Trove help for more examples.

You can of course add extra keywords to a NUC search to filter the collection. Add “Ellis, John” to the search box to find the John Ellis photos.

We’re going to use a very similar query to retrieve data from the Trove API.

The Trove web interface is designed for humans. The Trove API, on the other hand, delivers data in a nice structured form that computers can understand. You can get an idea of how it works using the Trove API Console. Click on some of the potted queries. Try changing parameters – don’t worry, you can’t break anything.

Here’s the query for the John Ellis photos:

http://api.trove.nla.gov.au/result?q=nuc:"VUMA"+"Ellis,+John"&zone=picture&encoding=json&reclevel=full&include=workVersions

The include=workVersions is important because Trove sometimes groups photos with the same title as ‘versions’ of a single ‘work’. The workVersions parameter will make sure we get all of the grouped photos.

Here’s what the results of the query look like.

Like I said APIs are for computers, not humans!

Ok, if you’ve set up your Python environment as described above, you can try harvesting some data. Type python inside the asalinks-workshop directory to open up the Python interpreter. Then:

>>> import harvest
>>> harvest.do_harvest()

The harvest script asks Trove for results in batches of 100, it then works through each result, saving the data to a CSV file, and downloading a copy of each image. You could easily modify the script to harvest other collections or content.

Once the script has finished you’ll have a CSV file containing more than 900 records, and a folder with the same number of images. Here’s some I prepared earlier:

Things to try

  • Would you like your own collection of ASIO files? The code I used to harvest them is here together with some instructions. You could use it to harvest any series from RecordSearch.
  • Sometime the data you want is online, but not in a convenient form like a CSV or API. If that’s the case you might have to resort to screen scrapers like Import.io – try this tutorial to see how it works.
  • DataBasic.io’s WTFcsv – learn how to find out what’s hiding in CSV files, starting with The Titanic!

Sites to visit

  • Pre-harvested data – a small (and slightly weird) collection of data sources that I’ve packaged up for easy access. Hansard, ASIO files, faces & more!
  • Building with Trove – Trove is not just a website, it’s a platform! Grab collections data from the Trove API and build things.

Viewing images

Now we have the data we can start to play! First of all it would be nice to see the images. There’s lots of ways we might do this, but to keep things simple I’m going to use Javascript to add the images to a web page in our demo site.

The harvest.py script includes a little function that takes the CSV file and turns it into JSON – it’s just another way of representing structured data, but it’s the one that Javascript understands. Again start up Python and then:

>>> import harvest
>>> harvest.save_json()

And here’s the result.

I’ve already copied this data file and the images over to our demo website, and written a little bit of Javascript to load the images. I’m also using the Packery library to pack all the images on the page. Here’s the demo:

https://wragge.github.io/asalinks-workshop/grid.html

Change the url above to view the same page in your own repository. It may be a little slow to load the first time. I normally wouldn’t try and load 900 images in one go, but it’s ok for a demo. If you reload the page you’ll seem the images in a diferent random order.

Let’s try changing a few things:

  • Go to your GitHub repository and click on the docs directory.

  • Click on grid.html.

  • One cool thing about GitHub is that you can edit files. Click on the pencil icon to start editing.

  • Find the line that says <h1>ASALinks demo</h1>. Change the ‘ASALinks demo’ part to something a bit more interesting.

  • When you’re finished click on the green Commit changes button.

While we’re at it, let’s limit the number of photos displayed at a time to speed things up a bit.

  • Under docs click on the js directory.

  • Click on script-grid.js to open it.

  • Click on the pencil icon to start editing.

  • Look for the line that says data = shuffle(data);. Add a new line after it and insert data = data.slice(0, 50);. It just says that we want the first 50 records.

  • When you’re finished click on the green Commit changes button.

Now reload the demo page (it might take a minute or two for the changes to flow through). That’s a lot faster!

Hey look! I found myself!

Protest photo

Plotting data

There are many data visualisation tools around these days that can help you get a different perspective on your metadata. We’re going to use an online charting tool called Plot.ly to show us the dates of photos in the Ellis collection.

There is no date field in the Ellis records on Trove. However, dates appear frequently in the titles of photos. I’ve created a simple function to look for years in titles, and count the number of times each year appears.

The key thing in the function is a regular expression – a way of defining a pattern of characters that I want to find. In this case the pattern \b(19\d{2})\b says that I want to find four digit numbers that start with ‘19’. Regular expressions can be very powerful and it’s worth getting to know how they work.

To find the dates we just start Python and:

>>> import harvest
>>> harvest.start_dates()

The results are saved to a CSV file. Here’s what it ends up like.

If you haven’t already, get yourself a free Plotly account and log in.

  • Download the years.csv and save it somewhere you can find it again.

  • From the Plotly home page click +Create and select Chart.

  • At the top of the page click on the ‘Switch back to Plot.ly 1’ link. (The new version is still missing some things.)

  • Click on Import and select Upload a file.

  • Select the file you just downloaded.

  • From the dropdown on the left choose ‘Bar chart’.

  • Click on the blue Bar chart button.

We have a chart! Now to make it a bit nicer.

  • Click on ‘Click to enter Plot title’ and enter a suitable title. Hit enter.

  • Do the same for the labels on the X and Y axes.

  • Click on Axes on the left menu.

  • Select ‘X axis’ from the dropdown.

  • Click on the ‘Ticks’ tab.

  • Click on ‘Linear’ to show all years.

  • Click on the ‘Labels’ tab and change the angle to -90.

That’ll do! Now we’re going to get a copy of this chart that we can embed in our dashboard.

  • Click on Share.

  • Click on Save.

  • Click on the ‘Embed’ tab and copy the text in the box.

  • Now go back to your GitHub repository and open the index.html file in your docs directory.

  • Click on the pencil icon to start editing.

  • Paste the Plotly code just after the <h1>ASALinks demo</h1> title line.

  • Click on the Commit changes button.

Reload your web page. You have a chart! One of the great things about Plot.ly charts is that they’re interactive. Visitors can even open you data and make their own charts.

  • You may think that the proportions of you chart are a bit odd. Edit the page again and change the height attribute of the chart to 500.

While we’re here, why not change the title of this page as well. Let’s also add a link to the photos page.

  • Open the page for editing as before.

  • Under the title add the following code:

<ul class="nav nav-pills">
<li><a href="grid.html">View photos</a></li>
</ul>
  • Commit your changes and reload your web page.

You can also use Plot.ly’s Javascript Graphing Library to build charts in your own site (not embedded). You can then define behaviours so for example if someone clicked on ‘1978’ they’d load all the images from 1978. That’s what I’ve done on the Closed Access site . If you play around you’ll see that most of the charts are ways of exploring the files themselves.

Things to try

  • Here’s another activity I’ve written for Plot.ly if you’d like some more practice.
  • There’s also a lesson I gave for my ‘Exploring digital heritage’ class that works through a number of different data visualisation tools. (Scroll past the ASIO stuff!)

Text as data

In a word

Sometimes the text is the data. Using digital tools we can break texts down into their component parts – words, phrases, and parts of speech – and manipulate them. How are certain words used within collections of texts? We can analyse things like occurance, frequency, and context to better understand what’s going on.

Once again we’re going to do a bit of manipulation of our photos dataset. This time we’re just going to extract the test fields – the title and description – and save them to a new file.

>>> import harvest
>>> harvest.save_text()

And here’s the descriptions.txt file that’s produced.

We’re going to analyse this file using Voyant Tools, and create another widget for our dashboard.

  • Download descriptions.txt

  • Go to Voyant Tools.

  • Click on Upload and select the file you’ve downloaded.

  • The default corpus view gives you a number of tools for exploring your texts. Have a play around!

We’re going to embed a copy of the word cloud.

  • Hover on the bar above the word cloud and click on the ‘Export a url’ icon.

  • Click on ‘Export view’ and check ‘an HTML snippet’.

  • Copy the code in the box.

  • Go back to your repository and open the index.html file for editing.

  • Paste the Voyant code after the title line.

  • Commit your changes.

Reload your web page. Hmmm… once again you might want to adjust the height and width of the Voyant window.

There are many other text analysis tools and techniques that might be useful. Have a look, for example, at the links below on topic modelling, which is a statistical method for finding clusters or themes within a set of texts (and texts could just mean file titles!).

Named entity recognition is also a useful and powerful technique for finding ‘things’ – people, organisations, places – within unstructured text.

For a quick example of named entity recognition go to AlchemyAPI’s demo site. Alchemy API provide online services such as entity recognition.

  • Click the ‘url’ tab on the ‘Analyze text’ box and enter this link to our descriptions.txt file.
https://raw.githubusercontent.com/wragge/asalinks-workshop-demo/master/data/descriptions.txt
  • Click analyze!

What do you see?

Things to try

  • There are many more tools and options in Voyant, so you might want to look at the Getting Started guide.
  • Once again I gave a class on text analysis that has a variety of differnt tools and examples.
  • DataBasic.io’s WordCounter and SameDiff – some simple and fun tools that introduce you to the possibilities of exploring text as data. Compare the lyrics of Beyonce and Aretha Franklin!
  • Getting started with Topic Modelling and MALLET by Shawn Graham, Scott Weingart and Ian Milligan, Programming Historian – topic modelling is a way of finding ‘themes’ or ‘topics’ within a collection of texts.

Sites to visit

Finding faces

Real face of White Australia

Computers are getting better at seeing. The things we take for granted, like the ability to recognise a face, are challenging tasks for computer vision. But recent years have brought great advances.

Computers can be taught to find shapes and patterns within images. Facial detection (finding a face in a photo) is pretty straightforward. This offers interesting possibilities for historians, but the use of such technologies for surveillance also presents political and social challenges.

We’re going to try and extract faces from the John Ellis photos and use them as another way into the collection.

Again I’ve created a basic script to do the job. It’s actually fairly simple. To run it:

> python face_detect.py

You end up with a folder full of faces! You’ll see that there’s a few false positives, but it’s not too bad. We’ve probably missed quite a few because the images are small. You can adjust the scaleFactor, minNeighbors and minSize settings in the script to see if you can improve the accuracy.

I’ve removed the false positives and included the faces in the demo website. I also generated a little data file to load all the faces onto a page. Try:

https://wragge.github.io/asalinks-workshop/faces.html

Again, change this link to point to your own repository.

By now you should be able to edit this page to change the title.

You could also try adding a link to the index.html page. (Just copy the link to the photos page, but change grid.html to faces.html.

I’ve also created our own little embeddable widget. Open index.html for editing again and paste in this code:

<iframe width="500" height="100" frameborder="0" src="face-widget.html"></iframe>
  • Commit you changes and reload your page!

Try changing the height and width of the iframe.

Would you like more faces?

  • Go to the js folder and open up script-widget.js for editing.

  • Find the data = data.slice(0, 20) line and change 20 to whatever you like! (But note that you might need to increase the size of the iframe to see them.)

Things to try

Sites to visit

Homework