Draft, 28 August 2016
This page should be in a useful state, but still needs work before it's finished.
There’s a lot of hype around about data. ‘Big data’ in particular is talked up as the source of new insights in research and business. But what is the ‘data’ that we’re interested in as cultural heritage workers? Miriam Posner’s article explores some of the complexities of ‘data’ from a humanities perspective. She notes that humanities researchers are:
not generally creating data through experimentation or observation — more often than not, we’re mining data from historical documents. You name it, we’ve tried to mine it, from whaling logs to menus to telephone directories.
Rather than working with the outputs of machines, we’re usually working with artefacts created by people which, as Miriam notes are often ‘eccentric, disparate, and historically and geographically uneven’. Nonetheless, working with this information in a digital environment allows us to do and see things differently. I wrote a bit about this in ‘“A map and some pins”: Open data and unlimited horizons’.
This week we’re going to look at some varieties of data that can be useful within the cultural heritage sector. Where do you find it? How do you get it in a useable form? How can you clean up some of the messiness?
Let’s start off with something simple. The Australian Data Archive provides historical census data from 1833 to 1901. The data is available in a variety of formats. Some of it is still locked up in scanned images, so you’d have to transcribe it before you could do anything interesting. But some of it is available as nicely formatted HTML tables.
The federal and most of the state governments maintain portals to help you find government datasets. This includes an increasing amount of data from cultural heritage organisations. Here’s the ones I know about:
Have a poke around on these sites to see what you can find. If you’re not sure where to start, try searching for things like ‘archives’, ‘history’, ‘heritage’, or ‘library’. Here’s a few examples:
Note that the first two of these are available as CSV (Comma Separated Values) files. CSV files are like spreadsheets and can be easily opened and used in tools like Excel, or many of the web services we’ll be playing with in coming weeks. The ‘Portraits and people’ data is available as an XML file. XML enables you to represent hierarchies within your data, so you can have more complex structures than in a flat CSV file. Many of the tools we’ll look at can also use XML.
Once you’ve found an interesting dataset, think about how you might use it in your project. Could you visualise the data as a chart? Locate it on a map? Link it to other sources of data?
If you have a CSV file, but you’re not sure what to make of it, WTFCSV is a fun little tool that can provide some useful insights. Try one of their sample datasets, or upload your own. What does it tell you about the data?
I’ve mentioned APIs (Application Programming Interfaces) before. They provide data in a form that computers can understand, and allow us to build new tools and interfaces such as Headline Roulette. To make use of an API you generally need some coding experience, but not always. Yvonne Perkins has written a great tutorial to show how you can use Excel to harvest newspaper articles from Trove using its API. My DIY exhibitions and headline roulette also allow you to make use of the Trove API without any coding.
To understand how the Trove API works, play around with my Trove API Console. Just click on a few of the examples and see what happens. Look for ‘q=’ in the query box and try changing the value that comes after it. Don’t worry – you can’t break anything!
The Trove Help Centre provides more information about the API, including a number of examples and experiments.
Would you like a dataset containing hundreds or thousands of newspaper articles on the topic of your choice? If you do then my Trove Harvester can help! It’s a tool that uses to API to save data about newspaper articles to a CSV file for further analysis. There’s detailed documentation here – it requires a bit of setup and the good old command line, but it’s pretty simple to use.
Sometimes the data you want isn’t available in a convenient form like CSVs or APIs. But if it’s on a web page, you might be able to extract it using a process called screen scraping. Remember how X-Ray Goggles showed you the structures underlying web pages. Screen scraping uses those structures to indentify the bits of data you want on a web page – once you’ve identified them, they can be extracted and saved in a nice structured format, such as a CSV file.
I’ve spent a lot of time screen scraping data from sources such as the National Archives of Australia (here’s 70gb worth of ASIO files for example). But fortunately there are a growing assortment of tools to make the process much less painful. Let’s try one!
Follow the instructions in Screen scraping with Import.io to create your very own dataset of members of the House of Representatives.
As we noted, cultural heritage data can often be messy and inconsistent. Fortunately there’s a great tool for cleaning up messy data – OpenRefine.
Intersect have prepared a useful tutorial to introduce you to the power of OpenRefine. In particular, its ability to cluster variations in spelling and standardise them all with a single click is just like magic. It can save you many hours of painstaking manual work.
To use OpenRefine you’ll need to download and install it. Use the latest version ‘2.6-rc2’.
Once you have it installed work through the tutorial. Try to at least complete up to section 7. If you’re feeling adventurous, go on to try some geocoding in section 8. We’ll cover geocoding in a few weeks when we play around with maps. Don’t worry about section 9 as the State Records of NSW website is currently undergoing some changes and the examples probably won’t work.
There’s another OpenRefine tutorial using collection data from the Powerhouse Museum on the Programming Historian site.