UC10153 -- Working with collections -- Week 11

Draft, 17 October 2016
This page should be in a useful state, but still needs work before it's finished.

REMINDER: No on-campus workshop this week – work through the activities and readings below and share the results on Slack or in your Moodle post. Have fun!

Enriching collection data

We’ve talked a lot about the importance of creating good quality metadata throughout this unit. Metadata captures the characteristics and context of collections to support management, preservation and discovery.

But what can you do when your metadata isn’t up to scratch? Perhaps you’ve inherited a spreadsheet of collection items that has been added to by a series of volunteers – each with their own ideas about what makes a good description. How might you go imposing some level of standardisation on the metadata?

Or perhaps you’ve done as much descriptive work as you can given your available time and resources. Are there ways of moving beyond these limits, taking advantage of volunteers or even automated systems to add additional metadata such as tags?

What about converting your existing metadata into other forms? In our discussion of Linked Open Data, we’ve seen how collection records can be enriched by linking to identity records for people and places. What about converting place names to coordinates so we can plot them on a map?

There are many ways we can improve and enrich collection metadata. This week we’re going to play around with a few useful tools and approaches.


The tool of choice for cleaning up messy metadata is OpenRefine (formerly known as GoogleRefine). It looks a lot like a normal spreadsheet, but it offers powerful options for finding and fixing all those annoying inconsistencies in your data.

This video gives a useful introduction to the basic functions of OpenRefine.

The best way to understand what’s possible is to try it out. There are lots of online tutorials available, but I’d suggest you start with this one from Intersect:

  • Download the tutorial (it’s a pdf)

  • Work through sections 1 to 7. This will walk you through installation, creating a project, and basic techniques for organising, clustering, and cleaning your data. Spend time getting to know the clustering feature in particular – play around with the different clustering techniques to get a sense of how they work.

  • If you’re feeling confident you can move on to section 8, which will show you how to automatically find coordinates (latitude and longitude) of the places in your dataset. This is a process known as geocoding or geolocation – we’ll look at other ways of doing this below.

  • Don’t do section 9 – the State record of NSW API doesn’t seem to be working at the moment.

There are a number of other tutorials available on the Free Your Metadata site. Their ‘Cleanup’ tutorial covers some similar ground to Intersect’s, but includes useful tips for finding and removing duplicates, and splitting column values. It also comes with a helpful screencast!

There’s another version of this tutorial available on the Programming Historian site.

The other tutorials on Free Your Metadata cover more advanced techniques such as reconciliation, and named entity extraction. Reconciliation is the process of matching things in your own dataset with external standards, vocabularies, or identifiers. It can help turn your collection into Linked Open Data. Named Entity Extraction finds the names of people, organisations, and places within free text fields. Once they’re identified that can be used as tags, or fed through the reconciliation process to match with authority lists.

The reconciliation tutorial shows you how you can match categories in the Powerhouse collection database with subject headings from the Library of Congress. The creators of the tutorial have also prepared a paper that examines in detail the usefulness of this technique for the GLAM sector. Even if you don’t work through the tutorial, it’s worth having a look to see what’s possible.


Last week we talked about OCR and transcription in the context of digitisation. As we noted, OCR results can be a bit messy and Trove lets users correct the OCR to improve the accuracy of its search results. This is one example of how cultural institutions are seeking the help of their users in cleaning, enriching and extending their collection data.

This collaboration between institutions and their publics is often called crowdsourcing – though as Trevor Owens points out it’s really not a very useful term. If you’d like to learn more about what crowdsourcing is (and isn’t) Mia Ridge’s FAQ is a good place to start.

Zooniverse is a crowdsourcing platform that hosts a wide variety of projects. It started off in the sciences, but has been steadily adding interesting cultural heritage initiatives. Browse to their History section and explore the projects on offer.

You’ll note that the projects tend to fall into two broad categories:

Within this outlines, projects might assign volunteers additional tasks such as categorising documents, or marking up texts.

Try out some of the projects above, and think about how they might be useful in enriching collection data. You can even build your own project using the Zooniverse platform

The New York Public Library has been one of the leaders in this field, and their projects are often held up as exemplars of what’s possible.

  • What’s on the menu? is one of everybody’s favourite crowdsourcing projects. Make sure you read the about page to understand the context. How has this project changed access to the Library’s menu collection?

  • Building Inspector is another highly innovative project in which volunteers extract and correct information from historic maps. Note the slogan – ‘Kill time. Make history.’

A lot of questions remain about why people get involved in crowdsourcing projects and what best practice looks like in the cultural heritage sector. Crowdsourcing is best thought of as a collaboration, rather than a source of cheap labour. By working with their communities, cultural organisations can open their collections to new forms of analysis and discovery.

Metadata games?

One way of getting people involved in crowdsourcing projects is to turn the whole thing into a game. A project in the US has been doing just that – devising a whole series of ‘metadata games’.

Most of the games tend to revolve around tagging – getting players to provide descriptive information about collection items. Play a few and see what you think? Do you think this could be a good way of getting the public involved with collections?


One common enrichment task is to try and geolocate places mentioned in collection descriptions. If you completed section 8 in the OpenRefine tutorial you will have seen how you can use the Google Maps API to obtain latitudes and longitudes. But there are a few simpler tools around that can help you get started.