UC10153 -- Working with collections -- Week 9

Draft, 01 October 2016
This page should be in a useful state, but still needs work before it's finished.

Discovery, access, and use

How do people find things in cultural heritage collections?

In our last session we explored the importance of documenting context, both for preserving the value of collections, and for opening up new avenues for exploration.


While both cultural heritage professionals and researchers understand the value of context, where do we go when we want to find something online?

Yep, Google. A 2013 study of Dutch scholars highlighted the ‘paradoxical attitude’ of scholars who, while emphasising the important of provenance and context to research, nonetheless relied heavily on Google to discover resources.

What’s wrong with that? Nothing – as long as we understand the biases and limitations of search engines like Google. As the study notes, search engines ‘do not simply retrieve information, but co-produce information by ranking and indicating the importance of information’. The results we see when we search for something in a search box are the result of complex ranking algorithms that try to predict what will be most relevant.

Most relevant to who? Your results might not be the same as mine. Search results are increasingly personalised based on things like your location and web browsing activity. The exact formulae used for results ranking are largely unknown. Search engine algorithms tend to be ‘black boxes’ – we use them, but we don’t really understand how they work.

Try the following search in Google site:trove.nla.gov.au. The site: modifier is a handy little trick that tells Google to limit results to a particular site or domain – in this case Trove. How many results does Google find? When I tried Google thought there were ‘About 1,900,000 results’. But you might get a different estimate – it can vary between computers and users. Don’t be tricked into thinking that Google tells the ‘truth’.

Hopefully you will remember that Trove holds considerably more than 1,900,000 resources (try half a billion). So what’s going on? As well as the mysteries of results ranking, and the inaccuracy of its estimates, Google only indexes what Google indexes. While Search Engine Optimisation is now a business in itself, you simply can’t make Google index all your stuff. All you can do is to try and make your online content as Google-able as possible. And hope.

Discovery happens elsewhere

Back in 2007, the library thinker Lorcan Dempsey coined the phrase ‘discovery happens elsewhere’ to encourage libraries to think about how they were exposing their metadata to the world. It’s not enough to have your own shiny web interface because, like it or not, a large proportion of your users will prefer to use Google.

Ok, so I just said that we really don’t have any control over what Google indexes and displays. That’s true, but there are things we can do to make collections more discoverable on the open web.

  • Use persistent urls. If you want people to find your stuff DON’T CHANGE YOUR URLS! It seems sort of common sense, but it’s amazing the number of times institutions install new systems or redesign their sites and break all their links. Have a look at ‘Cool URIs don’t change’ for the basic arguments. And if you don’t think this is a problem, have a look at the Wikipedia article on Link Rot.

  • Expose links to your collection items. If the only access to your collection is through a search box, Google can’t index your collection. You need to provide ways for search engines (and human beings) to browse your collection – to follow links that lead through to individual items. You can also use things like sitemaps and RSS feeds to publish lists of links in a form that machines can understand.

  • Include metadata. Remember our discussion about Linked Open Data and Schema.org? Schema.org was developed by search engines to aid discovery. You can use it to share basic information about collection items – like their title and creator – with search engines or other data aggregators. At the very least, you can use ‘meta’ tags to provide basic page-level data. Here’s an interesting blog post about using Schema.org and Google to provide library search services.

There are more Google-specific strategies you can implement, but the three points above are fundamental to web resource discovery in all its forms.

Sharing collections online

Flickr commons

But as well as optimising your own collection site for improved discovery, you can push your collections out into the wider world. Set your collections free!

For example, cultural institutions often share their photographic collections using platforms such as Flickr. In fact, there’s a whole special section called the Flickr Commons where cultural organisations can share public domain images. (What’s public domain? We’ll talk about that below…) By sharing images on Flickr, organisations open their collections to new audiences. Often the images will be viewed many more times on Flickr than they will on the institution’s own site. Some small organisations without their own web presence have used Flickr as their main collection site. Have a browse of the Commons’ list of ‘participating organisations’ – how many Australian organisations can you find? How many images do they share on Flickr?

There are doubts nowadays about the future of Flickr, but there are plenty of other ways to share collections. The State Library of Queensland, for example, donated 50,000 images to the Wikimedia Commons. Indeed, GLAM organisations all over the world are collaborating with the Wikimedia community in a number of different ways to highlight the contents of our cultural collections. See the GLAM-Wiki site for more information and ideas.

One great example of the possibilities is provided by the British Library which, in 2013, released over 1 million images on Flickr. The images came from scanned books from the 17th, 18th and 19th centuries and were shared without any restrictions for people to use, research and play. This generated enormous excitement and large amounts of web traffic on Flickr. The Wikimedia community started adding the images to the Wikimedia Commons as well, but they also developed a series of indexes to make it easier for for people to explore the collection. Here’s the Australian images.

See what you can find and share the results on Slack.

The images have even been used in an art installation at the Burning Man festival in Nevada.

Life on the outside

Why should you expect people to come to your web site to find and use collections? Why not push collections out into the spaces where people already congregate. Social media provides another way of making collections discoverable.

As well as being on Flickr and the Wikimedia Commons, the British Library’s million images are also shared on Tumblr and Twitter. Computer scripts (usually called bots) choose and post random images. The British Library’s @MechCuratorBot is just one example of a growing army of Twitter bots sharing random collection items.

I’ve made a few myself! @TroveNewsBot shares newspaper articles from Trove, while @TroveBot does the for the rest of Trove’s resources. But these bots do more than just post random stuff. Tweet keywords at them and they’ll search Trove, tweeting back the most relevant result. Yes, you can search Trove without ever leaving Twitter! There are more search options on the GitHub pages for @TroveNewsBot and @TroveBot. Here’s a few things you can try:

  • Tweet #luckydip at either bot for a random resource.
  • Tweet a url and the hashtag #keywords to find resources relating to a web page.

But social media can hold dangers for cultural heritage collections. Have you ever come across the @HistoryInPics (or some similar) Twitter feed? It’s an account that tweets out historical images. Sounds good huh? Hmmm, perhaps you should read Sarah Werner’s post ‘It’s history not a viral feed’.

While it’s great to see cultural heritage collections turning up on social media, accounts like @HistoryInPics aren’t very interested in fundamental things like attribution or accuracy. Often their posts are generated by bots – just scraped from other picture sites. The captions are often wrong, the photos sometimes faked. You can learn a few of their tricks by following @PicPedant.

So what can we do? Again, the main thing is to get the basics right. If your collections are easily discoverable through Google and elsewhere then someone trying to uncover the source of a posting by @HistoryInPics should be able to find it using Google’s reverse image search or a service like Tin Eye.

  • Go to @HistoryInPics and download one of the images they share. (Right click > Save Image)

  • Then go to Tin Eye, click on the upload button and select the image you downloaded.

  • What can you find out? Why might a service like Tin Eye be useful for cultural heritage workers?

If you want people to share your collections then make it easy for them to do it properly:

  • Use persistent urls I know I’ve mentioned it already, but IT’S SO IMPORTANT I can’t help but mention it at every opportunity. Some collection sites still don’t have individual urls for items. Seriously. Some collection urls break when you try to share them (looking at you RecordSearch). If we want people to attribute the source of collection items we need to give them urls that work!

  • Provide example citations. If you want people to properly attribute a collection item, show them how. Provide an automatically generated citation that can just copy and paste into their post.

  • Include metadata. This one’s worth repeating as well. Services like Twitter and Facebook define specific guidelines for publishing item metadata so that it can be easily retrieved and displayed when someone shares that item. It’s this metadata that’s used to create things like Twitter cards.

  • Tell people how they can use your collection. Persistent urls and the rest are not much good if the licensing information you provide makes people frightened to share your collection. If something is out-of-copyright, then say so. If it’s openly licensed, include details. Blanket statements about checking with the institution before use do not encourage sharing or creativity.

Sarah Werner provides some important tips in her talk on how to destroy special collections with social media.

Beyond the silos of the LAMs

A 2008 study called ‘Beyond the silos of the LAMs’ explored ways that libraries, archives and museums could collaborate to improve discovery. I mention it here mainly because it has the best title of all time.

A lot has happened since 2008. In particular a number of large-scale aggregation services have been developed to pull together collection items from across the GLAM sector. There’s:

  • Trove (of course, aggregating content from Australia)
  • DigitalNZ (New Zealand)
  • DPLA (Digital Public Library of America)
  • Europeana (many European countries)

Each of these services has a different emphasis and use different technologies, but they all aim to make cultural collections easier to find and use. They do this by:

  • Harvesting and indexing collection item metadata from lots of different GLAMs (and other organisations). This is the aggregation part of things – it creates a big centralised database of stuff.
  • Making all this stuff easily available for discovery and reuse.

Discovery and reuse? Each of these services provides a familiar-looking search portal – just type in your queries and off you go. But, more importantly, each of these services makes their aggregated data available in a form that computers can understand via an API (Application Programming Interface). This means that anybody can build new tools and interfaces that open up cultural collections in innovative and interesting ways.

Europeana’s WWI portal provides a good example of what APIs make possible.

You’ll see results from Europeana’s own collections, but you’ll also notice a series of tabs headed ‘New Zealand sources’, ‘American sources’, and ‘Australian sources’.

  • Click on ‘Australian sources’. Where do you think these pictures are coming from?

  • Click on one of the photos.

  • Click on the ‘View on partner’s website’ link.

  • Where are you now?

What’s going on? Europeana is using APIs to pull WWI content from Trove, DigitalNZ, and the DPLA. When you clicked on the ‘Australian sources’ tab, the site fired off a request to the Trove API. Trove replied with lots of nice structured, machine-readable data that Europeana could easily display within its own portal.

Aggregation brings metadata together. APIs send it back out into the world. Together these technologies enable new forms of discovery. We’ll explore more examples of this in a few weeks when we look at collection interfaces.

As I noted above, if we want people to share and use online collections we need to provide information on how they can use them. Institutions often attach rights statements to items that are really not very useful (and are sometimes inaccurate). An analysis of the records harvested by the DPLA showed that more than 26,000 different rights statements were being used across institutions. How are users supposed to make sense of all that?

Standard licensing schemes like Creative Commons can help if you’re the creator or copyright holder of a particular digital object. But CC licences don’t cover all of the situations that cultural institutions confront. That’s why the DPLA and Europeana have developed a set of 12 standard rights statements for cultural heritage organisations. By using these we can lessen the confusion of users, and make it easier for them to find resources that they are free to share or use in their own projects.

But while institutions can make better use of standard licenses and statements, the copyright system makes it hard for them to provide accurate information.

Here’s a collection of resources. Have a look at the images and related information and sort them into two groups – ‘in copyright’ and ‘out of copyright’.

Gathered Moss by Charles Fenner, published in 1946 in Melbourne. (Another image.)

Gathered moss

Two Frontiers by E.J. Brady, published in Sydney in 1944. (Another image.)

Two frontiers

A photo of a macaque, taken in Indonesia in 2011.


A photo of floods in Maitland, taken by the Newcastle Morning Herald in 1950.

Floods 1950

A photo of floods in Maitland, taken by Jim Lucy in 1955.

Floods 1955

Letter from Atlee Hunt to Edmund Barton forwarding an umbrella, 8 October 1903.

Barton letter

What did you base your decisions on and why?

Here’s the answers (don’t peek!).

Here’s a few useful resources on copyright: