Investigating the Hansard black hole

Update 10 July: I’ve now harvested all of the House of Reps and the Senate from 1901 to 1980, so some of the figures below have been adjusted.

Background

While harvesting XML files from Senate Hansard I noticed that some of the downloaded files had differently formatted filenames and were very small (around 300 bytes). Sure enough when I opened them I found they were empty:

<?xml version="1.0" ?>
<hansard xsi:noNamespaceSchemaLocation="../../hansard.xsd" version="2.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><session.header>
<date>1907-09-20</date>
<parliament.no>3</parliament.no>
<session.no>2</session.no>
<period.no></period.no>
<chamber>Senate</chamber>
<page.no>3556</page.no>
<proof>0</proof>
</session.header>
</hansard>

At first I thought they were just some artefact of the system and didn’t really worry about them. But as they started piling up, I thought I’d better investigate.

What’s on ParlInfo

If you search ParlInfo for the date in the XML file above, 20 September 1907, just one result is returned. This is odd – normally you’d expect to see a list of the debates and questions that made up the day’s proceedings. For example the previous day, 19 September 1907, returns 75 results.

If you click on the single result for 20 September 1907, you’ll notice again that there isn’t the usual nested list of the day’s proceedings in the left hand column. It’s just empty. If you click on the ‘View/Save XML’ link you’ll see the same empty XML file I harvested. The ‘Download fragment’ link provides an empty PDF template. Interestingly, if you click on the ‘Download full day’s Hansard’ you get a PDF file that says it has 30 pages, but only the first page is visible. (Update, 5 July 2016: I think this problem with PDFs loading was just a browser issue.)

Ok, so something clearly has gone wrong in the scanning/OCR/markup process. These things happen. The question is, what is the impact on users who have assumed that ParlInfo provides a full record of Hansard.

Is it searchable?

After poking around some more I noticed that some days did seem to have complete PDFs. I’m not sure what the ParlInfo search is actually searching, so I thought I’d better try a few tests to see whether the content of these empty or malformed files were searchable.

As noted above, you can find them by searching for a date range. But what happens when you add a keyword to that search?

The first page of the PDF included a question on the price of starch. So let’s add the keyword ‘starch’ to our query for 20 September 1907 – no results. Let’s check that our query’s ok by broadening the date range to the end of September. This time we get four results – it seems that the question on the price of starch was asked again on 25 September 1907.

So it seems that content from the ‘empty’ days is not searchable. They are effectively missing from ParlInfo searches.

The scale of the problem

How many days are missing? I modified my harvesting script to look for the strangely formatted filenames and write the dates and xml file urls to a new file. From 1901 to 1980, there seem to be 94 days where the proceedings are unsearchable. You’ll find the full list at the bottom of this page.

On the APH website, I found a page that listed the number of sitting days per year. I used this to compare the number of sitting days with the number of properly-formatted XML files I’d harvested (there should be one per day) and the ‘empty’ or unsearchable days. Here are the results:

You’ll notice that in some years the sum of the harvested and empty files doesn’t equal the number of sitting days. I’ve labelled the difference as ‘Missing?’, but I don’t know whether there are additional files missing from ParlInfo or not. It seems more likely that the number of sitting days is wrong, or that there was no Hansard recorded on some sitting days. I won’t know for sure until I (or someone else!) has sat down with a hardcopy of Hansard and my list of dates. (Update 1 September 2016: This mystery is solved!)

But even ignoring the possibly ‘missing’ files you can see that substantial blocks of Senate Hansard are not being searched by ParlInfo. The impact seems greatest around the WWI period, from 1910 to 1919. In 1917, for example, 21 of 47 sitting days are not being searched.

Ninety-four days over 80 years might not seem so much. But when you consider that the average number of sitting days per year is 51, you can see that nearly 2 years worth of Hansard is effectively invisible.

So if you’ve been using ParlInfo to search through Hansard in the period 1910 to 1919 you might want to have a think about what you could have missed!

What about the House of Reps?

As I noted, I originally wasn’t paying much attention to the empty files, and I think there were a couple that popped up when I was harvesting the House of Reps. But I’m pretty sure the scale of the problem is nothing like what I’ve observed in the Senate. To be sure, I’m running my harvesting script again across the Reps, gathering information about any empty files. So we’ll know for sure in a day or two.

Update, 5 July 2016

After reharvesting the House of Reps I found that there were eight missing days:

There was also something funny about 29 August 1945. The XML file I harvested for that day seems to be from 1 November 1935. If you search for 29 August 1945, things seem ok on the surface. But if you click on any of the results you’ll see the content is from 1935. So once again the day seems to be missing.

The empty files

Here’s the list of the empty files I’ve identified in the Senate Hansard. I’ve included links to the XML and PDF versions so you can check for yourselves.

Created: