Investigating the Hansard black hole

Historic Hansard · 29 May 2016

Update 10 July: I’ve now harvested all of the House of Reps and the Senate from 1901 to 1980, so some of the figures below have been adjusted.

Background

While harvesting XML files from Senate Hansard I noticed that some of the downloaded files had differently formatted filenames and were very small (around 300 bytes). Sure enough when I opened them I found they were empty:

<?xml version="1.0" ?>
<hansard xsi:noNamespaceSchemaLocation="../../hansard.xsd" version="2.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><session.header>
<date>1907-09-20</date>
<parliament.no>3</parliament.no>
<session.no>2</session.no>
<period.no></period.no>
<chamber>Senate</chamber>
<page.no>3556</page.no>
<proof>0</proof>
</session.header>
</hansard>

At first I thought they were just some artefact of the system and didn’t really worry about them. But as they started piling up, I thought I’d better investigate.

What’s on ParlInfo

If you search ParlInfo for the date in the XML file above, 20 September 1907, just one result is returned. This is odd – normally you’d expect to see a list of the debates and questions that made up the day’s proceedings. For example the previous day, 19 September 1907, returns 75 results.

If you click on the single result for 20 September 1907, you’ll notice again that there isn’t the usual nested list of the day’s proceedings in the left hand column. It’s just empty. If you click on the ‘View/Save XML’ link you’ll see the same empty XML file I harvested. The ‘Download fragment’ link provides an empty PDF template. Interestingly, if you click on the ‘Download full day’s Hansard’ you get a PDF file that says it has 30 pages, but only the first page is visible. (Update, 5 July 2016: I think this problem with PDFs loading was just a browser issue.)

Ok, so something clearly has gone wrong in the scanning/OCR/markup process. These things happen. The question is, what is the impact on users who have assumed that ParlInfo provides a full record of Hansard.

Is it searchable?

After poking around some more I noticed that some days did seem to have complete PDFs. I’m not sure what the ParlInfo search is actually searching, so I thought I’d better try a few tests to see whether the content of these empty or malformed files were searchable.

As noted above, you can find them by searching for a date range. But what happens when you add a keyword to that search?

The first page of the PDF included a question on the price of starch. So let’s add the keyword ‘starch’ to our query for 20 September 1907 – no results. Let’s check that our query’s ok by broadening the date range to the end of September. This time we get four results – it seems that the question on the price of starch was asked again on 25 September 1907.

So it seems that content from the ‘empty’ days is not searchable. They are effectively missing from ParlInfo searches.

The scale of the problem

How many days are missing? I modified my harvesting script to look for the strangely formatted filenames and write the dates and xml file urls to a new file. From 1901 to 1980, there seem to be 94 days where the proceedings are unsearchable. You’ll find the full list at the bottom of this page.

On the APH website, I found a page that listed the number of sitting days per year. I used this to compare the number of sitting days with the number of properly-formatted XML files I’d harvested (there should be one per day) and the ‘empty’ or unsearchable days. Here are the results:

You’ll notice that in some years the sum of the harvested and empty files doesn’t equal the number of sitting days. I’ve labelled the difference as ‘Missing?’, but I don’t know whether there are additional files missing from ParlInfo or not. It seems more likely that the number of sitting days is wrong, or that there was no Hansard recorded on some sitting days. I won’t know for sure until I (or someone else!) has sat down with a hardcopy of Hansard and my list of dates. (Update 1 September 2016: This mystery is solved!)

But even ignoring the possibly ‘missing’ files you can see that substantial blocks of Senate Hansard are not being searched by ParlInfo. The impact seems greatest around the WWI period, from 1910 to 1919. In 1917, for example, 21 of 47 sitting days are not being searched.

Ninety-four days over 80 years might not seem so much. But when you consider that the average number of sitting days per year is 51, you can see that nearly 2 years worth of Hansard is effectively invisible.

So if you’ve been using ParlInfo to search through Hansard in the period 1910 to 1919 you might want to have a think about what you could have missed!

What about the House of Reps?

As I noted, I originally wasn’t paying much attention to the empty files, and I think there were a couple that popped up when I was harvesting the House of Reps. But I’m pretty sure the scale of the problem is nothing like what I’ve observed in the Senate. To be sure, I’m running my harvesting script again across the Reps, gathering information about any empty files. So we’ll know for sure in a day or two.

Update, 5 July 2016

After reharvesting the House of Reps I found that there were eight missing days:

25 April 1902 – XML, PDF
27 March 1930 – XML, PDF
1 November 1935 – XML, PDF
23 August 1970 – XML, PDF
19 September 1970 – XML, PDF
26 September 1970 – XML, PDF
2 October 1970 – XML, PDF
3 October 1970 – XML, PDF

There was also something funny about 29 August 1945. The XML file I harvested for that day seems to be from 1 November 1935. If you search for 29 August 1945, things seem ok on the surface. But if you click on any of the results you’ll see the content is from 1935. So once again the day seems to be missing.

The empty files

Here’s the list of the empty files I’ve identified in the Senate Hansard. I’ve included links to the XML and PDF versions so you can check for yourselves.

20 September 1907 – XML, PDF
6 July 1910 – XML, PDF
7 July 1910 – XML, PDF
13 July 1910 – XML, PDF
3 August 1910 – XML, PDF
11 August 1910 – XML, PDF
12 August 1910 – XML, PDF
19 August 1910 – XML, PDF
25 August 1910 – XML, PDF
1 September 1910 – XML, PDF
7 September 1910 – XML, PDF
13 September 1910 – XML, PDF
29 September 1910 – XML, PDF
30 September 1910 – XML, PDF
5 October 1910 – XML, PDF
13 October 1910 – XML, PDF
18 October 1910 – XML, PDF
9 November 1910 – XML, PDF
18 October 1911 – XML, PDF
15 November 1911 – XML, PDF
22 November 1911 – XML, PDF
5 December 1911 – XML, PDF
4 July 1912 – XML, PDF
18 July 1912 – XML, PDF
24 July 1912 – XML, PDF
2 August 1912 – XML, PDF
7 August 1912 – XML, PDF
14 August 1912 – XML, PDF
9 October 1912 – XML, PDF
11 October 1912 – XML, PDF
6 November 1912 – XML, PDF
7 November 1912 – XML, PDF
8 November 1912 – XML, PDF
13 November 1912 – XML, PDF
27 November 1912 – XML, PDF
5 December 1912 – XML, PDF
9 July 1913 – XML, PDF
28 August 1913 – XML, PDF
24 September 1913 – XML, PDF
29 October 1913 – XML, PDF
11 December 1913 – XML, PDF
13 May 1914 – XML, PDF
3 June 1914 – XML, PDF
18 June 1914 – XML, PDF
8 October 1914 – XML, PDF
13 November 1914 – XML, PDF
27 November 1914 – XML, PDF
2 December 1914 – XML, PDF
10 December 1914 – XML, PDF
9 June 1915 – XML, PDF
19 August 1915 – XML, PDF
12 November 1915 – XML, PDF
9 May 1916 – XML, PDF
22 May 1916 – XML, PDF
2 March 1917 – XML, PDF
14 June 1917 – XML, PDF
11 July 1917 – XML, PDF
12 July 1917 – XML, PDF
13 July 1917 – XML, PDF
18 July 1917 – XML, PDF
19 July 1917 – XML, PDF
25 July 1917 – XML, PDF
26 July 1917 – XML, PDF
27 July 1917 – XML, PDF
1 August 1917 – XML, PDF
8 August 1917 – XML, PDF
9 August 1917 – XML, PDF
10 August 1917 – XML, PDF
15 August 1917 – XML, PDF
16 August 1917 – XML, PDF
17 August 1917 – XML, PDF
22 August 1917 – XML, PDF
23 August 1917 – XML, PDF
24 August 1917 – XML, PDF
25 September 1917 – XML, PDF
22 January 1918 – XML, PDF
24 January 1918 – XML, PDF
1 May 1918 – XML, PDF
8 May 1918 – XML, PDF
15 May 1918 – XML, PDF
30 May 1918 – XML, PDF
26 June 1918 – XML, PDF
10 October 1918 – XML, PDF
19 November 1918 – XML, PDF
11 December 1918 – XML, PDF
30 July 1919 – XML, PDF
29 August 1919 – XML, PDF
17 September 1919 – XML, PDF
16 October 1919 – XML, PDF
17 October 1919 – XML, PDF
22 October 1919 – XML, PDF
23 October 1919 – XML, PDF
24 October 1919 – XML, PDF
2 August 1934 – XML, PDF

Share on

Twitter Facebook Google+ LinkedIn

Tim Sherratt

Investigating the Hansard black hole

Background

What’s on ParlInfo

Is it searchable?

The scale of the problem

What about the House of Reps?

Update, 5 July 2016

The empty files

Share on

You May Also Enjoy

Tribune negatives metadata and licensing

An experiment in two-way direct linking using Hypothes.is

Closed Access 2017 update

Closed Access: Changes in 2016