Hacking
The mystery of the missing days

Created: 01 September 2016


Hansard volumes

Background

While investigating the Senate ‘black hole’ I compared the number of Senate sitting days reported on the Parliament House site with the number of days I’d harvested for each year and found some discrepancies. These ‘missing’ days were different to the ‘invisible’ days for which I had at least harvested an empty XML file. What was going on?

The mystery solved (or not…)!

The only way to really be sure what was going on seemed to be to sit down with a print version of Hansard. So that’s what I did.

My initial comparison indicated that 8 days were missing from 1905, so that seemed like a good place to start. I simply flicked through all 6 volumes for 1905, recording Senate sitting days, and checking them off against the files I’d harvested. All was going well until I hit the 12 December – the latest file I had was from 11 December, but the Senate also sat on 12, 13, 14, 15, 18, 19, 20, and 21 December – the 8 ‘missing’ days.

So the number of sitting days reported by the Parliament website was correct. But where were they? I searched ParlInfo for Senate Hansard on days between 12 and 31 December 1905 and they were all there!

There was only one place left to look – my own harvesting code… Yep, you guessed it.

start_date = '01%2F01%2F{}'.format(year)
end_date = '11%2F12%2F{}'.format(year)

These two lines set the date range for the search I used to generate a list of files to harvest. That ‘11’ in the second line should have been ‘31’. D’oh!

So I had managed not to harvest any sitting day after 11 December. Good job Tim…

The good news is that this is very easy to fix. I just need to set the date range to 12-31 December and run my harvester again to fill the gaps. An updated version of Historic Hansard will be appearing shortly!

But the ‘black hole’ remains

It’s probably worth emphasising that these aren’t the days I previously identified as missing from ParlInfo searches. The folks at Parliament are still looking into the ‘black hole’ and how to fix it. These are days that I missed in my harvests, so they’re present on Parlinfo, but not on Historic Hansard (for the moment).

Learning from failure

So the lessons for today are:

  • Double check and cross-reference – if I hadn’t compared my numbers with the sitting days on the Parliament site I’d have been blissfully unaware of my own stupidity.
  • Coding provides you with unparalleled opportunities to fail – admit it, fix it, and move on.
  • Don’t believe anything you see online. Seriously…

Update (2 September) – the confusion continues…

As I said it was easy to fix my harvester, but even with the end date adjusted I haven’t managed to fill all the gaps. The numbers still don’t add up. Looks like I’ll have to spend more time with the hardcopies…

The re-harvest also revealed an extra 12 empty XML files, so the ‘black hole’ not only remains, it has grown.

Here’s the current status of the House of Representatives:

And the Senate:

Related projects

Tags

fail