Hacking
Harvesting all NAA series summaries

Created: 27 May 2016


I want to aggregate series-level data in the National Archives of Australia to build some big pictures of what is described, digitised, and accessible. Some years ago the NAA published an XML file containing information about their record series, but this file doesn’t seem to be available any more and was pretty hard to work with anyway, so I created a harvester to scrape the data from RecordSearch.

What I wanted

Aside from the standard series information I wanted to extract the following:

  • the number of items in each series described in RecordSearch
  • the number of items in each series that have been digitised
  • the number of items in each series in each of the four access examination categories – Open, Open with exception, Closed, and Not yet examined.

The number of items described in RecordSearch is displayed on the summary page of each series. It was already being extracted by the RSSeriesClient in my RecordSearch Tools library. Details of access status and digitisation can only be found by firing off an item level search that filters by series number and the desired parameter. So that means a minimum of 6 RecordSearch requests are required for each series – the series summary plus 5 item searches.

Last year I updated my RecordSearch Tools library to capture this information as well as some other series summary data I was missing. I also added a RSSeriesSearchClient class so that I could harvest results from the Series Advanced Search form in RecordSearch.

Getting everything

Series numbers begin with a letter prefix. The series number field in the Advanced Search form supports wildcard searches – a search for ‘A*’ will retrieve all series whose number starts with ‘A’. So to harvest all series I just had to cycle through the alphabet, firing off searches for each letter prefix.

I created a series harvester that would take a letter prefix, use the RSSeriesSearchClient to extract all the information, and save the results to a Mongo db.


class SeriesDetailsHarvester():
    def __init__(self, series_id):
        self.series_id = series_id
        self.total_pages = None
        self.pages_complete = 0
        self.client = RSSeriesSearchClient()
        self.prepare_harvest()
        db = self.get_db()
        self.series = db.series

    def get_db(self):
        dbclient = MongoClient(MONGO_SERIES_URL)
        db = dbclient.get_default_database()
        return db

    def get_total(self):
        return self.client.total_results

    def prepare_harvest(self):
        self.client.search_series(results_per_page=0, series_id=self.series_id)
        total_results = self.client.total_results
        print '{} series'.format(total_results)
        self.total_pages = (int(total_results) / self.client.results_per_page) + 1
        print self.total_pages

    def start_harvest(self, page=None):
        if not page:
            page = self.pages_complete + 1
        else:
            self.pages_complete = page - 1
        while self.pages_complete < self.total_pages:
            response = self.client.search_series(series_id=self.series_id, page=page, sort='1')
            self.series.insert_many(response['results'])
            self.pages_complete += 1
            page += 1
            print '{} pages complete'.format(self.pages_complete)
            time.sleep(1)

So to harvest all series with an ‘A’ prefix:


harvester = SeriesDetailsHarvester('A*')
harvester.start_harvest()

Here’s a sample record from the Mongo database for series A5:

[
  {
    "_id": "584de617cab8e588e1e93c9d",
    "accumulation_dates": {
      "date_str": "01 Jan 1923 - 31 Dec 1924",
      "start_date": {
        "date": "1923-01-01 11:00:00",
        "day": true,
        "month": true
      },
      "end_date": {
        "date": "1924-12-31 11:00:00",
        "day": true,
        "month": true
      }
    },
    "items_digitised": 14,
    "recording_agencies": [
      {
        "identifier": "CA 15",
        "date_str": "01 Jan 1923 - 31 Dec 1924",
        "start_date": {
          "date": "1923-01-01 11:00:00",
          "day": true,
          "month": true
        },
        "end_date": {
          "date": "1924-12-31 11:00:00",
          "day": true,
          "month": true
        },
        "title": "Department of Home and Territories, Central Office"
      }
    ],
    "title": "Correspondence files, annual single number series with 'NG' (New Guinea) prefix",
    "subsequent_series": [
      {
        "identifier": "A1",
        "date_str": "01 Jan 1924",
        "start_date": {
          "date": "1924-01-01 11:00:00",
          "day": true,
          "month": true
        },
        "end_date": null,
        "title": "Correspondence files, annual single number series"
      },
      {
        "identifier": "A518",
        "date_str": "10 Dec 1928",
        "start_date": {
          "date": "1928-12-10 11:00:00",
          "day": true,
          "month": true
        },
        "end_date": null,
        "title": "Correspondence files, multiple number series with alphabetical prefix"
      }
    ],
    "related_series": [
      {
        "identifier": "A14263",
        "date_str": "",
        "start_date": {
          "date": null,
          "day": false,
          "month": false
        },
        "end_date": null,
        "title": "Various records donated by former patrol officers (kiaps) in Papua New Guinea"
      },
      {
        "identifier": "A6510",
        "date_str": "01 Jan 1923 -",
        "start_date": {
          "date": "1923-01-01 11:00:00",
          "day": true,
          "month": true
        },
        "end_date": {
          "date": null,
          "day": false,
          "month": false
        },
        "title": "Classified prints of photographs relating mainly to Papua and New Guinea"
      }
    ],
    "items_described": {
      "described_note": "Click to see items listed on RecordSearch. Please contact the National Reference Service if you can't find the record you want as not all items from the series may be on RecordSearch.",
      "described_number": 200
    },
    "locations": [
      {
        "location": "ACT",
        "quantity": 1.8
      }
    ],
    "contents_dates": {
      "date_str": "01 Jan 1923 - 31 Dec 1924",
      "start_date": {
        "date": "1923-01-01 11:00:00",
        "day": true,
        "month": true
      },
      "end_date": {
        "date": "1924-12-31 11:00:00",
        "day": true,
        "month": true
      }
    },
    "arrangement": "Annual single number system with 'NG' prefix",
    "physical_format": "PAPER FILES AND DOCUMENTS",
    "previous_series": [
      {
        "identifier": "A1",
        "date_str": "01 Jan 1923",
        "start_date": {
          "date": "1923-01-01 11:00:00",
          "day": true,
          "month": true
        },
        "end_date": null,
        "title": "Correspondence files, annual single number series"
      },
      {
        "identifier": "A4",
        "date_str": "01 Jan 1924",
        "start_date": {
          "date": "1924-01-01 11:00:00",
          "day": true,
          "month": true
        },
        "end_date": null,
        "title": "Correspondence files, single number series with 'NG' prefix (Old files)"
      }
    ],
    "controlling_agencies": [
      {
        "identifier": "CA 5987",
        "date_str": "24 Jul 1987 -",
        "start_date": {
          "date": "1987-07-24 10:00:00",
          "day": true,
          "month": true
        },
        "end_date": {
          "date": null,
          "day": false,
          "month": false
        },
        "title": "Department of Foreign Affairs and Trade, Central Office"
      }
    ],
    "access_status": {
      "OWE": 0,
      "OPEN": 198,
      "CLOSED": 0,
      "NYE": 2
    },
    "controlling_series": [
      {
        "identifier": "A72",
        "date_str": "01 Jan 1923 - 10 Dec 1924",
        "start_date": {
          "date": "1923-01-01 11:00:00",
          "day": true,
          "month": true
        },
        "end_date": {
          "date": "1924-12-10 11:00:00",
          "day": true,
          "month": true
        },
        "title": "Subject index cards for CRS A1, Correspondence files, annual single number series [Papua and New Guinea cabinet]"
      },
      {
        "identifier": "A73",
        "date_str": "01 Jan 1923 - 31 Dec 1924",
        "start_date": {
          "date": "1923-01-01 11:00:00",
          "day": true,
          "month": true
        },
        "end_date": {
          "date": "1924-12-31 11:00:00",
          "day": true,
          "month": true
        },
        "title": "Name index cards, annual single number series, 'Papua, Norfolk Island cabinet'"
      },
      {
        "identifier": "A252",
        "date_str": "01 Jan 1923 - 31 Dec 1924",
        "start_date": {
          "date": "1923-01-01 11:00:00",
          "day": true,
          "month": true
        },
        "end_date": {
          "date": "1924-12-31 11:00:00",
          "day": true,
          "month": true
        },
        "title": "Number Register, 'N.G.' Series"
      }
    ],
    "control_symbols": "NG 23/22 - NG 24/3852 (with gaps)",
    "identifier": "A5"
  }
]

Problems

The most obvious problem is one of currency. It takes several days to harvest all the series data and RecordSearch is being updated all the time. So by the time a harvest is finished it’s already out of date. I don’t think there’s any way around this, I just have to acknowledge that the data will never be entirely accurate.

The second problem is more complex. If a search returns more than 20,000 results, RecordSearch displays a warning and asks to refine your query. So that means if more than 20,000 items in a series are digitised, or have an access status of ‘Open’, there’s no direct way of finding the exact number of items. All RecordSearch tells me is that there’s more than 20,000.

For the moment I’ve developed a semi-manual process for getting around this. By default the harvester will insert the value ‘20000+’ for any count that hits this limit. So once the harvest was complete, I generated lists of all the searches for digitised items and access status that returned this value.

I could then work through the lists developing search strategies that slice the queries up into chunks of less than 20,000 results. I decided to construct searches using item control symbols. Like series numbers, you can search for control symbols using a single character and a wildcard (however, this doesn’t work for barcode numbers). So it’s just a matter of working out the systems used in assigning control numbers and firing off the necessary searches – in theory. In practice, I’ve found that there tends to be a lost more variation in control symbols than is documented in the series notes.

Sometimes it’s easy. For example, control symbols in series A1 start with years from between 1900 and 1940. So just four searches – for ‘190*’, ‘191*’, ‘192*’, and ‘193*’ – will retrieve all the items in the series in batches of less than 20,000. For convenience, this range can be recorded as a Python list – list(range(190, 194)).

Unfortunately, most of the series I’ve looked at require a much more complex search strategy, and in most cases I can’t quite get the number of results to match the number of items that’s recorded in the series summary. Often I get more results than I should. One of the reasons for this is that some items have a ‘legacy’ control symbol as well as their current one, and both are searched, resulting in duplicate results. There may be other indexing oddities that cause these discrepencies.

More worrying are the cases where I get fewer results than I should. These are just mystifying. For example, I’m missing 282 items in series D4878 even though I’m searching for every letter in the alphabet. And yes, I’ve checked that none start with numbers. What else could they start with? I’ve tried some punctuation, even though these are probably stripped out by the indexing. I’m stumped.

My basic strategy for constructing these searches is to start by running through each letter and number. Sometimes this is enough, but often one of these searches will return ‘20000+’ and so I then have to look for a more complex combination of letters, numbers and symbols. I’ve been expressing all these as Python lists, so that I can automate the process in the future. For example B883 is a very large series with a complex series of prefixes. The range of searches I need to capture all the possible combinations can be expressed as:

['{}X{}'.format(letter, num) for num in range(0, 10) for letter in ['Q', 'T', 'S', 'W', 'NG', 'D', 'P', 'UK']] + ['{}X{}{}'.format(letter, num1, num2) for num2 in range(0, 10) for num1 in range(0, 10) for letter in ['V', 'N']] + ['{}F'.format(letter) for letter in ['V', 'N', 'Q', 'T', 'S', 'W', 'NG', 'D', 'P', 'UK']] + ['{}G'.format(letter) for letter in ['V', 'Q', 'T', 'S', 'W', 'NG', 'D', 'P', 'UK']] + ['C', 'J'] + ['{}{}'.format(letter, num) for num in range(0, 10) for letter in ['V', 'N', 'Q', 'T', 'S', 'W', 'NG', 'D', 'P', 'UK']] + list(range(0, 10))

This creates a big list of prefixes which I can then loop through. The code to do that is:

def harvest_large_series(identifier, control_range=None, ignore_check=True):
    # First let's check that the defined range will get everything
    if not control_range:
        control_range = [letter for letter in string.ascii_uppercase] + [number for number in range(0, 10)]  # + [p for p in string.punctuation]
    total = 0
    digitised = 0
    access = {}
    dbclient = MongoClient(MONGO_SERIES_URL)
    db = dbclient.get_default_database()
    series = db.series.find_one({'identifier': identifier})
    described = series['items_described']['described_number']
    for control in control_range:
        client = RSSearchClient()
        try:
            client.search(series=identifier, control='{}*'.format(control))
        except TooManyError:
            print '{}: more than 20,000'.format(control)
        else:
            print '{}: {}'.format(control, client.total_results)
            total += int(client.total_results)
    print '{} of {} items found'.format(total, described)
    if total == described:
        print '\nYay! All items found!'
    else:
        print '{} items missing -- need to rework the range?'.format(described - total)
    print '\nNow checking for digitised items...\n'
    for control in control_range:
        client = RSSearchClient()
        client.search(series=identifier, control='{}*'.format(control), digital=['on'])
        print '{}: {}'.format(control, client.total_results)
        try:
            digitised += int(client.total_results)
        except TypeError:
            pass
    print '\nDigitised: {}'.format(digitised)
    print '\nNow checking for access status...\n'
    for control in control_range:
        for status in ['OPEN', 'OWE', 'CLOSED', 'NYE']:
            client.search(series=identifier, control='{}*'.format(control), access=status)
            print '{}: {} -- {}'.format(control, status, client.total_results)
            try:
                access[status] += int(client.total_results)
            except KeyError:
                access[status] = int(client.total_results)
    print '\nAccess status\n'
    for s, t in access.items():
        print '{}: {}'.format(s, t)

You can just feed this function a series number and a range of prefixes and it will aggregate all the results. At the moment it just prints the results to the terminal. If you don’t supply a range, it’ll default to using every letter and number.

I’ve been recording the ranges I’m using and the results in a spreadsheet. So far I’ve only worked through the searches for digitised items – I still have quite a few to analyse for access status. As you can see in the spreadsheet, although the number of items found often doesn’t match what’s expected, the errors are mostly very small.

Series Expected Found Difference % error
B883 473641 473677 -36 -0.007600693352
B884 372932 372943 -11 -0.002949599391
B2455 376053 376224 -171 -0.04547231374
B4747 67883 67884 -1 -0.001473122873
B6295 100232 100277 -45 -0.04489584165
D4878 42634 42352 282 0.661443918
MP1103/1 44512 44522 -10 -0.02246585191
MP1103/2 42358 42354 4 0.009443316493
A1 64454 64454 0 0
A1200 86893 86932 -39 -0.04488278688
A1501 30957 30925 32 0.1033691895
A6135 91149 91178 -29 -0.03181603748
A6770 104962 104966 -4 -0.003810902993
A9301 186701 186720 -19 -0.01017669964
A12111 26481 26495 -14 -0.05286809411

Once I’ve obtained results using these ranges I replace the ‘20000+’ values in the Mongo db with the aggregated totals.

Code

All the code is available on GitHub.

Related projects

Tags

RecordSearch