Using OCR to identify documents

I’ve been thinking for a while about what I wanted to do next with the 3000 or so digitised ASIO files I’ve downloaded from RecordSearch. In the first instance, I want to use them as a way of exploring the processes of access examination, that I’m looking at as part of my Closed Access project.

Closed Access looks only at files with the access status of closed. The digitised ASIO files are (mostly) ‘Open With Exemption’, so I can use them to look at the examination process in a more fine-grained way. At the same time, I hope to learn more about ASIO’s recordkeeping processes.

Most of the files include information generated through the access examination process that summarises the number of pages removed or redacted. The way the information is reported changes over time, but many of the forms look like this:

OWE form

My plan is to find these forms within individual files and collate the information. But how? I was thinking I’d do it manually, but last night I started playing around with Tesseract – the open source OCR engine.

Installation of Tesseract using Homebrew couldn’t have been easier.

brew install tesseract

I then installed the Python wrapper, tesseocr. It requires Cython and Pillow, so all I had to do inside my virtualenv was:

pip install cython
pip install pillow
pip install tesseocr

Done! Test like this:

>>> import tesseocr
>>> print tesserocr.file_to_text('your_sample_image.jpg')

And you’ll see the extracted text. So easy.

A quick experiment showed me that I could successfully OCR the digitised files and find the word ‘exemption’. So I wrote a simple script that would work it’s way through all the digitised files, OCRing every page and looking for any of the following words – ‘exemption’, ‘folio’, or ‘archives’. I thought that collection of words gave me a good chance of identifying the access examination forms without generating too many false positives. Here’s the script:

def find_forms():
    test_words = ['folio', 'archives', 'exemption']
    for root, dirs, files in os.walk(rootdir, topdown=True):
        for dir in dirs:
            count = 0
            for dir_path, sub_dirs, files in os.walk(os.path.join(root, dir), topdown=True):
                for file in files:
                    if file[-3:] == 'jpg':
                        image =, file))
                        ocr = tesserocr.image_to_text(image)
                        text = ocr.lower()
                        for test in test_words:
                            if test in text:
                      'forms', file))
                                count += 1
            print '{}: {}'.format(dir, count)

If it finds a matching image, it saves it to a new folder. It also reports on the number of matches per file.

So far it seems pretty successful. It’s not fast – it’s been running for about 18 hours and is still only part of the way though. But it’s not consuming significant resources, so I’m happy for it to chug away in the background.

Folder contents

What I’ve now got is a folder with lots and lots of the access examination forms, so I can just start entering the data from these into a database without having to manually open and browse every file.

Of course, I could also save all the OCR output to a database for further analysis – I’ll probably do that next. Given the nature of most of the files, it will be riddled with errors, but it should provide some new pathways for exploration. Such as…

A serendipitous discovery

One of the false positives identified by my script has opened up some interesting possibilities. Amongst the access examination forms I found this:

Immigration precis

It’s a precis of a file from the Department of Immigration. My script found it because it references ‘exemption’ certificates issued under the White Australia Policy. Despite all my work on exemption certificates I hadn’t thought about that at all.

My project aims to explore the archives of state surveillance generated by both the White Australia Policy and ASIO – this is an example of how such systems reinforced each other. Obviously I now need to search the OCRd files for all ‘immigration’ references!

Tags: ,