Activities
Screen scraping with Import.io

Under construction, 31 August 2016
This page is likely to be messy and incomplete. Check back later.


It’s great when you access to data in nicely pre-packaged forms like CSVs or APIs, but sometimes the information you want is only available as a list on a web site. So what do you do? (And yes, cut and paste gets very boring, very quickly…)

Screen scraping is the process of extracting structured data from a web page. It uses the tags and texts that make up a HTML page to create a set of coordinates within which individual pieces of data can be identified and extracted. It can be frustrating and fragile – a simple change to a web page can break your precious scraper – but sometimes it’s your only option.

I’ve used screen scrapers a lot. The early versions of my Trove tools, like QueryPic, were based on screen scrapers. I’ve also harvested large quantities of data from RecordSearch, such as thousands of publicly available ASIO files. My RecordSearch scraper is custom-built in Python, but there are also a number of general use screen scraping tools. If you’re happy to tinker with code, you can use Morph.io to build, host and share the results of your scraper. If you’re looking for a no-code solution then Import.io offers a free browser-based service that allows you build fairly complex scrapers. This tutorial is a quick introduction to using Import.io.

Have a look at this page from the Australian Parliament House site, it provides basic details of all the members of the House of Representatives. We’re going to turn this web page into a CSV file.

First sign up for a free account with Import.io. You can create scrapers without an account, but you’ll need one to save them and download the results. All done? OK, let’s get scraping!

Creating a new scraper

Import.io screenshot

  • Make sure you’re logged in, then go to import.io.

  • Copy and paste the url for the APH page

    http://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?page=1&expand=1&q=&mem=1&par=-1&gen=0&ps=100&st=1

    into the Enter a url box and click on Try it out.

  • Import.io will load the page and try and identify the interesting bits of data, which it will display as a grid.

Editing your scraper

Import.io screenshot

  • Look at the Title link column –you’ll notice that Import.io has automatically identified and extracted the list of members! That’s pretty cool, but some of the other fields aren’t very useful, so we’ll do some cleaning up.

  • Hover over the column heading for Whatsthis label and click on the down arrow. Select Delete column to remove it. Do the same for Track label and Inline text links.

  • Now let’s add some extra data. Click on the Website view tab to load the original page.

Adding columns

Import.io webview screenshot

  • The website view lets you see and edit the relationship between the web page and the data columns. If you select the Title link column, for example, each member’s name is highlighted on the web page.

  • Click on the Add column button to create a new column. In the column heading type ‘Electorate’.

  • Now try hovering over Tony Abbott’s electorate of ‘Warringah’ – you’ll see the cursor has a little green plus sign and a box is drawn around the electorate name. Click on the electorate. What happens?

  • You should notice that not only is Tony Abbott’s electorate selected, but so are the electorates for all the other members – magic! Based on your selection, Import.io expects the electorate to be the first item after each member’s name.

  • Click on the Data view tab to see how your new column will look.

I’ve found that sometimes the data doesn’t load correctly – either it doesn’t appear at all, or you see something like {"text": "Warringah"} in the data view. If this happens, just go back to the website view, click on Clear column and try again. It seems to work ok the second time.

Tricky selections

Import.io xpath screenshot

  • What about the parties of each member? Go back to the Website view, add a new column and name it ‘Party’.

  • Now we have a problem. If you scroll down the list you’ll see that some members (like Julie Bishop) have ‘Title(s)’ listed before their party. This means we can’t just tell Import.io to find the second piece of information after the member’s name – we have to be more precise.

  • Click on the gear icon in the new column header and select Manual XPath.

  • In the box that appears, paste the following

    //dt[text()="Party"]/following-sibling::dd[1]

  • Hopefully you should now see the ‘Party’ values highlighted.

  • Check the Data view to see how the grid’s looking now.

What just happened? XPaths are ways of identifying elements on a web page (or in an XML file). In this case, we’re looking for a <dt></dt> tag in the HTML of the page that has the value of ‘Party’, then we’re then grabbing the <dd></dd> tag that immediately follows it. Using XPaths you can uniquely identify any element on a web page.

And finally photos

  • Let’s get their photos as well. Go back to Website view, create a new column, and call it ‘Photo’.

  • Now just click on a photo. You should see a dialogue box asking if you want to save the link, click ‘Yes’. If for some reason the photos aren’t captured, just clear the column and try again.

  • That’s it! Look at the data view to see the finished grid.

Using your new scraper

Import.io final grid screenshot

  • Now click on the Done button to save your scraper. (Import.io actually calls them ‘Extractors’!). The dashboard for your new scraper will open, and you should see a message telling you that it’s being run for the first time. When it’s finished you’ll see a link to download the data as a CSV file.

  • But we’re not quite done because our scraper will only have harvested data from the first page of members – there we’re actually two pages of results. That’s easily fixed.

Adding more pages

Import.io dashboard screenshot

  • Click on Show URL Generator. Hopefully it should have already identified the page value in the url and highlighted it as {Parameter-1}. If not just look for page=1 and click on it.

  • Change the ‘to’ value to the last page you want to harvest – in this case ‘2’. You’ll now see two ‘generated urls’.

  • Click on the Add to list button to add these to the urls your extractor will use.

  • You’ll now have a duplicate url, so just click Remove duplicate URLs.

  • Click on Save.

  • Now you can click on the Run URLs button at the top of the page to re-run your scraper. This time it will grab the data of all members!

Download your data

  • Once your scraper has been run successfully you just click on Download CSV to get a copy of the extracted data. Here’s a CSV I created earlier.

  • For data that changes regularly you can set up a schedule to extract data automatically. You can also access the extracted data via an API. But that seems like a topic for another tutorial…