How to scrape with OutwitHub

Data scraping is one of those data journalism phrases that I encountered and thought – yep, never going to be able to do that. But it sounds a lot more scary than it is.

In this how-to guide I’ll run through how to do a really simple scrape using the free version of Outwit Hub.

Background

In honour of 420 (unofficial national marijuana day in America), I wrote a piece about GE 2015 candidates from the Cannabis is Safer Than Alcohol (CISTA) Party in order to practise/refresh my scraping skills.

I will be walking you through the scraping process I went through to get the following data from the CISTA website:

a) Candidate name

b) Candidate constituency

I then mapped the candidates according to where they were standing for election, using mapping tool CartoDB.

What does Outwit Hub do?

At its most basic, Outwit Hub retrieves text from between two ‘markers’ defined by you, the user. It delves into the source code of a web-page and recovers the data you want.

Step 1: Download OutWit Hub
The free version limits you to scraping 100 rows of data, but that should definitely be more than enough.

Step 2: Open up OutWit Hub, copy and paste the CISTA candidate web page into the URL bar at the top of the application. This will also show the source code of that particular web page.

http://cista.org/candidates

Step 3: Click on scrapers, create new scraper

Step 4: For the first line, under the ‘Description’ column, put ‘Candidate name’. This line will be where we’ll pull out the candidate name.

Step 5: Go to ‘marker before’, and put in: <div class=”col-md-3″>

Step 6: Go to ‘marker after’, and put in: </h3>

Step 7: For the second line, under the ‘Description column, put ‘Constituency’. This will be where we’ll pull out the candidate’s constituency that they’re standing in.

Step 8: Go to ‘marker before’, and put in: </a>

Step 9: Go to ‘marker after’, and put in: </p>

A photo probably illustrates the process better than me just writing about it. So — your screen should now look something like this:

OutWit Hub showing before and after markers

Step 10: Click ‘execute’

Congratulations! You have successfully scraped using OutWit Hub.

OutWit Hub showing what final scrape should look like

I still find it a bit touch and go in terms of deciding exactly which bits of source code to use as markers. In this case, you can see that the scraper pulled out one result (the top one) that also fit into the before and after markers we specified – it’s not an exact science, but you can delete outliers.

You can now export your results (click ‘Export’).

Excel spreadsheet of candidates and their constituencies

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s