Data scraping is one of those data journalism phrases that I encountered and thought – yep, never going to be able to do that. But it sounds a lot more scary than it is.
In this how-to guide I’ll run through how to do a really simple scrape using the free version of Outwit Hub.
In honour of 420 (unofficial national marijuana day in America), I wrote a piece about GE 2015 candidates from the Cannabis is Safer Than Alcohol (CISTA) Party in order to practise/refresh my scraping skills.
I will be walking you through the scraping process I went through to get the following data from the CISTA website:
a) Candidate name
b) Candidate constituency
I then mapped the candidates according to where they were standing for election, using mapping tool CartoDB.
What does Outwit Hub do?
At its most basic, Outwit Hub retrieves text from between two ‘markers’ defined by you, the user. It delves into the source code of a web-page and recovers the data you want.
Step 1: Download OutWit Hub
The free version limits you to scraping 100 rows of data, but that should definitely be more than enough.
Step 2: Open up OutWit Hub, copy and paste the CISTA candidate web page into the URL bar at the top of the application. This will also show the source code of that particular web page.
Step 3: Click on scrapers, create new scraper
Step 4: For the first line, under the ‘Description’ column, put ‘Candidate name’. This line will be where we’ll pull out the candidate name.
Step 5: Go to ‘marker before’, and put in: <div class=”col-md-3″>
Step 6: Go to ‘marker after’, and put in: </h3>
Step 7: For the second line, under the ‘Description column, put ‘Constituency’. This will be where we’ll pull out the candidate’s constituency that they’re standing in.
Step 8: Go to ‘marker before’, and put in: </a>
Step 9: Go to ‘marker after’, and put in: </p>
A photo probably illustrates the process better than me just writing about it. So — your screen should now look something like this:
Step 10: Click ‘execute’
Congratulations! You have successfully scraped using OutWit Hub.
I still find it a bit touch and go in terms of deciding exactly which bits of source code to use as markers. In this case, you can see that the scraper pulled out one result (the top one) that also fit into the before and after markers we specified – it’s not an exact science, but you can delete outliers.
You can now export your results (click ‘Export’).