How to make a
good scraper
better.
While this window is showing instructions, the user interface of OutWit Hub remains operational.
You can still interact normally with the application and you can move this tutorial window around on the screen to better see the parts of the interface that you want.
Here is the sample data
In the first tutorial about making a scraper, we have seen how to extract the population data on this page.
Say we want more detail than this:
The country should be in a separate column. We may also want to split the coordinates in two different fields. And, finally, a piece of information in the page has simply been lost: the continent.
We can grab all this and more.
Splitting City and Country
City and country are separated with a comma. We can use this for our purpose.
... VoilĂ ! City and Country are now in two separate columns.
IMPORTANT NOTE: "City" is the first field. It will be considered as the record delimiter by OutWit Hub. This means that if you had chosen another field as a delimiter, the data rows might have been cut in the middle. In order to make sure the scraper "wraps" the data rows as it should, try to follow the order of appearance of the data elements in the source code when you build your scraper.
Splitting the Coordinates
Latitude and Longitude are also separated with a comma. Let's use the Separator field the same way we just did.
Our data is now extracted as five separate columns, like we wanted.
The only one missing now, is the continent...
The continent is at a different (higher) level in the HTML list. This is why we couldn't grab it until now.
The continent names are not repeated on every row. They are on the first hierarchical level of the HTML list.
This means that the scraper, which has only one level of records, must create another field for this piece of information and repeat it in every data row.
(Aren't we lucky?)
Directives are a set of additional commands that help you instruct the scraper to alter its normal behaviour.
A directive must be entered in the Description field and surrounded with pound signs (see help). By typing #repeat#myFieldName in the description of a scraper line, you can add the data scraped by this line to an additional column named 'myFieldName' for each grabbed record.
Yesss!... This looks more like it.
The last elements we need to extract are the flags.
We have now added a new field to our scraper to extract the URL of the country flag for each City. The image URLs are located in img tags and we just need to grab the string located between the double quotes.
The image URLs are now scraped, but two problems remain:
1. the URLs are offset by one row.
2. they are "relative links": the first part is missing.
Solving problem #1 is easy: the flag URL is on the last line of our scraper, but it appears first in the page source code. Dragging the fifth line of the scraper up to the first position will solve this issue.
The Flag row is now first and this will solve the offset problem
As for the partial URL, the solution consists in adding a variable in the Replace field: #BASEURL# is the path to the current file (see help).
When typing a replacement value, \0 refers to the extracted data. So the replacement string #BASEURL#\0 means the concatenation of the path to the current page and the part of the URL that was extracted.
Here we are.
Congratulations!
Now, you can really be proud
of your first geeky scraper.
You can now export the results directly from the 'scraped' view or move them to your Catch and keep the data there until you decide what to do with it.
We will progressively publish other tutorials to lead you through the main features of OutWit Hub. Stay tuned.
You still haven't had enough of scrapers?... Try to make your own and, if you believe you miss a tutorial on a specific point, don't hesitate to send us your suggestions.