So Many Ways to
Explore the World...
While this window is showing instructions, the user interface of OutWit Hub remains operational.
You can still interact normally with the application and you can move this tutorial window around on the screen to better see the parts of the interface that you want.
The 5 main ways to auto-explore pages
1) Auto-browse series of results
2) Dig into a Website
3) Use grabbed or imported lists of URLs
4) Generate URLs
5) Build a self-navigating scraper
We are going to experiment all of these methods with you in the following pages.
The other important exploration function (fast-scraping) is covered in a separate tutorial.
One Continent at a Time
Here is a page with a table of all African countries.
Let's simply grab this data with the 'tables' view.
Done. Now we can do this for the whole world... But before, we need to make sure we will not loose our data.
For this we can either keep the extracted data in the tables view by choosing Empty on Demand, or move it to the Catch by selecting Auto-Catch.
(Do not do both or you will send masses of redundant data to the catch at each page load.)
Let's disable 'Auto-Empty' and keep the data in tables.
1) Auto-Browsing
We are ready.
As there is a nice simple Next Page link in our web page, OutWit couldn't miss it, and we just need to click on the next button (single right arrow) to go to the next page or on the Browse button (double right arrow) to go through all pages.
Let's click on browse.
The program is now loading all the pages of the series, until no more Next Page link is found.
Once we have reached Oceania, all countries will have been collected in the tables view.
(Note that if you are running the light version of OutWit Hub, the number of extracted rows is limited to 100, so you will not see all.)
2) The Dig Function
'Browse' was the first exploration tool. It is mostly useful for search engine or database result pages, with a succession of pages to be explored. OutWit does its best to find the link from one page to the other (and succeeds in most cases).
Digging (Navigation>Dig menu or double down arrow) means exploring the links of a page at a given depth, possibly with a filter criterion.
If you did a Dig without any limitations, the program would go from page to page, adding every link it finds to a queue to later explore that page, get more links, and so on... It would basically explore the Internet.
We don't want to do that.
We only wish to explore the six continent pages again (with URLs containing the string "countries/ow-explore"). Let's use this string to limit the exploration.
Here is the Advanced Settings dialog you get from the navigation menu:
(Note that Dig's Advanced Settings are a pro-only feature.)
The pages of this example do not contain outside links, so you can play with the dig function with or without filters, it will remain within the list of countries. Once you are done, we will see other ways to do this...
3) Exploring lists of URLs
Another way to proceed is to select the list of URLs you want to visit, in any datasheet of the application (or in a directory of the queries view)...
Then right-click on one of the selected links and choose Auto-Explore, then Browse, Dig or Fast Scrape. Do this with a bunch of links.
4) Exploring generated URLs
If the links you want to explore are built on a logical format and if they are not conveniently gathered in a single page, the easiest is often to ask OutWit to generate them.
You may have noticed already that all country pages file names are in the same format: country_XXX.html. In such cases, you can generate the URLs to explore. A simple format allows you to do this easily in OutWit Hub.
In this pattern we have simply replaced the number by the range [001:194] in the URL. If you right-click on it and choose 'insert rows', it will generate strings in the directory. If you right-click on it and choose Browse or Dig in the Auto-Explore submenu, it will load all the pages one by one... Finally, you can use it in a macro, apply a scraper on all the URLs it generates, etc.
5) Self-Navigating Scrapers
We have learned about the 'Browse' and 'Dig' functions, used them on a page, on a list of URLs or on a URL generation pattern, we could have also done a fast scrape on these pages but 'Fast Scraping' is the subject of a separate tutorial. (You should refer to it after this one).
The final function we will experiment with involves scraper advanced features. This is often the last resort if no simple and logical naming format can be found or if the list of URLs is too difficult to collect: include the navigation within a scraper.
Let's make a little scraper to grab the country name and its international two-letter code. We will then give it the ability to navigate alone within the pages.
5) Self-Navigating Scrapers
Let's make sure our scraper doesn't only apply to country pages (URLs containing "country_"), but also to list pages: It must apply to URLs containing "tutorials/work/countries". (In other cases you may have to create several scrapers.)
Then, let's add a #nextPage# directive for the country pages, and set it to #BACK# so that after scraping a country, the program backtracks to the list. We put #nextPage#0# so that, with the lowest rating, this scraper line is only applied if no other next page link is found.
When on a page with a list of countries, line 4 will grab the next page and line 5 will add country URLs to the queue to be explored.
Line 6 tells outwit to explore the URLs of the queue one after another (with the highest priority.)
Now, we just need to go to the top page: list of continents, keep the data in the scraped view (by unchecking 'empty')...
...and click on browse.
The program first visits all the list pages storing the URLs to explore in the queue, thanks to the #addToQueue# directive, it will then grab the data from coutry pages.
(To be kind to our servers and your patience, we are only exploring the first pages, so you will have data for a handful of countries, but you got the point).
Our Tutorial is Over
(Your turn now.)
Auto-Browsing through series of pages, Digging down the hierarchy of a site, combining both, exploring lists of URLs or automatically generated URLs, fast scraping... all these functions are described in the Help. The basic features of OutWit Hub are as simple as a click or two. When confronted with complex cases, though, choosing the most appropriate option can make you save hours or even days...