WIZ-SCRAPER01.ALT

This is an OutWit Tutorial file.


Testing Browser...
OutWit Hub - Making a Scraper
Application walkthrough: creating a simple scraper
userSpace.restoreOriginalPrefs = function restoreOriginalPrefs(){ if(userSpace.originalPrefs){ for(var thePref in userSpace.originalPrefs){ witscript.setPreference(thePref, userSpace.originalPrefs[thePref]); } } } userSpace.storeOriginalPrefs = function storeOriginalPrefs(){ if(!userSpace.originalPrefs){ userSpace.originalPrefs = {}; userSpace.originalPrefs["browse.tempo.min"] = witscript.getPreference("browse.tempo.min"); userSpace.originalPrefs["browse.tempo.max"] = witscript.getPreference("browse.tempo.max"); userSpace.originalPrefs["images.ondemandonly"] = witscript.getPreference("images.ondemandonly"); userSpace.originalPrefs["page.ignorePlugins"] = witscript.getPreference("page.ignorePlugins"); userSpace.originalPrefs["page.ignoreImages"] = witscript.getPreference("page.ignoreImages"); userSpace.originalPrefs["tableMinRows"] = witscript.getPreference("tableMinRows"); } } userSpace.setWizardPrefs = function setWizardPrefs(){ witscript.setPreference("browse.tempo.min", "2000"); witscript.setPreference("browse.tempo.max", "3500"); witscript.setPreference("images.ondemandonly", false); // witscript.setPreference("page.ignorePlugins", false); witscript.setPreference("page.ignoreImages", false); witscript.setPreference("tableMinRows", "1"); }


This tutorial is going to
walk you through your
first OutWit Scraper.


While this window is showing instructions, the user interface of OutWit Hub remains operational.

You can still interact normally with the application and you can move this tutorial window around on the screen to better see the parts of the interface that you want.

userSpace.waitOK = witscript.version("4") || !/Firefox\/2\d\./.test(navigator.userAgent); userSpace.eyeCatcherOK = !(wizardKit.platform=="mac" && /firefox/i.test(navigator.userAgent) && /rv:1[2-7]/i.test(navigator.userAgent)); if (/Firefox\/2\d\./.test(navigator.userAgent) && !witscript.version("4")) { userSpace.eyeCatcherOK = false; wizardKit.typeCellValue = function typeCellValue(tree, row, column, value){ //alert("This wizard may not work with this version of the application."); tree.setCellValue(row, column, value); tree.startEditing(row, column); //witscript.wait(200); tree.stopEditing(true); //witscript.wait(200); }; } if (/Firefox\/[23]\./.test(navigator.userAgent)){ alert("OutWit wizards cannot run on your version of Firefox. Please update to the current version and try again."); wizard.close(); } else if (!("witscript" in window) || !witscript.version || !witscript.version("2.0.1")){ alert("This wizard is not compatible with your version of the OutWit Kernel. Please download the latest version (2.0.1 or higher)"); wizard.close(); } if(witscript.version()=="2.1.1.4"){ alert("Version 2.1.1.5 was released with important fixes. Please update and restart the tutorial."); } if(witscript.version("2.1")){$(".owui-wizard-homelink").html("Hub Tutorials")}; userSpace.storeOriginalPrefs(); userSpace.setWizardPrefs(); witscript.logPanel.setAttribute("height",0);
wizardKit.say(this.parentNode);
// For some reason a script in owui-wizard-page is not executed if there are steps: // alert("This is page 2.0");
wizardKit.say(this.parentNode);
// Note: if several scripts for same event in a step, all are executed // alert("page 2.1"); wizardKit.hideCatch(); wizardKit.hideLog();

What is a scraper?

A Scraper is a template telling OutWit how to extract information from a page.

It is simply a list of the fields you want to recognize and extract in a page. For each field, it specifies the name of the field, the strings surrounding the data to extract and the format of the data.

wizardKit.say(this.parentNode);

When automatic extraction is not enough:

When the tables, lists or guess views do not manage to recognize automatically the structure of a page and extract its data the way you want, you still have the option to manually create a Scraper and tell OutWit how it should handle this specific URL (or all the pages of a given Web site, a sub-section thereof, etc).

witscript.views.page.load("http://www.outwit.com/support/help/hub/tutorials/sample_list.html"); witscript.views.page.display(); witscript.menutree.focus(); wizardKit.say(this.parentNode);

Here is a sample page with data

In this page, under the title "Sample Data For Extraction", you will find a short list of cities, with their country and flag, GPS coordinates and population.

If you go through the lists, tables and guess views, you see that this data is not tabulated in a way that allows OutWit to propose a satisfying export structure.

Don't forget that you can drag this window around the screen as you wish to reveal hidden parts of the interface.

witscript.setPreference("tableMinRows",1); witscript.views.tables.display(); witscript.menutree.focus(); if (userSpace.eyeCatcherOK) wizardKit.eyeCatcher(witscript.menutree.treechildren,1,.35,0,90); wizardKit.say(this.parentNode);

In this particular page:

The tables view shows all the data in a single line

This view extracts all the HTML tables from the source code. It usually works well, but it depends on the way the page was built. In this sample page, all the data is coded as a single-cell table. (Depending on your advanced preferences (Tools> Preferences> Advanced), tables of less than 3 rows may be ignored.)

witscript.views.lists.display(); witscript.menutree.focus(); if (userSpace.eyeCatcherOK) wizardKit.eyeCatcher(witscript.menutree.treechildren,1,.35,0,105); wizardKit.say(this.parentNode);

The lists extractor is giving better results, but...

The lists view extracts all the HTML lists from the source code. As the data is indeed a HTML list in this page, we do get each city in a separate row, but we still do not have the data presented as separate cells.

witscript.views.guess.display(); witscript.menutree.focus(); if (userSpace.eyeCatcherOK) wizardKit.eyeCatcher(witscript.menutree.treechildren,1,.35,0,120); wizardKit.say(this.parentNode);

Guess... well... doesn't.

The guess view cannot find a structure in the data layout: there are too few recognizable markers and separators for the program to do a good job and, instead of risking to show bad results, it didn't return any.

witscript.views.scrapers.manager.display(); if(witscript.views.scrapers.editor.isVisible){ if(witscript.views.scrapers.editor.isVisible()){ witscript.views.scrapers.editor.manageButton.click(); if(userSpace.waitOK) witscript.wait(7000,function(){return witscript.views.scrapers.manager.isVisible();}); } } else { witscript.views.scrapers.editor.saveButton.click(); if(userSpace.waitOK) witscript.wait(1000); witscript.views.scrapers.editor.manageButton.click(); } witscript.menutree.focus(); wizardKit.say(this.parentNode);

The Scrapers View.

In cases like this, you still have the option to create your own specific extractor in the scrapers view.

In this view, looking at the source code of the page, you can create the template that will help OutWit recognize the content to grab.

The scraper manager:

It lists all the scrapers saved in your profile and allows you to import, export, delete or duplicate existing scrapers or to alter their properties.

witscript.views.scrapers.source.findBar.toggleHighlight(false); //making sure the datasheet is sorted by AID //XXXXXXX would be better to make sure it is not sorted at all witscript.views.scrapers.manager.datasheet.sort(0,0); witscript.views.scrapers.manager.datasheet.scrollToRow(witscript.views.scrapers.manager.datasheet.getRowCount()-1) //witscript.views.scrapers.manager.datasheet.select(witscript.views.scrapers.manager.datasheet.getRowCount()-1); wizardKit.say(this.parentNode);
var currentAutomator = witscript.views.scrapers.manager.currentAutomator(); if (currentAutomator && currentAutomator.automatorId == -1) { // XXX seems outdated. Check and remove witscript.views.scrapers.editor.manageButton.click(); } if (!currentAutomator || !userSpace.automatorName || userSpace.automatorName != currentAutomator.name || !(/Tutorial Scraper/.test(currentAutomator.name))) { witscript.views.scrapers.manager.createAutomator("Tutorial Scraper"); userSpace.automatorName = witscript.views.scrapers.manager.currentAutomator().name; witscript.views.scrapers.editor.url.setValue("http://www.outwit.com/support/help/hub/tutorials/sample_list.html"); } //alert("opening scraper"); witscript.views.scrapers.manager.editButton.click(); $("#automatorName").html(userSpace.automatorName);

Creating a new scraper.

A click on the 'New' button creates and opens a new scraper: A blank scraper named was created.

wizardKit.say(this.parentNode); witscript.views.scrapers.editor.sourceSelector.static.click() witscript.views.scrapers.editor.display(); witscript.views.scrapers.editor.datasheet.focus(); witscript.views.scrapers.editor.datasheet.select(0); if (userSpace.eyeCatcherOK) wizardKit.eyeCatcher(witscript.views.scrapers.editor,.5,.5,0,0); wizardKit.typeCellValue(witscript.views.scrapers.editor.datasheet, 0, 2, "City"); wizardKit.typeCellValue(witscript.views.scrapers.editor.datasheet, 0, 3, "width=\"22\"> "); witscript.views.scrapers.source.findBar.toggleHighlight(false); witscript.views.scrapers.source.findBar.toggleHighlight(true, witscript.views.scrapers.editor.datasheet.getCell(0, 3)); witscript.views.scrapers.source.scrollToPercent(.4); witscript.views.scrapers.editor.display(); witscript.menutree.focus();

The scraper editor.

Entering the markers before and after the data you wish to extract is simple: you can type them, copy/paste them, or just drag them from the source to the field you want.

Creating the field "City"

In the description column of the first line, we put the name of the field: "City". The city name is located between "width=\"22\"> " and " (".
Note: <em> tags are removed from scraped data by the clean text function (bottom panel of the 'scraped' view). By default, only the visible content is extracted.

witscript.views.scrapers.editor.datasheet.focus(); wizardKit.typeCellValue(witscript.views.scrapers.editor.datasheet, 0, 4, " ("); // witscript.views.scrapers.editor.datasheet.setCellValue(0, 4, " ("); // witscript.views.scrapers.editor.datasheet.startEditing(0, 4); // witscript.wait(700); // witscript.views.scrapers.editor.datasheet.stopEditing(true); witscript.views.scrapers.source.findBar.toggleHighlight(false); witscript.views.scrapers.source.findBar.toggleHighlight(true, witscript.views.scrapers.editor.datasheet.getCell(0, 4)); witscript.views.scrapers.source.scrollToPercent(.4); witscript.views.scrapers.editor.display(); witscript.menutree.focus();
witscript.views.scrapers.editor.datasheet.focus(); witscript.views.scrapers.editor.datasheet.scrollToRow(0) witscript.views.scrapers.editor.datasheet.select(0); if(witscript.version("5.0")) { thePattern = witscript.lineRegExp( { ok:true, description:views.scrapers.editor.datasheet.getCell(0, 2), before:witscript.convertLiterals(views.scrapers.editor.datasheet.getCell(0, 3)), after:witscript.convertLiterals(views.scrapers.editor.datasheet.getCell(0, 4)), format:"" }); } else { thePattern = witscript.lineRegExp([witscript.views.scrapers.editor.datasheet.getCell(0, 2),witscript.convertLiterals(witscript.views.scrapers.editor.datasheet.getCell(0, 3)),witscript.convertLiterals(witscript.views.scrapers.editor.datasheet.getCell(0, 4)),""]); } witscript.views.scrapers.source.findBar.toggleHighlight(false); witscript.views.scrapers.source.findBar.toggleHighlight(true, "/"+thePattern+"/gi"); //if (userSpace.eyeCatcherOK) wizardKit.eyeCatcher(witscript.views.scrapers.editor,.5,.8,0,0); witscript.views.scrapers.source.scrollToPercent(.4); witscript.views.scrapers.editor.display(); witscript.menutree.focus();
witscript.views.scrapers.editor.datasheet.focus(); witscript.views.scrapers.editor.datasheet.scrollToRow(1) witscript.views.scrapers.editor.datasheet.select(1); witscript.views.scrapers.editor.datasheet.setCellValue(1, 1, true); witscript.views.scrapers.editor.datasheet.setCellValue(1, 2, "Coordinates"); witscript.views.scrapers.editor.datasheet.setCellValue(1, 3, "("); witscript.views.scrapers.source.findBar.toggleHighlight(false); witscript.views.scrapers.source.findBar.toggleHighlight(true, witscript.views.scrapers.editor.datasheet.getCell(1, 3)); witscript.views.scrapers.source.scrollToPercent(.4); witscript.views.scrapers.editor.display(); witscript.menutree.focus(); wizardKit.say(this.parentNode);

Creating the field "Coordinates"

Then, the coordinates, located between "(" and "):". (Note than we are reusing the "(", which was the 'Marker After' of the previous field.)

witscript.views.scrapers.editor.datasheet.focus(); witscript.views.scrapers.editor.datasheet.setCellValue(1, 4, "):"); witscript.views.scrapers.source.findBar.toggleHighlight(false); witscript.views.scrapers.source.findBar.toggleHighlight(true, witscript.views.scrapers.editor.datasheet.getCell(1,4)); witscript.views.scrapers.source.scrollToPercent(.4); witscript.views.scrapers.editor.display(); witscript.menutree.focus();
witscript.views.scrapers.editor.datasheet.focus(); witscript.views.scrapers.editor.datasheet.scrollToRow(1) witscript.views.scrapers.editor.datasheet.select(1); if(witscript.version("5.0")) { thePattern = witscript.lineRegExp( { ok:true, description:views.scrapers.editor.datasheet.getCell(1, 2), before:witscript.convertLiterals(views.scrapers.editor.datasheet.getCell(1, 3)), after:witscript.convertLiterals(views.scrapers.editor.datasheet.getCell(1, 4)), format:"" }); } else { thePattern = witscript.lineRegExp([witscript.views.scrapers.editor.datasheet.getCell(1, 2),witscript.convertLiterals(witscript.views.scrapers.editor.datasheet.getCell(1, 3)),witscript.convertLiterals(witscript.views.scrapers.editor.datasheet.getCell(1, 4)),""]); } witscript.views.scrapers.source.findBar.toggleHighlight(false); witscript.views.scrapers.source.findBar.toggleHighlight(true, "/"+thePattern+"/gi"); //if (userSpace.eyeCatcherOK) wizardKit.eyeCatcher(witscript.views.scrapers.editor,.5,.8,0,15); witscript.views.scrapers.source.scrollToPercent(.4); witscript.views.scrapers.editor.display(); witscript.menutree.focus();
witscript.views.scrapers.editor.datasheet.focus(); witscript.views.scrapers.editor.datasheet.scrollToRow(2) witscript.views.scrapers.editor.datasheet.select(2); witscript.views.scrapers.editor.datasheet.setCellValue(2, 1, true); witscript.views.scrapers.editor.datasheet.setCellValue(2, 2, "Population"); witscript.views.scrapers.editor.datasheet.setCellValue(2, 3, "):"); witscript.views.scrapers.source.findBar.toggleHighlight(false); witscript.views.scrapers.source.findBar.toggleHighlight(true, witscript.views.scrapers.editor.datasheet.getCell(2,3)); //witscript.views.scrapers.source.findBar.highlight.setValue(true); witscript.views.scrapers.source.scrollToPercent(.4); witscript.views.scrapers.editor.display(); witscript.menutree.focus(); wizardKit.say(this.parentNode);

Creating the field "Population"

Same process for the population figures.

witscript.views.scrapers.editor.datasheet.focus(); witscript.views.scrapers.editor.datasheet.setCellValue(2, 4, "inhab.</li>"); witscript.views.scrapers.source.findBar.toggleHighlight(false); witscript.views.scrapers.source.findBar.toggleHighlight(true, witscript.views.scrapers.editor.datasheet.getCell(2,4)); witscript.views.scrapers.source.scrollToPercent(.4); witscript.views.scrapers.editor.display(); witscript.menutree.focus();
witscript.views.scrapers.editor.datasheet.focus(); if(witscript.version("5.0")) { thePattern = witscript.lineRegExp( { ok:true, description:views.scrapers.editor.datasheet.getCell(2, 2), before:witscript.convertLiterals(views.scrapers.editor.datasheet.getCell(2, 3)), after:witscript.convertLiterals(views.scrapers.editor.datasheet.getCell(2, 4)), format:"" }); } else { thePattern = witscript.lineRegExp([witscript.views.scrapers.editor.datasheet.getCell(2, 2),witscript.convertLiterals(witscript.views.scrapers.editor.datasheet.getCell(2, 3)),witscript.convertLiterals(witscript.views.scrapers.editor.datasheet.getCell(2, 4)),""]); } witscript.views.scrapers.source.findBar.toggleHighlight(false); witscript.views.scrapers.source.findBar.toggleHighlight(true, "/"+thePattern+"/gi"); //if (userSpace.eyeCatcherOK) wizardKit.eyeCatcher(witscript.views.scrapers.editor,.5,.8,0,30); witscript.views.scrapers.source.scrollToPercent(.4); witscript.views.scrapers.editor.display(); witscript.menutree.focus();
if (userSpace.eyeCatcherOK) wizardKit.eyeCatcher(witscript.views.scrapers.editor.executeButton); wizardKit.say(this.parentNode);

Testing your new scraper.

When you hit the Execute button, OutWit applies the scraper to the current page. The 'Scraped' view is displayed, with the results of your new extractor.

witscript.views.scrapers.editor.executeButton.click(); witscript.menutree.focus(); //document.getElementById("Comment6.2").play();
userSpace.setWizardPrefs();
witscript.views.scrapers.display(); witscript.views.scrapers.editor.manageButton.click(); witscript.views.scraped.display(); // wizardKit.showCatch(); // XXX JC: This should not be here. Move to the close button (or event) userSpace.restoreOriginalPrefs(); if(witscript.version("2.1")){$(".owui-wizard-homelink").attr("style","color: #DFFFF9 !important; float:left;").html("More Tutorials")};
wizardKit.say(this.parentNode);

Congratulations!
You have created your first scraper.

You can now export the results directly from the view or move them to your Catch and keep the data there until you decide what to do with it.

We will progressively publish other tutorials to lead you through the main features of OutWit Hub. Stay tuned.

Except if...

You haven't had enough of scrapers, it's so much fun... The data is not presented exactly as you'd like...
In these cases, brace yourself and click for more!