Here are answers to the most frequent questions. [updated:2/23/18]
What is OutWit Hub and when should I use it?
When you are looking for something on the Web, search engines give you lists of links to the answers. The purpose of OutWit Hub is to actually go retrieve the answers for you and save them on your disk as data files, Excel tables, lists of email addresses, collections of documents, images...
If your question has one simple answer, it will be at the top of Wikipedia or Google results and you don't need OutWit for that. When you know, however, that it would take you 20, 50, 500 clicks to get what you want, then odds are you do need OutWit Hub:
The Hub is an all-in-one application for extracting and organizing data, images, documents from online and local sources. It offers a wealth of data recognition, autonomous exploration, extraction and export features to simplify Web research. OutWit Hub exists both as an add-on for Firefox (up to FF version 42) and as a standalone application for Windows, Mac OS, and Linux.
OK, I have downloaded OutWit Hub and I am running it. Now what?
The best first thing to do is certainly to run the built-in tutorials from the Help menu (Help>Tutorials).
I want OutWit Hub to browse through a series of result pages but the ‘Next in Series’ and ‘Browse’ buttons are disabled. How come?
When opening a Web page, OutWit analyzes the source code and tries to understand as many things as possible about the page. The first thing it does is to find navigation links (next, previous…) and, when it does, the ‘Next in Series’ arrow and ‘Browse’ double arrow become active. If they are inactive, it is because OutWit did not find any additional pages. There are many workarounds to do the scrape without having to click on all links manually: depending on the cases, the best alternative solutions are using the Dig function (with advanced settings in the Pro, Expert & Enterprise editions), generating the URLs to explore, making a 'self-navigating' scraper with the #nextPage# directive or, finally, grabging the URLs you want to scrape, putting them in a directory of queries and using this directory to do a new automatic exploration. (Note that for the latter, it is also possible to grab the links to the Catch in one macro and address the column of the Catch by the name you gave it in a second macro, by typing 'catch/yourColumnName' in the Start Page textbox.)
Some links are not working in the Standalone version of the Hub. What should I do?
These are links for which target=blank was specified in the source code. OutWit Hub cannot open separate popup windows but you can open them within the Hub. For this, check the "Open popup links in the application window" preference (Tools>Preferences>General).
Auto-Browse Function keeps stalling. What can I do?
There can be several reasons for this but the most likely is that some pages do not load fast enough for the timout set in the preferences. In this case the program doesn't have time to find the link to the next page and stops browsing. The first thing you should do is increase the timeout values in Tools>Preferences>Exploration. Flash ads and banners could also be slowing down the process. For this you should disable everything you do not need for the extraction: Right-click on the page and choose Options. Disable at least images and plugins, even Javascript, if you can (don't forget to reactivate afterwards).
Auto-Explore Functions and Fast Scraping are slower in the current version than in the previous. Why is that?
They are not, in fact. The program's exploration functions work exactly the same way. It is possible, though, that your preference settings have changed during the upgrade. Temporization and pause settings should actually be more precise and reliable than in previous versions. You can fine-tune all this in Tools>Preferences>Time Settings. Another recent preference which may have an impact on the exploration speed is 'Bypass Browser Cache' in the 'Advanced' panel: not using the cache does slow the browsing down, so you may want to set it to 'Never'. If, after this, you are still experiencing performance issues, consider disabling processes you may not need by right-clicking on 'page' in the left side bar.
The next page button functions correctly but when trying to do a Browse to capture the information, the application runs only 2 pages then stops. Why?
- Cause: the next page link is probably a javascript link and it is probably the same in all pages, so the program thinks this URL has already been visited and stops the exploration.
- Solution: there is a preference (Tools>Preferences>Exploration) just for this. Uncheck "Only visit pages once...". Important: Do not forget to check it back afterwards or your next Dig would probably last forever and bring back huge amounts of redundant data.
I have made a scraper which works fine on the page I want to scrape, but when I do a browse and set the 'scraped' view to collect the data, it grabs the data of the first page over and over again. What is happening?
You are probably trying to scrape information from AJAX pages where the data is dynamically added to the page by Javascript scripts. You need to set the type of source to be used by your scraper to Dynamic. When you do, the source code of the page will be displayed on a pale yellow background. Note that you will probably have to adapt your scraper if it was created for the Original source, as the code may have changed slightly.
When I browse through thousands of urls from a directory of queries, the program sometimes slows down dramatically or even freezes. How can I avoid that?
First, check that your machine resources are not too limited (particularly in terms of RAM: 4GB can be small for very large explorations, 8GB is more comfortable and with 16GB, you should never experience RAM problems). Close the error console if it is open as it consumes RAM and slows the process. Check that the datasheet or the Catch in which the extracted data goes is not sorted because, above a few thousand rows, it becomes pretty heavy to sort all incoming lines of data. Avoid very large numbers of columns: this is something OutWit particularly dislikes. 20 or 50 columns is fine, 100 is usually OK, more than 150 or 200 can make performances collapse dramatically. Also check that you do not have too many applications working at the same time.
Then, a good precaution is to ask the application to move the results to the Catch (choose 'Auto-Catch' in the views bottom panels --ALWAYS USE AUTO-EMPTY IF AUTO-CATCH IS ON, or you will end up with millions of useless lines in your catch - this is also true in a macro: if you send data to the catch or to an export file during the execution, make sure that data is deleted from the datasheet or it will accumulate and be re-sent every time, increasing the size of the destination exponentially and slowing down the process progressively until it stalls). The Catch is saved regularly and will not disappear in case of problem.
This being said, if the application freezes for more than say a minute or even crashes, there is a reason. It should not happen. So if you have problem cases that can be reproduced, please tell us and we will do our best to correct them.
In any case, for large and broad explorations, it is always a good idea to disable everything you do not need for the extraction: Right-click on the page and choose Options. Disable at least images and plugins, even Javascript, if you can (don't forget to reactivate them afterwards). It will save a lot of processing time and ignore many potential causes for problems. What causes problem could be badly written flash ads or looping scripts, in particular, trying to update content or banners in real time.
During an exploration or an extraction process, I sometimes see an alert saying that a script is not responding. What do I do?
Some extraction processes can be demanding and their execution can be long on very large Web pages or textual files that you load. You have several options at this point: Click on the Cancel button if you believe that this is a one-time problem on an uninteresting oversized page, click Continue (several times, if necessary) if you believe that there is data to be found but this is an exceptional case, Cancel the process and increase the time allowed for script execution in the General preference panel if this is happening frequently. Note that it is better to have a value (long if you wish) rather than setting it to unlimited: setting the preference to 0 (or checking the don't-show box in the alert) means that you will not be prompted for this in the future. It can be a problem in case of a real bug on a page: you would have to eventually force-quit the application.
How can I import lists of links (URLs) or other strings into OutWit Hub?
There are many different ways to do this. Here are a few:
How can I import CSV or other tabulated data into OutWit Hub?
Simply open the file (.txt, .csv ...) from the File menu. (Note that on some systems, the program may try to open .csv files with another application. In this case, just rename your file with the .txt extension.) If the original data was correctly tabulated, you should find the data well structured in the guess view. If the data was less structured, well, the Hub will do what it can.
How can I convert a list of values into a String Generation Pattern?
If the values are in one of the Hub's datasheets, just select them, right-click on one of them and select "Insert Rows...". If they are in a file on your hard disk, simply import them into a directory of queries (see above) and do the same.
I see the information I want on the page but it is not in the source code... How is this possible?
You are probably looking at the original source code of a dynamic AJAX page. The information is added to the page after the page is loaded. For this type of page, you need to work with the 'Dynamic' source code. Set the type of source to be used by your scraper to Dynamic. When you do, the source code of the page will be displayed on a pale yellow background. Note that the Light edition of OutWit Hub cannot scrape dynamic data.
I would like to extract the details of all the products/events/companies in this site/directory/list of subsidiaries... Could you please advise me on how to do that?
Unfortunately this is the purpose of the hundreds of features covered in the present Help, so it is difficult to answer in one sentence, but the general principle is this:
Go through the standard extractors (documents, lists, tables, guess...) by clicking in the left side panel. Either you find that one of them gives you the results you want, --in which case it is just a matter of exporting the data-- or you need to create a scraper for that site. In the second case, you first need to go to one of the detail pages, build a scraper in the 'scrapers' view for that page, test it on a few other pages. Then go to the list of results you need to grab and have OutWit browse through all the links and apply your new scraper. This can be done in two ways: either by actually going to each page ('browse' or 'dig' or a combination of both if you have a Pro, Expert or Enterprise editions) or by 'Fast Scraping' them (applying your scraper to selected URLs --right-click: Auto-Explore>Fast Scrape in any datasheet-- or 'Fast Scrape' in a macro).
The program doesn't find all the email addresses in this Website, Why is that?
There are several ways to have OutWit look for emails in a site. The fastest is to select Fast-Search For emails>In Current Domain, either from the Navigation menu or from the popup menu you get when you right-click on the page. This method, however, doesn't explore all pages in the site. It only looks for the most obvious (contacts, team, about us...) pages that can be found. If you want to systematically explore all pages in a site, you will have to use the Dig function, within domain, at the depth level you wish.
Why doesn't the program find contact information (phone, address...) for some of the email addresses?
First, of course, the info has to be present in the page. Then, if it is there, no technology allows for perfect semantic recognition. An address or a phone number can take so many different forms, depending on the country, on the way it is presented or on how words are abbreviated, that we can never expect to reach a 100% success rate.
Email address recognition is nearly exhaustive in OutWit; phone numbers are recognized rather well in general; physical addresses are more of a challenge: they are better recognized for US, Canada, Australia and some European countries than for the rest of the world. The program recognizes names in many cases. As for other fields like the title, for instance, automatic recognition in unstructured data is too complex at this point and results would not be reliable enough for us to include them unless they are clearly labeled. We are constantly improving our algorithms so you should make sure to keep your application up-to-date.
In the meantime, if automatic recognition is not sufficient, the way to grab precisely the data you want in the format you want, is to create a custom scraper.
I am observing the progress and I see that no new line is added for some pages when I am sure there is an email address or other info that should be found. Why is that?
This page (or one containing similar info) was probably visited before. Results are automatically deduplicated. This means that if an email address --or just a phone number or physical address-- has already been found, the row containing this data will be updated (and no new row, created) when a new occurrence is found.
What is the maximum number of rows of data OutWit Hub can extract and export? After a certain number of rows, when exporting, I get a dialog telling me a script is unresponsive. What should I do?
In our tests, we have extracted and successfully exported up to 1.3 million rows (of two or three columns).
Obviously, the limit varies a lot from system to system, depending on the platform, the RAM, the number of columns, the amount of data in each cell, etc. (Avoid very large numbers of columns: this is something OutWit particularly dislikes. 20 or 50 columns is fine, 100 is usually OK, more than 150 or 200 can make performances collapse dramatically.)
When exporting more than 50,000 or 100,000 rows, you may see unresponsive script dialogs, even several times in a row, when you click on Continue. There is a checkbox to stop this dialog from coming back.
(Note that Excel XML export is always much more demanding than CSV or TXT. RAM problems are more likely to happen with Excel than with CSV or TXT as, for the last two, export files are split when they exceed 250MB. You can lower this limit by typing about:config in the address bar and modifying the preference called 'extensions.outwit.export.maxFileSize'. Important: use caution when you change preferences this way. They will alter the behavior of the application and they will not be reverted to factory settings if you use the 'Restore Original Preferences' button in the preference panel.)
A recently added directive, available in the Expert and Enterprise editions, allows to break down the export load into several files: #exportAndDeleteEvery# allow you to export as soon as the chosen number of rows is reached. Releaving the application memory for very large extractions. This may produce large numbers of files in case of extensive scraping, so the ultimate step is to use the Enterprise edition and, using this directive, send the data directly to an SQLite database.
Finally, don't forget that you can simply move your results to the catch and save the catch itself in a file if you need to reuse the contents or just for backup purposes (File Menu). A catch file can only be read again in OutWit Hub but saving is always much faster than exporting the data.
How do I make a hidden column visible in a datasheet?
In the top right corner of every datasheet in the application is a little icon figuring a table with its header: the Column Picker. If you click on this icon, a popup menu will allow you to hide or show the different columns of the datasheet. Only visible columns are moved to the Catch and exported by default (this behavior can be changed with a custom export layout).
What is the Ordinal ID?
The Ordinal column is hidden by default in all datasheets. Use the column picker (icon at the top right corner of any datasheet) to display it. The Ordinal ID is an index composed of three groups of digits separated by dots. In browse mode, the first number is the number of the page from which the data line was extracted (it can only be higher than 1 if the 'empty' checkbox is unchecked.). The second number is the position of the data block in the page (can only be more than 1 in 'tables', 'lists', 'scraped' and 'news' views). The last number is the position of the data line in the block (or in the page, if there is only one data block in the page). In fast scrape mode, the first digit is the scrape execution, and the second is the page number. (If multiple queries where sent from a directory of queries, this second digit matches the order in which the queries were sent, not always the order in which responses were received).
I just purchased the product. Where do I enter my serial number?
If you haven't already downloaded and installed the free version, do so from the home page of outwit.com, then run the application. In the menu bar at the top of the screen, the rightmost menu is 'Upgrade' if no serial number has been entered yet and 'Registration' if a key has already been entered. In this menu, choose 'Enter Serial Number' and enter the key. (To avoid errors, do not retype email and serial number. Instead, copy and paste them from the email you received from OutWit when you ordered.)
I do not manage to enter my serial number in the Registration Dialog of OutWit Hub. The program keeps saying the key is invalid.
Your key was sent to you by email when you purchased the application. It is a series of letters and digits similar to this: 6YT3X-IU6TR-9V45E-AFS43-89U64. It must not be confused with the login password to your account on outwit.com which was also sent to you by email (if you miss one of these email messages, please check your spam box).
If you are wondering whether the Hub you are using is the Light, Pro, Expert or Enterprise editions, you will simply find the answer in the window title. Up to now, we haven't had a single case where a valid serial number would not work. You might be experiencing a very rare bug but this seems very unlikely after several years. The most likely causes are the simple ones: First make sure you have the right program and that you are not trying to enter a Sourcer Pro key in OutWit Hub, for instance. The key needs to be entered exactly like it is in the mail you received. So, either you are not typing it precisely right (in which case you should simply copy and paste the email address and the key from our original mail) or you are typing something completely different (the login to your outwit.com account, for instance?). If you have changed email addresses since you purchased your license, remember that the one to use is the one with which you originally placed your order.
This section gives you important info on the way to find and backup your profile files and to revert to factory settings. You should also read Help>Help>Standalone Application for more info on the same topic.
All my scrapers have disappeared. Help!
Don't panic! If, for some reason, your profile directory (or folder on mac) was not found when the program started or if it doesn't have the name that was expected, OutWit Hub will create a new install with a blank profile. However, unless you have explicitely deleted the previous directory, your old profile and all your automators are not lost: they are sitting in your old profile directory, in a file called User_Gear.owc. You can either learn below how to work with multiple profiles or simply open the old User_Gear.owc file from the application with File>Manage Automators>Import Automators From... Read the following and the help page on the Standalone Application to know how to locate and manage profiles.
On OutWit Hub For Firefox, I have been experiencing new issues recently: unresponsive scripts, timeouts, strange behaviors on pages that used to work fine... what can I do to revert to factory settings?
We are not aware of incompatibilities with other add-ons but it can always happen, some of your Frefox preferences could also have been changed by another extension or files may have been corrupted in your profile. You can try to create a blank profile and reinstall OutWit Hub (or other OutWit extensions) from outwit.com. This will bring you back to the initial state. Here is how to proceed on Windows:
http://kb.mozillazine.org/Creating_a_new_Firefox_profile_on_Windows
and on other platforms:
http://support.mozilla.com/kb/Managing+profiles
Can I create a new profile in OutWit Hub Standalone?
With the standalone version, the principle is almost exactly identical to the way it works in Firefox (see above paragraph).
Windows: click "Start" > "Run", and type :
"C:\Program Files (x86)\OutWit\OutWit Hub\outwit-hub.exe" -no-remote -ProfileManager
Macintosh: Run the Terminal application and type :
/Applications/OutWit\ Hub.app/Contents/MacOS/outwit-hub -no-remote -ProfileManager
Linux: open a terminal and type :
[path to directory]/outwit-hub -no-remote -ProfileManager
If you need instructions to go further, refer to the profile manager instructions for Firefox:
http://support.mozilla.org/en-US/kb/profile-manager-create-and-remove-firefox-profiles
Where is my profile directory?
In OutWit Hub (Standalone or Firefox Add-on), if you type about:support in the address bar, you will get a page with important information about your system and configuration. In this page, you will find a button that will lead you to your current profile directory with a name like "u3p9be0z.Default". If you have multiple profiles or if you are looking for an old profile, look in the parent directory called "Profiles". Among the files you will see in the profile directory, the ones with .owc extensions are Catch files, and files ending with .owg are User Gear files (the User Gear is the database where all your automators are stored). You can back these files up or rename them if you plan to alter your profile.