What is New in OutWit Hub v9

Here are the main additions and changes between v8.x and v9.x of OutWit Hub.
(The minimum required license type is indicated between parantheses.)
This page is just an overview, please look for usage detail in the corresponding Help sections.[updated:6/11/20]

Scrapers

New Directives and Advanced Replacement Functions
Here are some scraper directives and functions that were added in version 9.0:
(check Scraper Editor Help for details and edition) :

Data Extraction & Enhancement

Many substantial enhancements in contact recognition and filtering
Implementation of job title recognition (mostly English for now)
Better elimination of example/bogus email addresses and phone numbers
Enhancements in name, company, address and copyright fields recognition
Enhancements throughout the program in first/last names split and in physical address split
Better handling of obfuscated email addresses

Recognition of other elements
Enhancement of date recognition, including dates without a year.

Data Refining: Editors & Datasheet Functions

More data editing & refining tools were added to the right-click menu on datasheets:
Delete columns > to the right: deletes the selected column and all additional columns to its right.
Clean up > Normalize All Figure: enhanced and optimized.
Shuffle Rows was enhanced to function in more contexts.
Compute: allow to perform basic operations on selected numerical cells of a colum.

Exporting Data

It is now possible to export each scraped row to a separate file.
Fixes and performance enhancements in export functions.

User Interface

A camera shutter sound was added when executing the #screenshot# directive.
The application is resized at launch if it exceeds the dimensions of the screen.
Fixed source colorizing problems in the case of line breaks inside HTML tags.

Other Functions

Multiple fixes and enhancements in String Generation functions.
Updated the list of User Agents.
Added a default referer preference, complementing the #dowloadReferer# & #nextPageReferer# directives in scrapers.
Added several actions in job definitions (alert & suspend, play sound, empty datasheet...).

And more...

v9 includes many more enhancements and fixes that are not listed here, improving overall security, reliability and stability of the program.




New in OutWit Hub v8

Here are the main additions and changes between v7.x and v8.x of OutWit Hub.
(The minimum required license type is indicated between parantheses.)
This page is just an overview, please look for usage detail in the corresponding Help sections.[updated:6/11/19]

Scrapers

New Directives and Advanced Replacement Functions
Here are some scraper directives and functions that were added in version 8.0:
(check Scraper Editor Help for details and edition) :

Data Extraction & Enhancement

Faster volume scraping
Faster start preparation and end of process cleaning in large volume Fast Scrapes.

Periodic Scraper Executions (Expert & Enterprise)
When scraping a self-updating AJAX page, the #reapply# directive now allows to do the extraction n times at the frequency you choose.

Contact Recognition (Pro & Above)
The contact recognition module and its dictionary were enhanced, lax recognition and dummy email addresses elimination, improved.

Recognition (Expert & Enterprise)
Improved dictionary of multilingual words, acronyms and roots, frequently used in company names addresses etc. to enhance recognition.

Data Refining: Editors & Datasheet Functions

Several tools were added to the right-click menu on datasheets:
Insert Index Column: inserts a column with an incremeted index.
Duplicate Column: copies values from a column to another.
Indexed Duplicate Column: inserts a new column with values from another where duplicate cells and suffixed with an index.
Copy from Column...: copies values from a column to another.
Select if in...: selects rows in a datasheet if they are found in another datasheet. Particularly interesting to identify the rows in an extraction that have already been extracted and sent to the catch.
Duplicate cells are now underlined in the scraper editor to be clearly identifiable.
Copy & paste allows to transfer content between scrapers or jobs.

SQLite Data Import (Enterprise)
It is now possible to open a SQLite database file into OutWit Hub. This can be very useful when resuming previous extractions, to merge several files or simply to refine the data after extraction.

Exporting Data

Added the semicolon-separated CSV export format for countries that use commas as decimal separators.
Multiple fixes and enhancements in export functions.

Automatic Exploration

New URL Edition Tools
In addition to the URL Editor and the String Generation Panel (Right-Click>Insert>Insert Rows...), a Search Query Builder is now available in the Tools Menu and as a Toolbar button. It allows to generate complex queries for the most used search engines.

Enhancements and Fixes in POST Query Generation Syntax
#HEADER# allows to add custom parameters to the header of the query (see POST query format). #CHARSET# defines the encoding, #TYPE#, the contentType and #REFERER#, the referrer.

User Interface

Application Info Box
Added an info box at the bottom left corner of the application window.

Display Mode Buttons
A line of display mode buttons located at the bottom of the window, left of the status bar, allow you to easily toggle on and off the display of images or videos, the highlighting of nodes or series of links, the activation of plugins and javascript.

Error console
The error console is automatically closed if it seems to slow down the process of a fast scrape.

Automator Property Dialog
Multiple selection is now possible when opening the automator property dialog, allowing to edit a field for multiple items at once.

And more...

v9 includes many more enhancements and fixes that are not listed here, improving overall security, reliability and stability of the program.




New in OutWit Hub v7

Here are the main additions and changes between v6.x and v7.x of OutWit Hub.
(The minimum required license type is indicated between parantheses.)
This page is just an overview, please look for usage detail in the corresponding Help sections.[updated:1/30/18]

Scrapers

Over 60 New Directives
A very long list of scraper directives and functions was added in version 7.0:
We cannot list all of them but here is a selection (check Scraper Editor Help for details and edition) :

New advanced replacement functions (Expert & Enterprise)
The #match()# function, used in the replacement column allows you to search the other occurrences of a string (or matches of a RegExp) that you grab (or build) from the page itself. It allows very powerful conditional extractions.

Limiting multiple & duplicate results (Expert & Enterprise)
The syntax myFieldName<n, myFieldName>n, myFieldName= in the description column allows you to manage multiple results, duplicates etc..

Data Extraction & Enhancement

News / RSS feed Extraction (Pro & Above)
Improved recognition and extraction of rss feeds, publication dates in more locales, addition of universal identifier (guid)...

Contact Recognition (Pro & Above)
The contact recognition module was further enhanced, lax recognition and dummy email addresses elimination, improved.

Name Recognition (Expert & Enterprise)
A large dictionary of multilingual words, acronyms and roots, frequently used in company names addresses etc. was added to enhance recognition.

Exporting Data

Extract millions of rows (Enterprise)
Used in the Enterprise edition, the #exportAndDeleteEvery#n# scraping directive can define an SQLite database as the destination (using a filename with the .sqlite extension), allowing you to process and store extremely large volumes of data.

Saving Preferences (Expert & Enterprise)
You can now save (and restore) the state of preferences in (from) a directory of the queries view.

New OutWit Fetcher

If you need copies of OutWit Hub Expert for a fraction of the price, just to run extractions: OutWit Fetcher.
OutWit Hub still comes in three different editions (license levels): Pro, Expert and Enterprise but we now propose a streamlined version of the Hub that can do Web explorations and run scrapers but has no editing capacities. Don't hesitate to enquire about this on the customer support system.

And much more

v7 brings many additional enhancements and fixes that are not listed here, improving overall security, reliability and stability of the program.




New in OutWit Hub v6.x

Here are the main additions and changes between v5.x and v6.0 of OutWit Hub.
(The minimum required license type is indicated between parantheses.)
This page is just an overview, please look for usage detail in the corresponding Help sections.[updated:1/30/18]

Automators

Projects (Pro & Above)
Pro users can now organize their automators (scrapers, macros, jobs, queries), grouping them by projects and saving them as coherent collections.

New Directives (Pro & Above)
A large series of scraper directives and functions was added to the pro version:
Use #autoEmpty#, #autoCatch#, #emptyOnDemand#, #deduplicate# to set the value of the scraped view options from within a scraper.
The #default# directive allows you to set a default value to all fields, #default#fieldName# sets a default value for the passed field name.
Use #pauseBefore# to instruct the program to wait for the passed number of seconds before extracting the data...
#checkIfURL# and #checkIfNotURL# directives allow you to include URL-based conditions in a scraper.
#SECOND# to #FIFTH# were added to the replacement functions in scrapers, allowing to extract the corresponding occurrence of a matching string.
The #LOCALIP# replacement function allows you to access the current local IP from scrapers (can be useful when rotating proxy IPs)

New Directives (Expert & Enterprise)
The #storeVariables# directive makes variables set in a scrape available to subsequent scrapes.
#scope# defines the scope of the extraction in fast scraping/diging mode with the Expert edition (outside or within domain, all links or with a depth of 1 or 2).
#coalesceOnStop# instructs the program to merge extracted data rows, grouping them by the passed field.
#deduplicateOnStop#criterionColumnName# Does a smart deduplication of the extracted data (row by row) in the datasheet, once the current automatic exploration is achieved. (This prevents the deduplication to slow down the whole process.)
#deduplicateWithinPage# does a smart deduplication of the extracted data (row by row) for each scraped page, before sending the results to the datasheet. (This prevents the deduplication of potentially tens of thousands of rows to slow down the whole process.)
#scrollToEnd#cssSelector# was added, in order to address a specific HTML element and scroll down within this element. Very useful for recent AJAX interfaces.
The #encodeURL()# replacement function was added to encode special characters as they should be in a URL (the space character becoming %20, etc.). This is required in some cases when adding URLs to the queue of pages to explore.
#base64()# allows to convert a small image into a self-contained data element using the data: URI scheme.

Data Extraction & Enhancement

Email Address Recognition (Pro & Above)
The email recognition module was enhanced. It now allows for diacritic characters, more dummy email addresses (user@example.com...) are eliminated, lax recognition (jackie at mysite dot com...) is much more efficient.

Names and Genders (Expert & Enterprise)
A preference instructs the program to create an additional Gender column when using the Insert First/Last Name function in the right-click menu. This column will contain the string defined in the preference (i.e. "Dear Mr", "Ms", "Herr", "Chère Madame"...) when the gender is recognized and a fallback value ("Dear Customer"...) otherwise.

Name/Gender replacement functions (Expert & Enterprise)
New directives allow to get First Name, Last Name, First & Last Names, Gender from an extracted string directly in the scraper.

Word Count (Expert & Enterprise)
The words view now includes a text box where you can type or paste the words to count in the page. You can also paste a whole text to count common words between the Web page and this reference text.

Exporting Data

General enhancement and optimization of the Export module (All)
The export module was refactored and optimized in v6.0, fixing bugs, enhancing data cleaning and performance and adding features like additional preference settings for SQL exports (list of fields in INSERT statements, VARCHAR(xxx)...)

Appending extractions to previously exported data (Expert & Enterprise)
Macros can append extracted data to an existing txt, csv or SQL file.

FTP upload (Expert & Enterprise)
Extracted data can be uploaded to an FTP server, adding an index to the file name, if the same name already exist, or overwriting it. When FTP upload is selected as a destination in a macro, the FTP server info set in the advanced preferences is proposed by default.

SQLite Export (Enterprise)
OutWit Hub Datasheets and Catch can now be exported directly to an SQLite database. (SQLite is by far the most widely deployed SQL database engine in the world, installed on billions of computers, smart phones, tablets, TVs etc. It is powerful, easy to set up and use and it can of course be exported to any other database format.)

Automatic Exploration

Shuffle Rows (Pro & Above)
Added a Shuffle function to the right-click datasheet menu which allows, (in particular in the 'queries' view) to randomly reorder rows to avoid sending queries to a server in numerical or alphabetical order.

Check HTTP headers before querying a page (Pro & Above)
Added a prference to instruct the program to check the page header before loading it, in order to avoid errors and login dialogs that could block an automatic exploration.

Stacks in Database (Enterprise)
Added "Stacks in Database" preference which instructs the application to store the current exploration stacks (urls to visit, already visited urls...) into a database instead of in the RAM. This configuration is very interesting for large volume explorations and extractions that span over several days. It can multiply the maximum number of processed web pages you can process by a factor of 5 to 10 and fast scrapes of hundreds of thousands to several millions of pages become possible without running into memory limitations.
A series of commands was added to directly alter the exploration stacks: backToQueue, addToVisited, removeFromVisited, addToURLsToVisit, removeFromURLsToVisit, emptyStack, which can be accessed both from the datasheet right-click menu or typed in the address bar with the prefix outwit: