What is New in OutWit Hub v9

Here are the main additions and changes between v8.x and v9.x of OutWit Hub.
(The minimum required license type is indicated between parantheses.)
This page is just an overview, please look for usage detail in the corresponding Help sections.[updated:6/11/20]

Scrapers
Data Extraction & Enhancement
Exporting Data
Editors
User Interface
Other functions

Scrapers

New Directives and Advanced Replacement Functions
Here are some scraper directives and functions that were added in version 9.0:
(check Scraper Editor Help for details and edition) :

#clickOnNodes# instructs the scraper to click on page elements matching a css selector.
#enableNodes# and #disableNodes# allow to directly change the state of page elements matching a css selector.
#select# adds elements matching a css selector to the selection in the current page.
#pressKey# allows the scraper to simulate a key press in certain cases.
#setValue# now also allows to check radio buttons, checkboxes, etc.
#ignoreIfField# instructs the scraper to ignore this page or record if a field has a certain value.
#ifURLContains#, #ifURLDoesNotContain# allows to execute a scraper line or not depending on the URL being scraped.
#PAGESTATUS# replacement function returns info on the current page (errors, title...).
#EARLIEST# and #LATEST# allow to return the first/last date matching the scraper line.
Multiple required fields (descriptions ending with "!") can now be interpreted as AND or OR conditions.
A ^ suffix in the description (myURLFieldName^) in a scraper line destined to extract a URL, only returns the "top" url in the hierarchy (out of example.com/products/shoes and example.com/products/, only the latter is returned.)
#lowerCase()#, #upperCase()#, #properCase()#, #sentenceCase()# alters the case of the string extracted by a scraper line.
#decode()# several algorithms were added to this function to decode an obfuscated or encrypted string extracted by the scraper line into plain text.

Data Extraction & Enhancement

Many substantial enhancements in contact recognition and filtering
Implementation of job title recognition (mostly English for now)
Better elimination of example/bogus email addresses and phone numbers
Enhancements in name, company, address and copyright fields recognition
Enhancements throughout the program in first/last names split and in physical address split
Better handling of obfuscated email addresses

Recognition of other elements
Enhancement of date recognition, including dates without a year.

Data Refining: Editors & Datasheet Functions

More data editing & refining tools were added to the right-click menu on datasheets:
Delete columns > to the right: deletes the selected column and all additional columns to its right.
Clean up > Normalize All Figure: enhanced and optimized.
Shuffle Rows was enhanced to function in more contexts.
Compute: allow to perform basic operations on selected numerical cells of a colum.

Exporting Data

It is now possible to export each scraped row to a separate file.
Fixes and performance enhancements in export functions.

User Interface

A camera shutter sound was added when executing the #screenshot# directive.
The application is resized at launch if it exceeds the dimensions of the screen.
Fixed source colorizing problems in the case of line breaks inside HTML tags.

Other Functions

Multiple fixes and enhancements in String Generation functions.
Updated the list of User Agents.
Added a default referer preference, complementing the #dowloadReferer# & #nextPageReferer# directives in scrapers.
Added several actions in job definitions (alert & suspend, play sound, empty datasheet...).

And more...

v9 includes many more enhancements and fixes that are not listed here, improving overall security, reliability and stability of the program.

New in OutWit Hub v8

Here are the main additions and changes between v7.x and v8.x of OutWit Hub.
(The minimum required license type is indicated between parantheses.)
This page is just an overview, please look for usage detail in the corresponding Help sections.[updated:6/11/19]

Scrapers
Data Extraction & Enhancement
Exporting Data
Automatic Exploration
Editors
User Interface

Scrapers

New Directives and Advanced Replacement Functions
Here are some scraper directives and functions that were added in version 8.0:
(check Scraper Editor Help for details and edition) :

#emptyDirectory# Empties the first directory of the queries view matching the passed name.
#splitField# Splits the passed field as a post-process, using the values in the separator and labels columns. (Can allow consecutive splits.)
#decodeEntities# Decodes HTML entities (like & or >) to their plain text equivalent.
#decodeURL# Decodes URL encoded characters (like %20) to their plain text equivalent.
#save# Saves the string extracted by the scraper line to a separate text file.
#screenshot# Saves a screenshot of the current page into a file using the passed file name.
#hideNodes# Makes the nodes matching the passed css selector invisible.
#scrollBy# Scrolls the page loaded in the OutWit Hub browser by the passed number of pixels.
#resetPrefOnStop# Reset the passed preference to its default value at the end of the scrape process.
#uniqueField# Makes sure that no duplicate values are extracted for the specified field(s) during the same exploration. (An alternative to deduplication while scraping, in case volumes are too large to post-process it.)
#setValue# Sets the value of the <select> or <input> HTML block matching the format column, to the value passed in the replace column.
#restartEvery# Sets 'auto-explore on startup' flag to true and restarts the application, every n pages or seconds.
#uncheckURLInQuery# Unchecks the 'OK' checkbox of the first line containing the current URL in the passed query directory.
#uncheckItemInQuery# Unchecks the 'OK' checkbox of the first line containing the string extracted by the scraper line in the passed query directory.
It is now possible to set the field name with a variable in the #default# directive.
#readFromQueries# Reads the next active string from the passed query directory and stores its value in the passed variable, then unchecks the line in the query directory.
#switchTo# Changes the current view to the value set in the replace column.
#reapply# now accepts parameters for the number of applications and the delay between them.

#adler32()# Used in the replacement column, allows you to generate a short hash from the string extracted by the scraper line. (This can be useful for deduplication although it is not 100% reliable as, even if it is unlikely, two different strings can result in the same hash.)
#encodeBase64()#, #decodeBase64()# Converts the string extracted by the scraper line into a base64 encoded string or decodes it into plain text.
#decode()# Decodes the string extracted by the scraper line into plain text, trying several algorithms.
#unique()# Only returns the string extracted by the scraper line if the value is unique during the same exploration. (An alternative to deduplication while scraping, in case volumes are too large to post-process it.)
#WEEK# was added to the time variables. Returns the week number in the year.
#LAST-POST-QUERY# returns the last POST query send. #LAST-POST-QUERY#param# returns the value of the passed parameter in the last POST query sent.

Data Extraction & Enhancement

Faster volume scraping
Faster start preparation and end of process cleaning in large volume Fast Scrapes.

Periodic Scraper Executions (Expert & Enterprise)
When scraping a self-updating AJAX page, the #reapply# directive now allows to do the extraction n times at the frequency you choose.

Contact Recognition (Pro & Above)
The contact recognition module and its dictionary were enhanced, lax recognition and dummy email addresses elimination, improved.

Recognition (Expert & Enterprise)
Improved dictionary of multilingual words, acronyms and roots, frequently used in company names addresses etc. to enhance recognition.

Data Refining: Editors & Datasheet Functions

Several tools were added to the right-click menu on datasheets:
Insert Index Column: inserts a column with an incremeted index.
Duplicate Column: copies values from a column to another.
Indexed Duplicate Column: inserts a new column with values from another where duplicate cells and suffixed with an index.
Copy from Column...: copies values from a column to another.
Select if in...: selects rows in a datasheet if they are found in another datasheet. Particularly interesting to identify the rows in an extraction that have already been extracted and sent to the catch.
Duplicate cells are now underlined in the scraper editor to be clearly identifiable.
Copy & paste allows to transfer content between scrapers or jobs.

SQLite Data Import (Enterprise)
It is now possible to open a SQLite database file into OutWit Hub. This can be very useful when resuming previous extractions, to merge several files or simply to refine the data after extraction.

Exporting Data

Added the semicolon-separated CSV export format for countries that use commas as decimal separators.
Multiple fixes and enhancements in export functions.

Automatic Exploration

New URL Edition Tools
In addition to the URL Editor and the String Generation Panel (Right-Click>Insert>Insert Rows...), a Search Query Builder is now available in the Tools Menu and as a Toolbar button. It allows to generate complex queries for the most used search engines.

Enhancements and Fixes in POST Query Generation Syntax
#HEADER# allows to add custom parameters to the header of the query (see POST query format). #CHARSET# defines the encoding, #TYPE#, the contentType and #REFERER#, the referrer.

User Interface

Application Info Box
Added an info box at the bottom left corner of the application window.

Display Mode Buttons
A line of display mode buttons located at the bottom of the window, left of the status bar, allow you to easily toggle on and off the display of images or videos, the highlighting of nodes or series of links, the activation of plugins and javascript.

Error console
The error console is automatically closed if it seems to slow down the process of a fast scrape.

Automator Property Dialog
Multiple selection is now possible when opening the automator property dialog, allowing to edit a field for multiple items at once.

And more...

v9 includes many more enhancements and fixes that are not listed here, improving overall security, reliability and stability of the program.

New in OutWit Hub v7

Here are the main additions and changes between v6.x and v7.x of OutWit Hub.
(The minimum required license type is indicated between parantheses.)
This page is just an overview, please look for usage detail in the corresponding Help sections.[updated:1/30/18]

Scrapers
Data Extraction & Enhancement
Exporting Data
New OutWit Fetcher Edition

Scrapers

Over 60 New Directives
A very long list of scraper directives and functions was added in version 7.0:
We cannot list all of them but here is a selection (check Scraper Editor Help for details and edition) :

#exportEvery#n#, #exportAndDeleteEvery#n#, #catchEvery#n#, #catchOnStop# Catches or exports the extracted data when you desire during the process.
#abortAfter#, #abortAfterNPages#n#, #abortAfterNResults#n# Aborts the current extraction after a text is found or a certain number of pages or results have been reached.
#decodeJSCharcodes#, #zapGremlins# to decode hexadecimal, remove unwanted control or invisible characters, correct badly encoded characters, etc.
#clearForms#, #clearAllHistory#, #clearBrowsingHistory#, #clearCookie#, #clearCookieEvery#n#, #clearCookieIf#, #clearCookiesEvery#n#, #clearCookiesIf#, #clearCookiesIfNot# allow you to manage history and cookies from within the scraper.
Use #autoEmpty#, #autoCatch#, #emptyOnDemand#, #deduplicate# to set the value of the scraped view options from a scraper.
#keepForms#, #removeScripts#, #removeTags#, #allFrames#, #originalHTML#... allow you to determine exactly how you want the source before the scraper is applied.
#replaceInField#fieldName# replaces value (litteral of RegExp) in a given field at the end of the process.
#fieldGroup# Makes sure that the fields indexes in a same group are incremented together even if some of the fields are empty.
#oneRow# Makes sure that all extracted data in the page will be presented as a single row in the datasheet.
#allowCrossDomain# removes js restrictions, which is sometimes useful to simulate clicks and other interactions with the page.
#rename#, #unzip# give you post-processing access to files that you have downloaded.
... and many, many more.

New advanced replacement functions (Expert & Enterprise)
The #match()# function, used in the replacement column allows you to search the other occurrences of a string (or matches of a RegExp) that you grab (or build) from the page itself. It allows very powerful conditional extractions.

Limiting multiple & duplicate results (Expert & Enterprise)
The syntax myFieldName<n, myFieldName>n, myFieldName= in the description column allows you to manage multiple results, duplicates etc..

Data Extraction & Enhancement

News / RSS feed Extraction (Pro & Above)
Improved recognition and extraction of rss feeds, publication dates in more locales, addition of universal identifier (guid)...

Contact Recognition (Pro & Above)
The contact recognition module was further enhanced, lax recognition and dummy email addresses elimination, improved.

Name Recognition (Expert & Enterprise)
A large dictionary of multilingual words, acronyms and roots, frequently used in company names addresses etc. was added to enhance recognition.

Exporting Data

Extract millions of rows (Enterprise)
Used in the Enterprise edition, the #exportAndDeleteEvery#n# scraping directive can define an SQLite database as the destination (using a filename with the .sqlite extension), allowing you to process and store extremely large volumes of data.

Saving Preferences (Expert & Enterprise)
You can now save (and restore) the state of preferences in (from) a directory of the queries view.

New OutWit Fetcher

If you need copies of OutWit Hub Expert for a fraction of the price, just to run extractions: OutWit Fetcher.
OutWit Hub still comes in three different editions (license levels): Pro, Expert and Enterprise but we now propose a streamlined version of the Hub that can do Web explorations and run scrapers but has no editing capacities. Don't hesitate to enquire about this on the customer support system.

And much more

v7 brings many additional enhancements and fixes that are not listed here, improving overall security, reliability and stability of the program.

New in OutWit Hub v6.x

Here are the main additions and changes between v5.x and v6.0 of OutWit Hub.
(The minimum required license type is indicated between parantheses.)
This page is just an overview, please look for usage detail in the corresponding Help sections.[updated:1/30/18]

Automators

Projects (Pro & Above)
Pro users can now organize their automators (scrapers, macros, jobs, queries), grouping them by projects and saving them as coherent collections.

New Directives (Pro & Above)
A large series of scraper directives and functions was added to the pro version:
Use #autoEmpty#, #autoCatch#, #emptyOnDemand#, #deduplicate# to set the value of the scraped view options from within a scraper.
The #default# directive allows you to set a default value to all fields, #default#fieldName# sets a default value for the passed field name.
Use #pauseBefore# to instruct the program to wait for the passed number of seconds before extracting the data...
#checkIfURL# and #checkIfNotURL# directives allow you to include URL-based conditions in a scraper.
#SECOND# to #FIFTH# were added to the replacement functions in scrapers, allowing to extract the corresponding occurrence of a matching string.
The #LOCALIP# replacement function allows you to access the current local IP from scrapers (can be useful when rotating proxy IPs)

New Directives (Expert & Enterprise)
The #storeVariables# directive makes variables set in a scrape available to subsequent scrapes.
#scope# defines the scope of the extraction in fast scraping/diging mode with the Expert edition (outside or within domain, all links or with a depth of 1 or 2).
#coalesceOnStop# instructs the program to merge extracted data rows, grouping them by the passed field.
#deduplicateOnStop#criterionColumnName# Does a smart deduplication of the extracted data (row by row) in the datasheet, once the current automatic exploration is achieved. (This prevents the deduplication to slow down the whole process.)
#deduplicateWithinPage# does a smart deduplication of the extracted data (row by row) for each scraped page, before sending the results to the datasheet. (This prevents the deduplication of potentially tens of thousands of rows to slow down the whole process.)
#scrollToEnd#cssSelector# was added, in order to address a specific HTML element and scroll down within this element. Very useful for recent AJAX interfaces.
The #encodeURL()# replacement function was added to encode special characters as they should be in a URL (the space character becoming %20, etc.). This is required in some cases when adding URLs to the queue of pages to explore.
#base64()# allows to convert a small image into a self-contained data element using the data: URI scheme.

Data Extraction & Enhancement

Email Address Recognition (Pro & Above)
The email recognition module was enhanced. It now allows for diacritic characters, more dummy email addresses (user@example.com...) are eliminated, lax recognition (jackie at mysite dot com...) is much more efficient.

Names and Genders (Expert & Enterprise)
A preference instructs the program to create an additional Gender column when using the Insert First/Last Name function in the right-click menu. This column will contain the string defined in the preference (i.e. "Dear Mr", "Ms", "Herr", "Chère Madame"...) when the gender is recognized and a fallback value ("Dear Customer"...) otherwise.

Name/Gender replacement functions (Expert & Enterprise)
New directives allow to get First Name, Last Name, First & Last Names, Gender from an extracted string directly in the scraper.

Word Count (Expert & Enterprise)
The words view now includes a text box where you can type or paste the words to count in the page. You can also paste a whole text to count common words between the Web page and this reference text.

Exporting Data

General enhancement and optimization of the Export module (All)
The export module was refactored and optimized in v6.0, fixing bugs, enhancing data cleaning and performance and adding features like additional preference settings for SQL exports (list of fields in INSERT statements, VARCHAR(xxx)...)

Appending extractions to previously exported data (Expert & Enterprise)
Macros can append extracted data to an existing txt, csv or SQL file.

FTP upload (Expert & Enterprise)
Extracted data can be uploaded to an FTP server, adding an index to the file name, if the same name already exist, or overwriting it. When FTP upload is selected as a destination in a macro, the FTP server info set in the advanced preferences is proposed by default.

SQLite Export (Enterprise)
OutWit Hub Datasheets and Catch can now be exported directly to an SQLite database. (SQLite is by far the most widely deployed SQL database engine in the world, installed on billions of computers, smart phones, tablets, TVs etc. It is powerful, easy to set up and use and it can of course be exported to any other database format.)

Automatic Exploration

Shuffle Rows (Pro & Above)
Added a Shuffle function to the right-click datasheet menu which allows, (in particular in the 'queries' view) to randomly reorder rows to avoid sending queries to a server in numerical or alphabetical order.

Check HTTP headers before querying a page (Pro & Above)
Added a prference to instruct the program to check the page header before loading it, in order to avoid errors and login dialogs that could block an automatic exploration.

Stacks in Database (Enterprise)
Added "Stacks in Database" preference which instructs the application to store the current exploration stacks (urls to visit, already visited urls...) into a database instead of in the RAM. This configuration is very interesting for large volume explorations and extractions that span over several days. It can multiply the maximum number of processed web pages you can process by a factor of 5 to 10 and fast scrapes of hundreds of thousands to several millions of pages become possible without running into memory limitations.
A series of commands was added to directly alter the exploration stacks: backToQueue, addToVisited, removeFromVisited, addToURLsToVisit, removeFromURLsToVisit, emptyStack, which can be accessed both from the datasheet right-click menu or typed in the address bar with the prefix outwit: