Web-Harvest – Web Data Extraction Tool

Keep on Guard!


Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.

Process of extracting data from Web pages is also referred as Web Scraping or Web Data Mining. World Wide Web, as the largest database, often contains various data that we would like to consume for our needs. The problem is that this data is in most cases mixed together with formatting code – that way making human-friendly, but not machine-friendly content. Doing manual copy-paste is error prone, tedious and sometimes even impossible. Web software designers usually discuss how to make clean separation between content and style, using various frameworks and design patterns in order to achieve that. Anyway, some kind of merge occurs usually at the server side, so that the bunch of HTML is delivered to the web client.


Every Web site and every Web page is composed using some logic. It is therefore needed to describe reverse process – how to fetch desired data from the mixed content. Every extraction procedure in Web-Harvest is user-defined through XML-based configuration files. Each configuration file describes sequence of processors executing some common task in order to accomplish the final goal. Processors execute in the form of pipeline. Thus, the output of one processor execution is input to another one. This can be best explained using the simple configuration fragment:

When Web-Harvest executes this part of configuration, the following steps occur:

  1. http processor downloads content from the specified URL.
  2. html-to-xml processor cleans up that HTML producing XHTML content.
  3. xpath processor searches specific links in XHTML from previous step giving URL sequence as a result.

Web-Harvest supports a set of useful processors for variable manipulation, conditional branching, looping, functions, file operations, HTML and XML processing, exception handling. See User manual for technical description of provided processors.

You can download Web-Harvest 1.0 here:

webharvest1-exe.zip

Or read more here.

Posted in: Hacking News, Hacking Tools, Web Hacking

, ,


Latest Posts:


OWASP ZSC - Obfuscated Code Generator Tool OWASP ZSC – Obfuscated Code Generator Tool
OWASP ZSC is an open source obfuscated code generator tool in Python which lets you generate customized shellcodes and convert scripts to an obfuscated script.
A Look Back At 2017 – Tools & News Highlights A Look Back At 2017 – Tools & News Highlights
So here we are in 2018, taking a look back at 2017, quite a year it was. Here is a quick rundown of some of the best hacking/security tools released in 2017, the biggest news stories and the 10 most viewed posts on Darknet as a bonus.
Spectre & Meltdown Checker - Vulnerability Mitigation Tool For Linux Spectre & Meltdown Checker – Vulnerability Mitigation Tool For Linux
Spectre & Meltdown Checker is a simple shell script to tell if your Linux installation is vulnerable against the 3 "speculative execution" CVEs that were made public early 2018.
Hijacker - Reaver For Android Wifi Hacker App Hijacker – Reaver For Android Wifi Hacker App
Hijacker is a native GUI which provides Reaver for Android along with Aircrack-ng, Airodump-ng and MDK3 making it a powerful Wifi hacker app.
Sublist3r - Fast Python Subdomain Enumeration Tool Sublist3r – Fast Python Subdomain Enumeration Tool
Sublist3r is a Python-based tool designed to enumerate subdomains of websites using OSINT. It helps penetration testers and bug hunters collect and gather subdomains for the domain they are targeting.
coWPAtty Download - Audit Pre-shared WPA Keys coWPAtty Download – Audit Pre-shared WPA Keys
coWPAtty is a C-based tool for running a brute-force dictionary attack against WPA-PSK and audit pre-shared WPA keys.


2 Responses to Web-Harvest – Web Data Extraction Tool

  1. navin October 17, 2008 at 1:36 pm #

    Web harvest is a pretty nice implementaion of extraction methods I’d read about before…….a definite weekend project

  2. patate October 18, 2008 at 6:08 pm #

    I’ve been using web harvest for the past year or so for random tasks and it does the job very well! Its XML syntax is not so heavy to learn. Their IDE tool is definitely a must for the application, otherwise its a pain to go back and forth their doc.

    I’ve run in to some memory issues where i had to augment the jvm’s memory, but otherwise its a very stable tool.

    I wish that project would get more attention!! Thanks for the post Darknet!