• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Home
  • About Darknet
  • Hacking Tools
  • Popular Posts
  • Darknet Archives
  • Contact Darknet
    • Advertise
    • Submit a Tool
Darknet – Hacking Tools, Hacker News & Cyber Security

Darknet - Hacking Tools, Hacker News & Cyber Security

Darknet is your best source for the latest hacking tools, hacker news, cyber security best practices, ethical hacking & pen-testing.

Web-Harvest – Web Data Extraction Tool

October 17, 2008

Views: 16,183

Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.

Process of extracting data from Web pages is also referred as Web Scraping or Web Data Mining. World Wide Web, as the largest database, often contains various data that we would like to consume for our needs. The problem is that this data is in most cases mixed together with formatting code – that way making human-friendly, but not machine-friendly content. Doing manual copy-paste is error prone, tedious and sometimes even impossible. Web software designers usually discuss how to make clean separation between content and style, using various frameworks and design patterns in order to achieve that. Anyway, some kind of merge occurs usually at the server side, so that the bunch of HTML is delivered to the web client.

Every Web site and every Web page is composed using some logic. It is therefore needed to describe reverse process – how to fetch desired data from the mixed content. Every extraction procedure in Web-Harvest is user-defined through XML-based configuration files. Each configuration file describes sequence of processors executing some common task in order to accomplish the final goal. Processors execute in the form of pipeline. Thus, the output of one processor execution is input to another one. This can be best explained using the simple configuration fragment:

1
2
3
4
5
<xpath expression="//a[@shape='rect']/@href">
    <html-to-xml>
        <http url="http://www.somesite.com/"/>
    </html-to-xml>
</xpath>

When Web-Harvest executes this part of configuration, the following steps occur:

  1. http processor downloads content from the specified URL.
  2. html-to-xml processor cleans up that HTML producing XHTML content.
  3. xpath processor searches specific links in XHTML from previous step giving URL sequence as a result.

Web-Harvest supports a set of useful processors for variable manipulation, conditional branching, looping, functions, file operations, HTML and XML processing, exception handling. See User manual for technical description of provided processors.

You can download Web-Harvest 1.0 here:

webharvest1-exe.zip

Or read more here.

Related Posts:

  • An Introduction To Web Application Security Systems
  • Privacy Implications of Web 3.0 and Darknets
  • XXEinjector - Automatic XXE Injection Tool For Exploitation
  • nbtscan Download - NetBIOS Scanner For Windows & Linux
  • HTTrack - Website Downloader Copier & Site Ripper Download
  • Ettercap - A Suite For Man-In-The-Middle Attacks
Share
Tweet
Share
Buffer
WhatsApp
Email

Filed Under: Hacking News, Hacking Tools, Web Hacking Tagged With: information gathering, information gathering tool, java



Reader Interactions

Comments

  1. navin says

    October 17, 2008 at 1:36 pm

    Web harvest is a pretty nice implementaion of extraction methods I’d read about before…….a definite weekend project

  2. patate says

    October 18, 2008 at 6:08 pm

    I’ve been using web harvest for the past year or so for random tasks and it does the job very well! Its XML syntax is not so heavy to learn. Their IDE tool is definitely a must for the application, otherwise its a pain to go back and forth their doc.

    I’ve run in to some memory issues where i had to augment the jvm’s memory, but otherwise its a very stable tool.

    I wish that project would get more attention!! Thanks for the post Darknet!

Primary Sidebar

Search Darknet

  • Email
  • Facebook
  • LinkedIn
  • RSS
  • Twitter

Advertise on Darknet

Latest Posts

Reconnoitre - Open-Source Reconnaissance and Service Enumeration Tool

Reconnoitre – Open-Source Reconnaissance and Service Enumeration Tool

Views: 315

Reconnoitre is an open-source reconnaissance tool that automates multithreaded information gathering … ...More about Reconnoitre – Open-Source Reconnaissance and Service Enumeration Tool

Scanners-Box - Open-Source Reconnaissance and Scanning Toolkit

Scanners-Box – Open-Source Reconnaissance and Scanning Toolkit

Views: 489

Scanners-Box is an open-source, community-curated collection of scanners and reconnaissance … ...More about Scanners-Box – Open-Source Reconnaissance and Scanning Toolkit

Red Teaming LLMs 2025 - Offensive Security Meets Generative AI

Red Teaming LLMs 2025 – Offensive Security Meets Generative AI

Views: 525

As enterprises deploy large language models (LLMs) at scale, the offensive security discipline of … ...More about Red Teaming LLMs 2025 – Offensive Security Meets Generative AI

gitlab-runner-research - PoC for abusing self-hosted GitLab runners

gitlab-runner-research – PoC for abusing self-hosted GitLab runners

Views: 339

gitlab-runner-research is a proof-of-concept repository and write-up that demonstrates how attackers … ...More about gitlab-runner-research – PoC for abusing self-hosted GitLab runners

mcp-scanner - Python MCP Scanner for Prompt-Injection and Insecure Agents

mcp-scanner – Python MCP Scanner for Prompt-Injection and Insecure Agents

Views: 592

mcp-scanner is an open-source Python tool that scans Model Context Protocol (MCP) servers and agent … ...More about mcp-scanner – Python MCP Scanner for Prompt-Injection and Insecure Agents

Deepfake-as-a-Service 2025 - How Voice Cloning and Synthetic Media Fraud Are Changing Enterprise Defenses

Deepfake-as-a-Service 2025 – How Voice Cloning and Synthetic Media Fraud Are Changing Enterprise Defenses

Views: 673

Deepfake operations have matured into a commercial model that attackers package as … ...More about Deepfake-as-a-Service 2025 – How Voice Cloning and Synthetic Media Fraud Are Changing Enterprise Defenses

Topics

  • Advertorial (28)
  • Apple (46)
  • Cloud Security (8)
  • Countermeasures (231)
  • Cryptography (85)
  • Dark Web (4)
  • Database Hacking (89)
  • Events/Cons (7)
  • Exploits/Vulnerabilities (433)
  • Forensics (64)
  • GenAI (12)
  • Hacker Culture (10)
  • Hacking News (236)
  • Hacking Tools (708)
  • Hardware Hacking (82)
  • Legal Issues (179)
  • Linux Hacking (74)
  • Malware (241)
  • Networking Hacking Tools (352)
  • Password Cracking Tools (107)
  • Phishing (41)
  • Privacy (219)
  • Secure Coding (119)
  • Security Software (235)
  • Site News (51)
    • Authors (6)
  • Social Engineering (37)
  • Spammers & Scammers (76)
  • Stupid E-mails (6)
  • Telecomms Hacking (6)
  • UNIX Hacking (6)
  • Virology (6)
  • Web Hacking (384)
  • Windows Hacking (171)
  • Wireless Hacking (45)

Security Blogs

  • Dancho Danchev
  • F-Secure Weblog
  • Google Online Security
  • Graham Cluley
  • Internet Storm Center
  • Krebs on Security
  • Schneier on Security
  • TaoSecurity
  • Troy Hunt

Security Links

  • Exploits Database
  • Linux Security
  • Register – Security
  • SANS
  • Sec Lists
  • US CERT

Footer

Most Viewed Posts

  • Brutus Password Cracker Hacker – Download brutus-aet2.zip AET2 (2,395,057)
  • Darknet – Hacking Tools, Hacker News & Cyber Security (2,173,814)
  • Top 15 Security Utilities & Download Hacking Tools (2,097,292)
  • 10 Best Security Live CD Distros (Pen-Test, Forensics & Recovery) (1,200,142)
  • Password List Download Best Word List – Most Common Passwords (934,347)
  • wwwhack 1.9 – wwwhack19.zip Web Hacking Software Free Download (777,069)
  • Hack Tools/Exploits (673,985)
  • Wep0ff – Wireless WEP Key Cracker Tool (531,054)

Search

Recent Posts

  • Reconnoitre – Open-Source Reconnaissance and Service Enumeration Tool November 10, 2025
  • Scanners-Box – Open-Source Reconnaissance and Scanning Toolkit November 7, 2025
  • Red Teaming LLMs 2025 – Offensive Security Meets Generative AI November 5, 2025
  • gitlab-runner-research – PoC for abusing self-hosted GitLab runners November 3, 2025
  • mcp-scanner – Python MCP Scanner for Prompt-Injection and Insecure Agents October 31, 2025
  • Deepfake-as-a-Service 2025 – How Voice Cloning and Synthetic Media Fraud Are Changing Enterprise Defenses October 29, 2025

Tags

apple botnets computer-security darknet Database Hacking ddos dos exploits fuzzing google hacking-networks hacking-websites hacking-windows hacking tool Information-Security information gathering Legal Issues malware microsoft network-security Network Hacking Password Cracking pen-testing penetration-testing Phishing Privacy Python scammers Security Security Software spam spammers sql-injection trojan trojans virus viruses vulnerabilities web-application-security web-security windows windows-security Windows Hacking worms XSS

Copyright © 1999–2025 Darknet All Rights Reserved · Privacy Policy