• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Home
  • About Darknet
  • Hacking Tools
  • Popular Posts
  • Darknet Archives
  • Contact Darknet
    • Advertise
    • Submit a Tool
Darknet – Hacking Tools, Hacker News & Cyber Security

Darknet - Hacking Tools, Hacker News & Cyber Security

Darknet is your best source for the latest hacking tools, hacker news, cyber security best practices, ethical hacking & pen-testing.

Web-Harvest – Web Data Extraction Tool

October 17, 2008

Views: 16,153

Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.

Process of extracting data from Web pages is also referred as Web Scraping or Web Data Mining. World Wide Web, as the largest database, often contains various data that we would like to consume for our needs. The problem is that this data is in most cases mixed together with formatting code – that way making human-friendly, but not machine-friendly content. Doing manual copy-paste is error prone, tedious and sometimes even impossible. Web software designers usually discuss how to make clean separation between content and style, using various frameworks and design patterns in order to achieve that. Anyway, some kind of merge occurs usually at the server side, so that the bunch of HTML is delivered to the web client.

Every Web site and every Web page is composed using some logic. It is therefore needed to describe reverse process – how to fetch desired data from the mixed content. Every extraction procedure in Web-Harvest is user-defined through XML-based configuration files. Each configuration file describes sequence of processors executing some common task in order to accomplish the final goal. Processors execute in the form of pipeline. Thus, the output of one processor execution is input to another one. This can be best explained using the simple configuration fragment:

1
2
3
4
5
<xpath expression="//a[@shape='rect']/@href">
    <html-to-xml>
        <http url="http://www.somesite.com/"/>
    </html-to-xml>
</xpath>

When Web-Harvest executes this part of configuration, the following steps occur:

  1. http processor downloads content from the specified URL.
  2. html-to-xml processor cleans up that HTML producing XHTML content.
  3. xpath processor searches specific links in XHTML from previous step giving URL sequence as a result.

Web-Harvest supports a set of useful processors for variable manipulation, conditional branching, looping, functions, file operations, HTML and XML processing, exception handling. See User manual for technical description of provided processors.

You can download Web-Harvest 1.0 here:

webharvest1-exe.zip

Or read more here.

Share
Tweet
Share
Buffer
WhatsApp
Email
0 Shares

Filed Under: Hacking News, Hacking Tools, Web Hacking Tagged With: information gathering, information gathering tool, java



Reader Interactions

Comments

  1. navin says

    October 17, 2008 at 1:36 pm

    Web harvest is a pretty nice implementaion of extraction methods I’d read about before…….a definite weekend project

  2. patate says

    October 18, 2008 at 6:08 pm

    I’ve been using web harvest for the past year or so for random tasks and it does the job very well! Its XML syntax is not so heavy to learn. Their IDE tool is definitely a must for the application, otherwise its a pain to go back and forth their doc.

    I’ve run in to some memory issues where i had to augment the jvm’s memory, but otherwise its a very stable tool.

    I wish that project would get more attention!! Thanks for the post Darknet!

Primary Sidebar

Search Darknet

  • Email
  • Facebook
  • LinkedIn
  • RSS
  • Twitter

Advertise on Darknet

Latest Posts

SUDO_KILLER - Auditing Sudo Configurations for Privilege Escalation Paths

SUDO_KILLER – Auditing Sudo Configurations for Privilege Escalation Paths

Views: 82

sudo is a powerful utility in Unix-like systems that allows permitted users to execute commands with … ...More about SUDO_KILLER – Auditing Sudo Configurations for Privilege Escalation Paths

Bantam - Advanced PHP Backdoor Management Tool For Post Exploitation

Bantam – Advanced PHP Backdoor Management Tool For Post Exploitation

Views: 317

Bantam is a lightweight post-exploitation utility written in C# that includes advanced payload … ...More about Bantam – Advanced PHP Backdoor Management Tool For Post Exploitation

AI-Powered Cybercrime in 2025 - The Dark Web’s New Arms Race

AI-Powered Cybercrime in 2025 – The Dark Web’s New Arms Race

Views: 515

In 2025, the dark web isn't just a marketplace for illicit goods—it's a development lab. … ...More about AI-Powered Cybercrime in 2025 – The Dark Web’s New Arms Race

Upload_Bypass - Bypass Upload Restrictions During Penetration Testing

Upload_Bypass – Bypass Upload Restrictions During Penetration Testing

Views: 503

Upload_Bypass is a command-line tool that automates discovering and exploiting weak file upload … ...More about Upload_Bypass – Bypass Upload Restrictions During Penetration Testing

Shell3r - Powerful Shellcode Obfuscator for Offensive Security

Shell3r – Powerful Shellcode Obfuscator for Offensive Security

Views: 695

If antivirus and EDR vendors are getting smarter, so are the tools that red teamers and penetration … ...More about Shell3r – Powerful Shellcode Obfuscator for Offensive Security

Understanding the Deep Web, Dark Web, and Darknet (2025 Guide)

Understanding the Deep Web, Dark Web, and Darknet (2025 Guide)

Views: 8,670

Introduction: How Much of the Internet Can You See? You're only scratching the surface when you … ...More about Understanding the Deep Web, Dark Web, and Darknet (2025 Guide)

Topics

  • Advertorial (28)
  • Apple (46)
  • Countermeasures (227)
  • Cryptography (82)
  • Database Hacking (89)
  • Events/Cons (7)
  • Exploits/Vulnerabilities (431)
  • Forensics (65)
  • GenAI (3)
  • Hacker Culture (8)
  • Hacking News (229)
  • Hacking Tools (684)
  • Hardware Hacking (82)
  • Legal Issues (179)
  • Linux Hacking (74)
  • Malware (238)
  • Networking Hacking Tools (352)
  • Password Cracking Tools (104)
  • Phishing (41)
  • Privacy (219)
  • Secure Coding (118)
  • Security Software (233)
  • Site News (51)
    • Authors (6)
  • Social Engineering (37)
  • Spammers & Scammers (76)
  • Stupid E-mails (6)
  • Telecomms Hacking (6)
  • UNIX Hacking (6)
  • Virology (6)
  • Web Hacking (384)
  • Windows Hacking (169)
  • Wireless Hacking (45)

Security Blogs

  • Dancho Danchev
  • F-Secure Weblog
  • Google Online Security
  • Graham Cluley
  • Internet Storm Center
  • Krebs on Security
  • Schneier on Security
  • TaoSecurity
  • Troy Hunt

Security Links

  • Exploits Database
  • Linux Security
  • Register – Security
  • SANS
  • Sec Lists
  • US CERT

Footer

Most Viewed Posts

  • Brutus Password Cracker – Download brutus-aet2.zip AET2 (2,291,919)
  • Darknet – Hacking Tools, Hacker News & Cyber Security (2,173,071)
  • Top 15 Security Utilities & Download Hacking Tools (2,096,614)
  • 10 Best Security Live CD Distros (Pen-Test, Forensics & Recovery) (1,199,675)
  • Password List Download Best Word List – Most Common Passwords (933,464)
  • wwwhack 1.9 – wwwhack19.zip Web Hacking Software Free Download (776,130)
  • Hack Tools/Exploits (673,287)
  • Wep0ff – Wireless WEP Key Cracker Tool (530,143)

Search

Recent Posts

  • SUDO_KILLER – Auditing Sudo Configurations for Privilege Escalation Paths May 12, 2025
  • Bantam – Advanced PHP Backdoor Management Tool For Post Exploitation May 9, 2025
  • AI-Powered Cybercrime in 2025 – The Dark Web’s New Arms Race May 7, 2025
  • Upload_Bypass – Bypass Upload Restrictions During Penetration Testing May 5, 2025
  • Shell3r – Powerful Shellcode Obfuscator for Offensive Security May 2, 2025
  • Understanding the Deep Web, Dark Web, and Darknet (2025 Guide) April 30, 2025

Tags

apple botnets computer-security darknet Database Hacking ddos dos exploits fuzzing google hacking-networks hacking-websites hacking-windows hacking tool Information-Security information gathering Legal Issues malware microsoft network-security Network Hacking Password Cracking pen-testing penetration-testing Phishing Privacy Python scammers Security Security Software spam spammers sql-injection trojan trojans virus viruses vulnerabilities web-application-security web-security windows windows-security Windows Hacking worms XSS

Copyright © 1999–2025 Darknet All Rights Reserved · Privacy Policy