• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • Home
  • About Darknet
  • Hacking Tools
  • Popular Posts
  • Darknet Archives
  • Contact Darknet
    • Advertise
    • Submit a Tool
Darknet – Hacking Tools, Hacker News & Cyber Security

Darknet - Hacking Tools, Hacker News & Cyber Security

Darknet is your best source for the latest hacking tools, hacker news, cyber security best practices, ethical hacking & pen-testing.

Web-Harvest – Web Data Extraction Tool

October 17, 2008

Views: 16,077

Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.

Process of extracting data from Web pages is also referred as Web Scraping or Web Data Mining. World Wide Web, as the largest database, often contains various data that we would like to consume for our needs. The problem is that this data is in most cases mixed together with formatting code – that way making human-friendly, but not machine-friendly content. Doing manual copy-paste is error prone, tedious and sometimes even impossible. Web software designers usually discuss how to make clean separation between content and style, using various frameworks and design patterns in order to achieve that. Anyway, some kind of merge occurs usually at the server side, so that the bunch of HTML is delivered to the web client.

Every Web site and every Web page is composed using some logic. It is therefore needed to describe reverse process – how to fetch desired data from the mixed content. Every extraction procedure in Web-Harvest is user-defined through XML-based configuration files. Each configuration file describes sequence of processors executing some common task in order to accomplish the final goal. Processors execute in the form of pipeline. Thus, the output of one processor execution is input to another one. This can be best explained using the simple configuration fragment:

1
2
3
4
5
<xpath expression="//a[@shape='rect']/@href">
    <html-to-xml>
        <http url="http://www.somesite.com/"/>
    </html-to-xml>
</xpath>

When Web-Harvest executes this part of configuration, the following steps occur:

  1. http processor downloads content from the specified URL.
  2. html-to-xml processor cleans up that HTML producing XHTML content.
  3. xpath processor searches specific links in XHTML from previous step giving URL sequence as a result.

Web-Harvest supports a set of useful processors for variable manipulation, conditional branching, looping, functions, file operations, HTML and XML processing, exception handling. See User manual for technical description of provided processors.

You can download Web-Harvest 1.0 here:

webharvest1-exe.zip

Or read more here.

Share
Tweet
Share
Buffer
WhatsApp
Email
0 Shares

Filed Under: Hacking News, Hacking Tools, Web Hacking Tagged With: information gathering, information gathering tool, java



Reader Interactions

Comments

  1. navin says

    October 17, 2008 at 1:36 pm

    Web harvest is a pretty nice implementaion of extraction methods I’d read about before…….a definite weekend project

  2. patate says

    October 18, 2008 at 6:08 pm

    I’ve been using web harvest for the past year or so for random tasks and it does the job very well! Its XML syntax is not so heavy to learn. Their IDE tool is definitely a must for the application, otherwise its a pain to go back and forth their doc.

    I’ve run in to some memory issues where i had to augment the jvm’s memory, but otherwise its a very stable tool.

    I wish that project would get more attention!! Thanks for the post Darknet!

Primary Sidebar

Search Darknet

  • Email
  • Facebook
  • LinkedIn
  • RSS
  • Twitter

Advertise on Darknet

Latest Posts

AgentSmith HIDS - Host Based Intrusion Detection

AgentSmith HIDS – Host Based Intrusion Detection

padre - Padding Oracle Attack Tool

padre – Padding Oracle Attack Exploiter Tool

Privacy Implications of Web 3.0 and Darknets

Privacy Implications of Web 3.0 and Darknets

DataSurgeon - Extract Sensitive Information (PII) From Logs

DataSurgeon – Extract Sensitive Information (PII) From Logs

Pwnagotchi - Maximize Crackable WPA Material For Bettercap

Pwnagotchi – Maximize Crackable WPA Key Material For Bettercap

HardCIDR - Network CIDR and Range Discovery Tool

HardCIDR – Network CIDR and Range Discovery Tool

Topics

  • Advertorial (28)
  • Apple (46)
  • Countermeasures (225)
  • Cryptography (82)
  • Database Hacking (89)
  • Events/Cons (7)
  • Exploits/Vulnerabilities (430)
  • Forensics (64)
  • Hacker Culture (8)
  • Hacking News (228)
  • Hacking Tools (681)
  • Hardware Hacking (82)
  • Legal Issues (179)
  • Linux Hacking (72)
  • Malware (238)
  • Networking Hacking Tools (352)
  • Password Cracking Tools (104)
  • Phishing (41)
  • Privacy (218)
  • Secure Coding (118)
  • Security Software (233)
  • Site News (51)
    • Authors (6)
  • Social Engineering (37)
  • Spammers & Scammers (76)
  • Stupid E-mails (6)
  • Telecomms Hacking (6)
  • UNIX Hacking (6)
  • Virology (6)
  • Web Hacking (384)
  • Windows Hacking (169)
  • Wireless Hacking (45)

Security Blogs

  • Dancho Danchev
  • F-Secure Weblog
  • Google Online Security
  • Graham Cluley
  • Internet Storm Center
  • Krebs on Security
  • Schneier on Security
  • TaoSecurity
  • Troy Hunt

Security Links

  • Exploits Database
  • Linux Security
  • Register – Security
  • SANS
  • Sec Lists
  • US CERT

Footer

Most Viewed Posts

  • Brutus Password Cracker – Download brutus-aet2.zip AET2 (2,181,978)
  • Darknet – Hacking Tools, Hacker News & Cyber Security (2,172,352)
  • Top 15 Security Utilities & Download Hacking Tools (2,095,362)
  • 10 Best Security Live CD Distros (Pen-Test, Forensics & Recovery) (1,198,681)
  • Password List Download Best Word List – Most Common Passwords (931,850)
  • wwwhack 1.9 – wwwhack19.zip Web Hacking Software Free Download (774,478)
  • Hack Tools/Exploits (672,593)
  • Wep0ff – Wireless WEP Key Cracker Tool (528,864)

Search

Recent Posts

  • AgentSmith HIDS – Host Based Intrusion Detection August 31, 2023
  • padre – Padding Oracle Attack Exploiter Tool May 28, 2023
  • Privacy Implications of Web 3.0 and Darknets March 31, 2023
  • DataSurgeon – Extract Sensitive Information (PII) From Logs March 21, 2023
  • Pwnagotchi – Maximize Crackable WPA Key Material For Bettercap February 12, 2023
  • HardCIDR – Network CIDR and Range Discovery Tool December 29, 2022

Tags

apple botnets computer-security darknet Database Hacking ddos dos exploits fuzzing google hacking-networks hacking-websites hacking-windows hacking tool Information-Security information gathering Legal Issues malware microsoft network-security Network Hacking Password Cracking pen-testing penetration-testing Phishing Privacy Python scammers Security Security Software spam spammers sql-injection trojan trojans virus viruses vulnerabilities web-application-security web-security windows windows-security Windows Hacking worms XSS

Copyright © 1999–2023 Darknet All Rights Reserved · Privacy Policy