Chapter 6: Crawling the Web with Java

Learn how to use Java to implement a customized Web crawler.

Crawler technology is useful in many types of Web-related applications For example, you might use a crawler to

look for broken links in a commercial Web site. You might also use a crawler to find changes to a Web site.

Although Web crawlers are conceptually easy, in that you just follow the links from one site to another, they are a bit challenging to create. One complication is that a list of links to be crawled must be maintained, and this list grows and shrinks as sites are searched. Another complication is the complexity of handling absolute versus relative links.

Fortunately, Java contains features that help make it easier to implement a Web crawler. First, Java's support for networking makes downloading Web pages simple. Second, Java's support for regular expression processing simplifies the finding of links. Third, Java's Collection Framework supplies the mechanisms needed to store a list of links.

The Web crawler developed in this chapter from the book The Art of Java, by Herbert Schildt and James Holmes, is called Search Crawler. It crawls the Web, looking for sites that contain strings matching those specified by the user. It displays the URLs of the sites in which matches are found. Although Search Crawler is a useful utility as is, its greatest benefit is found when it is used as a starting point for your own crawler-based applications.

Click here to download this free book chapter.

This was first published in September 2004

Dig deeper on Java for Lotus Notes Domino

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

SearchWinIT

Search400

  • iSeries tutorials

    Search400.com's tutorials provide in-depth information on the iSeries. Our iSeries tutorials address areas you need to know about...

  • V6R1 upgrade planning checklist

    When upgrading to V6R1, make sure your software will be supported, your programs will function and the correct PTFs have been ...

  • Connecting multiple iSeries systems through DDM

    Working with databases over multiple iSeries systems can be simple when remotely connecting logical partitions with distributed ...

SearchEnterpriseLinux

SearchVirtualDataCentre.co.uk

Close