Crawler technology is useful in many types of Web-related applications For example, you might use a crawler to look for broken links in a commercial Web site. You might also use a crawler to find changes to a Web site.
Although Web crawlers are conceptually easy, in that you just follow the links from one site to another, they are a bit challenging to create. One complication is that a list of links to be crawled must be maintained, and this list grows and shrinks as sites are searched. Another complication is the complexity of handling absolute versus relative links.
Fortunately, Java contains features that help make it easier to implement a Web crawler. First, Java's support for networking makes downloading Web pages simple. Second, Java's support for regular expression processing simplifies the finding of links. Third, Java's Collection Framework supplies the mechanisms needed to store a list of links.
The Web crawler developed in this chapter from the book The Art of Java, by Herbert Schildt and James Holmes, is called Search Crawler. It crawls the Web, looking for sites that contain strings matching those specified by the user. It displays the URLs of the sites in which matches are found. Although Search Crawler is a useful utility as is, its greatest benefit is found when it is used as a starting point for your own crawler-based applications.
Click here to download this free book chapter.