News Stay informed about the latest enterprise technology news and product updates.

Harvesting email addresses and phishing

Learn tools and techniques phishers use to harvest your email address to spam you.

The following is tip #7 from "Phishing exposed -- 10 tips in 10 minutes," excerpted from Chapter 3 of the book Phishing Exposed, published by Syngress Publishing.

As many of you know, a major component in spamming is getting hold of valid email addresses to spam. The same goes for phishing. This part of the chapter delves into some of the more effective and creative techniques for harvesting valid email addresses. We will not attempt to cover them all, because frankly, there are many different ways to go about this task, and some are independent of our particular focus here.

The art of email harvesting is to obtain valid, high-quality, high-volume email addresses. In most cases, these factors have trade-offs in terms of time. High quality at high volume usually takes a lot longer to obtain, since you have to focus on more targeted mailing lists, newsgroups, and any other medium that displays static email addresses, but the quality of the emails themselves aren't really known. For high volume alone, a phisher will run multiple extractor tools on Web sites, newsgroups, and mailing lists to obtain email addresses. For high quality, high volume, and high speed, a phisher will most likely require a hacker to obtain stolen information that via breaking in or exploiting systems to gain access to their back-end customer databases.

Harvesting Tools, Targets, and Techniques

According to the FTC, 86 percent of the email addresses posted to Web pages receive spam ( If something had an @ sign in it, no matter where it was placed on the Web page, it attracted spammers' attention. The same goes for newsgroups—86 percent of the addresses posted to newsgroups also receive spam.

There are multiple ways to harvest email addresses off Web pages and newsgroups, but the majority of spammers and phishers use what are called bots or crawlers. These tools literally scour the Internet looking for email addresses. Crawler tools are readily available and fairly inexpensive, able to render solid results within the first hour. Take a look at one site, (see Figure 3.8), and you will see that it offers multiple tools that enable this sort of activity, and the prices are very reasonable. These tools include harvesting methods that grab information from Web sites, search engines, newsgroups, and whois databases.

Figure 8
Figure 8 Available E-Mail Harvesting Products

If you take a closer look at this product, you will see that it consists of multiple features, including search engine queries to trivially obtain the data we need to start sending our phish emails (see Figure 9).

Figure 9
Figure 9 Search Engine Selection

At this point, we tell the tool to search for specific words, and it begins to look up the data by crawling all the sites it finds to extract email addresses (see Figure 10).

Figure 10
Figure 10 E-Mail Collection

Unfortunately, this technique does not go undetected (see Figure 11)—Google interprets the automated requests against its site as malware or spyware coming from our computer and will ultimately block our IP address. This will limit our searching ability because it will require human intervention to continue our crawling endeavors. It would be ideal to add a crawling feature that could employ multiple proxies for our requested searches to use so as not appear to come in from the one IP address and we would not be blocked.

Figure 11
Figure 11 We Have Been Spotted!

For our more technically savvy readers with an interest in better stealth control, freely available tools allow a lot more extensibility and possible evilness to scan for vulnerabilities that do similar things. Specifically, wget is a very powerful tool for performing this type of "research" to obtain the information you need. Using wget in combination with other UNIX tools, we can easily demonstrate the power of this technique.

The trade-off of a somewhat stealthy approach versus our apparently overt attempt is mainly the time it will take to conduct the Web crawl, especially if you are using one search engine to crawl. The fast rate at which the Web Extractor tool could crawl made us look suspicious to Google's defensive infrastructure.

First, then, we need to set up wget to be more optimal for us, so that we can construct or edit a .wgetrc file (this file sits in your home directory). The .wgetrc file has some options that can help you control your wget without writing extremely long command lines. Before we get started, it should be noted that .wgetrc requires a bit of conservative behavior or you will end up mirroring a good portion of the Web, which more than likely will not fit on your local hard drive. Previously, in Chapter 2, we observed the /robots.txt file that prevented wget ignoring the other directories involved with our target. This was due to wget complying to the Robot Exclusion Standard. When we're harvesting email addresses, we must assume that we probably don't want to comply with this standard, since it limits our extracting of information. Here is what our .wgetrc should look like:

### Our .wgetrc file we will use to do our evil deeds.

# Lowering the maximum depth of the recursive retrieval is handy to
# prevent newbies from going too "deep" when they unwittingly start
# the recursive retrieval.  The default is 5.
reclevel = 7
# Set this to on to use timestamping by default:
timestamping = on

# For more stealth – we can optionally use a proxy – for our demo
# we'll keep it off, but assume that we would use it to remain stealthy.
#http_proxy =

# If you do not want to use proxy at all, set this to off.
#use_proxy = on

# Setting this to off makes Wget not download /robots.txt.  Be sure to
# know *exactly* what /robots.txt is and how it is used before changing
# the default!
robots = off

# It can be useful to make Wget wait between connections.  Set this to
# the number of seconds you want Wget to wait. We're setting this to 5 
# seconds for demo purposes, we can use 'randomwait = on' optionally. 
wait = 5

# You can turn on recursive retrieving by default (don't do this if
# you are not sure you know what it means) by setting this to on.
recursive = on

We now have our wget environment ready for use and need to find a good target that will provide us some email addresses—such as any particular known mailing list. For our example, let's select a security mailing list, namely (see Figure 12).

Figure 12
Figure 12 Mailing List Targets—Easy to Fetch Recursively

It is a known fact that open mailing lists are a popular target because their primary function is to draw a bunch of email users to communicate in a centralized forum. Even though harvesting email addresses from the Internet for the purpose of spamming is now illegal per the CAN-SPAM Act of 2003 (, literally thousands of mailing lists and organizations are targeted daily by directory harvest attacks (DHAs). DHAs are spammers' attempts to locate valid email addresses by infiltrating email servers and building a database of the legitimate email addresses they find.

Postini, an email security vendor, reported in March 2005 (http://postini. com/news_events/pr/pr033105.php) that it had processed over 26 million DHAs targeting corporate email alone, averaging more than 843,157 DHAs per day! We can only imagine how unbelievably high these daily DHA statistics would be if every mailing list targeted by spammers were monitored.

In our case, the target we are going after is quite an easy one from which to gain some mailing addresses. The site has an open directory listing of all the lists they archive, so this could be a gold mine for us. Now, the slightly obvious part of our demo is that if we were phishers, we would probably not target a security-focused mailing list, since it would be the equivalent of trying to hold up a police station with a knife, not to mention that the quality of email addresses might not be as high, since they are either email addresses of the mailing list itself or throwaway addresses. But as noted earlier, this is why we selected this particular target for demonstration purposes. This isn't to say that spammers do not target security mailing lists, but then again, the agenda of the common spammer is quite different and a bit more arbitrary than a criminal investing time in fraudulent activity.

Taking a look at, we want to execute a quick command that can grab the email addresses out of the Web pages. That means we have to sample how the site attempts to protect its email addresses from harvesting. We should be able to safely assume that a set of Web-archived security mailing lists are quite aware of the problem of spam, so some protection schemes should be in place. We can hope that this will still be a "one-liner" for us to harvest the email addresses. A one-liner is one set of commands on the UNIX command prompt—for example:

ls –la | grep –i somefile.txt

To do this, we locate one of the mailing-list submissions with an email address in it and see how they handle it. Here is one:

> > To: Steve Fletcher;

We want to target security-basics and be able to ensure that we can pick this email and others out of the HTML and successfully store them as human-readable email addresses. When we view the HTML source, we see what the email address looks like to a script, as shown in Figure 13.

Figure 13
Figure 13 Antiharvesting Technique

Sure enough, just as suspected, the site uses an antiharvesting technique that is intended to deter and evade most email address extractors. Whether or not it will actually work is the big question. However, in our case, since we know how the site is handling antiharvesting techniques, we should be able to quickly undo them with some simple Perl ( scripting. The antiharvesting technique is hiding the email address within a comment field that only displays within the HTML code and the use of the HTML coded character set. In this situation, the site is using @, which is the commercial @ character, and ., which is a period (.). The comment field then goes arbitrarily between the email address, which won't be interpreted by a human viewing it, but wget retrieving the HTML document will see it because it is a comment in the source code (see Figure 14).

Figure 14
Figure 14 W3C Details of the Character Set for HTML

Some Perl-compatible regular expressions (regex; see can bypass this filter trivially and we can still do it all on one line. The advantage of Perl is the –e flag, or the eval flag, which takes in lines of code on the command line and executes them. So, to quickly set up our Web email extractor, we know that we can use wget to mirror the site and post the data to standard out. Then we'll pipe it to some Perl code to handle the filtering. To eliminate duplicates, we'll perform one last pipe to sort –u >> e-maillist.txt, which will uniquely sort the emails and send them to emaillist.txt. Our command line now looks like this:

me@unix~$ wget -m -q -O - '' 
| perl -lne 's/
 if (@x) { $x[0].="@"; print @x }' | 
sort –u >> maillist.txt

Regex can be a pain to get your mind around at first, but as you get into it, it's not all that bad. What our filter is doing is eliminating the altogether as it finds it within the HTML. Then it handles the character codes and converts them to their proper character representation. From that point it takes a variable and attributes it to matching patterns that represent multiple variants on the antiharvesting filters, such as user at user dot com. Regex will then convert it properly to a normally formatted email address and print it to standard out (stdout) if we find a match. Since we are piping it to sort and sending it to a file, this will eliminate duplicates and store them in our maillist.txt file. Now we have successfully harvested email addresses from

Let's run maillist.txt through a line count using the command wc –l to see how many addresses we successfully harvested from We achieved only 174 names on this initial pass, which is actually not bad for a light footprint of a Web site. If you tried this on a site that distributes press releases for companies, you could expect it to take days to grab all the email addresses off the site. On a site that has an overwhelming number of email addresses posted, you can lower your recursive count to get speedy results and lower your duplicate counts if you're looking to harvest at a faster rate.

In less than five minutes with this script, we were able to obtain more than 300 unique email addresses from a publicly available press release distributing firm. With a wget "in-file" full of domains to harvest from, you can spend a few days pulling a lot of e-mail addresses off the Web. Whether you're using readily available tools or homegrown, command-line regular expressions to scour the Web for e-mail addresses, all it really takes is a little time, patience, and available data storage!

Return Receipts

A very neat trick for obtaining the high-quality email addresses is to be on a mailing list and use return receipts to gather addresses. I was once on a list with lots of major corporations and financial institutions, and the majority of them use Outlook or an automatic Message Disposition Notification via their IMAP server. A weakness with this device is that many implementations have an autorespond delivery notice when a user sends a message requesting a receipt. Even if the email was not read, the recipient of the original e-mail is notified with detailed information about the user. Here's an example:

Final-Recipient: RFC822;
Disposition: automatic-action/MDN-sent-automatically; displayed
X-MSExch-Correlation-Key: LKhYJD6UMU+l66CeV9Ju6g==
Original-Message-ID: <;

On an unmoderated mailing list rumored to be occupied by 1200 members, I was able to obtain over 500 unique, high-quality email addresses triggered by one message I sent to the list. Not only that, I now can use this to create a signature for the username semantics for each company that autoresponded to my receipt request. This will enable me to obtain more email addresses through guessing and some basic research:

250 +OK SMTP server V1.182.4.2 Ready
250 +OK Sender OK
rcpt to:
550 Mailbox unavailable or access denied 
- <>
550 Mailbox unavailable or access denied 
- <>
250 +OK Recipient OK

To top it off, the username semantics are verified by their mail server.

Phishing exposed -- 10 tips in 10 minutes

 Home: Introduction
 Tip 1: Phishing and email basics
 Tip 2: Phishing and the mail delivery process
 Tip 3: Anonymous email and phishing
 Tip 4: Forging headers and phishing
 Tip 5: Open relays, proxy servers and phishing
 Tip 6: Proxy chaining, onion routing, mixnets and phishing
 Tip 7: Harvesting email addresses and phishing
 Tip 8: Phishers, hackers and insiders
 Tip 9: Sending spam and phishing
 Tip 10: Fighting phishing with spam filters

This chapter excerpt from Phishing Exposed, Lance James, is printed with permission from Syngress Publishing, Copyright 2005. Click here for the chapter download.

Dig Deeper on Lotus Notes Domino Phishing and Email Fraud Protection

Start the conversation

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.




  • iSeries tutorials's tutorials provide in-depth information on the iSeries. Our iSeries tutorials address areas you need to know about...

  • V6R1 upgrade planning checklist

    When upgrading to V6R1, make sure your software will be supported, your programs will function and the correct PTFs have been ...

  • Connecting multiple iSeries systems through DDM

    Working with databases over multiple iSeries systems can be simple when remotely connecting logical partitions with distributed ...