Controlling spam with spamassassin
Controlling Spam with SpamAssassin
Controlling Spam with SpamAssassin
How to set up SpamAssassin and teach it to recognize spam.
The people who produce unsolicited commercial e-mail (UCE), or spam, are the big thieves of the Information Age, spewing out messages for pharmaceuticals, time pieces,fast money and fast women. Large chunks of bandwidth that we have to pay for is eaten up by these crooks. After getting these messages, we have to waste time goingthrough our inboxes and deleting the garbage. Further, unlike magazines, newspapers, commercial radio and television, where the advertisements reduce the cost or make thecontent free, spam gives nothing back to us as readers or viewers.
Although we cannot stop spam, some tools exist to make spam easier to deal with. One such tool is SpamAssassin, which looks at each incoming e-mail message and rates theprobability that the e-mail is spam. Messages that are given a high probability of being spam get flagged as such, and other programs, such as Evolution, KMail or Procmail,can deal painlessly with the flagged e-mail.
SpamAssassin works by going through e-mail messages and looking for things that are associated with spam or non-spam e-mail, which add or subtract points from an e-mail'sscore. So, for example, the word Viagra, and close misspellings of Viagra (as they are used in many pharmaceutical spam messages), adds to the total score. On the otherhand, a valid Sender Policy Framework (SPF) record in the e-mail, which shows that the sender location was not forged, subtracts from the score. By default, any message thatgets a total score of five or more is assumed to be spam.
One problem with the above calculations is that it is a fair bit of work for your computer, so if your machine is currently straining under the workload it has, or if you deal witha lot of e-mail, you may want to look at a hardware upgrade (faster CPU chip and/or more memory) before starting up SpamAssassin.
A number of Linux distributions include SpamAssassin by default. If yours isn't one of them, it should be very simple to add. If you have a Debian-based distribution, itshould be as simple as starting up a terminal window and typing:
sudo apt-get install spamassassin
Once installed, you can start tweaking SpamAssassin's settings. SpamAssassin's configuration file can be found at /.spamassassin/user_prefs. The first setting isrequired_score:
SpamAssassin is not perfect, no matter how you set things. There will be some spam e-mail allowed through, and some valid e-mail will be classed as spam. The goal withthe configuration process is to make sure this happens as seldom as possible. The score of five is an excellent compromise for most people. But, if you find yourself getting alot of spam coming through as non-spam, even after taking the configuration steps noted below, you may want to lower that number to a four or three (or possibly even lower). If, on the other hand, you find after configuration you have a lot of real e-mail identified as spam, you might want to raise the required_score.
There are some people that you always want to hear from, or at least, always want their e-mail to come through, such as coworkers and family members. There also are peoplethat you never want to hear from again, such as annoying exes. SpamAssassin deals with these situations by having a whitelist and blacklist. An e-mail from someone on thewhitelist gets 100 subtracted from the score; anyone on the blacklist gets 100 added to the score. To add someone to your white/blacklist, you need to add something like thefollowing to user_prefs:
Some people have specific reasons why they would want particular spam tests changed. For example, people working at a jewelry store, or watch collectors, might want toallow messages where the word Rolex has been emphasized, accepting that doing so also will increase the amount of replica-watch-related spam they will see. There is a listof SpamAssassin tests at http://spamassassin.apache.org/tests.html. For example, to change the score that an e-mail message gets when the word Rolex has been emphasized,reducing the chances that such a message would be tagged as spam, put the following line in user_prefs:
If too many legitimate Rolex-brand watch-related e-mail messages are still being tagged as spam, the above could be changed to a negative number.
By default, SpamAssassin assumes e-mail in a number of Asian languages, most notably, but not exclusively Chinese, Japanese and Korean, are probably spam. This is aproblem if you use one of those languages. To allow Asian languages, you need to uncomment some lines by removing the # character at the start of the last four lines ofuser_prefs.
Now, let's further refine SpamAssassin's taste. My first run-through with SpamAssassin was a disappointment. Out of some 2,200 spam messages, only about 10% werecorrectly identified as spam. Fortunately, with SpamAssassin there is a utility program called sa-learn that will "teach" SpamAssassin what you consider to be spam and ham(non-spam). This process greatly improves SpamAssassin's ability to identify spam messages correctly. The trick here is to create folders, one filled with spam and anotherfilled with the sort of material you want to keep, and then feed each folder into sa-learn. Using the Evolution e-mail program, I created a folder called BULK, and then Imanually placed all the spam messages into that folder. Next, I ran the sa-learn program with the following command:
sa-learn --mbox --spam /.evolution/mail/local/BULK
Evolution stores all its e-mail in the mbox mail format, thus the --mbox option in the command above. The command for the non-spam messages, which I keep in the Inboxfolder, is:
sa-learn --mbox --ham /.evolution/mail/local/Inbox
The learning system SpamAssassin uses starts to become good at around 1,000 spam and 1,000 ham messages. With a semi-exception, the system doesn't improve noticeablyuntil after seeing more than 5,000 e-mail messages. The semi-exception relates to the fact that spam is a moving target. Some spammers are always looking for better ways toget around filter programs, changing their spam as they go. What this means is that you need to re-train SpamAssassin periodically with new spam and new ham. How oftendepends on your situation, but basically you need to re-train whenever you see a noticeable increase in the amount of spam getting past SpamAssassin. Still, with training, it isvery possible to reach spam-detection accuracy rates of more than 99%.
Remember that SpamAssassin remembers what e-mail it has seen before, so although some people may be tempted to run the same 1,000 e-mail messages through sa-learnfive times, all this will do is waste time.
8/27/2007 7:29 PM
Controlling Spam with SpamAssassin
Let's see how SpamAssassin, actually rates a sample e-mail. For a test, I created a simple text file, testmail.txt with the following:
From:
[email protected]:
[email protected]: Sat, 2 Dec 2006 13:34:50 -0400 (EDT)Subject: Back from vacation
Alice, I am back from vacation, anything importanthappen when I was away?
Then, I ran SpamAssassin as a test with the following command:
spamassassin -t testmail.txt
I received an output like the following:
From:
[email protected]:
[email protected]: Sat, 2 Dec 2006 13:34:50 -0400 (EDT)Subject: Back from vacationX-Spam-Checker-Version: SpamAssassin 3.0.3(2005-04-27) on diamondX-Spam-Level:X-Spam-Status: No, score=-5.9 required=5.0tests=ALL_TRUSTED,BAYES_00, NO_REAL_NAME autolearn=ham version=3.0.3
Alice, I am back from vacation, anything importanthappen when I was away?
Colin McGregorSpam detection software, running on the system"diamond", hasidentified this incoming email as possible spam. Theoriginal messagehas been attached to this so you can view it (if itisn't spam) or labelsimilar future email. If you have any questions, seethe administrator of that system for details.
Content preview: Alice, I am back from vacation,anything important happen when I was away? Colin McGregor [.]
Content analysis details: (-5.9 points, 5.0required)
pts rule name description---- ---------------- ---------------------------------- 0.0 NO_REAL_NAME From: does not include a real name-3.3 ALL_TRUSTED Did not pass through any untrusted hosts-2.6 BAYES_00 BODY: Bayesian spam probability is 0 to 1% [score: 0.0000]
With a score of -5.9, SpamAssassin would not consider the above to be actual spam. By editing testmail.txt and repeating the above, you can see how SpamAssassin reacts tovarious sorts of keywords -- in particular, terms commonly found in spam such as luxury brand-name watches, pharmaceutical products, financial service terms and/or variouspornographic terms.
It isn't clear yet what the magic bullet will be to stop spam and regain the bandwidth spam steals from all of us -- better technology, new laws or better enforcement of lawscurrently in place. Likely an end to spam will require a mixture of actions. In the meantime, SpamAssassin does make dealing with spam a less painful, but not pain-freeexperience.
Evolution and SpamAssassin
The Evolution e-mail display program has a good filtering system for sorting out incoming e-mail, but it is a bit weak when it comes to identifying spam. Fortunately,Evolution allows us to use external programs to help with sorting. From the main screen click on Tools→Filters. Then, click on +Add to create a new rule. You need a namefor this rule, and spam should be just fine. Next, we want to send a copy of each e-mail to SpamAssassin and find out if SpamAssassin views the e-mail as spam; we do notcare about the score SpamAssassin gives the e-mail, just a "yes" or "no". So, we Pipe to Program and then throw everything except the result code away. We do this with theinstruction:
/usr/bin/spamassassin -e > /dev/null
If the above command returns a value of 0, it isn't spam. Anything more than 0 means we very likely have a spam and want it dropped into a separate folder. In the exampleshown in Figure 1, I am sending the e-mail into a folder labeled BULK. After doing the above steps, we want the filter program to stop and wait for the next incoming e-mail.
8/27/2007 7:29 PM
Controlling Spam with SpamAssassin
Figure 1. Creating and Editing Rules in Evolution
As noted previously running sa-learn over the same e-mail twice is a waste of time. This raises another point when using Evolution and SpamAssassin, when you delete ane-mail message under Evolution, the program does not delete the e-mail from the /.evolution/mail/<file name> e-mail file, it just flags it for future removal. This way, if youmake an error deleting an e-mail, you can get it back. To get rid of deleted e-mails completely under Evolution, you must click on Actions→Expunge. During your first dayswith SpamAssassin, when you might be running sa-learn several times over your BULK folder and your Incoming folder, you may not only want to delete e-mail previouslyseen by sa-learn, but also to Expunge it.
8/27/2007 7:29 PM
Source: http://www.ee.ryerson.ca/~courses/coe518/LinuxJournal/elj2007-153-SpamAssassin.pdf
Verein gegen Tierfabriken Schweiz VgT www.vgt.ch gegründet am 4. Juni 1989 Dr Erwin Kessler, Präsident Im Bühl 2, CH-9546 Tuttwil, Fax 052 378 23 62, Tel-Beantworter 052 378 23 01 Dr Daniel L. Vasella, NOVARTIS Dr Erwin Kessler, VgT Plädoyer und Verteidigungsschrift von Erwin Kessler teilweise mündlich vorgetragen und schriftlich eingereicht anlässlich der
A Hypergraph-based Method for Discovering Semantically Associated Itemsets Haishan Liu∗, Paea LePendu†, Ruoming Jin‡ and Dejing Dou∗ ∗ Department of Computer and Information Science University of Oregon, Eugene, OR, 97403, USA Email: [email protected] & [email protected] † Stanford Center for Biomedical Informatics Research