[Geeks are Sexy] technology news





Thursday, March 30, 2006

The Inner Working of a Bayesian Spam Filter

Before going further about how Bayesian filters work, let’s take a look at the history and the theory behind Bayesianism. The Bayesian theory was named after Reverend Thomas Bayes, a renowned British mathematician who lived in the 18th century. The theory tries to estimate the probability of an event happening based on the degree in which someone believes an event will happen again. To help you understand the concept, here's a sentence that would resume the theory very well:

John has been drinking 2 cups of coffee every morning for the past 5 years. This morning, John has drank 2 cups of coffee; it is then very probable that john will drink 2 cups of coffee tomorrow morning.

The Bayesian Theory, when applied to spam mail, is a very effective method of detecting which emails are legitimate or not. Bayesian filters calculate the probability of a message being spam according to its content. The more emails it sees, the more effective it gets. At first, the filter will not be perfect, but if you provide it with hundreds of emails to analyze, it should eventually achieve approximately 99.5% of efficiency.

How can it get this effective? Let's say you receive 10 messages, if you open and close them really quickly, I'm pretty sure you'll know right away which one of them are spam or not. If you can weed them out that quickly, it means that your brain has associated some patterns to them. Patterns are natural phenomenons, they are everywhere. Since they are repetitive, you can usually predict them using mathematical formulas, and this is why bayesian filtering is incredibly effective.

After having inspected 100's of emails, a spam filter will create a list of the most common elements that categorize junk mail: words (or combination of them), title, html code, meta information, color patterns. It will also categorize legitimate mail the same way, so it can compare each message you get to both of those lists.

Here is the mathematical formula that is used to determine if a message is considered as spam or not.

Excerpt from Wikipedia: "Bayesian email filters take advantage of Bayes' theorem. Bayes' theorem, in the context of spam, says that the probability that an email is spam, given that it has certain words in it, is equal to the probability of finding those certain words in spam email, times the probability that any email is spam, divided by the probability of finding those words in any email:"

I think that we can agree that the word "Viagra" (vi@gra, v1@gr@, \/1@Gr/\, etc.) is one of the most popular words that afflict junk mail today. If you see this word in a message, you can be pretty sure it'll be classified as spam. Now let's say that I receive a message from a friend named Hugo telling me that he'll need to take Viagra because... (no need to explain further right?). Hugo ends his message by signing his name. If you’ve been exchanging mail with Hugo for a while, the probability of the message ending up in the spam folder will be very small because the filter knows that when the name "Hugo" appears in the message, it is usually legitimate. The way the filter classify spam is not that simple, its decision is based on 100's of factors, but you get the point right?

Users must be careful though, like I said before, after a couple of months, a Bayesian filter can attain an efficiency of 99.5%, but there is still .5% of chance that a legitimate message will end up in the spam folder. Users must inspect their spam folder regularly to see if any good mail gets dumped in there. If it is the case, the user has to move the message back in his inbox, and tell the filter that it made a mistake so it will not make it again in the future.

The only way for spammers to defeat a Bayesian filter is to make their message look exactly like ordinary, boring ones. Due to the nature and goal of spammers, this will not happen anytime soon, so Bayesian filters will always remain very effective.

If you are working in a corporate environment hosting their own mail under an exchange mail server, and that spam is starting to be a major problem for all of your users, you may want to read a little review I did of an excellent, centrally managed, anti-spam solution: GFI Mail Essentials.

If you are a simple user that is using Outlook, you may want to have a look at SpamBayes.

"The SpamBayes project is working on developing a statistical (commonly, although a little inaccurately, referred to as Bayesian) anti-spam filter, initially based on the work of Paul Graham. The major difference between this and other, similar projects is the emphasis on testing newer approaches to scoring messages. While most anti-spam projects are still working with the original graham algorithm, we found that a number of alternate methods yielded a more useful response."

Other recommendations are welcomed in the comments.

Other [Geeks Are Sexy] technology articles

Add to Del.Icio.Us



4 Comments:

  • Bayesian seems to be the best way to go.
    I am very pleased with Spambully, which uses Bayesian as well as a few other techniques

    By Blogger moldwrite, at 10:13 AM  

  • "The only way for spammers to defeat a Bayesian filter is to make their message look exactly like ordinary, boring ones. Due to the nature and goal of spammers, this will not happen anytime soon, so Bayesian filters will always remain very effective."

    Wrong! Spam full of random words can get through, and often does. It does not seem very useful, but spammers still use this tactic.

    By Anonymous Anonymous, at 8:42 PM  

  • Sorry, it is the simple theory. It is possible to argue Infinitely. In practice, in struggle with spam, I very much trust spambully. Still promise the new version...

    By Blogger moldwrite, at 5:33 PM  

  • Great article, thanks!

    I also use spam bully, excellent bit of kit!

    By Anonymous Anonymous, at 8:06 AM  

Post a Comment

Links to this post:

Create a Link

<< Home