Bayesian filtering for ORF - ORF Forums

Bayesian filtering for ORF RSS Back to forum

1

I know this has been a feature request for several years now, but has not made it into ORF. I was curious if there's any continued development to include this in the next release or not.

I continue to get more than what I would consider "several spam" emails that make it through ORF almost every day. I've posted about this before and implemented all the suggested feedback without it making much improvements. The spam is the same type and likely the same person(s). I'm pretty sure they are in North America as they follow the traditional work day and weekend schedule for the US. I get spam starting in the morning and it continues throughout the day and tapers off in the evening. I typically don't get any during the weekends or US observed holidays.

Anyway, I think a Bayesian type filtering is probably one of the few things that may work with this type of spam. If Bayesian filtering is not planning to be implemented in a future release, can an add-on Bayesian filtering be run on the same mail server or would it have to be sent to another mail server?

Here are some past threads with what I've tried and discovered.

http://vamsoft.com/forum/topic/470/secondary-spam-toolprogram-or-additional-help-with-orf-config
http://vamsoft.com/forum/topic/477/blacklisting-specific-string-during-smtp-communications

Another recent post that appears to have similar type spam.

http://vamsoft.com/forum/topic/499/persistent-spam

Thanks
Josh

by Josh more than 10 years ago
2

@Josh: Bayesian filtering is unlikely to get implemented in the near future in ORF for various reasons, but setting up an add-on Bayesian filter on the same server could work (or using a command line tool via the External Agent feature).

In most cases, the degraded spam filtering performance of ORF is caused by DNS-related problems: ORF relies heavily on online blacklist databases queried via DNS, and if these lookups fail, spam will be missed. So the first question is, do you see any errors/warnings in your ORF log which indicate DNS lookups are unsuccessful (timeouts, SERVFAIL RCODE2 errors, etc)? Are all recommended DNS blacklists and SURBLs enabled and functional?

Another possibility is that the spam you receive does not have such characteristics which are recognized by ORF (e.g., scam/phishing emails sent manually from legitimate free email providers (Gmail, Microsoft Live Mail) without any URL payload). These are technically not spam (unsolicited, bulk commercial emails), so regular spam filtering methods will not work. If you receive such unwanted emails, I recommend giving ClamAV a try with the 3rd party anti-scam & anti-phishing signatures provided by SaneSecurity:

http://vamsoft.com/support/docs/articles/using-clamav-with-ORF-part-1
http://vamsoft.com/support/docs/articles/using-clamav-with-ORF-part-2
http://sanesecurity.com/usage/signatures/

by Krisztián Fekete (Vamsoft) more than 10 years ago
(in reply to this post)

3

Krisztian,

Thanks for the reply. My DNS is working fine, and the recommended blacklists & surbls are enabled. Unless the recommended list has changed since July, they are all setup properly. The only occasional DNS errors I get are related to checking SPF records. It only happens for certain domains. In my past post in July I sent all my logs and information to Vamsoft to review and nothing has changed in my configuration since then.

The spam is not coming from legitimate free email providers. I tried ClamAV with some of the additional signatures from Sane Security and it did not catch any of the spam emails and the performance from the ClamAV service was not that great. The service would occasionally crash and I ended up just disabling it since it wasn't catching any of the regular spam that was getting through.

If you can recommend an add-on or additionall Bayesian filtering tool that I can use, I would greatly appreciate it.

Thanks
Josh

by Josh more than 10 years ago
4

@Josh: Sylfilter (http://sylpheed.sraoss.jp/sylfilter/) looks promising, though I haven't tried it myself. From the documentation, it seems it first requires you to feed it manually with ham and spam EML files (it will build two database files accordingly), then you can classify any incoming emails.

For mass-exporting spam and ham samples from Outlook, you can use this tool:

http://www.outlookfreeware.com/en/products/all/OutlookMessagesExportEML/

Return values of sylfilter:

0 junk (spam)
1 clean (non-spam)
2 uncertain
127 other errors

You can create an External Agent definition accordingly.

by Krisztián Fekete (Vamsoft) more than 10 years ago
(in reply to this post)

5

@Krisztián Fekete (Vamsoft): Krisztian,

Thank you for the information, I appreciate it! I've taken a quick look at Sylfilter's readme and will need to look over the documentation more closely to figure out exactly how to install/configure everything. I'll reply back to this thread with any questions I have (I'm sure I'll have some).

Thanks
Josh

by Josh more than 10 years ago
(in reply to this post)

6

Well I finally got around to giving Sylfilter a try. The readme file was a little misleading with the install or I didn't really undrestand it. Below is what I wasn't sure about:

"This program requires GLib and a key-value store engine. Install them before building.
Currently SQLite (enabled by default), QDBM and GDBM are supported for key-value store engine."

I just unzipped the file to c:\sylfilter and that was all I did for the "install".

I used the message export utility linked above and that worked well. One of the things I ran into while getting the program to learn the spam messages was that it would error and stop if the file name started with a number or had characters other than roman-numeric in the file name. I manually renamed those, and everything then proceeded as normal. I had just a little over 900 junk messages I fed it. I received a couple warnings like below, but it continued processing the messages:


(sylfilter-cui.exe:3768): SylFilter-WARNING **: html_get_tag(): syntax error in
tag: 'font face=erdana" size="'


(sylfilter-cui.exe:3768): SylFilter-WARNING **: html_get_tag(): syntax error in
tag: 'font face=erdana" size="'


(sylfilter-cui.exe:3768): SylFilter-WARNING **: html_get_tag(): syntax error in
tag: 'td width=50" style=order-style: none;
border-width: medium" height=1" colspan="'


(sylfilter-cui.exe:3768): SylFilter-WARNING **: html_get_tag(): syntax error in
tag: 'td width=55" style=order-left-style:none;
border-left-width:medium;
border-right-style:none;
border-right-width:medium;
border-top-style:none;
border-top-width:medium;
border-bottom-style:solid;
border-bottom-width:1px"
height=2" colspan="'


Just for completeness, here is the full usage/help output:

C:\Sylfilter>sylfilter-cui --help
SylFilter (tentative name) version 0.8

Usage: sylfilter [options] message [message ...]

Options:
-j learn junk (spam) messages
-c learn clean (non-spam) messages
-J unlearn junk (spam) messages
-C unlearn clean (non-spam) messages
-t classify messages
-v show verbose messages
-d show debug messages
-m n|r
specify filtering method
n : Paul Graham (Naive Bayes) method
r : Gary Robinson (Robinson-Fisher) method (default)
--min-dev
ignore if score near (default: 0.1)
--robs
Robinson's s parameter (default: 1.0)
--robx
Robinson's x parameter (default: 0.5)
-B do not bias probability for clean mail
(Paul/Naive method only, may increase false-positive)

-V print version
-h, --help
print this help message

-E <engine_name>
specify key-value store engine (show below)
-p <path>
specify database directory

Return values:
0 junk (spam)
1 clean (non-spam)
2 uncertain
127 other errors

Default database location: C:\Documents and Settings\Administrator\Application D
ata\SylFilter/*.db

Available key-value stores:
QDBM

C:\Sylfilter>

by Josh 9 years ago
7

@Krisztián Fekete (Vamsoft): Krisztian,

Would you mind looking at my external agent configuration to see I've set things up correctly? I've got everything configured to just tag as "spam" for the time being until I get comfortable that it's working properly and not tagging legitimate emails as spam.

Here is a link to screen captures of my external agent configuration.

http://www.main.experiencetherave.com/images/orf-sylfilter

Thanks!
Josh

by Josh 9 years ago
(in reply to this post)

8

@Josh: I am very interested in your results as I suffer from the same. Keep us posted and thanks for experimenting with this.

by mike.galbicka 9 years ago
(in reply to this post)

9

@mike.galbicka: Mike,

Will do, I'll keep everyone posted. I had three emails get classified as "uncertain"; they were two bill/statement ready emails and one new forum post email. You can also feed "clean" emails for Sylfilter to learn, so I've since fed about 2,400 "clean" emails to it.

Here are the current Sylfilter database stats:

Junk words: 786,038
Not junk words: 2,093,804
Junk learned numbers: 926
Not junk learned numbers: 2,472

I've also modified the settings a bit to not log the file name, take no action on errors (code 127), and changed the subject tagging for "uncertain" emails. Depending on how many legitimate vs. spam emails I receive in this category I may just change ORF's behavior for this exit code to "no action".

The spammers start kicking things off Monday morning, so we'll see how it handles the onslaught of spammers this week.

Josh

by Josh 9 years ago
(in reply to this post)

10

@Josh: Hi Josh,

Looks fine, though I would not configure any actions on errors.

by Krisztián Fekete (Vamsoft) 9 years ago
(in reply to this post)

11

@Krisztián Fekete (Vamsoft): Ok, thanks for the feedback Krisztián.


The initial results today weren't great. I had two spam emails get through the other checks, and Sylfilter tagged them as "uncertain", and tagged seven legitimate emails as "uncertain" as well. I'm going to continue feeding those good emails that get tagged as uncertain and the spam ones to see if it continues to learn.

I also need to take a look at the links provided to understand exactly how it's checking/determining whether the emails are spam or not. There are also two different methods that come with it, so I may play around with the other method as well.

Josh

by Josh 9 years ago
(in reply to this post)

12

Well after a bit of research and trial & error I have a more positive update.

Initially, most everything was coming through as "uncertain", including "clean" emails and this occurred for some weekly marketing type emails that Sylfilter learned. I tried changing the filtering type, but was unsuccessful at trying to do that.

After some further digging I noticed that Sylfilter creates a separate database (clean & junk) for whatever account is running the program. I had been using the command line program that was run under the administrator account to load all the clean & junk emails. ORF however runs under the system account so it was essentially running without any database information. Once I realised this I added the parameter in ORF to specify the database file location:

"-p "C:\Documents and Settings\Administrator\Application Data\SylFilter" {EMAILFILESPEC}"

I just made this change Sunday afternoon. Monday was a slow spam day apparently, however today it has caught and flagged appropriately two spam emails that typically get by ORF. It did however tag one legitimate promo email as spam.

So I will have to see how things continue for the next week or two, but this has certainly given better results than what I was getting before.

I'll keep an eye on things and provide another update in two weeks or so.

Josh

by Josh 9 years ago
13

@Josh: thanks for the update and effort

by mike.galbicka 9 years ago
(in reply to this post)

14

Well I have an update for everyone; I've been running Sylfilter properly configured for just a little over 41 days. The results have been very promising and Sylfilter has caught pretty much all the spam emails that were routinely getting through the other ORF tests. Only two emails were completely missed (not tagged uncertain or junk) by Sylfilter. However, one of the emails was from a legitimate account that appeared to be hacked. And only two emails were incorrectly tagged spam.

As emails were tagged uncertain or incorrectly tagged I continued to update Sylfilter's databases. I am happy with the results, but will likely continue to just tag the emails for a while longer to be able to better improve Sylfilter's databases and ensure legitimate emails don't get accidently tagged as spam.

Below are the statistics for things so far. I haven't calculated out the percentages, but the data is below if you want to do that.

Days Running 41.23
Total Emails Checked 13824
Tagged by Sylfilter 141
Tagged Uncertain 44
Tagged Uncertain but Junk 9
Tagged Uncertain but Legit 35
Tagged Junk 97
Tagged Junk but Legit 2
Tagged Junk is Junk 95
Missed by Sylfilter 2

Hope this information is helpful for others.

Josh

by Josh 9 years ago
15

@Josh: thanks for the update :)

by Krisztián Fekete (Vamsoft) 9 years ago
(in reply to this post)

New comment

Fill in the form below to add a new comment. All fields are required. If you are a registered user on our site, please sign in first.

It will not be published.
hnp1 | hnp2