Keyword blacklist regular expressions - ORF Forums

Keyword blacklist regular expressions RSS Back to forum

1

Good day, everyone

I have written a regular expression to filter every email which does not have a specific domain in its To: or CC: email source headers:

^((?!\n(to|cc):[^\n]+domain\.tld)[\S\s])+$

With the next dummy text from orf help manual everything works fine (that could be tested on any online resource , such as regex101.com):

Return-Path: <>
Received: from secondarymx.domain.tld ([5.5.5.5]) by primarymx.domain.tld with Microsoft SMTPSVC(5.0.6462.1250);
Mon, 24 May 2004 09:45:47 -0100
Received: from mailrelay.isp.tld ([3.3.3.3]) by secondarymx.domain.tld with Microsoft SMTPSVC(5.0.6462.1250);
Mon, 24 May 2004 09:45:45 -0100
Received: from adsl-1-2-3-4.dsl.isp.tld (1.1.1.1)
by mailrelay.isp.tld with SMTP; 24 May 2004 11:54:17 +0200
Message-ID: <223112573957.43227@>
Reply-To: "Kerri Francis" <>
From: "Kerri Francis" <>
To: "Spammed" <>
CC: "Spammed" <>
Reply-To: "Spammed" <>
Subject: Home delivery on all meds
Date: Mon, 24 May 2004 13:46:45 +0300
MIME-Version: 1.0 (produced by fleeingencapsulatevernier 61.25)
Content-Type: multipart/alternative;
boundary="--40091327580672012"

But when it comes to ORF everything changes, and stop working properly, here are a few things i have noticed:

- 1) ORF does not like first \n character in this regex, and would match anything no matter what I do. Anyway, if I remove \n char, and leave plain (?!(to|cc) part - it starts working (but now i am without condition that header starts from a new line)
- 2) if rule testing text is more than 1197 characters - regex would never match anything (but it is not a big deal)
- 3) If I switch testing scope from email body to email header in first tab - it starts match anything during tests in the next tab

And from this point I completely do not understand how ORF regex testing works, what it tests, how does it alter regular expressions before apply it on the email, i cannot even rely on regex testing services or ORF's own test window since it works randomly

ORF fusion 5.0

by Konstantin 4 years ago
2

@Konstantin: Hello Konstantin,

In case of the "Body" filter, the line ends (CRLF) are replaced with whitespace (ASCII code 32) characters, so the the whole body becomes a single line.

The "Header" filter works with multi-line text, instead of text converted into a single line like the body text filter does. Take this into consideration when filtering for multi-line header fields. For example, using .* in a regular expression to match "any characters, any number of repetitions" will match line breaks as well. To limit the scope of the expression to a single line, you should use [^\r\n]* instead, which matches "any character except for line breaks, any number of repetitions".

Note that each regex pattern you create will start with an implicit ^ (caret) in front of the expression. ORF uses case-insensitive regular expression matching, except where case sensitivity can be configured.

by Daniel Novak (Vamsoft) 4 years ago
(in reply to this post)

3

@Daniel Novak (Vamsoft): Hello, Daniel

Thank you very much for information, with this I was able to create a rule that works great in testing window
(email header/raw mime scope, regular expression, case insensitive):

((?!.*^(cc|to):[^\r\n]+domain\.tld).)*$

However when it came to real working it started to detect false positives, when there is surely present "To: recipient(at)domain.tld" in the email source headers. I have also pasted this spam detected email source headers to the rule testing window and it detected nothing. I must say I had to cut a few lines before testing because as I said testing window would not match anything if it's text is longer than 1173 characters (i guess new line characters are counted too, and testing window is limited by 1200 bytes or so), but To: header with proper domain was there and testing worked correctly and shown no matches.

Is there way to get even more verbose logging? I just don't get it. I am not sure which sources are being tested by ORF.

Could be that header source is being tested by ORF not as a whole data chunk but instead by some blocks of text? Like 1200bytes in testing window - this will explain false positives because if the first part of headers do not include any of To: or Cc: headers it will be detected as spam.

by Konstantin 4 years ago
(in reply to this post)

4

@Konstantin: Hello again, Daniel

I have modified the regular expression so it would check only the text that includes To or CC headers

((?=.*^(cc|to))(?!.*^(cc|to):[^\r\n]+domain\.tld).)*$

And it started working properly, so it looks like (I guess) the header data is actually splitting for testing. But this solution is dangerous because if To and CC headers are splitted to the different chunks, it could lead to false positives again.

Could you please advise on how to examine all data as a whole block if I am not wrong about these splits theory or point out what is the real reason of this behaviour

by Konstantin 4 years ago
(in reply to this post)

5

@Konstantin: Hello Konstantin,

Is there a particular reason you are trying to use the Keyword Blacklist to implement your filter? Have you considered using the Recipient Blacklist test to block every single incoming email except the ones sent to the specified address ("domain.tld")? You might have more success - and less headache - with that. Just select the "Blacklist all addresses, except the list below" option on the 'Blacklists > Recipient Blacklist' page and add the domain(s) that are allowed to receive emails.

That said, I am going to ask our developers to comment on your header "chunking" theory. I will get back to you as soon as I know more.

In the meantime, if you still have the falsely blacklisted emails, you should send them to us () for analysis. We will find out what the problem is.

by Daniel Novak (Vamsoft) 4 years ago
(in reply to this post)

6

Hello Daniel

We have a simple rule for detecting spam - if email does not include anyone from our domain in To: field AND in CC: field (but only in BCC) it is spam. I cannot block any other domain recipients, in these fields, i must only ensure that our domain is present in any of them. So I am not sure if there is an option to check To or CC headers with ORF any other way except regexp.

This rule haven't triggered false positives yet, but ORF still passed one spam message. However I pasted this spam message header source right to the test window - and it detected spam with that rule. So thats why I would like to know these ORF specifics :(

by Konstantin 4 years ago
7

Hello, Daniel

Unfortunately there is no information from developers yet. If they got no information about orf regular expressions working, I have only one question (or it is more like a bug report):
why the next rule
(ORF fusion 5.0, keyword blacklist, email headers scope, regular expression, case-insesitive):

((?=.*^(cc|to))(?!.*^(cc|to):[^\r\n]+domain\.tld).)*$

detects nothing in the next text:

Received: from mail.galaxypower.eu (213.91.151.77) by spb-spb-spb2.namco.spb
Received: from mail.galaxypower.eu ([127.0.0.1]) by localhost
(mail.galaxypower.eu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id
rSgturpIlTjH; Mon, 18 Nov 2019 09:47:53 +0200 (EET)
Received: from localhost (localhost [127.0.0.1]) by mail.galaxypower.eu
(Postfix) with ESMTP id 7B3536BE01BE; Mon, 18 Nov 2019 09:47:53 +0200 (EET)
DKIM-Filter: OpenDKIM Filter v2.10.3 mail.galaxypower.eu 7B3536BE01BE
Received: from mail.galaxypower.eu ([127.0.0.1]) by localhost
(mail.galaxypower.eu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id
r61VqpUAWHlo; Mon, 18 Nov 2019 09:47:53 +0200 (EET)
Received: from [127.0.0.1] (213-91-151-74.ip.btc-net.bg [213.91.151.74]) by
mail.galaxypower.eu (Postfix) with ESMTPSA id 4AAB66BE0179; Mon, 18 Nov 2019
09:47:36 +0200 (EET)
From: Denitsa Prodanova <>
Subject: BARTIN / BURGAS,3,000 - 4,000 MT CEMENT IN BULK
Message-ID: <>
Date: Mon, 18 Nov 2019 09:47:28 +0200
Content-Language: enGB
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101
Thunderbird/60.9.1
To: undisclosed-recipients:;
Content-Type: multipart/alternative;
boundary="------------FE279E05FCB143212C423440"
Return-Path:
X-MS-Exchange-Organization-AuthSource: spb-spb-spb2.namco.spb
X-MS-Exchange-Organization-AuthAs: Anonymous
X-Auto-Response-Suppress: DR, OOF, AutoReply
MIME-Version: 1.0

But if I delete any random character before To: header it starts to work properly

by Konstantin 4 years ago
8

@Konstantin: FYI the original source of the headers was modified by forum engine. The next example illustrates orf regexp behaviour better:

tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
tttttttttttttttt
To: ttt

Rule cannot match anything until you delete any character before "To: ttt" text

by Konstantin 4 years ago
(in reply to this post)

9

@Konstantin: Hello Konstantin,

I have received the answer to your questions: ORF does not break up the message header into smaller blocks before performing the Keyword Blacklist and Keyword Whitelist tests, it checks the filter expressions against the whole message header stream. The "false negative" you received is caused by something else. If you can send us the original email (to , saved in a .msg or .eml format) we might be able to figure out why the regex has failed to work in this case.

by Daniel Novak (Vamsoft) 4 years ago
(in reply to this post)

10

@Daniel Novak (Vamsoft): Thank you for the sample email Konstantin. The culprit positively appears to be the regex expression itself which keeps hitting a hard-coded recursion limit in the regex engine configuration.

During a discussion with our developers I was informed that we currently use a hard limit of 1200 for the number of recursions, in order to avoid exhausting the stack on certain subjects (~input data). As we cannot really control the content of emails, this limit was chosen to remain safe at all times, but it is known to be a very conservative and low limit. However, in this particular case we are hitting the limit, because the regular expression look-around constructs in your regex trigger excessive (and unnecessary) recursion.

I propose a new regex solution to resolve this issue:

(?!.*(^|\n)(To|Cc):[^\r\n]+domain\.tld)

I have tested this regex with a fairly large message header (~64Kb) and the recursion limit was not hit.

Can you please try this and get to back to me if it works?

by Daniel Novak (Vamsoft) 4 years ago
(in reply to this post)

11

Good day, Daniel

As far as I have tested rule cases with your regular expression it works without any false positives or negatives. So my problem was in a bad design of regexp. Well now I guess everything would work as intended. Thank you and all of your team very much.

by Konstantin 4 years ago
12

@Konstantin: My pleasure Konstantin. I am glad I was able to help :)

If anything else comes up, just let us know.

by Daniel Novak (Vamsoft) 4 years ago
(in reply to this post)

13

I had a similar situation

by richiamelou 4 years ago

New comment

Fill in the form below to add a new comment. All fields are required. If you are a registered user on our site, please sign in first.

It will not be published.
hnp1 | hnp2