viagrow check attachment unicode regex - ORF Forums

viagrow check attachment unicode regex RSS Back to forum

1

Hello,

we're getting a lot of spam "otherwise accepted" which passes all tests. The problem is they all use unique strings of unicode letters in the subject which sort of spell out the same thing, but different each time.

I have seen some really nice looking regex's built for the old 'penny stock' scams, for example this:

(?!cialis)([cçg]|(\[|\{|\())[i1l\|\\\/!¡îíìï:;](([a\^@àáâãäå])|(\/\W{0,2}\\))[i1l\|!¡îíìï:;][i1l\|\\\/!¡îíìï:;][sz5\$]

Could some regex pro out there build one which would block all of the subjects in the picture below?

[img]http://img28.imageshack.us/img28/7769/checkattachment.png[/img]

i would have copied/pasted as text but 90% of the characters can't be pasted

by Bryon more than 10 years ago
2

(the example i pasted is specifically for cialis of course but you get what i mean)

by Bryon more than 10 years ago
3

more info:

most are from yahoo.com addresses, many have two digits just before the @ sign, all are from different ip addresses - i can't find anything that stays the same.

by Bryon more than 10 years ago
4

@Bryon: I have some regexes constructed for "viagra" and "pharmacy", you can download the importable XML file from http://dl.dropbox.com/u/6193776/orf_keywords.xml

To import it, start the Administration Tool, select Configuration | Import | Keyword blacklist... from the main menu, or navigate to the Configuration / Filtering - On Arrival / Keyword blacklist, right-click the expressions box , select "Import list..." and import orf_keywords.xml. Click "No" when prompted to overwrite the current list (otherwise your current expressions will be wiped out).

What you need to change is the character variations: for example to add the "p with acute" character to the character variations of the pharmacy regex, modify it like this:

.*\b[ṕqgp9][\s\._*]?[h4]{1,3}[\s\._*f]?[aàáâåãäæ\@]{1,3}[\s\._*]?r{1,3}[\s\._*]?m{1,3}[\s\._*]?[aàáâåãäæ\@]{1,3}[\s\._*f]?[cç6]{1,3}[\s\._*f]?[yÿ]{1,3}[\s\._*f]?\b.*

You can add other characters this way as well - if you cannot copy/paste them, find their Unicode code, and use that. Example for the above mentioned character (http://www.fileformat.info/info/unicode/char/1e55/index.htm):

.*\b[\x{1e55}qgp9][\s\._*]?[h4]{1,3}[\s\._*f]?[aàáâåãäæ\@]{1,3}[\s\._*]?r{1,3}[\s\._*]?m{1,3}[\s\._*]?[aàáâåãäæ\@]{1,3}[\s\._*f]?[cç6]{1,3}[\s\._*f]?[yÿ]{1,3}[\s\._*f]?\b.*

by Krisztian Fekete (Vamsoft) more than 10 years ago
(in reply to this post)

5

Thanks for that - i've been using those keywords since we installed ORF years ago - i think it's not hitting because it doesn't actually say viagra, but a sort of "viagrow"

but - is there an easy way to say "if there is weird characters in the subject, blacklist it"?

more specifically, if the subject contains any character in this range of unicode codes, blacklist it:


U+00A1 - U00FF
U+0100 - U017F
U+0180 - U+024F
U+0250 - U+02AE
U+02B0 - U+02FE


OR can we say "only allow the range U+0020 thru u+0073" ?


(http://en.wikipedia.org/wiki/List_of_Unicode_characters)

by Bryon more than 10 years ago
6

@Bryon: "more specifically, if the subject contains any character in this range of unicode codes, blacklist it"

1. Start the Administration Tool
2. Navigate to Configuration / Filtering - On Arrival / Keyword Blacklist
3. Click New
4. On the Filter Properties tab, set the Search scope to Email subject
5. Add a Comment text (e.g., "Weirdo characters in the subject")
6. On the Filter expression tab, add the following expression:

.*[\x{0100}-\x{017f}\x{0180}-\x{024f}\x{0250}-\x{02af}\x{02b0}-\x{02ff}\x{00a1}-\x{00ff}].*

7. Set the expression type to "Regular expression"
8. Click OK
9. Save your settings to apply the changes by pressing Ctrl + S.

by Krisztian Fekete (Vamsoft) more than 10 years ago
(in reply to this post)

7

That's really close, but i see how to do it now.

That string above matches on anything with a lowercase "i", an uppercase "I", and an uppercase "S"

All of the other "normal" characters pass though

by Bryon more than 10 years ago
8

@Bryon: Hm, indeed. Not sure why lowercase i (x0069) uppercase S (x0053), and uppercase I (x0049) considered a match by the built-in PCRE engine, those are in the Basic Latin range (0000-007F)...

by Krisztian Fekete (Vamsoft) more than 10 years ago
(in reply to this post)

9

@Krisztian Fekete (Vamsoft): OK, I narrowed it down: it seems the Capital S character is actually considered as "LATIN SMALL LETTER LONG S" (x017f) by the PCRE engine, not sure why. It's this character, definitely not a capital S:

ſ

by Krisztian Fekete (Vamsoft) more than 10 years ago
(in reply to this post)

10

@Krisztian Fekete (Vamsoft): Update: our lead developer shed some light on this - basically the Unicode standard considers these characters equivalent during normalization:

http://en.wikipedia.org/wiki/Unicode_equivalence

http://www.fileformat.info/info/unicode/char/17f/index.htm

The only workaround I can think of is excluding the problematic Unicode characters from these ranges, this should work (i.e, will not match i, s or S):

.*[\x{0100}-\x{012f}\x{0131}-\x{017e}\x{0180}-\x{024f}\x{0250}-\x{02af}\x{02b0}-\x{02ff}].*

by Krisztian Fekete (Vamsoft) more than 10 years ago
(in reply to this post)

11

How interesting that we found out about those three characters

The more recent regex almost works - it still matches on capital I

weird!

by Bryon more than 10 years ago
12

@Bryon: Try

.*[\x{0100}-\x{012f}\x{0132}-\x{017e}\x{0180}-\x{024f}\x{0250}-\x{02af}\x{02b0}-\x{02ff}].*

by Krisztian Fekete more than 10 years ago
(in reply to this post)

13

That works perfectly :)

by Bryon more than 10 years ago
14

@Bryon: Glad to hear it :)

by Krisztian Fekete more than 10 years ago
(in reply to this post)

New comment

Fill in the form below to add a new comment. All fields are required. If you are a registered user on our site, please sign in first.

It will not be published.
hnp1 | hnp2