viagrow check attachment unicode regex RSS Back to forum
(the example i pasted is specifically for cialis of course but you get what i mean)
more info:
most are from yahoo.com addresses, many have two digits just before the @ sign, all are from different ip addresses - i can't find anything that stays the same.
@Bryon:
I have some regexes constructed for "viagra" and "pharmacy", you can download the importable XML file from http://dl.dropbox.com/u/6193776/orf_keywords.xml
To import it, start the Administration Tool, select Configuration | Import | Keyword blacklist... from the main menu, or navigate to the Configuration / Filtering - On Arrival / Keyword blacklist, right-click the expressions box , select "Import list..." and import orf_keywords.xml. Click "No" when prompted to overwrite the current list (otherwise your current expressions will be wiped out).
What you need to change is the character variations: for example to add the "p with acute" character to the character variations of the pharmacy regex, modify it like this:
.*\b[ṕqgp9][\s\._*]?[h4]{1,3}[\s\._*f]?[aàáâåãäæ\@]{1,3}[\s\._*]?r{1,3}[\s\._*]?m{1,3}[\s\._*]?[aàáâåãäæ\@]{1,3}[\s\._*f]?[cç6]{1,3}[\s\._*f]?[yÿ]{1,3}[\s\._*f]?\b.*
You can add other characters this way as well - if you cannot copy/paste them, find their Unicode code, and use that. Example for the above mentioned character (http://www.fileformat.info/info/unicode/char/1e55/index.htm):
.*\b[\x{1e55}qgp9][\s\._*]?[h4]{1,3}[\s\._*f]?[aàáâåãäæ\@]{1,3}[\s\._*]?r{1,3}[\s\._*]?m{1,3}[\s\._*]?[aàáâåãäæ\@]{1,3}[\s\._*f]?[cç6]{1,3}[\s\._*f]?[yÿ]{1,3}[\s\._*f]?\b.*
Thanks for that - i've been using those keywords since we installed ORF years ago - i think it's not hitting because it doesn't actually say viagra, but a sort of "viagrow"
but - is there an easy way to say "if there is weird characters in the subject, blacklist it"?
more specifically, if the subject contains any character in this range of unicode codes, blacklist it:
U+00A1 - U00FF
U+0100 - U017F
U+0180 - U+024F
U+0250 - U+02AE
U+02B0 - U+02FE
OR can we say "only allow the range U+0020 thru u+0073" ?
(http://en.wikipedia.org/wiki/List_of_Unicode_characters)
@Bryon:
"more specifically, if the subject contains any character in this range of unicode codes, blacklist it"
1. Start the Administration Tool
2. Navigate to Configuration / Filtering - On Arrival / Keyword Blacklist
3. Click New
4. On the Filter Properties tab, set the Search scope to Email subject
5. Add a Comment text (e.g., "Weirdo characters in the subject")
6. On the Filter expression tab, add the following expression:
.*[\x{0100}-\x{017f}\x{0180}-\x{024f}\x{0250}-\x{02af}\x{02b0}-\x{02ff}\x{00a1}-\x{00ff}].*
7. Set the expression type to "Regular expression"
8. Click OK
9. Save your settings to apply the changes by pressing Ctrl + S.
That's really close, but i see how to do it now.
That string above matches on anything with a lowercase "i", an uppercase "I", and an uppercase "S"
All of the other "normal" characters pass though
@Bryon: Hm, indeed. Not sure why lowercase i (x0069) uppercase S (x0053), and uppercase I (x0049) considered a match by the built-in PCRE engine, those are in the Basic Latin range (0000-007F)...
@Krisztian Fekete (Vamsoft):
OK, I narrowed it down: it seems the Capital S character is actually considered as "LATIN SMALL LETTER LONG S" (x017f) by the PCRE engine, not sure why. It's this character, definitely not a capital S:
ſ
@Krisztian Fekete (Vamsoft):
Update: our lead developer shed some light on this - basically the Unicode standard considers these characters equivalent during normalization:
http://en.wikipedia.org/wiki/Unicode_equivalence
http://www.fileformat.info/info/unicode/char/17f/index.htm
The only workaround I can think of is excluding the problematic Unicode characters from these ranges, this should work (i.e, will not match i, s or S):
.*[\x{0100}-\x{012f}\x{0131}-\x{017e}\x{0180}-\x{024f}\x{0250}-\x{02af}\x{02b0}-\x{02ff}].*
How interesting that we found out about those three characters
The more recent regex almost works - it still matches on capital I
weird!
@Bryon:
Try
.*[\x{0100}-\x{012f}\x{0132}-\x{017e}\x{0180}-\x{024f}\x{0250}-\x{02af}\x{02b0}-\x{02ff}].*
Hello,
we're getting a lot of spam "otherwise accepted" which passes all tests. The problem is they all use unique strings of unicode letters in the subject which sort of spell out the same thing, but different each time.
I have seen some really nice looking regex's built for the old 'penny stock' scams, for example this:
(?!cialis)([cçg]|(\[|\{|\())[i1l\|\\\/!¡îíìï:;](([a\^@àáâãäå])|(\/\W{0,2}\\))[i1l\|!¡îíìï:;][i1l\|\\\/!¡îíìï:;][sz5\$]
Could some regex pro out there build one which would block all of the subjects in the picture below?
[img]http://img28.imageshack.us/img28/7769/checkattachment.png[/img]
i would have copied/pasted as text but 90% of the characters can't be pasted