Regex filter all emails with links except to specific tlds - ORF Forums

Not a member yet? Sign In

We're sorry but our site requires JavaScript.

JavaScript seems to be disabled in your browser. Please enable JavaScript to avoid running into issues while using our website.

Regex filter all emails with links except to specific tlds RSS Back to forum

1

I was trying to cobble together a regex expression that would match emails with links to domains other than tlds I specify. I built one using the help of online tools, and checked in a few online regex testers and it seems to work. However, it doesn't work in ORF. I'm hoping I could find out why. Here is the regex:

.*(http:\/\/|https:\/\/)(?:(?!com|org|net|gov|us|edu).)*\s.*$

Reply

by CBGraham 6 years ago

2

@CBGraham: Hello CBGraham,

If you remove the whitespace requirement (i.e. \s) after the non-capturing group, then I think your regex should work. It should be noted though, that your regex pattern is not restricted to match URLs inside the HTML <a> anchor tags but anywhere in the document - which may or may not be an issue for you.

As for the configuration in ORF, you have two options here, either should work:

EDIT: The 'User-Defined URL Domain Blacklist' (option 2) should not be used for this. The SURBL test extracts the domains from the values of the 'src' and 'href' attributes found anywhere in the HTML source.

1) Add your regex to the Keyword Blacklist of ORF with an "Email body + Body raw HTML source" search scope (Keyword Blacklist > Keyword Filter Properties dialog > Filter Properties tab). Make sure you select the "Regular expression (Perl-compatible)" expression type on the 'Filter Expression' tab as well.

2) Create a new regex without the http(s) prefix to match the unwanted TLDs in the domains that ORF has already extracted from the mail body, then add it to the 'User-Defined URL Domain Blacklist' of ORF (Blacklists > SURBL Test > User-Defined URL Domain Blacklist - Configure button). See: User-Defined URL Domain Blacklist: https://vamsoft.com/support/docs/orf-help/5.5.1/adm-urlblacklist#manubl

Reply

by Daniel Novak (Vamsoft) 6 years ago

(in reply to this post)

3

@Daniel Novak (Vamsoft): Thank you for the help Daniel. I had not noticed that User Defined URL Domain Blacklist before. I'm trying to work within that now.

I want to create a whitelist of allowed url TLDs instead of maintaining an ever growing blacklist.

I ended up using the following:
^((?!\.com\b|\.org\b|\.net\b|\.gov\b|\.edu\b|\.us\b).)*$

However, when I turned that on I noticed several false positives right away. I'm not certain what the urls look like that come across to the regex test, hopefully you can let me know why I'm seeing false positives. Does it send the full url or just the domain? My current regex is expecting just domain information.

Reply

by CBGraham 6 years ago

(in reply to this post)

4

@CBGraham: The URLs that ORF extracts from an email is distilled to a "pure" domain list, meaning the resulting list contains only strings in a 'sub.domain.tld' format. http(s), ftp, www etc. prefixes and other suffixes are stripped away. Each domain is then checked against the enabled SURBLs and user-defined wildcard/regex expressions.

Although your regex pattern is a bit overcomplicated**, it should work properly without false positives - as far as I can tell. Are you sure that it was your TLD filter that triggered the blacklist event (if you add a comment to the filter expression, that will be logged on each hit)? What message was logged for the event?

** I would recommend a simpler solution: (?!.*\.(com|org|net|gov|edu|us))

Reply

by Daniel Novak (Vamsoft) 6 years ago

(in reply to this post)

5

I tried it again and immediately got some bad hits. This is the full message:

Blacklisted by the User-Defined URL Domain Blacklist. Domain: "01D424CA.AB974BC0". Filter comment: "Block URL links to domains other than the list we want".

Reply

by CBGraham 6 years ago

6

@CBGraham: I don't quite see how this is a bad match - unless the domain "01D424CA.AB974BC0" was not actually harvested from an URL. Do you have the email that contained this domain by any chance? If so, could you save the email in an .eml or .msg format and send it to us () for analysis, please?

Reply

by Daniel Novak (Vamsoft) 6 years ago

(in reply to this post)

7

@Daniel Novak (Vamsoft): Hello CBGraham,

I have asked the devs whether the SURBL Test harvests URLs from anything other than actual HTML hyperinks ( i.e. from anchor elements) and unfortunately it does. It looks for the 'src' and 'href' attributes in the raw HTML source and extracts the domain from those. Thus, I must withdraw my previous suggestion that you should use the 'User-Defined URL Domain Blacklist' for your filter idea.

Instead, add the following regex to the Keyword Blacklist of ORF with the following settings:

1. Start the ORF Administration Tool, and connect to the local or a remote instance
2. Navigate to 'Blacklists > Keyword Blacklist' page
3. Click 'New'
4. In the 'Keyword Filter Properties' dialog, set the search scope to 'Email body' and mark the 'Body raw HTML source' checkbox enabled
5. Add a 'Comment' text (e.g., "URL TLD filter")
6. On the 'Filter Expression' tab, add the following expression:

.*&lta[^>]*href=['"](?![^>]*\.(com|org|net|gov|edu|us))[^>]*>.*

7. Set the expression type to 'Regular expression (Perl-compatible)'
8. Click 'OK'
9. Save your settings to apply the changes by pressing Ctrl + S.

Please let me know if this has helped.

Reply

by Daniel Novak (Vamsoft) 6 years ago

(in reply to this post)

8

@Daniel Novak (Vamsoft): Daniel,

Thank you again for your help with this. In SURBL - Settings, I turned off "email" from URL types to be extracted from emails and it is working well for us with results matching what the regex is looking for.

Reply

by CBGraham 6 years ago

(in reply to this post)

9

@CBGraham: I am afraid that will not solve the issue in the long term. It may not cause false-positives in case of URLs such as, "cid:[email protected]", but the URLs will still be harvested from every HTML element that has an 'src' or 'href' attribute.

Of course, this is not an issue if you want to filter the top-level domains everywhere in the email - even in image URLs. In that case, consider the problem solved :)

Reply

by Daniel Novak (Vamsoft) 6 years ago

(in reply to this post)

New comment

Fill in the form below to add a new comment. All fields are required. If you are a registered user on our site, please sign in first.

Nickname

Email address

It will not be published.

Your comment

Notify me of new comments in email