Regex to match url-only spam? - ORF Forums

Regex to match url-only spam? RSS Back to forum

1

hello

we've been using this regex to try to match emails which contain only url's, and it seems to work ok but now, the spammers are adding on a carriage return on the end, which breaks the match and allows the spam:

^.*?http://.*\.[a-z]{2,77}(|\.|/)\w{0,50}$

when we receive an email which contains ONLY a URL (regardless of the amount of white spaces before or after including a 'return' to the next line, we want to match that.

it is of course important not to match legit emails which contain real words before or after the url, so, how would i change this?

i've tried to just take away the $ but then it matches anything that has a url anywhere in it as far as i can tell.

some example url's that make it thru (the ones i checked do match, but there's a 'return' in the actual email, so it doesn't match anymore):
http://jpk.com.sg/components/com_ag_google_analytics2/site.php?html1
http://chipinbiz.com/holidays.php?uid=22&;detail=169&item=55
http://ontimecontact.com/holidays.php?uid=51&;detail=361&item=34
http://jardines.sc36.info/holidays.php?uid=81&;detail=151&item=21
http://tecnoboxsa.com.ar/holidays.php?uid=64&;detail=688&item=11
http://tsmcharitygolf.com.my/qwertyuasz.html
http://iprofumidiesse.com/uryeqwpkfh.html
http://www.gametv.az/qwertyuasz.html

by Bryon more than 10 years ago
2

@Bryon: How about

^.*?http://.*\.[a-z]{2,77}(|\.|/)\w{0,50}[\s\r\n]*$

[\s\r\n] matches any whitespace character or LF or CRLF (the * character means "any repetitions"). I did not test this but I think it should work.

by Krisztian Fekete (Vamsoft) more than 10 years ago
(in reply to this post)

3

i see what you did there, and that looks good and matches anything where the url is the last thing in the email, but it also matches any amount of text before too, so i adjusted like this:

^http://.*\.[a-z]{2,77}(|\.|/)\w{0,50}[\s\r\n]*$

which seems to be "a body that starts with http:// and ends with any amount of white spaces including returns as long as there's no more text in the body"

but now that doesn't match if there's a return or white space before the url

so the final pattern that seems to work best, including any white/return before or after a url with no other text present, is this:

^[\s\r\n]*http://.*\.[a-z]{2,77}(|\.|/)\w{0,50}[\s\r\n]*$

and i only got there with your help, so thank's for that :)


by Bryon more than 10 years ago
4

@Bryon: I am glad I could help :)

by Krisztian Fekete (Vamsoft) more than 10 years ago
(in reply to this post)

New comment

Fill in the form below to add a new comment. All fields are required. If you are a registered user on our site, please sign in first.

It will not be published.
hnp1 | hnp2