Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
There is a bug in the regex for $pattern in project_mail_output() in mail.inc which I just noticed today. It doesn't work correctly if there are attributes other than href in the link, such as rel="nofollow" or class="blah".
The regex is $pattern = '@(<a href="(.+?)">(.+?)</a>)@ei';
I recommend something more like (untested): $pattern = '@(<a .*?href="([^"]+?)"[^>]*>(.+?)</a>)@ei';
Comment | File | Size | Author |
---|---|---|---|
#15 | mail_regex_1.patch | 668 bytes | ChrisKennedy |
#2 | mail_regex_0.patch | 767 bytes | ChrisKennedy |
#1 | mail_regex.patch | 651 bytes | ChrisKennedy |
Comments
Comment #1
ChrisKennedy CreditAttribution: ChrisKennedy commentedHere's a patch - tested to work correctly.
Comment #2
ChrisKennedy CreditAttribution: ChrisKennedy commentedHere's a slightly better version that ensures that the URL attribute is actually "href" in its entirely, and not something like "blahhref".
Comment #3
ChrisKennedy CreditAttribution: ChrisKennedy commentedBump - easy fix.
Comment #4
dwwregexp looks fine on visual inspection (though it's not clear why the original and your copy both consider the part of the href between the " marks to be optional -- why the '?' in there?).
anyone have a chance to test this before I commit? (i'm busy juggling many other issues)...
thanks,
-derek
Comment #5
ChrisKennedy CreditAttribution: ChrisKennedy commentedThe +? doesn't make it optional - it makes it "ungreedy", in that it will stop at the earliest possible location rather than the latest. Otherwise it would eat through multiple attributes and stop at the last quotation mark.
Comment #6
ChrisKennedy CreditAttribution: ChrisKennedy commentedStill applies and still works.
I tested with a basic multiline link, then this test string:
asfdasd<a title="testing" blkaj asdf href="woot.html">woot</a> <blah href="test.html">test</a> <a blahrhef="test2.html">test</a> <a href="blah.html">hello</woot> test <a href="yah.html">uh oh</a>
. The patched version correctly cites the first and last link, compared to the current code which does not cite the first link and merges the last two test links into one incorrect citation.I tried for a while to get single parentheses to work as href delimiters too, but for some reason the back references weren't working. *shrug*
Comment #7
ChrisKennedy CreditAttribution: ChrisKennedy commentedHere's another simple test case for ya:
<a href="drupal.html" title="hey I'm broken!">don't click this</a>
Comment #8
felipensp CreditAttribution: felipensp commentedMy suggestion:
'@(<a(?:(?!href=).)+href=[\x22\x27]?\s*([^\x22\x27>\s]+)\s*[\x22\x27]?[^>]*>([^<]+)<\/a>)@eis'
Accept valids values:
href=foo.php
href=' foo.php '
And others attributes in tag.
Comment #9
ChrisKennedy CreditAttribution: ChrisKennedy commentedI believe there are two problems with the proposal in #8: 1) it will trigger on something like <a blahhref="blah.html"> even though it's not valid, and 2) it does not require the ' or " to be matched with the same character, i.e. it would link <a href="blah.html'> which is invalid.
Comment #10
ChrisKennedy CreditAttribution: ChrisKennedy commentedAlso, it doesn't support spaces in the URL. #3 fixes the original bug and is solid.
Comment #11
felipensp CreditAttribution: felipensp commentedThe problem in treat "...", '...' and value without space is the number of group capturing. In PCRE 7.2 there not this problem...
PCRE 7.2 changelog:
If can modifier the function, all right!
See my test:
Comment #12
ChrisKennedy CreditAttribution: ChrisKennedy commentedThe latest regex looks quite sweet. The only thing is that I wonder how well-supported PCRE 7.2 is. I checked my own webhost and they only have 7.0.18 on it.
Comment #13
hunmonk CreditAttribution: hunmonk commented@ChrisKennedy: in the example you offer in #6, it looks to me like the two 'good' links are:
but the links i'm getting after applying your patch are:
Comment #14
hunmonk CreditAttribution: hunmonk commentedComment #15
ChrisKennedy CreditAttribution: ChrisKennedy commentedWhoops, you're right. Here's a fixed version, also with the spacing support from #13 and with the outermost parentheses removed, which makes the patch a one-liner.
Comment #16
hunmonk CreditAttribution: hunmonk commentedtested, works as expected on the rather complex example. committed to 4.7.x-1.x, 4.7.x-2.x, 5.x-1.x
thanks chris!
Comment #17
(not verified) CreditAttribution: commentedComment #18
felipensp CreditAttribution: felipensp commentedOne question... Can happen to appear anything tag within tag ?
Example: foo bar
In this case, the regex don't matches.
Comment #19
felipensp CreditAttribution: felipensp commentedOps, rewriting example...
<a ...>foo <b>bar</b></a>
Comment #20
blackhatseo CreditAttribution: blackhatseo commentedThe current regex doesn't allow whitespace in the closing tag. Also doesn't allow for html tags inside the anchor.
Here is a much better regex for [a] tags. Accounts for any possible whitespace, attributes, single/double/no quotes,html in anchor,etc.
(via http://www.blackhat-seo.com/2007/regex-tips-and-tricks/)
Comment #21
blackhatseo CreditAttribution: blackhatseo commentedComment #22
hunmonk CreditAttribution: hunmonk commented@blackhatseo: please submit a proper patch if you want this considered for inclusion:
http://drupal.org/patch/create
Comment #23
drummThe HTML to text conversion is now handled by Drupal core.