There is a bug in the regex for $pattern in project_mail_output() in mail.inc which I just noticed today. It doesn't work correctly if there are attributes other than href in the link, such as rel="nofollow" or class="blah".

The regex is $pattern = '@(<a href="(.+?)">(.+?)</a>)@ei';

I recommend something more like (untested): $pattern = '@(<a .*?href="([^"]+?)"[^>]*>(.+?)</a>)@ei';

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

ChrisKennedy’s picture

Status: Active » Needs review
FileSize
651 bytes

Here's a patch - tested to work correctly.

ChrisKennedy’s picture

FileSize
767 bytes

Here's a slightly better version that ensures that the URL attribute is actually "href" in its entirely, and not something like "blahhref".

ChrisKennedy’s picture

Bump - easy fix.

dww’s picture

regexp looks fine on visual inspection (though it's not clear why the original and your copy both consider the part of the href between the " marks to be optional -- why the '?' in there?).

anyone have a chance to test this before I commit? (i'm busy juggling many other issues)...

thanks,
-derek

ChrisKennedy’s picture

The +? doesn't make it optional - it makes it "ungreedy", in that it will stop at the earliest possible location rather than the latest. Otherwise it would eat through multiple attributes and stop at the last quotation mark.

ChrisKennedy’s picture

Still applies and still works.

I tested with a basic multiline link, then this test string: asfdasd<a title="testing" blkaj asdf href="woot.html">woot</a> <blah href="test.html">test</a> <a blahrhef="test2.html">test</a> <a href="blah.html">hello</woot> test <a href="yah.html">uh oh</a>. The patched version correctly cites the first and last link, compared to the current code which does not cite the first link and merges the last two test links into one incorrect citation.

I tried for a while to get single parentheses to work as href delimiters too, but for some reason the back references weren't working. *shrug*

ChrisKennedy’s picture

Here's another simple test case for ya: <a href="drupal.html" title="hey I'm broken!">don't click this</a>

felipensp’s picture

My suggestion:
'@(<a(?:(?!href=).)+href=[\x22\x27]?\s*([^\x22\x27>\s]+)\s*[\x22\x27]?[^>]*>([^<]+)<\/a>)@eis'

Accept valids values:

href="
foo.php"

href=foo.php

href=' foo.php '

And others attributes in tag.

ChrisKennedy’s picture

I believe there are two problems with the proposal in #8: 1) it will trigger on something like <a blahhref="blah.html"> even though it's not valid, and 2) it does not require the ' or " to be matched with the same character, i.e. it would link <a href="blah.html'> which is invalid.

ChrisKennedy’s picture

Also, it doesn't support spaces in the URL. #3 fixes the original bug and is solid.

felipensp’s picture

The problem in treat "...", '...' and value without space is the number of group capturing. In PCRE 7.2 there not this problem...

PCRE 7.2 changelog:

(g) (?| introduces a group in which the numbering of parentheses in each
alternative starts with the same number.

If can modifier the function, all right!

See my test:

<pre>
<?php

$l = <<<LINKS
  <a href="foo.php">foo</a>
  <a href='bar.php'>bar</a>
  <a href=baz.php>baz</a>
  <a href="erro.php'>erro</a>
  <ahref="erro2.php'>erro2</a>
LINKS;
    
preg_match_all('@<a(?:(?!\shref=).)*\shref\s*=\s*(?:\x22([^\x22\x27]+)\x22|\x27([^\x27\x22]+)\x27|([^\x22\x27>\s]+))[^>]*>([^<]+)<\/a>@is', $l, $matches);
print_r($matches); 

?>
ChrisKennedy’s picture

The latest regex looks quite sweet. The only thing is that I wonder how well-supported PCRE 7.2 is. I checked my own webhost and they only have 7.0.18 on it.

hunmonk’s picture

Status: Needs review » Needs work

@ChrisKennedy: in the example you offer in #6, it looks to me like the two 'good' links are:

<a title="testing" blkaj asdf href="woot.html">woot</a>
<a href="yah.html">uh oh</a>

but the links i'm getting after applying your patch are:

http://localhost.drupal.project/?q=woot.html
http://localhost.drupal.project/?q=blah.html
hunmonk’s picture

Priority: Normal » Minor
ChrisKennedy’s picture

Status: Needs work » Needs review
FileSize
668 bytes

Whoops, you're right. Here's a fixed version, also with the spacing support from #13 and with the outermost parentheses removed, which makes the patch a one-liner.

hunmonk’s picture

Status: Needs review » Fixed

tested, works as expected on the rather complex example. committed to 4.7.x-1.x, 4.7.x-2.x, 5.x-1.x

thanks chris!

Anonymous’s picture

Status: Fixed » Closed (fixed)
felipensp’s picture

One question... Can happen to appear anything tag within tag ?

Example: foo bar

In this case, the regex don't matches.

felipensp’s picture

Ops, rewriting example...

<a ...>foo <b>bar</b></a>

blackhatseo’s picture

Status: Closed (fixed) » Active

The current regex doesn't allow whitespace in the closing tag. Also doesn't allow for html tags inside the anchor.

Here is a much better regex for [a] tags. Accounts for any possible whitespace, attributes, single/double/no quotes,html in anchor,etc.

@<\s*a[^>]*?href\s*=[\"\’\s]*(.*?)[\"\’\s]*[^>]*?>(.*?)<\s*/\s*a\s*>@is

(via http://www.blackhat-seo.com/2007/regex-tips-and-tricks/)

blackhatseo’s picture

Title: URL regex assumes href is sole attribute » URL regex assumes no whitespace in closing tag and no nested html
Status: Active » Needs review
hunmonk’s picture

Status: Needs review » Needs work

@blackhatseo: please submit a proper patch if you want this considered for inclusion:

http://drupal.org/patch/create

drumm’s picture

Issue summary: View changes
Status: Needs work » Closed (cannot reproduce)

The HTML to text conversion is now handled by Drupal core.