URL filter breaks generated href tags [#161217]

Hi all

Attached is a patch that is a major rewrite to function _filter_url($text, $format). It is based on a recent version of the 5.x development branch, but it only touches parts that are unchanged in 6.x so it can be applied as is to that too.

Rationale

The current version of the URL filter is really basic. The regular expressions used only find url's in plaintext, inside p, li and br tags. In addition it has some bad bugs, such as it might manipulate url's that are really part of some html element's attribute.

This patch extends the "contexts" in which URL's are searched for, and on the other hand makes the regular expressions more accurate so as to avoid manipulating url's that absolutely must not be touched.

What changed?

//From line 1109...
/**
 * URL filter. Automatically converts text web addresses (URLs, e-mail addresses,
 * ftp links, etc.) into hyperlinks.
 */
function _filter_url($text, $format) {
  //The following tags may contain text with a URL that should be converted to a link
  //List also contains standalone/empty tags like img and br
  $tags_begin =  'caption|td|th|div|dd|dt|li|blockquote|address|p|h[1-6]|i|b|u|tt|big|small|strike|s|em|strong|font|dfn|samp|kbd|var|cite|abbr|acronym|sub|sup|q|ins|del|img|br|object|applet';


  //transform each tagname from 'p' to '<\/?p[^>]*>'
  //this pattern will match <p>, </p>, <p class="foo"> and <p />
  //the last one is significant for xhtml standalone tags like <img /> and <br />
  //note: <img/> and <br/>, won't work. While valid xhtml their use is discouraged anyway
  //note: the pattern will obviously match many invalid tags too like <p &&&>    
  $arr_tags_begin = explode( "|", $tags_begin );
  for ( $i = 0; $i < count( $arr_tags_begin ); $i++ ){
    $arr_tags_begin[$i] = '<\/?' . $arr_tags_begin[$i] . '[^>]*>'; 
  }
  $tags_begin = implode( "|", $arr_tags_begin );

  // Pass length to regexp callback
  _filter_url_trim(NULL, variable_get('filter_url_length_'. $format, 72));

  $text   = ' '. $text .' ';

  //The structure of each three matches below is
  // 1. a tag from the list above or newline or </a>
  // 2. any characters except "<" 
  // 3. the url to match
  // 4. possible period or other marks at end of url are considered separate (end of sentence)
  // 5. end url in space, newline or tag ("<" character)

  //In many cases the patterns below only match the first url if there are many in the same paragraph. 
  //Therefore the preg_replace are encapsulate in loops that will re-run the replace as many times as needed.

  // Match absolute URLs. 
  $protocols = 'http://|https://|ftp://|mailto:|smb://|afp://|file://|gopher://|news://|ssl://|sslv2://|sslv3://|tls://|tcp://|udp://';
  $urlpattern = "($protocols)([a-zA-Z0-9@:%_+*~#?&=.,/;-]*[a-zA-Z0-9@:%_+*~#&=/;-])";
  $re = "`(($tags_begin|\n|\r|</a>)([^<])*?)($urlpattern)(([\.\,\?\!]*?)[\<\s\n\r])`i";
  while(preg_match($re, $text)) {
    $text = preg_replace_callback($re, '_filter_url_parse_full_links', $text);
  }
  
  // Match e-mail addresses.
  $urlpattern = '[A-Za-z0-9._-]+@[A-Za-z0-9._+-]+\.[A-Za-z]{2,4}';
  $re = "`(($tags_begin|\n|\r|</a>)([^<])*?)($urlpattern)(([\.\,\?\!]*?)[\<\s\n\r])`i";
  while(preg_match($re, $text)) {
    $text = preg_replace($re, '\1<a href="mailto:\4">\4</a>\5', $text);
  }
  
  // Match www domains/addresses.
  $urlpattern = 'www\.[a-zA-Z0-9@:%_+*~#?&=.,/;-]*[a-zA-Z0-9@:%_+~#\&=/;-]';
  $re = "`(($tags_begin|\n|\r|</a>)([^<])*?)($urlpattern)(([\.\,\?\!]*?)[\<\s\n\r])`i";
  while(preg_match($re, $text)) {
    $text = preg_replace_callback($re, '_filter_url_parse_partial_links', $text);
  }
  
  $text = substr($text, 1, -1);
  return $text;
}

/**
 * Make links out of absolute URLs.
 */
function _filter_url_parse_full_links($match) {
  //which parenthesis in the regexp contains the url?
  $i = 4;
  $match[$i] = decode_entities($match[$i]);
  $caption = check_plain(_filter_url_trim($match[$i]));
  $match[$i] = check_url($match[$i]);
  return $match[1] . '<a href="'. $match[$i] .'" title="'. $match[$i] .'">'. $caption .'</a>'. $match[$i+3];
}

/**
 * Make links out of domain names starting with "www."
 */
function _filter_url_parse_partial_links($match) {
  //which parenthesis in the regexp contains the url?
  $i = 4;
  $match[$i] = decode_entities($match[$i]);
  $caption = check_plain(_filter_url_trim($match[$i]));
  $match[$i] = check_plain($match[$i]);
  return $match[1] . '<a href="http://'. $match[$i] .'" title="'. $match[$i] .'">'. $caption .'</a>'. $match[$i+1];
}

Attached are a patch that can be applied against 5.x-dev and also the entire filter.module file itself. (So you can just replace your current filter.module to test.)

Comment	File	Size	Author
#153	drupal.filter-url.153.patch	35.4 KB	sun
#151	drupal.filter-url.150.patch	160.31 KB	sun
#144	drupal.filter-url.144.patch	36.21 KB	sun
#142	drupal.filter-url.141.patch	33.99 KB	sun
#139	drupal.filter-url.137.patch	28.87 KB	hingo
#139	test.tar_.gz	1.16 KB	hingo
#137	drupal.filter-url.137.patch	24.43 KB	sun
#131	drupal.filter-url.131.patch	24.51 KB	sun
#127	drupal.filter-url.127.patch	24.14 KB	sun
#126	drupal.filter-url.126.patch	23.98 KB	sun
#124	drupal.filter-url.123.patch	16.57 KB	sun
#122	drupal.filter-url.122.patch	17.49 KB	sun
#113	161217-url-filter.patch	17.3 KB	klausi
#111	161217-filter_url-3.patch	8.82 KB	redndahead
#109	161217-filter_url-2.patch	8.78 KB	redndahead
#103	filter_url.patch	7.65 KB	catch
#99	filter_url.patch	8.96 KB	catch
#91	url_filter.patch	15.92 KB	catch
#89	url_filter.patch	16.96 KB	catch
#87	url_filter.patch	16.8 KB	catch
#86	url_filter.patch	16.79 KB	catch
#84	url_filter.patch	17.26 KB	catch
#80	urlfilter-80.patch	8 KB	dmitrig01
#74	locale_confirm_280628_7x_13.patch	7.47 KB	cwgordon7
#71	urlfilter-71.patch	7.41 KB	hingo
#68	urlfilteragain_0.patch	6.29 KB	hingo
#65	urlfilter.patch	7.63 KB	chx
#64	url_filter-161217-64.patch	7.75 KB	catch
#63	url_filter-161217-63.patch	7.75 KB	floretan
#62	urlfilteragain.patch	6.29 KB	catch
	urlfilter.patch	6.29 KB	catch
	urlfilter.patch	6.29 KB	catch
#46	urlfilter-161217-46.patch	6.3 KB	floretan
#43	UrlFilterTestCases.txt	2.08 KB	hingo
#36	filter.module.urlfilter-rewrite.HEAD-0.9.patch	6.07 KB	hingo
#33	filter.module.urlfilter-rewrite.HEAD-0.8.patch	6.19 KB	hingo
#22	filter.module.urlfilter-rewrite.HEAD-0.7.patch	5.88 KB	hingo
#17	filter.testdata.txt	2.12 KB	hingo
#16	filter.module.urlfilter-rewrite.HEAD-0.6.patch	5.87 KB	hingo
#10	filter.module.urlfilter-rewrite.HEAD-0.5.patch	5.26 KB	hingo
#9	filter.module.urlfilter-rewrite.HEAD-0.4-badidea.patch	5.37 KB	hingo
#8	filter.module.urlfilter-rewrite.HEAD-0.3.patch	5.56 KB	hingo
#5	filter.module.urlfilter-rewrite.HEAD-0.2.patch	5.21 KB	hingo
#3	filter.module.urlfilter-rewrite.HEAD-0.1.patch	4.56 KB	hingo
#2	urlfilter-rewrite.testdata.txt	1.04 KB	hingo
#1	filter.module.txt	59.01 KB	hingo
	filter.module.urlfilter-rewrite.v0.1.patch	4.55 KB	hingo

Comments

Comment #1

hingo commented 22 July 2007 at 12:55

Title:

Better URL filter in filter module

» Better URL filter in filter module (filter.module as attachment)

Status	File	Size
new	filter.module.txt	59.01 KB

Here is the filter.module attached

Comment #2

hingo commented 22 July 2007 at 12:58

Title:

Better URL filter in filter module (filter.module as attachment)

» Better URL filter in filter module (testdata attachment)

Status	File	Size
new	urlfilter-rewrite.testdata.txt	1.04 KB

And here is a file that can be used as testdata.

Comment #3

hingo commented 23 July 2007 at 21:19

Title:

Better URL filter in filter module (testdata attachment)

» Better URL filter in filter module (patch against HEAD)

Status	File	Size
new	filter.module.urlfilter-rewrite.HEAD-0.1.patch	4.56 KB

Attached is the same patch against the most recent HEAD / 6.0

So no you are all set to go to review and apply this on any system you wish.

Comment #4

drumm

he/him

NY, US

commented 24 July 2007 at 17:43

Version:	5.x-dev	» 7.x-dev
Status:	Needs review	» Needs work

Changes such as this will only be committed to the development branch, which I think will be for the 7.x release cycle, 6.x is currently in feature freeze.

This patch is not properly formatted, it should be in unified diff format. See http://drupal.org/diffandpatch.

Comment #5

hingo commented 26 July 2007 at 09:16

Title:	Better URL filter in filter module (patch against HEAD)	» Better URL filter in filter module
Version:	7.x-dev	» 6.x-dev
Status:	Needs work	» Needs review

Status	File	Size
new	filter.module.urlfilter-rewrite.HEAD-0.2.patch	5.21 KB

Oh right, sorry about the patch. Here is a diff -up patch against head attached.

About the 6 vs 7 issue. I would like to point out that this patch contains absolutely no new features. It does the same replacements as the current urlfilter module (the core regexp patterns didn't change) but it does the replacements in some new places where the current one doesn't but imho should (inside many new tags) and otoh it doesn't touch several places the current one shouldn't but does. In this sense I would argue this fix is within the scope of "During the initial stage of the code freeze, documentation updates, usability improvements, and performance improvements will still be accepted. New functionality or API changes / additions, on the other hand, will not be unless deemed critical. As of today, the focus is to strengthen the code base's performance, usability, and stability. As we progress, focus will shift towards stability and, near the end of the code freeze, only bug fixes will be allowed, until no release critical bugs are left." (http://drupal.org/drupal-6.0-code-freeze)

I do realise this patch is not a simple one-liner bug fix and the URL filter has worked fine for many of you for many years now. So if you won't include this in 6.x I'm not going to whine about it more than this, but I do want to stress that nothing new is introduced in this patch, it is a bug fix.

I'm changing version back to 6.x-dev just to get attention from the right people. Please just change it back to 7.x if that is your judgement.

Comment #6

Steven commented 27 July 2007 at 00:29

Status:

Needs review

» Needs work

If the goal is to make it context-aware, then I don't see why we don't do a full tag split like other core code does (e.g. htmlcorrector, the search indexer, drupal_html_to_text). That way we can simplify the extremely complicated regular expression into a simpler, more readable heuristic. Ignoring any content between <a> </a> tags or ignoring any HTML attributes would be a simple if check at the right point.

The combined regular expressions that result from all this code seem incredibly convoluted, and contain slow, ungreedy operators.

Also, the following is just not acceptable:

  //note: <img/> and <br/>, won't work. While valid xhtml their use is discouraged anyway

Comment #7

webchick

she/they

English

Vancouver 🇨🇦

commented 27 July 2007 at 02:11

Also, from a code-style standpoint, please ensure there is one space between // and the comment and that the comment uses sentence capitalization. For example:

+  //this pattern will match <p>, </p>, <p class="foo"> and <p />

should be:

+  // This pattern will match <p>, </p>, <p class="foo"> and <p />.

There are places in core where the comments don't conform to this standard (yet), but we shouldn't introduce anymore new ones that don't. :)

Comment #8

hingo commented 2 August 2007 at 18:38

Status	File	Size
new	filter.module.urlfilter-rewrite.HEAD-0.3.patch	5.56 KB

Hi again. Thanks for the good comments. It took me a couple of days to go through them, but now I'm actually back with two new patches.

Attached here is a patch 0.3 that corrects all of the minor problems you have pointed out for 0.2. In particular:

Also, the following is just not acceptable:
//note: and
, won't work. While valid xhtml their use is discouraged anyway

Actually, it seems that comment was wrong and those tags actually did work. The comment is now removed. However, the regexp at that position has now been reworked to be a bit more precise than before, since I realised I had tought about it the wrong way.

Also, from a code-style standpoint, please ensure there is one space between // and the comment and that the comment uses sentence capitalization. For example:

Done.

Comment #9

hingo commented 2 August 2007 at 19:07

Status	File	Size
new	filter.module.urlfilter-rewrite.HEAD-0.4-badidea.patch	5.37 KB

And here is the other patch, addressing the final suggestion made by Steven.

Please not that this patch is not intended to be reviewed or included anywhere. I'm just attaching it as an attachment to the explanation that follows. Please consider 0.3 for inclusion into Drupal core.

If the goal is to make it context-aware, then I don't see why we don't do a full tag split like other core code does (e.g. htmlcorrector, the search indexer, drupal_html_to_text). That way we can simplify the extremely complicated regular expression into a simpler, more readable heuristic. Ignoring any content between tags or ignoring any HTML attributes would be a simple if check at the right point.

In the attached patch I tried to follow this path instead. It is almost a complete rewrite of the whole _filter_url_filter() function. (Whereas patch 0.3. actually is more like additions onto the simple regexps in current HEAD.)

The problem is that URL filter actually makes 3 different replacements. It makes links out of

1) http://any.full.url
2) email@addresses.com
3) www.domains.com

The problem is that once you do the split (you can just split on all tags, it's really simple)
(1) works fine,
(2) works fine since it (practically) never confilcts with (1),
but then at step (3) you end up actually hitting all of the addresses that were converted in step (1), thus messing up the links created in (1). So the problem is that while we preg_split away all tags present in the input, (1) and (2) create new <a> tags and there is no new split function and then you mess everything up at (3).

Possible solutions to this dilemma:

1) Make the last regexp more intelligent/complex so it doesn't do this -> This would end up being the "complex" regexp used in patch v0.3, kind of defeating the whole point of the excercise: Then we'd have both more complex code and the complex regexp.

2) Separate the 3 regexps so that only one is done at a time. For instance have a loop that goes 3 times, doing the preg_split anew for each loop. This way the new <a> tags created in steps (1) and (2) would become protected at step (3).

I personally am of the opinion that while the regexps in v0.3 may be a bit wordy, the complexity of the if-else structures of v0.4 are not that simple either. Adding another loop or other logic doesn't seem worthwile to me - I'd rather vote for v0.3 where at least the code is straightforward, albeit the regexps may be on the long side.

(Another comment on patch v0.2)

The combined regular expressions that result from all this code seem incredibly convoluted, and contain slow, ungreedy operators.

Is the slowness of a regexp in a filter an issue, since the filter is cached? I do admit that URL filter is one of the most used ones, but nobody runs filters without having the caching on, right?

Whether the regexps are readable or not is of course a matter of taste, to the author everything always looks simple. But I'd argue the if-else logic in v0.4 takes about as much time to understand as the regexps in v0.3. (v.0.4 also took longer to get right than modding the regexps took for v0.3.)

So in summary, I'm still proposing the patch v0.3.

Comment #10

hingo commented 3 August 2007 at 05:06

Status	File	Size
new	filter.module.urlfilter-rewrite.HEAD-0.5.patch	5.26 KB

Small restructuring of the regexp in $tags_begin variable. (Refinement based on patch 0.3, see previous comment.)

Changes line 1089 (line 10 in this patch)

  $tags_begin = '<\/?(?:' . $tags_begin . ')(?:\s[^>]|\/)*>';
  // This pattern will match <p>, </p>, <p class="foo">, <p /> and <p/>
  // The last ones are significant for xhtml standalone tags like <img/> and <br/>
  // Note: the pattern will obviously match many invalid tag-like constructs too 
  // like <p &&&>

Removes the for-loop that did the same transformation around each tag separately.

This change makes the regexps a bit shorter, so should be slightly simpler to read and also to execute.

Comment #11

hingo commented 8 August 2007 at 18:19

Status:

Needs work

» Needs review

Oops, forgot to change status...

Comment #12

hingo commented 28 August 2007 at 13:45

How should silence in a "needs review" situation be interpreted?

Where is 6.x at the moment, is it still realistic to target this for 6.x or are we close to RC time?

Comment #13

shap commented 30 August 2007 at 08:13

I applied your patch 0.5 to test. There seems to be a bug here. Given the input

blah blah http://mumble.net

the url is getting filtered, with the effect that the code example is obfuscated. This occurs even though the code tag is NOT on the list of allowable contexts for filtering. Any idea what is happening?

Hmm. Actually, things get substituted in a code tag even without the surrounding blockquote. This doesn't seem right if I am understanding the module code's intentions.

Comment #14

hingo commented 2 September 2007 at 19:03

Hi shap

Thanks for testing this.

Unfortunately, I cannot reproduce what you are getting. I tested on a vanilla 6.x-dev, with 0.5 patch applied. For instance,

This is just a www.test.com. paragraph with person@test.com. some http://www.test.com. urls thrown in and also <code>using www.test.com the code tag<  /code>.

<blockquote>
This is just a www.test.com. paragraph with person@test.com. some http://www.test.com. urls thrown in and also <code>using www.test.com the code tag<  /code>.
</blockquote>

<code>So what's this about? http://www.test.com abc<  /code>

http://www.test.com
www.test.com
person@test.com
<code>www.test.com<  /code>

All of the other url's get modified, but the url's inside the code tags are not modified.

On the other hand, the behaviour you describe matches the current drupal url filter. (And yes it is wrong and that is exactly the reason I'm trying to fix it.) So could you please check if:

* you really are running filter.module with the patch applied
* you are not seeing a cached version of your content. For instance, write foo123 at the end of the text to make sure it is different from some old version.

Comment #15

hingo commented 3 September 2007 at 13:43

Status:

Needs review

» Needs work

I found one instance where actually url's are modified when they shouldn't:

<code>In this example www.test.com the first url is not modified, but the second www.test.com is even if it shouldn't. </ code>

The problem is that v0.5 only pays attention to the innermost tags. I haven't thought of a case in normal html where this would lead to problems, but indeed the code tag is different.

I'll work on this and see how to solve it. Thanks shap for bringin this to my attention - I don't know if this is exactly the bug you were describing, but thanks still.

Comment #16

hingo commented 3 September 2007 at 17:30

Status:

Needs work

» Needs review

Status	File	Size
new	filter.module.urlfilter-rewrite.HEAD-0.6.patch	5.87 KB

Well, to fix the problem with tags nested inside code tags, I had to revert to the preg_split/implode strategy from 0.4. I've completed and refined that approach and this patch is a refined version of 0.4.

I would appreciate if you can test this out too. I'm pretty sure this is close to perfect, I dare you to come up with examples to break this :-)

Comment #17

hingo commented 3 September 2007 at 17:33

Status	File	Size
new	filter.testdata.txt	2.12 KB

Attached is the testdata I'm currently using.

Comment #18

hingo commented 16 September 2007 at 09:02

Version:

6.x-dev

» 7.x-dev

Since 6.0 has now reached beta (congratulations, btw) I'm changing this to target 7.x-dev.

Comment #19

catch

he/him

English

commented 16 September 2007 at 10:51

Version:	7.x-dev	» 6.x-dev
Category:	feature	» bug

Well I've had the url filter mess with urls created by the footnotes filter before (i.e. within a href), so I think this counts as a bug. I'm tentatively moving it back to 6.x-dev since other bugs will get fixed before rc, and changing it to a bug report since that's really what it is.

Sorry no time to review this morning.

Comment #20

hingo commented 16 September 2007 at 11:07

Great! I certainly agree with classifying this as a bug, if there are more than me who will agree.

I was trying to be cautious, since I'm under the impression that the URL filter is "good enough" for many people and some may not agree that what has now become a complete rewrite would be considered for 6.x. But as this thread too shows, there are planty of things the old URL filter does wrong, and this patch should fix all of them.

Comment #21

hass commented 21 September 2007 at 12:34

Status:

Needs review

» Needs work

You fileencoding is wrong (ISO). You must use UTF8 without BOM's. Check your editor, please.

Comment #22

hingo commented 25 September 2007 at 17:54

Status	File	Size
new	filter.module.urlfilter-rewrite.HEAD-0.7.patch	5.88 KB

Sorry about that, I didn't realise there were any characters where that would have mattered. Here is the same patch but now in UTF-8 format.

Comment #23

hingo commented 6 October 2007 at 16:15

Status:

Needs work

» Needs review

Forgot to change status again...

Comment #24

hingo commented 22 October 2007 at 04:17

...

Comment #25

stevenpatz

he/him

commented 22 October 2007 at 13:12

Version:

6.x-dev

» 7.x-dev

Beta 2 is out for 6.x, so I'm moving this to the development queue.

Comment #26

catch

he/him

English

commented 22 October 2007 at 13:48

Title:	Better URL filter in filter module	» URL filter breaks generated href tags
Version:	7.x-dev	» 6.x-dev

Bugs are still being fixed in 6.x, and at least the footnotes module is caught by this bug requiring careful weighting of the various filters. It's possible the patch may be too much of a change at this stage, but it'd be nice to get more reviews before bumping it.

Comment #27

hingo commented 29 January 2008 at 08:35

As someone who doesn't follow the dev mailing list, I'll ask here: Is HEAD already taking in new patches or is activity still focused on D6?

Comment #28

catch

he/him

English

commented 29 January 2008 at 09:37

hingo, currently 6.x and HEAD are the same thing.

Comment #29

jpsalter commented 5 March 2008 at 10:35

I just ran into this problem in 5.7. Has this thread gone cold? Or, is there a dupe somewhere?

My use case:

www.example.com --> no conversion
www.example.com --> works

Any chance this will make it into 5.x or should I hack core?

Comment #30

hingo commented 5 March 2008 at 17:16

Hi jpsalter

You've found the right place, this is THE thread and it is still active.

6.0 is now out and this wasn't applied. I'm actually wondering myself what the correct Version tag for this would be? 6.0 (since this bug exists in 6.0) or 6.x (where it should be applied) or 6.1 or 7.x?

I doubt it will be applied to 5.x, but you should be pretty safe applying it yourself. From what I can tell the _filter_url() functionality has been untouched through the whole 5.x series to 6.x. Even if the patch wouldn't apply you could still just replace the relevant functions with copy pasting.

Comment #31

jpsalter commented 5 March 2008 at 17:19

Thanks for the confirmation and all your hard work. I hope others recognize it too.

Comment #32

keith.smith commented 5 March 2008 at 17:20

Version:	6.x-dev	» 7.x-dev
Status:	Needs review	» Needs work

This should be fixed in HEAD (now 7.x-dev) and then backported if appropriate.

The patch in #22 has some code style issues with comments (lacks periods at ends of comments, for one).

Comment #33

hingo commented 8 March 2008 at 21:24

Status:

Needs work

» Needs review

Status	File	Size
new	filter.module.urlfilter-rewrite.HEAD-0.8.patch	6.19 KB

Thanks Keith

Old habits die hard... I already had the periods in place, but rewriting parts of the code it seems I introduced periodless comments again.

This patch 0.8
- fixes style issues with comments (periods)
- diff against 7.x-dev (these functions untouched but file changed so line numbers changed from 6.x-dev)

Comment #34

hingo commented 17 March 2008 at 09:09

Could someone please see if there are any more punctuation or style issues with the comments and help get this committed to 7.x.

I first submitted this patch during my last years Summer vacation, and now the snow is melting and the grass is green again, I would be disappointed not to have such a straightforward fix committed before my next Summer holiday due in 3 months...

Comment #35

keith.smith commented 17 March 2008 at 11:16

Status:

Needs review

» Needs work

hingo: Well, since you ask, on a quick read-through, we do use 'HTML' and 'URL' in comments (rather than 'html' and 'url', respectively).

Comment #36

hingo commented 18 March 2008 at 08:02

Status	File	Size
new	filter.module.urlfilter-rewrite.HEAD-0.9.patch	6.07 KB

Thanks Keith

Would you find in yourself enough energy to see if there is still something that needs to be corrected. I'm afraid next you'll tell me that www should also be uppercase? Speaking of which, is there an official style guide where I could get all this advice in one shot?

v0.9
* style: uppercase HTML and URL

Comment #37

hass commented 18 March 2008 at 14:41

http://drupal.org/coding-standards is a good starting point and try coder module... it helps, too.

Comment #38

hingo commented 18 March 2008 at 20:21

Thanks. Looking at it now I realise I already read through it last summer. But it isn't as detailed as some of the advise I got here, which is more like grammatical stuff almost. *Which I'm completely ok with*, btw, it just seems there is an additional "oral tradition" to style issues at Drupal, so it seems I will have to continue relying on the grace of core developers giving some feedback every once in a while.

Comment #39

keith.smith commented 18 March 2008 at 20:28

hingo: Some of this is contained in pages similar to the coding-standards page (but that are more general in scope), like http://drupal.org/about/authoring.

Comment #40

hingo commented 20 March 2008 at 06:32

Ah, Thanks Kev. Had not seen that one.

In any case, we are digressing now a bit from the main question? What should yet be done so that this can be committed to 7.x?

Comment #41

hingo commented 3 April 2008 at 06:31

Status:

Needs work

» Needs review

Comment #42

floretan commented 3 April 2008 at 20:15

@hingo: Can you provide some sample content with URLs, some of which are inappropriately processed by the current filter, so that the efficiency of this patch can be tested?

Comment #43

hingo commented 4 April 2008 at 11:35

Status	File	Size
new	UrlFilterTestCases.txt	2.08 KB

Attached is my most developed "test case". The patch works good with all of these. The current url-filter fails at least:
* urls inside a dl list (not processed, should be)
* urls in a title attribute in a link (are processed, shouldn't)
* urls in html comments (are processed, shouldn't)
* urls inside some other html tags also are not processed but should
(This is out of my memory, I haven't used the official urlfilter in a while :-)

Comment #44

hingo commented 20 April 2008 at 21:59

ping

Comment #45

floretan commented 21 April 2008 at 07:41

Issue #249200: Test individual filters. adds some test cases that show the failure of the URL filter in the cases mentioned in #43. Please review and extend those tests, and it will help to get a revised URL filter into core.

Comment #46

floretan commented 5 May 2008 at 23:03

Status	File	Size
new	urlfilter-161217-46.patch	6.3 KB

Re-rolled patch from #36.

The tests written in #249200: Test individual filters. have been committed. 12 assertions fail with the current link detection algorithm, but all of them pass successfully with this patch.

Comment #47

hingo commented 16 May 2008 at 15:24

Status:

Needs review

» Reviewed & tested by the community

Thanks, flobruit. I'm immediately going to take the benefit of not being the submitter of the last patch anymore and put a reviewer hat on. Let's see if we can finally get this committed!

/me reviewing... Looks good!

We've had several people looking at this without complaints for months now, and now we also have the unit tests committed in filter.test that show the improvement over the existing url filter. I vote for committing the patch by flobruit in #46.

Comment #48

catch

he/him

English

commented 4 June 2008 at 10:45

Still applies with offset, and fixes those failures in filter.test (although there's some other unrelated ones).

Comment #49

catch

he/him

English

commented 5 June 2008 at 20:34

The other failures in filter.test are dealt with here: http://drupal.org/node/266539

Comment #50

sun

German

Karlsruhe

commented 9 June 2008 at 20:04

Status:

Reviewed & tested by the community

» Needs work

Please remove line-break here:

+    // Note: PHP ensures the array consists of alternating delimiters and literals
+    // and begins and ends with a literal (inserting NULL as required).
+
+    // If an ignoretag is found, it is stored here and removed when the 
+    // closing tag is found. Until then no replacements are made.
+    // Think of this as a stack that always has 0 or 1 items.

Please read coding standards again:

+  for ( $task = 1; $task <= 3; $task++ ) {
+    for ( $i = 0; $i < count( $chunks ); $i++ ) {
+        if ( $opentag == '' ) {

Please improve these comments:

+      // Even/odd = text/tag.
+      if ( $i % 2 == 0 ) { 
+        // Text...
+
+        // Only do replacements when there are no unclosed ignoretags.
+      else { 
+        // Tag...

Please insert a newline between breaks and cases inside of switch statements:

+              break;
+            case 2:

$task seems not to be defined and poorly named.

+  for ( $task = 1; $task <= 3; $task++ ) {

Please improve these comments:

+          // No open "ignoretag".
+          // Is it an ignoretag? (opening tag.)

We do not use such "reminders" in Drupal: (if that's needed, the function body is likely too large)

+        } // if ( $opentag == '' )
+      } // if ( $i % 2 )
+    } // for chunks

...probably more, but please fix those first.

Comment #51

hingo commented 9 June 2008 at 20:58

Status:

Needs work

» Needs review

Wow, this looks like a real review! I have to admit I had given up hope...

Sun,

Please remove line-break here:

+    // Note: PHP ensures the array consists of alternating delimiters and literals
+    // and begins and ends with a literal (inserting NULL as required).
+
+    // If an ignoretag is found, it is stored here and removed when the
+    // closing tag is found. Until then no replacements are made.
+    // Think of this as a stack that always has 0 or 1 items.

Please don't! Those are not the same comment. The 2 first lines refer to the previous line, while the three last ones to the following line. (And since next you are going to say that comments cannot appear after the statement they are a comment on, I might as well add right now that this code is a copy paste from existing code in Drupal core and is like that in 4 other places too.)

Please read coding standards again:

This is a rather unhelpful comment (I'm being polite here) and no, I won't read them again. If there is a style issue, say what it is.

We do not use such "reminders" in Drupal: (if that's needed, the function body is likely too large)

Sure, they can be removed. (This must be the first time anyone complained of there being too many comments in a patch I wrote :-)

...probably more, but please fix those first.

Sorry, this won't work. We've had several people now flying in and "helpfully" pointing out punctuation, capitalisation and whatnot issues. I'm not interested in that game anymore. This patch fixes a known bug in Drupal which several people have complained about. Since September 2007 (that's 9 months, I have made a baby in less!) there hasn't been any code updates, just grammar/style discussion. If we get a list of the style/comment issues that should be fixed we will fix them. But at least for my part I'm not gonna participate in a never ending game of "fix these first" reviews. This is a short patch, it is not too hard for you to give a finite list of problems and we'll fix them and then the patch can be committed.

Thanks.

(FWIW, your comments I didn't comment on are perfectly good comments and I'll be happy to improve on those together with the complete list of remarks that I hope to get soon. I also want to point out that it's obviously not your fault that we've already had too many unhelpful helpful reviewers pass by without any progress, you just happened unluckily to be the last one - in fact your review was the best so far.)

Comment #52

sun

German

Karlsruhe

commented 9 June 2008 at 21:46

Status:

Needs review

» Needs work

We've had several people now flying in and "helpfully" pointing out punctuation, capitalisation and whatnot issues. I'm not interested in that game anymore. This patch fixes a known bug in Drupal which several people have complained about.

Hrm. This is called peer review. We, the community, have advanced in this area a lot in the past. No patch will hit core that contains coding-style issues or any other logical/syntactical weaks. If we would slip such patches in, Drupal core would quickly look like err... tryin' hard to avoid names of other CMS. Please note that I have not tested your patch, just reviewed the code. Before any other core or core-alike developer will stumble upon this issue, you should really try hard to make this patch ready to be committed. That means, your patch should include no coding-style or coding-standards issues, and it must stand the objections of performance-related freaks that are yet to come.

I wouldn't have noted the following if it wasn't contained right in the first code example snippet of the first section of Drupal's coding standards:
Please read coding standards again:

-  for ( $task = 1; $task <= 3; $task++ ) {
+  for ($task = 1; $task <= 3; $task++) {
-    for ( $i = 0; $i < count( $chunks ); $i++ ) {
+    for ($i = 0; $i < count($chunks); $i++) {
-        if ( $opentag == '' ) {
+        if ($opentag == '') {

There's no patch attached.

Comment #53

hingo commented 9 June 2008 at 22:11

Status:

Needs work

» Needs review

Sun, I have nothing against peer review. We have been waiting for soon a year now for it to happen, and for nine months this patch has been *nothing but* waiting for peer review to happen. So I'm not objecting to the issues you pointed out, I'm objecting to you saying that there are more that you didn't point out! Please do so.

I'm moving this back to Code Needs Review because that's what we really need here. When we know that we have a final list of issues to correct, I'll correct them.

(Thanks for clarification on that one point.)

Comment #54

jpsalter commented 9 June 2008 at 22:44

A search-n-replace of '( ' => ')' and ' ) => ')' cleans up the spaces in functions.

There are a few comments left that do not begin with a capital letter and end with a period (mostly comments that end loops:

+ } // if ($opentag == '')
+ } // if ($i % 2)
+ } // for chunks

I'm not sure of the Drupal policy of closing } having comments. Maybe it is easier to remove them.

I've been struggling to get my code to meet Drupal's conventions. It is hard to undo years of coding habits. But, after taking the pain on a module or two -- it is getting easier.

Thanks for keeping at this.
Hope this helps.

Comment #55

catch

he/him

English

commented 10 June 2008 at 00:18

@hingo: there are over 270 patches in the queue against D7 that need review, some of those patches have been around a lot longer than this one, and been through many, many more re-rolls, even for comment style issues.

I didn't look at the patch yet, just ran it through tests, but will do so tomorrow.

Comment #56

hingo commented 10 June 2008 at 06:47

jpsalter, catch: Thanks.

@catch: Sure, and I understand this is how projects like Drupal works, and I'm not in a hurry, for myself I can apply the patch anyway and if no-one else cares then it is ok to focus on higher priority issues.

What is unproductive imho, is that there are people passing by, pretending to do a review when it's just really very very superficial. So for instance in this last instance, 1) the reviewer himself says that his review is incomplete and he may come back for more comments later and 2) he later also says "I didn't even look at the code yet" (which I didn't want to unnecessarily point out, but was also apparent from the first comments, sorry to say). From my experience if those are fixed, then for 2 months it will be quiet again, then someone else will pop by saying "Here are a few more comments but this is not all..." And at no point in these reviews are we really moving closer to actually committing anything.

So by all means, feel free to leave even more such comments, but there is no need to change the Status to "code needs work" when you know yourself you didn't really do a proper review. What this patch needs is more review, and when we've had a proper review it's just 1 minutes work to incorporate all the issues also from the partial reviews, such as those made by Sun.

So what I'm waiting for now is for someone to say: "I have really looked at the code *and* comments and here are *all the issues as far as I can see*." Even better if we get someone to say: "After fixing these I can (or I will propose) commit this patch."

Comment #62

catch

he/him

English

commented 10 June 2008 at 09:44

Status	File	Size
new	urlfilteragain.patch	6.29 KB

Attachment problems again. This should get us at least part of the way with the code style issues.

Comment #63

floretan commented 10 June 2008 at 11:04

Status	File	Size
new	url_filter-161217-63.patch	7.75 KB

// Pass length to regexp callback.
_filter_url_trim(NULL, variable_get('filter_url_length_' . $format, 72));

Why pass a drupal variable as a parameter to _filter_url_trim()? We can simply access that variable from inside the function.

$chunks = preg_split('/(<.+?>)/i', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
// Note: PHP ensures the array consists of alternating delimiters and literals
// and begins and ends with a literal (inserting $null as required).

This comment was copied from _filter_htmlcorrector(). $null should be NULL.
Also, when positioned after the line of code, the comment is not really clear. What array?
I suggest this variant (the same comment appears in other functions in filter.module, I changed those as well):

$chunks = preg_split('/(<.+?>)/i', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
// Note: The return value of preg_split() consists of an array of 
// alternating delimiters and literals that begins and ends with a 
// literal (inserting NULL as required).

In most of core, drupal uses / or ! to delimit regular expressions. The use of ` isn't wrong, but we might as well make it consistent with the rest of the code.

Finally, a PHP-esque

foreach (range(1, 3) as $task) {

would look better in this context than the C++esque

for ($task = 1; $task <= 3; $task++) {

Comment #64

catch

he/him

English

commented 10 June 2008 at 11:25

Status	File	Size
new	url_filter-161217-64.patch	7.75 KB

@flobruit: those changes all look good.

Found a couple more stray spaces.

Comment #65

chx commented 24 June 2008 at 20:30

Status	File	Size
new	urlfilter.patch	7.63 KB

I have removed a bit of code duplication and simplified a bit.

Comment #66

catch

he/him

English

commented 24 June 2008 at 20:41

Priority:	Normal	» Critical
Status:	Needs review	» Reviewed & tested by the community

Much tidier, and it still fixes all the failures in filter.test

Since this is a blocker to all tests passing, marking to critical.

Comment #67

catch

he/him

English

commented 24 June 2008 at 21:09

Status:

Reviewed & tested by the community

» Needs work

Good old fashioned clicking and posting found:

* notice: Undefined offset: 1 in /modules/filter/filter.module on line 771.

Comment #68

hingo commented 25 June 2008 at 22:16

Status:

Needs work

» Needs review

Status	File	Size
new	urlfilteragain_0.patch	6.29 KB

Guys, I'm sorry to say, but this "cleanup work" has broken quite many things. It was nice to see so many genuinly helpful inputs and I'm sorry I wasn't able to follow up instantly on changes you made. However, looking at this patch now, there are several things that have been broken, you ahve each contributed some of it. To simplify this, could we just go back to #62 which is the last working patch as far as I can see.

Things broken in the new patch:

* A crucial } else { has inexplicably disappeared, this is a major bug of course. (missing at line 77 in patch).

* Removing $length parameter from _filter_url_trim() has broken that function ($format is now undefined). Probably buggy at least for long URL's?

Further, I strongly disagree with moving the fetching of the length parameter into the function in the first place. _filter_url_trim() is a generic utility function which should just take an input and return an output, it shouldn't have any drupal stuff inside of it. Utility functions should be generic, fetching a drupal variable from db belongs to the main code.

* I don't agree that moving the patterns into an array and away from their actual regexps calls help readability at all. If you want to be clever, go all the way and make the switch-case statement go away completely (email task needs to be modified to use callback too), that would make code shorter and at least arguably easier to read.

* Minor: in the callback functions I find the new comment "+ // Find the parenthesis in the regexp containing the URL." misleading, since this is not about finding anything. My original comment was just supposed to be a description of what variable $i is, not any actual action.

***

So, back to main topic, attached is the patch from #62 which we need to get back to. This contains the whitespace fixes but none of the restructuring of code which all introduces bugs. (It contains the new comment in the callback functions I don't like as listed above in the last minor point, so that could be reverted in the next edit.)

I'm switching back to code needs review and we can continue discussion from where it was at #62. (ie, shouldn't be any bugs, you may have style opinions you want to re-introduce...)

I'm thankful for more help on this patch, but it is not tolerable to introduce real bugs while discussing style issues. Please be more careful and remember that you are editing a working patch.

Comment #69

chx commented 28 June 2008 at 08:01

hingo, my patch removed the whole if ($i % 2 == 0) business because I found it ugly. It's not just an else disappeared -- I replace that with an $i++. I am not debating merely code style here. I was thinking on how to remove the switch, but it never came to me to replace that with a preg_replace_callback. Honestly: good idea. (Sigh, I wish we had lambda functions, would be useful here) What bugs did I introduce?

Can you reroll with the if ($i % 2 == 0) and the switch removed?

Edit: I agree wholeheartedly that introducing bugs into working patches is very frustrating. However, I am not introducing new features here, but trying to get a better code (in my eyes, better, at least). Really sorry if I broke your patch -- I know how it feels.

Comment #70

hingo commented 29 June 2008 at 20:10

Status:

Needs review

» Needs work

@chx: I tested your patch, with the particular input I had in mind. You're almost right, cause you didn't break what I thought. However, you are to blame for the 3 new notices we get:

# notice: Undefined offset: 13 in /var/www/drupal7/modules/filter/filter.module on line 771.
# notice: Undefined offset: 17 in /var/www/drupal7/modules/filter/filter.module on line 771.
# notice: Undefined offset: 21 in /var/www/drupal7/modules/filter/filter.module on line 771.

(The fourth notice is because of the _filter_url_trim() change introducing a bug.)

This is because the $chunks array always ends with a text chunk, but in your patch the code always ends with trying to process a tag, so these notices are the PHP equivalent of ArrayIndexOutOfBounds complaints. It's not fatal (output actually is still ok) but it is obviously not acceptable either. In a more general description: Your trick doesn't work if there is an odd number of array elements, and in this case there often (always?) is.

I agree that the if($i % 2 == 0) thing may not be the nicest code either. What do you think about the following:
1) Keep my original code hierarcy
2) Replace my modulo stuff with a variable that takes text values "text" and "tag" instead.
I mean:

// $chunks now contains a series of text and tags (literals and delimiters)
// always beginning with text.
$chunk_type = "text";

....

for ($i = 0; $i < count($chunks); $i++) {
  if($chunk_type == "text") {

  ...

   // Done processing text chunk, so next chunk is a tag.
   $chunk_type == "tag";
  }
  else {

  ...

   // Done processing tag chunk, so next chunk is text.
   $chunk_type == "text";  
  }
}

What do you say? Btw, this introduces an "unnecessary" variable just for readability, is that considered good or bad in the Drupal community?

***

Going forward: I need to go to sleep now and will be on a business trip the whole next week. If nothing happens during that time, I will update the patch with the below changes. If someone is still bold enough, feel free to do these updates and continue the debate, then I'll come back a week from now and tell you what I think again.

FWIW, I've been a little cranky here earlier, I am in fact happy now that the last contributions have been addressing code and other substance, not just whitespace and grammar. (Note to self: check whitespace when done, I'd hate to be caught a fourth time for it!)

TODO, wrt patch in #68:

1) Agree with chx (and others) how to replace if($i % 2 ... and replace it

2) Make the switch statement go away completely. Email regexp processing needs to be made into using a callback function too, and then use the name of each callback function as $task (See $patterns array in#65). (This is another "clever" thing I'm still lukewarm about, but it will make code significantly shorter and less indented, so if we do it with good comments I hope it will increase readability. I really never liked that switch, have to admit.)

3) Do NOT make any changes to _filter_url_trim().

4) Revert the changed comment inside the callback functions so that it is a comment introducing variable $i. If there is need to comment further on what the function is doing, do that too.

5) Double check, but I believe all issues in #50 have been addressed. (Either included or rejected.)

No patch attached, I'll be back in a week (plus perhaps a few days).

Comment #71

hingo commented 7 July 2008 at 21:09

Status:

Needs work

» Needs review

Status	File	Size
new	urlfilter-71.patch	7.41 KB

Attached patch implements changes as discussed in comment #70. There are no outstanding issues so patch is ready to be reviewed and committed.

Comment #72

catch

he/him

English

commented 7 July 2008 at 21:40

All filter tests pass, no notices. Not looked at code changes.

Comment #73

hingo commented 10 July 2008 at 22:52

Just a short note to everyone following this thread. I have today committed the above version of URL filter into the Footnotes module, of which I'm the maintainer. You can now easily install and use a working version of URL filter by

Download Footnotes 6.x-2.x-dev snapshot (it is safe, there is no wild development going on.
Activate the Footnotes module
Replace "URL filter" with "Better URL filter" everywhere
Note: You can take advantage of the Better URL filter whether you use Footnotes or not.
Note: Even though this is a Drupal 6.x module, it can be installed into earlier versions too.

Please see #281143: Better URL filter now available within Footnotes for more info. Please comment on Better URL filter in that thread and not here!

I hope this will resolve the standstill we have here and users who need a working URL filter can finally have it without letting this discussion hold them hostage for another year or two.

Comment #74

cwgordon7 commented 5 August 2008 at 06:32

Status:

Needs review

» Reviewed & tested by the community

Status	File	Size
new	locale_confirm_280628_7x_13.patch	7.47 KB

This causes core filters tests to pass, well done. I've looked through the code, and all looks good, except for one thing:

-  // Use +3 for '...' string length.
-  if (strlen($text) > $_length + 3) {
+  if (strlen($text) > $_length) {

That makes me very very very unhappy. I don't have the heart, though, to mark it back to cnw, so rerolled (just those lines changed) and happily rtbc'ing instead. :) Please don't credit me on commit.

Comment #75

cwgordon7 commented 5 August 2008 at 06:34

Please excuse mislabeled patch in previous comment. It is the correct patch, only mislabeled. :)

Comment #76

catch

he/him

English

commented 5 August 2008 at 08:19

I've patched and run tests with this, confirmed RTBC.

Comment #77

hingo commented 5 August 2008 at 12:09

@cwgordon

I originally removed the +3 since I don't understand the logic of it. If anything, it should be -3.

Now, if user has set max length for URL's to be 72 (for instance), then we should cut all URL's that are longer than 72, not longer than 75. Even so, this is not a major issue for me, so I'm letting your patch stand. I'm just saying I don't understand it and I don't see it doing what the comment says it is doing.

Comment #78

catch

he/him

English

commented 5 August 2008 at 18:16

Title:

URL filter breaks generated href tags

» URL filter breaks generated href tags (and filter.test)

;)

Comment #79

cwgordon7 commented 8 August 2008 at 22:47

Bump.

Comment #80

dmitrig01 commented 9 August 2008 at 15:17

Status	File	Size
new	urlfilter-80.patch	8 KB

don't smashtogether variable names. This patch moves comments around, removes windows linebrakes, and renames some variables. no actual logic has changed.

Comment #81

catch

he/him

English

commented 14 August 2008 at 23:18

Status:

Reviewed & tested by the community

» Needs work

Patch adds an extra _filter_url_parse_full_links() rather than replacing it.

Comment #82

catch

he/him

English

commented 15 August 2008 at 08:17

#260484: Regression: Filter tests fail was committed, so those tests need to be rolled in with this patch.

Comment #83

hingo commented 15 August 2008 at 08:53

Indeed, patch #80 doesn't seem to remove much lines at all, same problem also with the other *_parse_*_links() functions for starters. Very unhelpful.

As I'm the original proponent of this patch, I guess it is fair to announce publicly that what happened in #260484: Regression: Filter tests fail made me loose whatever little faith I had left in this project and I don't intend to lift another finger working on this patch anymore.

(The good news is I will release version 2.0 of Footnotes during the weekend and users can just enable a working URL filter (this patch) from there instead, as it will be included.)

Comment #84

catch

he/him

English

commented 15 August 2008 at 09:01

Status:

Needs work

» Needs review

Status	File	Size
new	url_filter.patch	17.26 KB

I've gone through and re-rolled #74 with some of the comment and variable naming improvements in #80 + the filter tests themselves. Back to needs review so we won't just be pretending we have no bugs in core :p.

Comment #85

hass commented 15 August 2008 at 11:45

Why are you not using "example.com" for tests?

Comment #86

catch

he/him

English

commented 15 August 2008 at 12:18

Status	File	Size
new	url_filter.patch	16.79 KB

These tests were just rolled back from core this morning, I didn't change anything when rolling into this patch. Either way, fixed.

Comment #87

catch

he/him

English

commented 15 August 2008 at 12:25

Status	File	Size
new	url_filter.patch	16.8 KB

Missed some.

Comment #88

catch

he/him

English

commented 16 August 2008 at 00:59

doppelpost

Comment #89

catch

he/him

English

commented 16 August 2008 at 00:59

Status	File	Size
new	url_filter.patch	16.96 KB

Whoops: http://drupal.org/node/260484#comment-966252

re-rolled to get this back to a functioning state, although it'd be nice to clean this up later so it's not using non-legit domain names.

Comment #90

damien tournoud commented 16 August 2008 at 09:22

Assigned:	hingo	» damien tournoud
Status:	Needs review	» Needs work

I'm tired of this, mine for a few hours.

Comment #91

catch

he/him

English

commented 14 November 2008 at 09:56

Status:

Needs work

» Needs review

Status	File	Size
new	url_filter.patch	15.92 KB

Re-rolled #89 and marking back to needs review since it still passes its own tests. We should either apply this, or get expectedFail() in. More elegant solutions can come later, leaving stuff broken in core when there's a working patch with tests for this long is bad.

Comment #92

16 November 2008 at 22:20

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Comment #93

lilou commented 17 November 2008 at 04:43

Status:

Needs work

» Needs review

Test failure : #335122: Test clean HEAD after every commit

Comment #94

9 March 2009 at 12:30

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Comment #95

webchick

she/they

English

Vancouver 🇨🇦

commented 28 June 2009 at 11:14

Status:

Needs work

» Needs review

Oh you silly, silly test bot.

Also, wow. Somehow this managed to stay off my radar for a very long time. I guess my RTBC queue sprints and this patch's CNW statuses were not well-timed at all. Really sorry for that, hingo. :( You might have better luck asking for reviews in IRC? There's usually a group of us hanging out and talking D7 at all hours of the day.

Comment #96

catch

he/him

English

commented 28 June 2009 at 11:30

Status:

Needs review

» Reviewed & tested by the community

Marking back to RTBC per #74 and pending non-broken test bot. There's been no significant changes to the patch since then apart from re-rolls if you don't count going around in circles.

Comment #97

webchick

she/they

English

Vancouver 🇨🇦

commented 28 June 2009 at 13:09

Status:

Reviewed & tested by the community

» Needs work

This no longer applies after committing #276597: TestingParty08: filter.module. Some tests for the URL Filter were added as a part of that patch, so the tests here will have to be re-worked slightly to fit in there. Sorry for that, but I had to make a judgment call and XSS filter testing seemed pretty important.

Here's the format for assertions used in the other patch:

+    $f = _filter_url('http://www.example.com/', 'f');
+    $this->assertEqual($f, '<a href="http://www.example.com/" title="http://www.example.com/">http://www.example.com/</a>', t('Converting URLs.'));
+
+    $f = _filter_url('http://www.example.com/?a=1&b=2', 'f');
+    $this->assertEqual($f, '<a href="http://www.example.com/?a=1&amp;b=2" title="http://www.example.com/?a=1&amp;b=2">http://www.example.com/?a=1&amp;b=2</a>', t('Converting URLs -- ampersands.'));
+

We should make this patch fit that as well. Having the entire contents of the test string in HEREDOC syntax is very odd. This exercise will also help reviewers to understand the specific issues being fixed by this patch, since right now it's not very clear.

Unsurprisingly ;), the patch itself conforms well to coding standards, and is very well documented, etc. I'm not positive about the logic therein, but if the tests prove it works, we can probably just commit it as is unless someone comes up with any alarming reasons why not to. Although before that, it would be good to have some baseline benchmarks just to ensure we're not increasing the time spent in this function by 400%.

Comment #98

sun

German

Karlsruhe

commented 28 June 2009 at 17:47

Only nitpick:

 /**
- * URL filter. Automatically converts text web addresses (URLs, e-mail addresses,
- * ftp links, etc.) into hyperlinks.
+ * URL filter. Automatically converts text web addresses (URLs, e-mail
+ * addresses, ftp links, etc.) into hyperlinks.

PHPDoc and comments should wrap at 80 chars - except the PHPDoc summary.

Comment #99

catch

he/him

English

commented 28 June 2009 at 21:26

Status:

Needs work

» Needs review

Status	File	Size
new	filter_url.patch	8.96 KB

This fixes the comment issue sun noticed, and adds two of wrwrwr's failing tests from http://drupal.org/node/276597#comment-1196760

May or may not get to looking at adding more assertions.

Comment #100

29 June 2009 at 01:22

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Comment #101

webchick

she/they

English

Vancouver 🇨🇦

commented 14 July 2009 at 04:57

Note: Henrik contacted me off-list and noted that one real advantage of the HEREDOC syntax is that you can test each of these conditions on "real world" data, rather than in isolation. I think that's a really good point.

Probably what makes the most sense though rather than embedding this data within the test itself is making a separate .html file in the modules/filter/tests/ folder that we can continue to add to as we come up with other crazy ways to break various filters.

Comment #102

sun

German

Karlsruhe

commented 7 November 2009 at 04:45

Would be awesome to get this finally done for D7.

The list of tags to exclude should also contain PRE.

Comment #103

catch

he/him

English

commented 29 January 2010 at 18:26

Status	File	Size
new	filter_url.patch	7.65 KB

Untested re-roll.

Comment #104

catch

he/him

English

commented 29 January 2010 at 18:30

Status:

Needs work

» Needs review

Just to see how tests look, not really CNR 'cos this will need some tests adding back in.

Comment #105

29 January 2010 at 18:50

Status:

Needs review

» Needs work

The last submitted patch, filter_url.patch, failed testing.

Comment #106

sun

German

Karlsruhe

commented 9 February 2010 at 16:33

Assigned:

damien tournoud

» Unassigned

+++ modules/filter/filter.module	29 Jan 2010 18:25:47 -0000
@@ -1058,24 +1058,100 @@ function _filter_url_settings($form, &$f
+  // List of tags - the content of which must be skipped.
+  $ignore_tags = 'a|script|style|code';

We also need to skip HTML comments.

Powered by Dreditor.

Comment #107

hingo commented 9 February 2010 at 19:11

HTML comments are properly skipped. Not because of that line, but because of line 70 in the most recent patch.

+    // Split at all tags.
+    // This ensures that nothing that is a tagname or attribute will be processed.
+    $chunks = preg_split('/(<.+?>)/i', $text, -1, PREG_SPLIT_DELIM_CAPTURE);

It could be added to the preceding code comment of course, if wanting to be pedantic, but it does also exclude html comments.

Comment #108

kenorb commented 2 April 2011 at 13:12

Title:	URL filter breaks generated href tags	» URL filter breaks generated href tags (and filter.test)
Assigned:	sun	» Unassigned
Priority:	Major	» Critical
Status:	Closed (fixed)	» Needs work

Comment #109

redndahead commented 22 April 2010 at 18:34

Status:

Needs work

» Needs review

Status	File	Size
new	161217-filter_url-2.patch	8.78 KB

This patch fixes the failing tests, plus this doesn't seem right. I don't see this patch adding a variable and one doesn't exist at the moment. I have changed it. Tests now pass.

+++ modules/filter/filter.module	29 Jan 2010 18:25:47 -0000
@@ -1058,24 +1058,100 @@ function _filter_url_settings($form, &$f
+  _filter_url_trim(NULL, variable_get('filter_url_length_' . $filter->settings['filter_url_length'], 72));

Powered by Dreditor.

Comment #110

sun

German

Karlsruhe

commented 22 April 2010 at 18:53

Title:	URL filter breaks generated href tags (and filter.test)	» URL filter breaks generated href tags
Priority:	Critical	» Normal
Status:	Needs review	» Needs work

Note that there are quite a few other bug fix patches for URL filter in the queue. Looking at the changes of this patch, it may be worth to tackle the other issues first. Just an idea.

Also, this issue was bumped to critical in 2008. Filter tests pass in the meantime, so this is just a regular bug.

+++ modules/filter/filter.module	22 Apr 2010 18:31:36 -0000
@@ -1333,24 +1333,100 @@ function _filter_url_settings($form, &$f
+  // List of tags - the content of which must be skipped.
+  $ignore_tags = 'a|script|style|code';

Still seems to be missing PRE.

+++ modules/filter/filter.module	22 Apr 2010 18:31:36 -0000
@@ -1333,24 +1333,100 @@ function _filter_url_settings($form, &$f
+  // is to return the replacement for the match. The array is used and ¶

Trailing white-space.

+++ modules/filter/filter.module	22 Apr 2010 18:31:36 -0000
@@ -1333,24 +1333,100 @@ function _filter_url_settings($form, &$f
+  // Note: The ICANN seems to be on track towards accepting more diverse top level domains,
+  // so this pattern has been "future-proofed" to allow for TLD's of length 2-64.
...
+    // This ensures that nothing that is a tagname or attribute will be processed.
...
+    // Note: PHP ensures the array consists of alternating delimiters and literals
...
+    // closing tag is found. Until the closing tag is found, no replacements are made.
...
+          // There is an $ignore_tag open. See if this is a matching closing tag.

Exceeds 80 chars.

94 critical left. Go review some!

Comment #111

redndahead commented 22 April 2010 at 19:28

Status:

Needs work

» Needs review

Status	File	Size
new	161217-filter_url-3.patch	8.82 KB

This integrates sun's comments.

Comment #112

hingo commented 22 April 2010 at 21:08

@sun

Note that this patch will fix almost half of the other URL filter bugs since most of them are just various symptoms of what this patch fixes. So it certainly makes sense to take this first, then see how many other bugs you can immediately close. #190466, #168227, #191744, #728380, #555280... from the first page of the list should all go away. (This is also the reason this was marked critical, the old URL filter fails in quite many ways and this addresses the root problem.)

For tests passing, it is because tests were removed, not because bugs were fixed. Add back the tests from comment #91 and you'll see a different picture. (How critical the issues are is of course another thing, but as you can see, there are quite many bugs filed against URL filter that are various permutations of this one.)

PRE tag: Personally, I don't see a reason why URL's inside a PRE tag couldn't be made clickable, but don't have a strong opinion either way. I agree that a strict interpretation of "preformatted" means there should be no other formatting inside the tag.

Only thing to still change:

Please use the tests from #91. The current "one liner" tests will not actually catch many of the bugs this patch addresses. It is important to test on data that contains multiple strings to make clickable, and multiple rows, and multiple paragraphs. If HEREDOC syntax is frowned upon, please separate the big test data chunk into a separate file and use it from there.

On a quick reading the last committed patch is good and would pass all tests from #91, so committing it is ok. But since the tests were written, I wouldn't mind adding them to the test harness too. This seems to be an area where some contributors may easily break things.

PS: I'm glad to see this patch is finally getting some serious attention. I'm happy to help if there are any questions. Unfortunately I have been affected by the European flight chaos, and will now embark on a 5 day trip back home, but will be back online mid next week.

Comment #113

klausi

he/him||they/them

German

🇦🇹 Vienna

commented 22 April 2010 at 22:20

Status	File	Size
new	161217-url-filter.patch	17.3 KB

Same patch as in #111 plus reused tests from #91.

Comment #114

David Latapie commented 26 April 2010 at 02:41

How would your patch compare to HTML Purifier? It lloks like we have at least three different tools to do the same thing (core URL filter, forked URL filter, HTML Purifier module). And I see no advantage in this.

Thanks in advance for your answer.

Comment #115

hingo commented 27 April 2010 at 23:59

(This is based on a quick read of HTML Purifier code, not actually installing it.)

HTML Purify mainly does the same job as HTML filter (making HTML safe). But indeed, it contains also the "Linkify" class as some kind of addon service.

In general HTML Purify seems to be a rather complex beast comparing with Drupal's very simple core modules for doing the same thing. Also it is (probably?) based on actually validating the HTML with an HTML parser (correct, but slower), whereas Drupal filters tend to just use some regexps to get their job done (good enough if done right, faster).

As for the URL filter vs Linkify comparison, it is surprising to note that my patch here addresses more URI types and the Linkify class in HTML Purifier turns out to be rather simple. In fact also the original URL filter in Drupal core does more than the one in HTML Purify. This can probably be explained by the fact that this is not the main task of HTML Purify, so the implementation has been left as rather simplistic.

HTML Purify will only convert into links strings beginning with
http://|https://|ftp://

This patch is more complete and will catch all of:
http://|https://|ftp://|mailto:|smb://|afp://|file://|gopher://|news://|ssl://|sslv2://|sslv3://|tls://|tcp://|udp://

Also the original Drupal URL filter was more versatile than the one in HTML purify.

In addition both this patch and the old URL filter also convert email addresses (...@...com) and "plain" URL's without the "http://" (www.example.com or ftp.example.com).

In summary, Drupal's URL filter does more than the one in HTML Purify, and is (very probably) simpler and faster.

(And to those just tuning in: there is no difference with this patch vs the forked URL filter, they are the same. The difference between this patch and the old Drupal URL filter is that the old filter is rather buggy, it will miss many URL's that should be converted (especially if other HTML is involved) and it will mangle some places it shouldn't touch.)

Comment #116

hingo commented 10 August 2010 at 22:36

Priority:

Normal

» Critical

I didn't notice earlier that this was downgraded from critical in #110. As I explain in comment #112, the statement that "tests pass" is not correct and the argument that there are many other bugs in the queue is questionable too, since this patch will fix about half of those bugs! I apologize for not setting back to critical at that point, I was stuck in an airport somewhere with bad connectivity.

If there is some other reason not to regard this as critical, feel free to revisit, I'm just saying the earlier downgrade seems to have been done based on wrong assumptions.

Comment #117

redndahead commented 10 August 2010 at 23:18

#113: 161217-url-filter.patch queued for re-testing.

Comment #118

10 August 2010 at 23:40

Status:

Needs review

» Needs work

The last submitted patch, 161217-url-filter.patch, failed testing.

Comment #119

sun

German

Karlsruhe

commented 11 August 2010 at 09:57

Priority:

Critical

» Normal

A bug in one of Filter module's filters is very very unlikely to be more important than normal. That is, because you can replace any filter with one of a contributed module, and D7 makes this even easier (e.g., swapping out the processing callback only).

Anyway, let's fix those tests. Will try to do a fresh review asap.

Comment #120

sun

German

Karlsruhe

commented 13 August 2010 at 11:13

Priority:

Normal

» Major

ok, talked to hingo. Given that this patch seems to fix a full range of other open bug reports filed against URL filter currently, promoting to major.

Ideally, we should write (only) tests for those other issues to show that they fail.

+++ modules/filter/filter.module
@@ -1333,24 +1333,104 @@ function _filter_url_settings($form, &$form_state, $filter, $format, $defaults)
+  // Create an array which contains the regexps for each type of link.
...
+  $tasks = NULL;

Why do we initialize a variable with NULL that's supposed to be an array?

+++ modules/filter/filter.module
@@ -1333,24 +1333,104 @@ function _filter_url_settings($form, &$form_state, $filter, $format, $defaults)
   // Match absolute URLs.
...
+  $replacement = "`($url_pattern)([\.\,\?\!]*?)`i";

I don't think we need to escape .,?! here. Also, I wonder why *? instead of just ? ? Lastly, we don't need that matched value, so this should cut it:

$replacement = "`($url_pattern)(?:[.,?!])?`i";

Or am I missing something? If I do, then the code is missing an inline comment.

+++ modules/filter/filter.module
@@ -1359,20 +1439,39 @@ function _filter_url($text, $filter) {
+  return '<a href="' . $match[$i] . '" title="' . $match[$i] . '">' . $caption . '</a>' . $match[$i+1];

IIRC, the (duplicating) title attribute has been removed by another patch in the meantime.

+++ modules/filter/filter.test
@@ -1059,6 +1059,114 @@ class FilterUnitTestCase extends DrupalUnitTestCase {
+$text=<<<END
+Testing wwwstring with period at end www.example1.com. Testing email with period at end person@example2.com. Testing HTTP URL with period at end http://www.example3.com. Also test < code>using www.example4.com the code tag< /code>.
...
+    $this->assertTrue(strpos($filtered_text, 'href="http://www.example1.com"'), t('Parse simple www-string but not the end-of-sentence period.'));

Some phrases in this testing text sufficiently explain what is being tested and expected, but some of them do not. We should make sure that expectations are sufficiently explained.

Alternatively, and I think I'd almost prefer it that way, we could build an array of testing strings and assertions to keep them together, and implode them afterwards, like so:

  $tests = array(
    'Testing wwwstring with period at end www .example1.com.' => array(
      'href="http://www.example1.com"' => TRUE,
      'href="http://www.example1.com."' => FALSE,
      '<a href="http://www.example1.com">http://www.example1.com</a>.' => TRUE,
      '<a href="http://www.example1.com">http://www.example1.com.</a>' => FALSE,
    ),
    // ...
  );
  $string = implode(array_keys($tests), "\n");
  // ...
  foreach ($tests as $tasks) {
    foreach ($tasks as $value => $expected) {
      // Not using assertIdentical, since combination with strpos() is hard to grok.
      if ($expected) {
        $this->assertTrue(strpos($string, $value) !== FALSE, t('@value found in @string.', array(
          '@value' => var_export($value, TRUE),
          '@string' => var_export($string, TRUE),
        )));
      }
      else {
        $this->assertTrue(strpos($string, $value) === FALSE, t('@value not found in @string.', array(
          '@value' => var_export($value, TRUE),
          '@string' => var_export($string, TRUE),
        )));
      }
    }
  }

As you can see, this allows to cleanly define a single test string and test multiple expectations. For the single period (punctuation) test string, we can assert at least 4 things. Larger/multiline testing strings are also no problem.

It might make sense to put the actual testing code into a helper method. So we can build separate $test task lists for individual filter functionality, i.e. $test_absolute, $test_mail, $test_partial, or rather, any separation that would make sense. Technically, I think one could separate by topics characters/encoding, surrounding markup, security, and potentially more - splitting assertions up into such sections probably makes more sense.

In a separate follow-up issue, we should convert most of the existing filter unit tests to use this construct, as we have a lot of awkward assertion messages there.

Powered by Dreditor.

Comment #121

hingo commented 13 August 2010 at 18:56

Hi sun, and thanks for the additional review. (Would've forgotten about the title attribute otherwise, is *that* checked by a test? :-)

So just for the public record, we did a thorough review and out of 9 open bugs, following are fixed by this patch:

#168227: URL filter fails on URLs surrounded by HTML entities
#877050: URL filter does not convert a link inside a div
#555280: Filter "Convert URLs into links" converts URLS within inline css background-image property

I can work on this during next week.

A few questions to sun:

1)
It would be easy to also fix
#190466: URL filter fails on square brackets
#728380: URL filter fails on parenthesis (RFC2396 Unreserved Characters; i.e., ( and ))
...should I alter the regexps here to cover entire unicode alphabet, or do you want to do that in a separate issue?

2)
When those other bugs have test cased posted, but they are not in D7 CVS yet, should I copy them into this patch?

I'm willing to do both if you answer yes.

Comment #122

sun

German

Karlsruhe

commented 16 August 2010 at 16:57

Status:

Needs work

» Needs review

Status	File	Size
new	drupal.filter-url.122.patch	17.49 KB

Re-rolled against HEAD.

1) Let's deal with unicode separately. 2) Let's keep those other issues open, get this one in, and revisit the other issues afterwards.

Comment #123

16 August 2010 at 17:40

Status:

Needs review

» Needs work

The last submitted patch, drupal.filter-url.122.patch, failed testing.

Comment #124

sun

German

Karlsruhe

commented 16 August 2010 at 20:31

Status:

Needs work

» Needs review

Status	File	Size
new	drupal.filter-url.123.patch	16.57 KB

Reworked the tests. Clarified some comments.

The remaining test failures are for real.

Comment #125

16 August 2010 at 21:10

Status:

Needs review

» Needs work

The last submitted patch, drupal.filter-url.123.patch, failed testing.

Comment #126

sun

German

Karlsruhe

commented 18 August 2010 at 22:17

Assigned:	Unassigned	» sun
Status:	Needs work	» Needs review

Status	File	Size
new	drupal.filter-url.126.patch	23.98 KB

Fixed those regular expressions. I think this patch is ready to fly now. Anyone available for review + potential RTBC?

The new tests are entirely replacing the old tests now, as they're much more precise. I've taken extra care and double-checked that all of the existing assertions are indeed covered by the new ones.

Also, I had to mostly inline similar changes as already proposed in #257193: URL filter: Use filter_allowed_protocols variable, since my new tests tried to assert that a link using gopher:// gets converted, but apparently, that hard-coded list of protocols was totally outdated, and thus, check_url() replaced those invalid bad protocols with http:// ;)

Comment #127

sun

German

Karlsruhe

commented 18 August 2010 at 22:24

Status	File	Size
new	drupal.filter-url.127.patch	24.14 KB

Additional took over a useful @see and assertion from my original patch over in #257193-5: URL filter: Use filter_allowed_protocols variable

Comment #128

hingo commented 19 August 2010 at 10:30

Hi sun

Good work pushing this forward and getting the patch to apply again. Here's a review. I have errands today, but can work on these myself too if nobody else does it before me.

+  // Prepare protocols pattern for absolute URLs.
+  // check_url() will replace any bad protocols with HTTP, so we need to support
+  // the identical list. While '//' is technically optional for MAILTO only,
+  // we cannot cleanly differ between protocols here without hard-coding MAILTO,
+  // so '//' is optional for all protocols.
+  // @see filter_xss_bad_protocol()
+  $protocols = variable_get('filter_allowed_protocols', array('http', 'https', 'ftp', 'news', 'nntp', 'telnet', 'mailto', 'irc', 'ssh', 'sftp', 'webcal', 'rtsp'));
+  $protocols = implode(':(?://)?|', $protocols) . ':(?://)?';
+

As you note in #257193: URL filter: Use filter_allowed_protocols variable, / and : are valid characters anywhere in the URL. So we could simply omit the addition of (?://)? and allow those characters liberally in the next part of the URL. The current regexp doesn't however allow them. So we would need to back to the plan of allowing most unicode characters anywhere in a URL. It can either be done right now, or we let the above stay as it is and then remember to remove it once we change the other regexp so that the above (?://)? isn't needed anymore.

(For clarity, the above needn't necessarily be fixed in this path.)

+  $domain = '(?:[A-Za-z0-9._+-]+\.)?[A-Za-z]{2,64}\b';
+  $ip = '(?:[0-9]{1,3}\.){3}[0-9]{1,3}';
...
+  $url_pattern = "[A-Za-z0-9._-]+@(?:$domain)";

Good call to add $ip (whoever did it).

Same comment here: All of this will be much simpler if we in the future change this regexp to something that just matches a string of unicode characters until next whitespace. (Nothing to change here.)

From #120 above:

Also, I wonder why *? instead of just ? ? ... Or am I missing something? If I do, then the code is missing an inline comment.

The *? is supposed to catch multiple trailing punctuation such as www.example.com... or (blablabla www.example.com.) or example.com!!! Note that it may not always repeat the same character, eg .) must match too.

+  $pattern = "`($url_pattern)`i";
+  $tasks['_filter_url_parse_email_links'] = $pattern;

It seems the email task doesn't contain that regexp at all so me@example.com. wouldn't be matched correctly. (This omission exists already in my patches and original code.)

+    // Split at all tags; ensures that no tags or attributes are processed.
+    $chunks = preg_split('/(<.+?>)/i', $text, -1, PREG_SPLIT_DELIM_CAPTURE);

The above code correctly ignores also multi-line comments. A separate test for this could be added though, or rather just add \n to the existing test.

That is all for the functional code. I have a meeting now, will also review test code separately (look forward to what you've done :-). Hopefully tonight.

Comment #129

hingo commented 19 August 2010 at 10:33

Heh, funny to watch how drupal.org itself behaves when I give examples of strings that should or shouldn't be converted.

In this case, this comment:

It seems the email task doesn't contain that regexp at all so me@example.com. wouldn't be matched correctly. (This omission exists already in my patches and original code.)

...is wrong, it works correctly for other reasons. Please ignore.

Comment #130

sun

German

Karlsruhe

commented 19 August 2010 at 11:52

Thanks for reviewing, hingo!

/ and : are valid characters anywhere in the URL. So we could simply omit the addition of (?://)? and allow those characters liberally in the next part of the URL. The current regexp doesn't however allow them.
[...]
All of this will be much simpler if we in the future change this regexp to something that just matches a string of unicode characters until next whitespace.

I may be mistaken, but this is not a valid domain name:

$domain = 'foo/bar:baz.com';
$url = 'http://' . $domain;
# var_dump($url);
(string) 'http://foo/bar:baz.com'

We need a separate pattern for $domain, so as to be able to properly match valid domain names and not match invalid domain names. At minimum, there are domain name length restrictions we have to account for, but also, not all characters can be used in all parts of a URL.

For now, if this is debatable, I'd recommend to move detailed discussion about protocols into #257193: URL filter: Use filter_allowed_protocols variable

Good call to add $ip (whoever did it).

Yeah, I had to get those tests passing. ;)

The *? is supposed to catch multiple trailing punctuation such as www.example.com... or (blablabla www.example.com.) or example.com!!!

That's what I figured, too. However, we are not interested in further trailing text; just want to optionally exclude a certain (single) trailing punctuation character from the URL, if there happens to be one.

I agree that e-mails should get that punctuation treatment, too, and we should move the pattern into a new common $punctuation variable.

The above code correctly ignores also multi-line comments. A separate test for this could be added

Good catch! Let's add a separate string test for multi-line comments then. Also not sure whether we should add another containing HTML markup within a comment.

Comment #131

sun

German

Karlsruhe

commented 19 August 2010 at 12:12

Status	File	Size
new	drupal.filter-url.131.patch	24.51 KB

Alright, incorporated all points from #130.

Problem: URLs in multi-line comments are not ignored. @hingo: Can you fix that?

Comment #132

19 August 2010 at 12:50

Status:

Needs review

» Needs work

The last submitted patch, drupal.filter-url.131.patch, failed testing.

Comment #133

hingo commented 20 August 2010 at 16:18

I'm not at my home computer at the moment, so cannot test, but since you are so active on this, a few quickie comments from my head. (...which may be wrong).

Problem: URLs in multi-line comments are not ignored. @hingo: Can you fix that?

I can take a look later, but more importantly I actually tested a multi-line comment before I wrote my previous review and it worked correctly. I only had one newline, is your problem with more than that?

Adding yet another test with html markup in comments is a very good idea. (That may actually break, I realize now! The URL should be after the html...)

The *? is supposed to catch multiple trailing punctuation such as www.example.com... or (blablabla www.example.com.) or example.com!!!

That's what I figured, too. However, we are not interested in further trailing text; just want to optionally exclude a certain (single) trailing punctuation character from the URL, if there happens to be one.

Yes, but now if you have such multicharacter punctuation, it stops on the last punctuation character, not the first? Ie it produces <a href="http://www.example.com..">www.example.com..</a>.

I agree that e-mails should get that punctuation treatment, too, and we should move the pattern into a new common $punctuation variable.

Turns out emails never needed them, since they always end in a letter anyway. (like the m in .com)

Your comment about domain names is correct. But I was thinking there could be other protocols than http which don't have a domain name at the start. Even in mailto, I could see someone having "//@example.com" as their email address (yeah, probably illegal, need to check)

Comment #134

hingo commented 19 August 2010 at 13:00

Status:

Needs work

» Needs review

Looked at the failed test now. I think a multiline comment works, it's your p tag that breaks it. The problem is the > character of the p tag matches here:

$chunks = preg_split('/(<.+?>)/i', $text, -1, PREG_SPLIT_DELIM_CAPTURE);

My first reaction is that we need to handle html comments as a special case. Probably the cleanest solution is to add a step that escapes all html inside comments, like some filters do in the $op="prepare" phase.

Comment #135

sun

German

Karlsruhe

commented 19 August 2010 at 13:24

Adding yet another test with html markup in comments is a very good idea. (That may actually break, I realize now! The URL should be after the html...)
...
My first reaction is that we need to handle html comments as a special case.

If I'm not mistaken, then #222926: HTML Corrector filter escapes HTML comments solved a similar problem for the HTMLCorrector filter in D6 (where it still was regex-based). We may borrow a solution attempt from over there.

Yes, but now if you have such multicharacter punctuation, it stops on the last punctuation character, not the first? Ie it produces www.example.com.

Let's assert this in a test. All of our expectations have to be asserted.

But I was thinking there could be other protocols than http which don't have a domain name at the start.

Let's not overcomplicate things here. ;)

Comment #136

hingo commented 20 August 2010 at 18:34

Here is a review of the tests, finally. As before, if nobody else is faster, I will also implement these myself asap.

Since I haven't really looked at these tests before, I'm reviewing the filter.test file itself, not just the most recent patch.

Line 996

    // Filter selection/pattern matching.
    $tests = array(
      // HTTP URLs.
      '
http://example.com or www.example.com
' => array(
        '<a href="http://example.com">http://example.com</a>' => TRUE,
        '<a href="http://example.com">example.com</a>' => FALSE,
        '<a href="http://www.example.com">www.example.com</a>' => TRUE,
        '<a href="http://www.example.com">http://www.example.com</a>' => FALSE,
      ),

There is no reason to test both positively and negatively the same test. If the first test passes, then the second test automatically passes. Same for 3 and 4.

Also it is in general not a very useful approach to test some specific ways things could go wrong, since there are always a million ways things go wrong. We should just assert that the result is correct and that's enough. If it is known to be correct, we also know it is not wrong.

In other words this is sufficient:

    // Filter selection/pattern matching.
    $tests = array(
      // HTTP URLs.
      '
http://example.com or www.example.com
' => array(
        '<a href="http://example.com">http://example.com</a>' => TRUE,
        '<a href="http://www.example.com">www.example.com</a>' => TRUE,
      ),

This comment applies to many of the following tests too.

Line 1050:

      // Absolute URL protocols.
      '
https://example.com,
ftp://ftp.example.com,
news://example.net,
telnet://example,
irc://example.host,
ssh://odd.geek,
sftp://secure.host?,
webcal://calendar,
rtsp://127.0.0.1,
not foo://disallowed.com.

Maybe add cross referencing comments here and line 1325 in filter.module so that one remembers to update the test if the list of supported protocls changes.

Good that you also test for a not supported protocol.

Line 1078:

    // Surrounding text/punctuation.
    $tests = array(
      '
Partial URL with trailing period www.partial.com.
E-mail with trailing period person@example.com.
Absolute URL with trailing period http://www.absolute.com.
Query string with trailing period www.query.com/index.php?a=.
'

Need to add back examples with multiple trailing punctuation characters from comment #128.

Line 1194:

      '
<!--
Skip any URLs like www.comment.com in comments.
<p>Also ignore http://commented.out/markup.</p>
-->

Learning from your previous comment, this test is ambiguous since when it fails you don't know if it fails on the HTML tag or because of the newlines. My favorite solution would be to also add newline to the previous test. Alternatively remove newline here and have a separate third test for it.

Ok, so now we have good test coverage.

From the email discussion we had, the approach you showed in building tests suggests it would be possible to easily test all HTML tags by iterating over an array that contains them. (You seem to have further developed that approach here so now it's slightly different.) Now we just test for DL list, P and some other random picks. But for me it is perhaps better to postpone that exercise to a separate issue and close this with the tests we have.

Even if I've already objected to it many times, you have now removed the one test I actually provided. The current tests do cover the various issues well in isolation, however it is also important to test on a larger document that combines many of the cases into one document. With the work you have done, it needn't be as complex as the one I did earlier, but I want to add back one test which would start with the following comment:

// It is important to also test on a more complex document than just short one 
// liners testing one function in isolation.  Note that common regexp errors 
// include: accidental (greedy)* match instead of (minimal)*? match, only 
// match first occurrence rather than all, and newlines not matching .*.
// Additionally early versions of URL filter had bugs in cases where you have
// combinations of the 3 types of URLs to convert.
// This test covers:
//  * Document with multiple newlines and paragraphs (two newlines).
//  * Mix of several HTML tags, invalid non-HTML tags, tags to ignore and
//    HTML comments.
//  * Empty HTML tags (br, img)
//  * Mix of http://... email@... and www... strings in same document.

PS:

But I was thinking there could be other protocols than http which don't have a domain name at the start.

Let's not overcomplicate things here. ;)

That's why I'm interested in the unicode support: the reason that is a tempting path to explore is that the regexps would be much simpler, like ($protocol:\S)$punctuation*?\s

Comment #137

sun

German

Karlsruhe

commented 20 August 2010 at 21:24

Status	File	Size
new	drupal.filter-url.137.patch	24.43 KB

Fixed those tests. I need you to fix the multiline comments, not the tests. :)

Comment #138

20 August 2010 at 22:00

Status:

Needs review

» Needs work

The last submitted patch, drupal.filter-url.137.patch, failed testing.

Comment #139

hingo commented 20 August 2010 at 22:41

Status:

Needs work

» Needs review

Status	File	Size
new	test.tar_.gz	1.16 KB
new	drupal.filter-url.137.patch	28.87 KB

Ok, here's a patch that
- reverts email handling to not use the punctuation stuff (I tried to say so in #129)
- includes fixes for the tests from my previous comment, including the test with large document
- adds back support for multiple trailing punctuation (and tests)
- correctly handles commented out HTML
- since I needed to touch the code anyway, finally fixes _filter_url_trim(). See http://drupal.org/node/260484#comment-966659

The code added to handle HTML comments definitively needs review.

I didn't use anything from #222926: HTML Corrector filter escapes HTML comments since it seems they don't have a working patch themselves. Maybe they can borrow my solution :-)

(Hmm... I cannot add new files with a diff, so I'm attaching a tar file separately that contains a new test/ directory.)

Comment #140

hingo commented 20 August 2010 at 22:44

Note that automated test will obviously fail now that 2 new files are missing from the patch. Tests naturally pass on my own machine.

Comment #141

20 August 2010 at 23:20

Status:

Needs review

» Needs work

The last submitted patch, drupal.filter-url.137.patch, failed testing.

Comment #142

sun

German

Karlsruhe

commented 21 August 2010 at 00:45

Status:

Needs work

» Needs review

Status	File	Size
new	drupal.filter-url.141.patch	33.99 KB

Yet another reason to put .test files into a /tests subdirectory... grml.

Comment #143

21 August 2010 at 01:20

Status:

Needs review

» Needs work

The last submitted patch, drupal.filter-url.141.patch, failed testing.

Comment #144

sun

German

Karlsruhe

commented 22 August 2010 at 20:29

Status:

Needs work

» Needs review

Status	File	Size
new	drupal.filter-url.144.patch	36.21 KB

Fixed those tests. Not entirely happy with the new test document -- takes hours + lots of debugging tools/code to find the actual cause for failing assertions. But we can go with this for now.

Comment #145

hingo commented 23 August 2010 at 11:09

Line numbers refer to the latest patch, not the source file:

Line 230:

@@ -1350,7 +1506,7 @@ function _filter_url_trim($text, $length
   }
 
   // Use +3 for '...' string length.
-  if (strlen($text) > $_length + 3) {
+  if ($_length && strlen($text) > $_length + 3) {
     $text = substr($text, 0, $_length) . '...';
   }

Line 552:

+    $filter->settings['filter_url_length'] = 20;
+    $tests = array(
+      'www.trimmed.com/d/ff.ext?a=1&b=2#a1' => array(
+        '<a href="http://www.trimmed.com/d/ff.ext?a=1&amp;b=2#a1">www.trimmed.com/d/ff...</a>' => TRUE,
+      ),
+    );

Do you want to elaborate on why you have switched back to the broken version of _filter_url_trim()? In #141 the test was correct and the code was wrong. Now you have changed the code to match the test and they are both wrong.

In particular, you set filter_url_length to 20 characters, but now it cuts after 23 characters. (For instance, try with a URL with 22 characters and you see what I mean.) This is a minor issue, I'm just curious why you took the extra effort to remove my correct code?

FWIW, I'm ok with changes to _filter_url_escape_comments(). Using boolean variable doesn't semantically contain information about everything that the argument does, but I guess it saves so many bytes it is ok.

736:

Index: modules/simpletest/drupal_web_test_case.php
===================================================================
RCS file: /cvs/drupal/drupal/modules/simpletest/drupal_web_test_case.php,v
retrieving revision 1.226
diff -u -p -r1.226 drupal_web_test_case.php
--- modules/simpletest/drupal_web_test_case.php	22 Aug 2010 15:31:18 -0000	1.226
+++ modules/simpletest/drupal_web_test_case.php	22 Aug 2010 20:23:44 -0000

This is probably here by mistake? Also, it just moves a function a little bit, doesn't seem to do anything.

Comment #146

sun

German

Karlsruhe

commented 25 August 2010 at 17:54

1) Sorry for reverting that URL length trimming change. I don't think it was mentioned in this issue yet, so I wasn't aware that it really belongs to this patch. I'd prefer to move this fix into a separate issue.

2) The verbose() method is moved from DrupalWebTestCase into the base class, DrupalTestCase, as it cannot be used in unit tests currently. To be able to debug URL filter's processing of the entire document, verbose() is needed, as that's the only way to look at the actual results.

So it looks like we're done here?

Comment #147

hingo commented 25 August 2010 at 20:17

1) Ok. It is a different issue and can be handled separately. I mentioned it in #139. I'm ok with taking it away, just wanted to raise a flag since you reverted it without commenting.

2) If it was intentional, I have no comment, I'll focus on filter.module only.

+1 for #144 then.

Comment #148

sun

German

Karlsruhe

commented 25 August 2010 at 20:31

Status:

Needs review

» Reviewed & tested by the community

I guess you meant RTBC.

Comment #149

sun

German

Karlsruhe

commented 1 September 2010 at 12:14

This patch blocks various other issues.

Comment #150

dries commented 1 September 2010 at 20:04

Instead of calling those files filter.url.input.test.txt, can't we call them filter-url-input-test.txt?

Comment #151

sun

German

Karlsruhe

commented 1 September 2010 at 20:23

Status	File	Size
new	drupal.filter-url.150.patch	160.31 KB

Since that is the only review issue, here's a proper patch. Don't forget to:

cvs rm modules/filter/filter.test
cvs add modules/filter/tests/filter.test
cvs add modules/filter/tests/filter.url-input.txt
cvs add modules/filter/tests/filter.url-output.txt

Comment #152

webchick

she/they

English

Vancouver 🇨🇦

commented 1 September 2010 at 23:01

Status:

Reviewed & tested by the community

» Needs work

We need to back out moving modules/filter/filter.test to modules/filter/tests/filter.test. That's inconsistent with every other module in core, and has nothing to do with fixing this problem.

Comment #153

sun

German

Karlsruhe

commented 2 September 2010 at 00:07

Status:

Needs work

» Reviewed & tested by the community

Status	File	Size
new	drupal.filter-url.153.patch	35.4 KB

That's not true. We have approx. the same amount of .test files directly in the module's folder as in a tests/ sub-folder. Additionally, having all testing stuff together in one location is only beneficial. Basically eliminates the t() in test assertion messages issue.

Anyway.

Comment #154

redndahead commented 2 September 2010 at 07:00

Just to support Sun's claims these modules use "tests" directories.

Aggregator
Block
Field
File
Image
Locale
Node
Openid
RDF
Search
Simpletest
Trigger
Update

Comment #155

sun

German

Karlsruhe

commented 2 September 2010 at 11:32

Let's discuss that detail in a separate issue. Latest patch no longer moves filter.test into the tests sub-directory. But of course, the testing text files have to be in there.

Comment #156

dries commented 2 September 2010 at 15:56

Status:

Reviewed & tested by the community

» Fixed

Committed to CVS HEAD. Thanks!
Let's talk about the location of tests files in a seperate issue indeed.

Comment #157

16 September 2010 at 16:00

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Comment #160

redndahead commented 8 August 2011 at 05:36

Title:	URL filter breaks generated href tags (and filter.test)	» URL filter breaks generated href tags
Assigned:	Unassigned	» sun
Priority:	Critical	» Minor
Status:	Needs work	» Closed (fixed)