URL filter breaks generated href tags (and filter.test)
| Project: | Drupal |
| Version: | 7.x-dev |
| Component: | filter.module |
| Category: | bug report |
| Priority: | critical |
| Assigned: | Damien Tournoud |
| Status: | needs work |
Hi all
Attached is a patch that is a major rewrite to function _filter_url($text, $format). It is based on a recent version of the 5.x development branch, but it only touches parts that are unchanged in 6.x so it can be applied as is to that too.
Rationale
The current version of the URL filter is really basic. The regular expressions used only find url's in plaintext, inside p, li and br tags. In addition it has some bad bugs, such as it might manipulate url's that are really part of some html element's attribute.
This patch extends the "contexts" in which URL's are searched for, and on the other hand makes the regular expressions more accurate so as to avoid manipulating url's that absolutely must not be touched.
What changed?
//From line 1109...
/**
* URL filter. Automatically converts text web addresses (URLs, e-mail addresses,
* ftp links, etc.) into hyperlinks.
*/
function _filter_url($text, $format) {
//The following tags may contain text with a URL that should be converted to a link
//List also contains standalone/empty tags like img and br
$tags_begin = 'caption|td|th|div|dd|dt|li|blockquote|address|p|h[1-6]|i|b|u|tt|big|small|strike|s|em|strong|font|dfn|samp|kbd|var|cite|abbr|acronym|sub|sup|q|ins|del|img|br|object|applet';
//transform each tagname from 'p' to '<\/?p[^>]*>'
//this pattern will match <p>, </p>, <p class="foo"> and <p />
//the last one is significant for xhtml standalone tags like <img /> and <br />
//note: <img/> and <br/>, won't work. While valid xhtml their use is discouraged anyway
//note: the pattern will obviously match many invalid tags too like <p &&&>
$arr_tags_begin = explode( "|", $tags_begin );
for ( $i = 0; $i < count( $arr_tags_begin ); $i++ ){
$arr_tags_begin[$i] = '<\/?' . $arr_tags_begin[$i] . '[^>]*>';
}
$tags_begin = implode( "|", $arr_tags_begin );
// Pass length to regexp callback
_filter_url_trim(NULL, variable_get('filter_url_length_'. $format, 72));
$text = ' '. $text .' ';
//The structure of each three matches below is
// 1. a tag from the list above or newline or </a>
// 2. any characters except "<"
// 3. the url to match
// 4. possible period or other marks at end of url are considered separate (end of sentence)
// 5. end url in space, newline or tag ("<" character)
//In many cases the patterns below only match the first url if there are many in the same paragraph.
//Therefore the preg_replace are encapsulate in loops that will re-run the replace as many times as needed.
// Match absolute URLs.
$protocols = 'http://|https://|ftp://|mailto:|smb://|afp://|file://|gopher://|news://|ssl://|sslv2://|sslv3://|tls://|tcp://|udp://';
$urlpattern = "($protocols)([a-zA-Z0-9@:%_+*~#?&=.,/;-]*[a-zA-Z0-9@:%_+*~#&=/;-])";
$re = "`(($tags_begin|\n|\r|</a>)([^<])*?)($urlpattern)(([\.\,\?\!]*?)[\<\s\n\r])`i";
while(preg_match($re, $text)) {
$text = preg_replace_callback($re, '_filter_url_parse_full_links', $text);
}
// Match e-mail addresses.
$urlpattern = '[A-Za-z0-9._-]+@[A-Za-z0-9._+-]+\.[A-Za-z]{2,4}';
$re = "`(($tags_begin|\n|\r|</a>)([^<])*?)($urlpattern)(([\.\,\?\!]*?)[\<\s\n\r])`i";
while(preg_match($re, $text)) {
$text = preg_replace($re, '\1<a href="mailto:\4">\4</a>\5', $text);
}
// Match www domains/addresses.
$urlpattern = 'www\.[a-zA-Z0-9@:%_+*~#?&=.,/;-]*[a-zA-Z0-9@:%_+~#\&=/;-]';
$re = "`(($tags_begin|\n|\r|</a>)([^<])*?)($urlpattern)(([\.\,\?\!]*?)[\<\s\n\r])`i";
while(preg_match($re, $text)) {
$text = preg_replace_callback($re, '_filter_url_parse_partial_links', $text);
}
$text = substr($text, 1, -1);
return $text;
}
/**
* Make links out of absolute URLs.
*/
function _filter_url_parse_full_links($match) {
//which parenthesis in the regexp contains the url?
$i = 4;
$match[$i] = decode_entities($match[$i]);
$caption = check_plain(_filter_url_trim($match[$i]));
$match[$i] = check_url($match[$i]);
return $match[1] . '<a href="'. $match[$i] .'" title="'. $match[$i] .'">'. $caption .'</a>'. $match[$i+3];
}
/**
* Make links out of domain names starting with "www."
*/
function _filter_url_parse_partial_links($match) {
//which parenthesis in the regexp contains the url?
$i = 4;
$match[$i] = decode_entities($match[$i]);
$caption = check_plain(_filter_url_trim($match[$i]));
$match[$i] = check_plain($match[$i]);
return $match[1] . '<a href="http://'. $match[$i] .'" title="'. $match[$i] .'">'. $caption .'</a>'. $match[$i+1];
}Attached are a patch that can be applied against 5.x-dev and also the entire filter.module file itself. (So you can just replace your current filter.module to test.)
| Attachment | Size | Status | Test result | Operations |
|---|---|---|---|---|
| filter.module.urlfilter-rewrite.v0.1.patch | 4.55 KB | Ignored | None | None |

#1
Here is the filter.module attached
#2
And here is a file that can be used as testdata.
#3
Attached is the same patch against the most recent HEAD / 6.0
So no you are all set to go to review and apply this on any system you wish.
#4
Changes such as this will only be committed to the development branch, which I think will be for the 7.x release cycle, 6.x is currently in feature freeze.
This patch is not properly formatted, it should be in unified diff format. See http://drupal.org/diffandpatch.
#5
Oh right, sorry about the patch. Here is a diff -up patch against head attached.
About the 6 vs 7 issue. I would like to point out that this patch contains absolutely no new features. It does the same replacements as the current urlfilter module (the core regexp patterns didn't change) but it does the replacements in some new places where the current one doesn't but imho should (inside many new tags) and otoh it doesn't touch several places the current one shouldn't but does. In this sense I would argue this fix is within the scope of "During the initial stage of the code freeze, documentation updates, usability improvements, and performance improvements will still be accepted. New functionality or API changes / additions, on the other hand, will not be unless deemed critical. As of today, the focus is to strengthen the code base's performance, usability, and stability. As we progress, focus will shift towards stability and, near the end of the code freeze, only bug fixes will be allowed, until no release critical bugs are left." (http://drupal.org/drupal-6.0-code-freeze)
I do realise this patch is not a simple one-liner bug fix and the URL filter has worked fine for many of you for many years now. So if you won't include this in 6.x I'm not going to whine about it more than this, but I do want to stress that nothing new is introduced in this patch, it is a bug fix.
I'm changing version back to 6.x-dev just to get attention from the right people. Please just change it back to 7.x if that is your judgement.
#6
If the goal is to make it context-aware, then I don't see why we don't do a full tag split like other core code does (e.g. htmlcorrector, the search indexer, drupal_html_to_text). That way we can simplify the extremely complicated regular expression into a simpler, more readable heuristic. Ignoring any content between
<a> </a>tags or ignoring any HTML attributes would be a simple if check at the right point.The combined regular expressions that result from all this code seem incredibly convoluted, and contain slow, ungreedy operators.
Also, the following is just not acceptable:
//note: <img/> and <br/>, won't work. While valid xhtml their use is discouraged anyway#7
Also, from a code-style standpoint, please ensure there is one space between
//and the comment and that the comment uses sentence capitalization. For example:+ //this pattern will match <p>, </p>, <p class="foo"> and <p />should be:
+ // This pattern will match <p>, </p>, <p class="foo"> and <p />.There are places in core where the comments don't conform to this standard (yet), but we shouldn't introduce anymore new ones that don't. :)
#8
Hi again. Thanks for the good comments. It took me a couple of days to go through them, but now I'm actually back with two new patches.
Attached here is a patch 0.3 that corrects all of the minor problems you have pointed out for 0.2. In particular:
Actually, it seems that comment was wrong and those tags actually did work. The comment is now removed. However, the regexp at that position has now been reworked to be a bit more precise than before, since I realised I had tought about it the wrong way.
Done.
#9
And here is the other patch, addressing the final suggestion made by Steven.
Please not that this patch is not intended to be reviewed or included anywhere. I'm just attaching it as an attachment to the explanation that follows. Please consider 0.3 for inclusion into Drupal core.
In the attached patch I tried to follow this path instead. It is almost a complete rewrite of the whole _filter_url_filter() function. (Whereas patch 0.3. actually is more like additions onto the simple regexps in current HEAD.)
The problem is that URL filter actually makes 3 different replacements. It makes links out of
1) http://any.full.url
2) email@addresses.com
3) www.domains.com
The problem is that once you do the split (you can just split on all tags, it's really simple)
(1) works fine,
(2) works fine since it (practically) never confilcts with (1),
but then at step (3) you end up actually hitting all of the addresses that were converted in step (1), thus messing up the links created in (1). So the problem is that while we preg_split away all tags present in the input, (1) and (2) create new
<a>tags and there is no new split function and then you mess everything up at (3).Possible solutions to this dilemma:
1) Make the last regexp more intelligent/complex so it doesn't do this -> This would end up being the "complex" regexp used in patch v0.3, kind of defeating the whole point of the excercise: Then we'd have both more complex code and the complex regexp.
2) Separate the 3 regexps so that only one is done at a time. For instance have a loop that goes 3 times, doing the preg_split anew for each loop. This way the new
<a>tags created in steps (1) and (2) would become protected at step (3).I personally am of the opinion that while the regexps in v0.3 may be a bit wordy, the complexity of the if-else structures of v0.4 are not that simple either. Adding another loop or other logic doesn't seem worthwile to me - I'd rather vote for v0.3 where at least the code is straightforward, albeit the regexps may be on the long side.
(Another comment on patch v0.2)
Is the slowness of a regexp in a filter an issue, since the filter is cached? I do admit that URL filter is one of the most used ones, but nobody runs filters without having the caching on, right?
Whether the regexps are readable or not is of course a matter of taste, to the author everything always looks simple. But I'd argue the if-else logic in v0.4 takes about as much time to understand as the regexps in v0.3. (v.0.4 also took longer to get right than modding the regexps took for v0.3.)
So in summary, I'm still proposing the patch v0.3.
#10
Small restructuring of the regexp in $tags_begin variable. (Refinement based on patch 0.3, see previous comment.)
Changes line 1089 (line 10 in this patch)
$tags_begin = '<\/?(?:' . $tags_begin . ')(?:\s[^>]|\/)*>';// This pattern will match <p>, </p>, <p class="foo">, <p /> and <p/>
// The last ones are significant for xhtml standalone tags like <img/> and <br/>
// Note: the pattern will obviously match many invalid tag-like constructs too
// like <p &&&>
Removes the for-loop that did the same transformation around each tag separately.
This change makes the regexps a bit shorter, so should be slightly simpler to read and also to execute.
#11
Oops, forgot to change status...
#12
How should silence in a "needs review" situation be interpreted?
Where is 6.x at the moment, is it still realistic to target this for 6.x or are we close to RC time?
#13
I applied your patch 0.5 to test. There seems to be a bug here. Given the input
the url is getting filtered, with the effect that the code example is obfuscated. This occurs even though the code tag is NOT on the list of allowable contexts for filtering. Any idea what is happening?
Hmm. Actually, things get substituted in a code tag even without the surrounding blockquote. This doesn't seem right if I am understanding the module code's intentions.
#14
Hi shap
Thanks for testing this.
Unfortunately, I cannot reproduce what you are getting. I tested on a vanilla 6.x-dev, with 0.5 patch applied. For instance,
This is just a www.test.com. paragraph with person@test.com. some http://www.test.com. urls thrown in and also <code>using www.test.com the code tag< /code>.
<blockquote>
This is just a www.test.com. paragraph with person@test.com. some http://www.test.com. urls thrown in and also <code>using www.test.com the code tag< /code>.
</blockquote>
<code>So what's this about? http://www.test.com abc< /code>
http://www.test.com
www.test.com
person@test.com
<code>www.test.com< /code>
All of the other url's get modified, but the url's inside the code tags are not modified.
On the other hand, the behaviour you describe matches the current drupal url filter. (And yes it is wrong and that is exactly the reason I'm trying to fix it.) So could you please check if:
* you really are running filter.module with the patch applied
* you are not seeing a cached version of your content. For instance, write foo123 at the end of the text to make sure it is different from some old version.
#15
I found one instance where actually url's are modified when they shouldn't:
<code>In this example www.test.com the first url is not modified, but <em>the second www.test.com is</em> even if it shouldn't. </ code>The problem is that v0.5 only pays attention to the innermost tags. I haven't thought of a case in normal html where this would lead to problems, but indeed the code tag is different.
I'll work on this and see how to solve it. Thanks shap for bringin this to my attention - I don't know if this is exactly the bug you were describing, but thanks still.
#16
Well, to fix the problem with tags nested inside code tags, I had to revert to the preg_split/implode strategy from 0.4. I've completed and refined that approach and this patch is a refined version of 0.4.
I would appreciate if you can test this out too. I'm pretty sure this is close to perfect, I dare you to come up with examples to break this :-)
#17
Attached is the testdata I'm currently using.
#18
Since 6.0 has now reached beta (congratulations, btw) I'm changing this to target 7.x-dev.
#19
Well I've had the url filter mess with urls created by the footnotes filter before (i.e. within a href), so I think this counts as a bug. I'm tentatively moving it back to 6.x-dev since other bugs will get fixed before rc, and changing it to a bug report since that's really what it is.
Sorry no time to review this morning.
#20
Great! I certainly agree with classifying this as a bug, if there are more than me who will agree.
I was trying to be cautious, since I'm under the impression that the URL filter is "good enough" for many people and some may not agree that what has now become a complete rewrite would be considered for 6.x. But as this thread too shows, there are planty of things the old URL filter does wrong, and this patch should fix all of them.
#21
You fileencoding is wrong (ISO). You must use UTF8 without BOM's. Check your editor, please.
#22
Sorry about that, I didn't realise there were any characters where that would have mattered. Here is the same patch but now in UTF-8 format.
#23
Forgot to change status again...
#24
...
#25
Beta 2 is out for 6.x, so I'm moving this to the development queue.
#26
Bugs are still being fixed in 6.x, and at least the footnotes module is caught by this bug requiring careful weighting of the various filters. It's possible the patch may be too much of a change at this stage, but it'd be nice to get more reviews before bumping it.
#27
As someone who doesn't follow the dev mailing list, I'll ask here: Is HEAD already taking in new patches or is activity still focused on D6?
#28
hingo, currently 6.x and HEAD are the same thing.
#29
I just ran into this problem in 5.7. Has this thread gone cold? Or, is there a dupe somewhere?
My use case:
<p class="foo">www.example.com</p>--> no conversion<p>www.example.com</p>--> worksAny chance this will make it into 5.x or should I hack core?
#30
Hi jpsalter
You've found the right place, this is THE thread and it is still active.
6.0 is now out and this wasn't applied. I'm actually wondering myself what the correct Version tag for this would be? 6.0 (since this bug exists in 6.0) or 6.x (where it should be applied) or 6.1 or 7.x?
I doubt it will be applied to 5.x, but you should be pretty safe applying it yourself. From what I can tell the _filter_url() functionality has been untouched through the whole 5.x series to 6.x. Even if the patch wouldn't apply you could still just replace the relevant functions with copy pasting.
#31
Thanks for the confirmation and all your hard work. I hope others recognize it too.
#32
This should be fixed in HEAD (now 7.x-dev) and then backported if appropriate.
The patch in #22 has some code style issues with comments (lacks periods at ends of comments, for one).
#33
Thanks Keith
Old habits die hard... I already had the periods in place, but rewriting parts of the code it seems I introduced periodless comments again.
This patch 0.8
- fixes style issues with comments (periods)
- diff against 7.x-dev (these functions untouched but file changed so line numbers changed from 6.x-dev)
#34
Could someone please see if there are any more punctuation or style issues with the comments and help get this committed to 7.x.
I first submitted this patch during my last years Summer vacation, and now the snow is melting and the grass is green again, I would be disappointed not to have such a straightforward fix committed before my next Summer holiday due in 3 months...
#35
hingo: Well, since you ask, on a quick read-through, we do use 'HTML' and 'URL' in comments (rather than 'html' and 'url', respectively).
#36
Thanks Keith
Would you find in yourself enough energy to see if there is still something that needs to be corrected. I'm afraid next you'll tell me that www should also be uppercase? Speaking of which, is there an official style guide where I could get all this advice in one shot?
v0.9
* style: uppercase HTML and URL
#37
http://drupal.org/coding-standards is a good starting point and try coder module... it helps, too.
#38
Thanks. Looking at it now I realise I already read through it last summer. But it isn't as detailed as some of the advise I got here, which is more like grammatical stuff almost. *Which I'm completely ok with*, btw, it just seems there is an additional "oral tradition" to style issues at Drupal, so it seems I will have to continue relying on the grace of core developers giving some feedback every once in a while.
#39
hingo: Some of this is contained in pages similar to the coding-standards page (but that are more general in scope), like http://drupal.org/about/authoring.
#40
Ah, Thanks Kev. Had not seen that one.
In any case, we are digressing now a bit from the main question? What should yet be done so that this can be committed to 7.x?
#41
#42
@hingo: Can you provide some sample content with URLs, some of which are inappropriately processed by the current filter, so that the efficiency of this patch can be tested?
#43
Attached is my most developed "test case". The patch works good with all of these. The current url-filter fails at least:
* urls inside a dl list (not processed, should be)
* urls in a title attribute in a link (are processed, shouldn't)
* urls in html comments (are processed, shouldn't)
* urls inside some other html tags also are not processed but should
(This is out of my memory, I haven't used the official urlfilter in a while :-)
#44
ping
#45
Issue #249200: Test individual filters. adds some test cases that show the failure of the URL filter in the cases mentioned in #43. Please review and extend those tests, and it will help to get a revised URL filter into core.
#46
Re-rolled patch from #36.
The tests written in #249200: Test individual filters. have been committed. 12 assertions fail with the current link detection algorithm, but all of them pass successfully with this patch.
#47
Thanks, flobruit. I'm immediately going to take the benefit of not being the submitter of the last patch anymore and put a reviewer hat on. Let's see if we can finally get this committed!
/me reviewing... Looks good!
We've had several people looking at this without complaints for months now, and now we also have the unit tests committed in
filter.testthat show the improvement over the existing url filter. I vote for committing the patch by flobruit in #46.#48
Still applies with offset, and fixes those failures in filter.test (although there's some other unrelated ones).
#49
The other failures in filter.test are dealt with here: http://drupal.org/node/266539
#50
Please remove line-break here:
+ // Note: PHP ensures the array consists of alternating delimiters and literals+ // and begins and ends with a literal (inserting NULL as required).
+
+ // If an ignoretag is found, it is stored here and removed when the
+ // closing tag is found. Until then no replacements are made.
+ // Think of this as a stack that always has 0 or 1 items.
Please read coding standards again:
+ for ( $task = 1; $task <= 3; $task++ ) {+ for ( $i = 0; $i < count( $chunks ); $i++ ) {
+ if ( $opentag == '' ) {
Please improve these comments:
+ // Even/odd = text/tag.+ if ( $i % 2 == 0 ) {
+ // Text...
+
+ // Only do replacements when there are no unclosed ignoretags.
+ else {
+ // Tag...
Please insert a newline between breaks and cases inside of switch statements:
+ break;+ case 2:
$task seems not to be defined and poorly named.
+ for ( $task = 1; $task <= 3; $task++ ) {Please improve these comments:
+ // No open "ignoretag".+ // Is it an ignoretag? (opening tag.)
We do not use such "reminders" in Drupal: (if that's needed, the function body is likely too large)
+ } // if ( $opentag == '' )+ } // if ( $i % 2 )
+ } // for chunks
...probably more, but please fix those first.
#51
Wow, this looks like a real review! I have to admit I had given up hope...
Sun,
Please don't! Those are not the same comment. The 2 first lines refer to the previous line, while the three last ones to the following line. (And since next you are going to say that comments cannot appear after the statement they are a comment on, I might as well add right now that this code is a copy paste from existing code in Drupal core and is like that in 4 other places too.)
This is a rather unhelpful comment (I'm being polite here) and no, I won't read them again. If there is a style issue, say what it is.
Sure, they can be removed. (This must be the first time anyone complained of there being too many comments in a patch I wrote :-)
Sorry, this won't work. We've had several people now flying in and "helpfully" pointing out punctuation, capitalisation and whatnot issues. I'm not interested in that game anymore. This patch fixes a known bug in Drupal which several people have complained about. Since September 2007 (that's 9 months, I have made a baby in less!) there hasn't been any code updates, just grammar/style discussion. If we get a list of the style/comment issues that should be fixed we will fix them. But at least for my part I'm not gonna participate in a never ending game of "fix these first" reviews. This is a short patch, it is not too hard for you to give a finite list of problems and we'll fix them and then the patch can be committed.
Thanks.
(FWIW, your comments I didn't comment on are perfectly good comments and I'll be happy to improve on those together with the complete list of remarks that I hope to get soon. I also want to point out that it's obviously not your fault that we've already had too many unhelpful helpful reviewers pass by without any progress, you just happened unluckily to be the last one - in fact your review was the best so far.)
#52
Hrm. This is called peer review. We, the community, have advanced in this area a lot in the past. No patch will hit core that contains coding-style issues or any other logical/syntactical weaks. If we would slip such patches in, Drupal core would quickly look like err... tryin' hard to avoid names of other CMS. Please note that I have not tested your patch, just reviewed the code. Before any other core or core-alike developer will stumble upon this issue, you should really try hard to make this patch ready to be committed. That means, your patch should include no coding-style or coding-standards issues, and it must stand the objections of performance-related freaks that are yet to come.
I wouldn't have noted the following if it wasn't contained right in the first code example snippet of the first section of Drupal's coding standards:
Please read coding standards again:
- for ( $task = 1; $task <= 3; $task++ ) {+ for ($task = 1; $task <= 3; $task++) {
- for ( $i = 0; $i < count( $chunks ); $i++ ) {
+ for ($i = 0; $i < count($chunks); $i++) {
- if ( $opentag == '' ) {
+ if ($opentag == '') {
There's no patch attached.
#53
Sun, I have nothing against peer review. We have been waiting for soon a year now for it to happen, and for nine months this patch has been *nothing but* waiting for peer review to happen. So I'm not objecting to the issues you pointed out, I'm objecting to you saying that there are more that you didn't point out! Please do so.
I'm moving this back to Code Needs Review because that's what we really need here. When we know that we have a final list of issues to correct, I'll correct them.
(Thanks for clarification on that one point.)
#54
A search-n-replace of '( ' => ')' and ' ) => ')' cleans up the spaces in functions.
There are a few comments left that do not begin with a capital letter and end with a period (mostly comments that end loops:
+ } // if ($opentag == '')
+ } // if ($i % 2)
+ } // for chunks
I'm not sure of the Drupal policy of closing } having comments. Maybe it is easier to remove them.
I've been struggling to get my code to meet Drupal's conventions. It is hard to undo years of coding habits. But, after taking the pain on a module or two -- it is getting easier.
Thanks for keeping at this.
Hope this helps.
#55
@hingo: there are over 270 patches in the queue against D7 that need review, some of those patches have been around a lot longer than this one, and been through many, many more re-rolls, even for comment style issues.
I didn't look at the patch yet, just ran it through tests, but will do so tomorrow.
#56
jpsalter, catch: Thanks.
@catch: Sure, and I understand this is how projects like Drupal works, and I'm not in a hurry, for myself I can apply the patch anyway and if no-one else cares then it is ok to focus on higher priority issues.
What is unproductive imho, is that there are people passing by, pretending to do a review when it's just really very very superficial. So for instance in this last instance, 1) the reviewer himself says that his review is incomplete and he may come back for more comments later and 2) he later also says "I didn't even look at the code yet" (which I didn't want to unnecessarily point out, but was also apparent from the first comments, sorry to say). From my experience if those are fixed, then for 2 months it will be quiet again, then someone else will pop by saying "Here are a few more comments but this is not all..." And at no point in these reviews are we really moving closer to actually committing anything.
So by all means, feel free to leave even more such comments, but there is no need to change the Status to "code needs work" when you know yourself you didn't really do a proper review. What this patch needs is more review, and when we've had a proper review it's just 1 minutes work to incorporate all the issues also from the partial reviews, such as those made by Sun.
So what I'm waiting for now is for someone to say: "I have really looked at the code *and* comments and here are *all the issues as far as I can see*." Even better if we get someone to say: "After fixing these I can (or I will propose) commit this patch."
#62
Attachment problems again. This should get us at least part of the way with the code style issues.
#63
// Pass length to regexp callback._filter_url_trim(NULL, variable_get('filter_url_length_' . $format, 72));
Why pass a drupal variable as a parameter to _filter_url_trim()? We can simply access that variable from inside the function.
$chunks = preg_split('/(<.+?>)/i', $text, -1, PREG_SPLIT_DELIM_CAPTURE);// Note: PHP ensures the array consists of alternating delimiters and literals
// and begins and ends with a literal (inserting $null as required).
This comment was copied from _filter_htmlcorrector(). $null should be NULL.
Also, when positioned after the line of code, the comment is not really clear. What array?
I suggest this variant (the same comment appears in other functions in filter.module, I changed those as well):
$chunks = preg_split('/(<.+?>)/i', $text, -1, PREG_SPLIT_DELIM_CAPTURE);// Note: The return value of preg_split() consists of an array of
// alternating delimiters and literals that begins and ends with a
// literal (inserting NULL as required).
In most of core, drupal uses
/or!to delimit regular expressions. The use of`isn't wrong, but we might as well make it consistent with the rest of the code.Finally, a PHP-esque
foreach (range(1, 3) as $task) {would look better in this context than the C++esque
for ($task = 1; $task <= 3; $task++) {#64
@flobruit: those changes all look good.
Found a couple more stray spaces.
#65
I have removed a bit of code duplication and simplified a bit.
#66
Much tidier, and it still fixes all the failures in filter.test
Since this is a blocker to all tests passing, marking to critical.
#67
Good old fashioned clicking and posting found:
* notice: Undefined offset: 1 in /modules/filter/filter.module on line 771.
#68
Guys, I'm sorry to say, but this "cleanup work" has broken quite many things. It was nice to see so many genuinly helpful inputs and I'm sorry I wasn't able to follow up instantly on changes you made. However, looking at this patch now, there are several things that have been broken, you ahve each contributed some of it. To simplify this, could we just go back to #62 which is the last working patch as far as I can see.
Things broken in the new patch:
* A crucial
} else {has inexplicably disappeared, this is a major bug of course. (missing at line 77 in patch).* Removing $length parameter from _filter_url_trim() has broken that function ($format is now undefined). Probably buggy at least for long URL's?
Further, I strongly disagree with moving the fetching of the length parameter into the function in the first place. _filter_url_trim() is a generic utility function which should just take an input and return an output, it shouldn't have any drupal stuff inside of it. Utility functions should be generic, fetching a drupal variable from db belongs to the main code.
* I don't agree that moving the patterns into an array and away from their actual regexps calls help readability at all. If you want to be clever, go all the way and make the switch-case statement go away completely (email task needs to be modified to use callback too), that would make code shorter and at least arguably easier to read.
* Minor: in the callback functions I find the new comment "+ // Find the parenthesis in the regexp containing the URL." misleading, since this is not about finding anything. My original comment was just supposed to be a description of what variable
$iis, not any actual action.***
So, back to main topic, attached is the patch from #62 which we need to get back to. This contains the whitespace fixes but none of the restructuring of code which all introduces bugs. (It contains the new comment in the callback functions I don't like as listed above in the last minor point, so that could be reverted in the next edit.)
I'm switching back to code needs review and we can continue discussion from where it was at #62. (ie, shouldn't be any bugs, you may have style opinions you want to re-introduce...)
I'm thankful for more help on this patch, but it is not tolerable to introduce real bugs while discussing style issues. Please be more careful and remember that you are editing a working patch.
#69
hingo, my patch removed the whole
if ($i % 2 == 0)business because I found it ugly. It's not just anelsedisappeared -- I replace that with an$i++. I am not debating merely code style here. I was thinking on how to remove the switch, but it never came to me to replace that with a preg_replace_callback. Honestly: good idea. (Sigh, I wish we had lambda functions, would be useful here) What bugs did I introduce?Can you reroll with the
if ($i % 2 == 0)and theswitchremoved?Edit: I agree wholeheartedly that introducing bugs into working patches is very frustrating. However, I am not introducing new features here, but trying to get a better code (in my eyes, better, at least). Really sorry if I broke your patch -- I know how it feels.
#70
@chx: I tested your patch, with the particular input I had in mind. You're almost right, cause you didn't break what I thought. However, you are to blame for the 3 new notices we get:
(The fourth notice is because of the _filter_url_trim() change introducing a bug.)
This is because the $chunks array always ends with a text chunk, but in your patch the code always ends with trying to process a tag, so these notices are the PHP equivalent of ArrayIndexOutOfBounds complaints. It's not fatal (output actually is still ok) but it is obviously not acceptable either. In a more general description: Your trick doesn't work if there is an odd number of array elements, and in this case there often (always?) is.
I agree that the
if($i % 2 == 0)thing may not be the nicest code either. What do you think about the following:1) Keep my original code hierarcy
2) Replace my modulo stuff with a variable that takes text values "text" and "tag" instead.
I mean:
What do you say? Btw, this introduces an "unnecessary" variable just for readability, is that considered good or bad in the Drupal community?
***
Going forward: I need to go to sleep now and will be on a business trip the whole next week. If nothing happens during that time, I will update the patch with the below changes. If someone is still bold enough, feel free to do these updates and continue the debate, then I'll come back a week from now and tell you what I think again.
FWIW, I've been a little cranky here earlier, I am in fact happy now that the last contributions have been addressing code and other substance, not just whitespace and grammar. (Note to self: check whitespace when done, I'd hate to be caught a fourth time for it!)
TODO, wrt patch in #68:
1) Agree with chx (and others) how to replace
if($i % 2 ...and replace it2) Make the switch statement go away completely. Email regexp processing needs to be made into using a callback function too, and then use the name of each callback function as $task (See $patterns array in#65). (This is another "clever" thing I'm still lukewarm about, but it will make code significantly shorter and less indented, so if we do it with good comments I hope it will increase readability. I really never liked that switch, have to admit.)
3) Do NOT make any changes to
_filter_url_trim().4) Revert the changed comment inside the callback functions so that it is a comment introducing variable $i. If there is need to comment further on what the function is doing, do that too.
5) Double check, but I believe all issues in #50 have been addressed. (Either included or rejected.)
No patch attached, I'll be back in a week (plus perhaps a few days).
#71
Attached patch implements changes as discussed in comment #70. There are no outstanding issues so patch is ready to be reviewed and committed.
#72
All filter tests pass, no notices. Not looked at code changes.
#73
Just a short note to everyone following this thread. I have today committed the above version of URL filter into the Footnotes module, of which I'm the maintainer. You can now easily install and use a working version of URL filter by
Please see #281143: Better URL filter now available within Footnotes for more info. Please comment on Better URL filter in that thread and not here!
I hope this will resolve the standstill we have here and users who need a working URL filter can finally have it without letting this discussion hold them hostage for another year or two.
#74
This causes core filters tests to pass, well done. I've looked through the code, and all looks good, except for one thing:
- // Use +3 for '...' string length.- if (strlen($text) > $_length + 3) {
+ if (strlen($text) > $_length) {
That makes me very very very unhappy. I don't have the heart, though, to mark it back to cnw, so rerolled (just those lines changed) and happily rtbc'ing instead. :) Please don't credit me on commit.
#75
Please excuse mislabeled patch in previous comment. It is the correct patch, only mislabeled. :)
#76
I've patched and run tests with this, confirmed RTBC.
#77
@cwgordon
I originally removed the +3 since I don't understand the logic of it. If anything, it should be -3.
Now, if user has set max length for URL's to be 72 (for instance), then we should cut all URL's that are longer than 72, not longer than 75. Even so, this is not a major issue for me, so I'm letting your patch stand. I'm just saying I don't understand it and I don't see it doing what the comment says it is doing.
#78
;)
#79
Bump.
#80
don't smashtogether variable names. This patch moves comments around, removes windows linebrakes, and renames some variables. no actual logic has changed.
#81
Patch adds an extra _filter_url_parse_full_links() rather than replacing it.
#82
#260484: Regression: Filter tests fail was committed, so those tests need to be rolled in with this patch.
#83
Indeed, patch #80 doesn't seem to remove much lines at all, same problem also with the other *_parse_*_links() functions for starters. Very unhelpful.
As I'm the original proponent of this patch, I guess it is fair to announce publicly that what happened in #260484: Regression: Filter tests fail made me loose whatever little faith I had left in this project and I don't intend to lift another finger working on this patch anymore.
(The good news is I will release version 2.0 of Footnotes during the weekend and users can just enable a working URL filter (this patch) from there instead, as it will be included.)
#84
I've gone through and re-rolled #74 with some of the comment and variable naming improvements in #80 + the filter tests themselves. Back to needs review so we won't just be pretending we have no bugs in core :p.
#85
Why are you not using "example.com" for tests?
#86
These tests were just rolled back from core this morning, I didn't change anything when rolling into this patch. Either way, fixed.
#87
Missed some.
#88
doppelpost
#89
Whoops: http://drupal.org/node/260484#comment-966252
re-rolled to get this back to a functioning state, although it'd be nice to clean this up later so it's not using non-legit domain names.
#90
I'm tired of this, mine for a few hours.
#91
Re-rolled #89 and marking back to needs review since it still passes its own tests. We should either apply this, or get expectedFail() in. More elegant solutions can come later, leaving stuff broken in core when there's a working patch with tests for this long is bad.
#92
The last submitted patch failed testing.
#93
Test failure : #335122: Test clean HEAD after every commit
#94
The last submitted patch failed testing.
#95
Oh you silly, silly test bot.
Also, wow. Somehow this managed to stay off my radar for a very long time. I guess my RTBC queue sprints and this patch's CNW statuses were not well-timed at all. Really sorry for that, hingo. :( You might have better luck asking for reviews in IRC? There's usually a group of us hanging out and talking D7 at all hours of the day.
#96
Marking back to RTBC per #74 and pending non-broken test bot. There's been no significant changes to the patch since then apart from re-rolls if you don't count going around in circles.
#97
This no longer applies after committing #276597: TestingParty08: filter.module. Some tests for the URL Filter were added as a part of that patch, so the tests here will have to be re-worked slightly to fit in there. Sorry for that, but I had to make a judgment call and XSS filter testing seemed pretty important.
Here's the format for assertions used in the other patch:
<?php+ $f = _filter_url('http://www.example.com/', 'f');
+ $this->assertEqual($f, '<a href="http://www.example.com/" title="http://www.example.com/">http://www.example.com/</a>', t('Converting URLs.'));
+
+ $f = _filter_url('http://www.example.com/?a=1&b=2', 'f');
+ $this->assertEqual($f, '<a href="http://www.example.com/?a=1&b=2" title="http://www.example.com/?a=1&b=2">http://www.example.com/?a=1&b=2</a>', t('Converting URLs -- ampersands.'));
+
?>
We should make this patch fit that as well. Having the entire contents of the test string in HEREDOC syntax is very odd. This exercise will also help reviewers to understand the specific issues being fixed by this patch, since right now it's not very clear.
Unsurprisingly ;), the patch itself conforms well to coding standards, and is very well documented, etc. I'm not positive about the logic therein, but if the tests prove it works, we can probably just commit it as is unless someone comes up with any alarming reasons why not to. Although before that, it would be good to have some baseline benchmarks just to ensure we're not increasing the time spent in this function by 400%.
#98
Only nitpick:
/**- * URL filter. Automatically converts text web addresses (URLs, e-mail addresses,
- * ftp links, etc.) into hyperlinks.
+ * URL filter. Automatically converts text web addresses (URLs, e-mail
+ * addresses, ftp links, etc.) into hyperlinks.
PHPDoc and comments should wrap at 80 chars - except the PHPDoc summary.
#99
This fixes the comment issue sun noticed, and adds two of wrwrwr's failing tests from http://drupal.org/node/276597#comment-1196760
May or may not get to looking at adding more assertions.
#100
The last submitted patch failed testing.
#101
Note: Henrik contacted me off-list and noted that one real advantage of the HEREDOC syntax is that you can test each of these conditions on "real world" data, rather than in isolation. I think that's a really good point.
Probably what makes the most sense though rather than embedding this data within the test itself is making a separate .html file in the modules/filter/tests/ folder that we can continue to add to as we come up with other crazy ways to break various filters.
#102
Would be awesome to get this finally done for D7.
The list of tags to exclude should also contain PRE.