Hi all
Attached is a patch that is a major rewrite to function _filter_url($text, $format)
. It is based on a recent version of the 5.x development branch, but it only touches parts that are unchanged in 6.x so it can be applied as is to that too.
Rationale
The current version of the URL filter is really basic. The regular expressions used only find url's in plaintext, inside p, li and br tags. In addition it has some bad bugs, such as it might manipulate url's that are really part of some html element's attribute.
This patch extends the "contexts" in which URL's are searched for, and on the other hand makes the regular expressions more accurate so as to avoid manipulating url's that absolutely must not be touched.
What changed?
//From line 1109...
/**
* URL filter. Automatically converts text web addresses (URLs, e-mail addresses,
* ftp links, etc.) into hyperlinks.
*/
function _filter_url($text, $format) {
//The following tags may contain text with a URL that should be converted to a link
//List also contains standalone/empty tags like img and br
$tags_begin = 'caption|td|th|div|dd|dt|li|blockquote|address|p|h[1-6]|i|b|u|tt|big|small|strike|s|em|strong|font|dfn|samp|kbd|var|cite|abbr|acronym|sub|sup|q|ins|del|img|br|object|applet';
//transform each tagname from 'p' to '<\/?p[^>]*>'
//this pattern will match <p>, </p>, <p class="foo"> and <p />
//the last one is significant for xhtml standalone tags like <img /> and <br />
//note: <img/> and <br/>, won't work. While valid xhtml their use is discouraged anyway
//note: the pattern will obviously match many invalid tags too like <p &&&>
$arr_tags_begin = explode( "|", $tags_begin );
for ( $i = 0; $i < count( $arr_tags_begin ); $i++ ){
$arr_tags_begin[$i] = '<\/?' . $arr_tags_begin[$i] . '[^>]*>';
}
$tags_begin = implode( "|", $arr_tags_begin );
// Pass length to regexp callback
_filter_url_trim(NULL, variable_get('filter_url_length_'. $format, 72));
$text = ' '. $text .' ';
//The structure of each three matches below is
// 1. a tag from the list above or newline or </a>
// 2. any characters except "<"
// 3. the url to match
// 4. possible period or other marks at end of url are considered separate (end of sentence)
// 5. end url in space, newline or tag ("<" character)
//In many cases the patterns below only match the first url if there are many in the same paragraph.
//Therefore the preg_replace are encapsulate in loops that will re-run the replace as many times as needed.
// Match absolute URLs.
$protocols = 'http://|https://|ftp://|mailto:|smb://|afp://|file://|gopher://|news://|ssl://|sslv2://|sslv3://|tls://|tcp://|udp://';
$urlpattern = "($protocols)([a-zA-Z0-9@:%_+*~#?&=.,/;-]*[a-zA-Z0-9@:%_+*~#&=/;-])";
$re = "`(($tags_begin|\n|\r|</a>)([^<])*?)($urlpattern)(([\.\,\?\!]*?)[\<\s\n\r])`i";
while(preg_match($re, $text)) {
$text = preg_replace_callback($re, '_filter_url_parse_full_links', $text);
}
// Match e-mail addresses.
$urlpattern = '[A-Za-z0-9._-]+@[A-Za-z0-9._+-]+\.[A-Za-z]{2,4}';
$re = "`(($tags_begin|\n|\r|</a>)([^<])*?)($urlpattern)(([\.\,\?\!]*?)[\<\s\n\r])`i";
while(preg_match($re, $text)) {
$text = preg_replace($re, '\1<a href="mailto:\4">\4</a>\5', $text);
}
// Match www domains/addresses.
$urlpattern = 'www\.[a-zA-Z0-9@:%_+*~#?&=.,/;-]*[a-zA-Z0-9@:%_+~#\&=/;-]';
$re = "`(($tags_begin|\n|\r|</a>)([^<])*?)($urlpattern)(([\.\,\?\!]*?)[\<\s\n\r])`i";
while(preg_match($re, $text)) {
$text = preg_replace_callback($re, '_filter_url_parse_partial_links', $text);
}
$text = substr($text, 1, -1);
return $text;
}
/**
* Make links out of absolute URLs.
*/
function _filter_url_parse_full_links($match) {
//which parenthesis in the regexp contains the url?
$i = 4;
$match[$i] = decode_entities($match[$i]);
$caption = check_plain(_filter_url_trim($match[$i]));
$match[$i] = check_url($match[$i]);
return $match[1] . '<a href="'. $match[$i] .'" title="'. $match[$i] .'">'. $caption .'</a>'. $match[$i+3];
}
/**
* Make links out of domain names starting with "www."
*/
function _filter_url_parse_partial_links($match) {
//which parenthesis in the regexp contains the url?
$i = 4;
$match[$i] = decode_entities($match[$i]);
$caption = check_plain(_filter_url_trim($match[$i]));
$match[$i] = check_plain($match[$i]);
return $match[1] . '<a href="http://'. $match[$i] .'" title="'. $match[$i] .'">'. $caption .'</a>'. $match[$i+1];
}
Attached are a patch that can be applied against 5.x-dev and also the entire filter.module file itself. (So you can just replace your current filter.module to test.)
Comment | File | Size | Author |
---|---|---|---|
#153 | drupal.filter-url.153.patch | 35.4 KB | sun |
#151 | drupal.filter-url.150.patch | 160.31 KB | sun |
#144 | drupal.filter-url.144.patch | 36.21 KB | sun |
#142 | drupal.filter-url.141.patch | 33.99 KB | sun |
#139 | drupal.filter-url.137.patch | 28.87 KB | hingo |
Comments
Comment #1
hingo CreditAttribution: hingo commentedHere is the filter.module attached
Comment #2
hingo CreditAttribution: hingo commentedAnd here is a file that can be used as testdata.
Comment #3
hingo CreditAttribution: hingo commentedAttached is the same patch against the most recent HEAD / 6.0
So no you are all set to go to review and apply this on any system you wish.
Comment #4
drummChanges such as this will only be committed to the development branch, which I think will be for the 7.x release cycle, 6.x is currently in feature freeze.
This patch is not properly formatted, it should be in unified diff format. See http://drupal.org/diffandpatch.
Comment #5
hingo CreditAttribution: hingo commentedOh right, sorry about the patch. Here is a diff -up patch against head attached.
About the 6 vs 7 issue. I would like to point out that this patch contains absolutely no new features. It does the same replacements as the current urlfilter module (the core regexp patterns didn't change) but it does the replacements in some new places where the current one doesn't but imho should (inside many new tags) and otoh it doesn't touch several places the current one shouldn't but does. In this sense I would argue this fix is within the scope of "During the initial stage of the code freeze, documentation updates, usability improvements, and performance improvements will still be accepted. New functionality or API changes / additions, on the other hand, will not be unless deemed critical. As of today, the focus is to strengthen the code base's performance, usability, and stability. As we progress, focus will shift towards stability and, near the end of the code freeze, only bug fixes will be allowed, until no release critical bugs are left." (http://drupal.org/drupal-6.0-code-freeze)
I do realise this patch is not a simple one-liner bug fix and the URL filter has worked fine for many of you for many years now. So if you won't include this in 6.x I'm not going to whine about it more than this, but I do want to stress that nothing new is introduced in this patch, it is a bug fix.
I'm changing version back to 6.x-dev just to get attention from the right people. Please just change it back to 7.x if that is your judgement.
Comment #6
Steven CreditAttribution: Steven commentedIf the goal is to make it context-aware, then I don't see why we don't do a full tag split like other core code does (e.g. htmlcorrector, the search indexer, drupal_html_to_text). That way we can simplify the extremely complicated regular expression into a simpler, more readable heuristic. Ignoring any content between
<a> </a>
tags or ignoring any HTML attributes would be a simple if check at the right point.The combined regular expressions that result from all this code seem incredibly convoluted, and contain slow, ungreedy operators.
Also, the following is just not acceptable:
Comment #7
webchickAlso, from a code-style standpoint, please ensure there is one space between
//
and the comment and that the comment uses sentence capitalization. For example:should be:
There are places in core where the comments don't conform to this standard (yet), but we shouldn't introduce anymore new ones that don't. :)
Comment #8
hingo CreditAttribution: hingo commentedHi again. Thanks for the good comments. It took me a couple of days to go through them, but now I'm actually back with two new patches.
Attached here is a patch 0.3 that corrects all of the minor problems you have pointed out for 0.2. In particular:
Actually, it seems that comment was wrong and those tags actually did work. The comment is now removed. However, the regexp at that position has now been reworked to be a bit more precise than before, since I realised I had tought about it the wrong way.
Done.
Comment #9
hingo CreditAttribution: hingo commentedAnd here is the other patch, addressing the final suggestion made by Steven.
Please not that this patch is not intended to be reviewed or included anywhere. I'm just attaching it as an attachment to the explanation that follows. Please consider 0.3 for inclusion into Drupal core.
In the attached patch I tried to follow this path instead. It is almost a complete rewrite of the whole _filter_url_filter() function. (Whereas patch 0.3. actually is more like additions onto the simple regexps in current HEAD.)
The problem is that URL filter actually makes 3 different replacements. It makes links out of
1) http://any.full.url
2) email@addresses.com
3) www.domains.com
The problem is that once you do the split (you can just split on all tags, it's really simple)
(1) works fine,
(2) works fine since it (practically) never confilcts with (1),
but then at step (3) you end up actually hitting all of the addresses that were converted in step (1), thus messing up the links created in (1). So the problem is that while we preg_split away all tags present in the input, (1) and (2) create new
<a>
tags and there is no new split function and then you mess everything up at (3).Possible solutions to this dilemma:
1) Make the last regexp more intelligent/complex so it doesn't do this -> This would end up being the "complex" regexp used in patch v0.3, kind of defeating the whole point of the excercise: Then we'd have both more complex code and the complex regexp.
2) Separate the 3 regexps so that only one is done at a time. For instance have a loop that goes 3 times, doing the preg_split anew for each loop. This way the new
<a>
tags created in steps (1) and (2) would become protected at step (3).I personally am of the opinion that while the regexps in v0.3 may be a bit wordy, the complexity of the if-else structures of v0.4 are not that simple either. Adding another loop or other logic doesn't seem worthwile to me - I'd rather vote for v0.3 where at least the code is straightforward, albeit the regexps may be on the long side.
(Another comment on patch v0.2)
Is the slowness of a regexp in a filter an issue, since the filter is cached? I do admit that URL filter is one of the most used ones, but nobody runs filters without having the caching on, right?
Whether the regexps are readable or not is of course a matter of taste, to the author everything always looks simple. But I'd argue the if-else logic in v0.4 takes about as much time to understand as the regexps in v0.3. (v.0.4 also took longer to get right than modding the regexps took for v0.3.)
So in summary, I'm still proposing the patch v0.3.
Comment #10
hingo CreditAttribution: hingo commentedSmall restructuring of the regexp in $tags_begin variable. (Refinement based on patch 0.3, see previous comment.)
Changes line 1089 (line 10 in this patch)
Removes the for-loop that did the same transformation around each tag separately.
This change makes the regexps a bit shorter, so should be slightly simpler to read and also to execute.
Comment #11
hingo CreditAttribution: hingo commentedOops, forgot to change status...
Comment #12
hingo CreditAttribution: hingo commentedHow should silence in a "needs review" situation be interpreted?
Where is 6.x at the moment, is it still realistic to target this for 6.x or are we close to RC time?
Comment #13
shap CreditAttribution: shap commentedI applied your patch 0.5 to test. There seems to be a bug here. Given the input
the url is getting filtered, with the effect that the code example is obfuscated. This occurs even though the code tag is NOT on the list of allowable contexts for filtering. Any idea what is happening?
Hmm. Actually, things get substituted in a code tag even without the surrounding blockquote. This doesn't seem right if I am understanding the module code's intentions.
Comment #14
hingo CreditAttribution: hingo commentedHi shap
Thanks for testing this.
Unfortunately, I cannot reproduce what you are getting. I tested on a vanilla 6.x-dev, with 0.5 patch applied. For instance,
All of the other url's get modified, but the url's inside the code tags are not modified.
On the other hand, the behaviour you describe matches the current drupal url filter. (And yes it is wrong and that is exactly the reason I'm trying to fix it.) So could you please check if:
* you really are running filter.module with the patch applied
* you are not seeing a cached version of your content. For instance, write foo123 at the end of the text to make sure it is different from some old version.
Comment #15
hingo CreditAttribution: hingo commentedI found one instance where actually url's are modified when they shouldn't:
<code>In this example www.test.com the first url is not modified, but <em>the second www.test.com is</em> even if it shouldn't. </ code>
The problem is that v0.5 only pays attention to the innermost tags. I haven't thought of a case in normal html where this would lead to problems, but indeed the code tag is different.
I'll work on this and see how to solve it. Thanks shap for bringin this to my attention - I don't know if this is exactly the bug you were describing, but thanks still.
Comment #16
hingo CreditAttribution: hingo commentedWell, to fix the problem with tags nested inside code tags, I had to revert to the preg_split/implode strategy from 0.4. I've completed and refined that approach and this patch is a refined version of 0.4.
I would appreciate if you can test this out too. I'm pretty sure this is close to perfect, I dare you to come up with examples to break this :-)
Comment #17
hingo CreditAttribution: hingo commentedAttached is the testdata I'm currently using.
Comment #18
hingo CreditAttribution: hingo commentedSince 6.0 has now reached beta (congratulations, btw) I'm changing this to target 7.x-dev.
Comment #19
catchWell I've had the url filter mess with urls created by the footnotes filter before (i.e. within a href), so I think this counts as a bug. I'm tentatively moving it back to 6.x-dev since other bugs will get fixed before rc, and changing it to a bug report since that's really what it is.
Sorry no time to review this morning.
Comment #20
hingo CreditAttribution: hingo commentedGreat! I certainly agree with classifying this as a bug, if there are more than me who will agree.
I was trying to be cautious, since I'm under the impression that the URL filter is "good enough" for many people and some may not agree that what has now become a complete rewrite would be considered for 6.x. But as this thread too shows, there are planty of things the old URL filter does wrong, and this patch should fix all of them.
Comment #21
hass CreditAttribution: hass commentedYou fileencoding is wrong (ISO). You must use UTF8 without BOM's. Check your editor, please.
Comment #22
hingo CreditAttribution: hingo commentedSorry about that, I didn't realise there were any characters where that would have mattered. Here is the same patch but now in UTF-8 format.
Comment #23
hingo CreditAttribution: hingo commentedForgot to change status again...
Comment #24
hingo CreditAttribution: hingo commented...
Comment #25
StevenPatzBeta 2 is out for 6.x, so I'm moving this to the development queue.
Comment #26
catchBugs are still being fixed in 6.x, and at least the footnotes module is caught by this bug requiring careful weighting of the various filters. It's possible the patch may be too much of a change at this stage, but it'd be nice to get more reviews before bumping it.
Comment #27
hingo CreditAttribution: hingo commentedAs someone who doesn't follow the dev mailing list, I'll ask here: Is HEAD already taking in new patches or is activity still focused on D6?
Comment #28
catchhingo, currently 6.x and HEAD are the same thing.
Comment #29
jpsalter CreditAttribution: jpsalter commentedI just ran into this problem in 5.7. Has this thread gone cold? Or, is there a dupe somewhere?
My use case:
<p class="foo">www.example.com</p>
--> no conversion<p>www.example.com</p>
--> worksAny chance this will make it into 5.x or should I hack core?
Comment #30
hingo CreditAttribution: hingo commentedHi jpsalter
You've found the right place, this is THE thread and it is still active.
6.0 is now out and this wasn't applied. I'm actually wondering myself what the correct Version tag for this would be? 6.0 (since this bug exists in 6.0) or 6.x (where it should be applied) or 6.1 or 7.x?
I doubt it will be applied to 5.x, but you should be pretty safe applying it yourself. From what I can tell the _filter_url() functionality has been untouched through the whole 5.x series to 6.x. Even if the patch wouldn't apply you could still just replace the relevant functions with copy pasting.
Comment #31
jpsalter CreditAttribution: jpsalter commentedThanks for the confirmation and all your hard work. I hope others recognize it too.
Comment #32
keith.smith CreditAttribution: keith.smith commentedThis should be fixed in HEAD (now 7.x-dev) and then backported if appropriate.
The patch in #22 has some code style issues with comments (lacks periods at ends of comments, for one).
Comment #33
hingo CreditAttribution: hingo commentedThanks Keith
Old habits die hard... I already had the periods in place, but rewriting parts of the code it seems I introduced periodless comments again.
This patch 0.8
- fixes style issues with comments (periods)
- diff against 7.x-dev (these functions untouched but file changed so line numbers changed from 6.x-dev)
Comment #34
hingo CreditAttribution: hingo commentedCould someone please see if there are any more punctuation or style issues with the comments and help get this committed to 7.x.
I first submitted this patch during my last years Summer vacation, and now the snow is melting and the grass is green again, I would be disappointed not to have such a straightforward fix committed before my next Summer holiday due in 3 months...
Comment #35
keith.smith CreditAttribution: keith.smith commentedhingo: Well, since you ask, on a quick read-through, we do use 'HTML' and 'URL' in comments (rather than 'html' and 'url', respectively).
Comment #36
hingo CreditAttribution: hingo commentedThanks Keith
Would you find in yourself enough energy to see if there is still something that needs to be corrected. I'm afraid next you'll tell me that www should also be uppercase? Speaking of which, is there an official style guide where I could get all this advice in one shot?
v0.9
* style: uppercase HTML and URL
Comment #37
hass CreditAttribution: hass commentedhttp://drupal.org/coding-standards is a good starting point and try coder module... it helps, too.
Comment #38
hingo CreditAttribution: hingo commentedThanks. Looking at it now I realise I already read through it last summer. But it isn't as detailed as some of the advise I got here, which is more like grammatical stuff almost. *Which I'm completely ok with*, btw, it just seems there is an additional "oral tradition" to style issues at Drupal, so it seems I will have to continue relying on the grace of core developers giving some feedback every once in a while.
Comment #39
keith.smith CreditAttribution: keith.smith commentedhingo: Some of this is contained in pages similar to the coding-standards page (but that are more general in scope), like http://drupal.org/about/authoring.
Comment #40
hingo CreditAttribution: hingo commentedAh, Thanks Kev. Had not seen that one.
In any case, we are digressing now a bit from the main question? What should yet be done so that this can be committed to 7.x?
Comment #41
hingo CreditAttribution: hingo commentedComment #42
floretan CreditAttribution: floretan commented@hingo: Can you provide some sample content with URLs, some of which are inappropriately processed by the current filter, so that the efficiency of this patch can be tested?
Comment #43
hingo CreditAttribution: hingo commentedAttached is my most developed "test case". The patch works good with all of these. The current url-filter fails at least:
* urls inside a dl list (not processed, should be)
* urls in a title attribute in a link (are processed, shouldn't)
* urls in html comments (are processed, shouldn't)
* urls inside some other html tags also are not processed but should
(This is out of my memory, I haven't used the official urlfilter in a while :-)
Comment #44
hingo CreditAttribution: hingo commentedping
Comment #45
floretan CreditAttribution: floretan commentedIssue #249200: Test individual filters. adds some test cases that show the failure of the URL filter in the cases mentioned in #43. Please review and extend those tests, and it will help to get a revised URL filter into core.
Comment #46
floretan CreditAttribution: floretan commentedRe-rolled patch from #36.
The tests written in #249200: Test individual filters. have been committed. 12 assertions fail with the current link detection algorithm, but all of them pass successfully with this patch.
Comment #47
hingo CreditAttribution: hingo commentedThanks, flobruit. I'm immediately going to take the benefit of not being the submitter of the last patch anymore and put a reviewer hat on. Let's see if we can finally get this committed!
/me reviewing... Looks good!
We've had several people looking at this without complaints for months now, and now we also have the unit tests committed in
filter.test
that show the improvement over the existing url filter. I vote for committing the patch by flobruit in #46.Comment #48
catchStill applies with offset, and fixes those failures in filter.test (although there's some other unrelated ones).
Comment #49
catchThe other failures in filter.test are dealt with here: http://drupal.org/node/266539
Comment #50
sunPlease remove line-break here:
Please read coding standards again:
Please improve these comments:
Please insert a newline between breaks and cases inside of switch statements:
$task seems not to be defined and poorly named.
Please improve these comments:
We do not use such "reminders" in Drupal: (if that's needed, the function body is likely too large)
...probably more, but please fix those first.
Comment #51
hingo CreditAttribution: hingo commentedWow, this looks like a real review! I have to admit I had given up hope...
Sun,
Please don't! Those are not the same comment. The 2 first lines refer to the previous line, while the three last ones to the following line. (And since next you are going to say that comments cannot appear after the statement they are a comment on, I might as well add right now that this code is a copy paste from existing code in Drupal core and is like that in 4 other places too.)
This is a rather unhelpful comment (I'm being polite here) and no, I won't read them again. If there is a style issue, say what it is.
Sure, they can be removed. (This must be the first time anyone complained of there being too many comments in a patch I wrote :-)
Sorry, this won't work. We've had several people now flying in and "helpfully" pointing out punctuation, capitalisation and whatnot issues. I'm not interested in that game anymore. This patch fixes a known bug in Drupal which several people have complained about. Since September 2007 (that's 9 months, I have made a baby in less!) there hasn't been any code updates, just grammar/style discussion. If we get a list of the style/comment issues that should be fixed we will fix them. But at least for my part I'm not gonna participate in a never ending game of "fix these first" reviews. This is a short patch, it is not too hard for you to give a finite list of problems and we'll fix them and then the patch can be committed.
Thanks.
(FWIW, your comments I didn't comment on are perfectly good comments and I'll be happy to improve on those together with the complete list of remarks that I hope to get soon. I also want to point out that it's obviously not your fault that we've already had too many unhelpful helpful reviewers pass by without any progress, you just happened unluckily to be the last one - in fact your review was the best so far.)
Comment #52
sunHrm. This is called peer review. We, the community, have advanced in this area a lot in the past. No patch will hit core that contains coding-style issues or any other logical/syntactical weaks. If we would slip such patches in, Drupal core would quickly look like err... tryin' hard to avoid names of other CMS. Please note that I have not tested your patch, just reviewed the code. Before any other core or core-alike developer will stumble upon this issue, you should really try hard to make this patch ready to be committed. That means, your patch should include no coding-style or coding-standards issues, and it must stand the objections of performance-related freaks that are yet to come.
I wouldn't have noted the following if it wasn't contained right in the first code example snippet of the first section of Drupal's coding standards:
Please read coding standards again:
There's no patch attached.
Comment #53
hingo CreditAttribution: hingo commentedSun, I have nothing against peer review. We have been waiting for soon a year now for it to happen, and for nine months this patch has been *nothing but* waiting for peer review to happen. So I'm not objecting to the issues you pointed out, I'm objecting to you saying that there are more that you didn't point out! Please do so.
I'm moving this back to Code Needs Review because that's what we really need here. When we know that we have a final list of issues to correct, I'll correct them.
(Thanks for clarification on that one point.)
Comment #54
jpsalter CreditAttribution: jpsalter commentedA search-n-replace of '( ' => ')' and ' ) => ')' cleans up the spaces in functions.
There are a few comments left that do not begin with a capital letter and end with a period (mostly comments that end loops:
+ } // if ($opentag == '')
+ } // if ($i % 2)
+ } // for chunks
I'm not sure of the Drupal policy of closing } having comments. Maybe it is easier to remove them.
I've been struggling to get my code to meet Drupal's conventions. It is hard to undo years of coding habits. But, after taking the pain on a module or two -- it is getting easier.
Thanks for keeping at this.
Hope this helps.
Comment #55
catch@hingo: there are over 270 patches in the queue against D7 that need review, some of those patches have been around a lot longer than this one, and been through many, many more re-rolls, even for comment style issues.
I didn't look at the patch yet, just ran it through tests, but will do so tomorrow.
Comment #56
hingo CreditAttribution: hingo commentedjpsalter, catch: Thanks.
@catch: Sure, and I understand this is how projects like Drupal works, and I'm not in a hurry, for myself I can apply the patch anyway and if no-one else cares then it is ok to focus on higher priority issues.
What is unproductive imho, is that there are people passing by, pretending to do a review when it's just really very very superficial. So for instance in this last instance, 1) the reviewer himself says that his review is incomplete and he may come back for more comments later and 2) he later also says "I didn't even look at the code yet" (which I didn't want to unnecessarily point out, but was also apparent from the first comments, sorry to say). From my experience if those are fixed, then for 2 months it will be quiet again, then someone else will pop by saying "Here are a few more comments but this is not all..." And at no point in these reviews are we really moving closer to actually committing anything.
So by all means, feel free to leave even more such comments, but there is no need to change the Status to "code needs work" when you know yourself you didn't really do a proper review. What this patch needs is more review, and when we've had a proper review it's just 1 minutes work to incorporate all the issues also from the partial reviews, such as those made by Sun.
So what I'm waiting for now is for someone to say: "I have really looked at the code *and* comments and here are *all the issues as far as I can see*." Even better if we get someone to say: "After fixing these I can (or I will propose) commit this patch."
Comment #62
catchAttachment problems again. This should get us at least part of the way with the code style issues.
Comment #63
floretan CreditAttribution: floretan commentedWhy pass a drupal variable as a parameter to _filter_url_trim()? We can simply access that variable from inside the function.
This comment was copied from _filter_htmlcorrector(). $null should be NULL.
Also, when positioned after the line of code, the comment is not really clear. What array?
I suggest this variant (the same comment appears in other functions in filter.module, I changed those as well):
In most of core, drupal uses
/
or!
to delimit regular expressions. The use of`
isn't wrong, but we might as well make it consistent with the rest of the code.Finally, a PHP-esque
would look better in this context than the C++esque
Comment #64
catch@flobruit: those changes all look good.
Found a couple more stray spaces.
Comment #65
chx CreditAttribution: chx commentedI have removed a bit of code duplication and simplified a bit.
Comment #66
catchMuch tidier, and it still fixes all the failures in filter.test
Since this is a blocker to all tests passing, marking to critical.
Comment #67
catchGood old fashioned clicking and posting found:
* notice: Undefined offset: 1 in /modules/filter/filter.module on line 771.
Comment #68
hingo CreditAttribution: hingo commentedGuys, I'm sorry to say, but this "cleanup work" has broken quite many things. It was nice to see so many genuinly helpful inputs and I'm sorry I wasn't able to follow up instantly on changes you made. However, looking at this patch now, there are several things that have been broken, you ahve each contributed some of it. To simplify this, could we just go back to #62 which is the last working patch as far as I can see.
Things broken in the new patch:
* A crucial
} else {
has inexplicably disappeared, this is a major bug of course. (missing at line 77 in patch).* Removing $length parameter from _filter_url_trim() has broken that function ($format is now undefined). Probably buggy at least for long URL's?
Further, I strongly disagree with moving the fetching of the length parameter into the function in the first place. _filter_url_trim() is a generic utility function which should just take an input and return an output, it shouldn't have any drupal stuff inside of it. Utility functions should be generic, fetching a drupal variable from db belongs to the main code.
* I don't agree that moving the patterns into an array and away from their actual regexps calls help readability at all. If you want to be clever, go all the way and make the switch-case statement go away completely (email task needs to be modified to use callback too), that would make code shorter and at least arguably easier to read.
* Minor: in the callback functions I find the new comment "+ // Find the parenthesis in the regexp containing the URL." misleading, since this is not about finding anything. My original comment was just supposed to be a description of what variable
$i
is, not any actual action.***
So, back to main topic, attached is the patch from #62 which we need to get back to. This contains the whitespace fixes but none of the restructuring of code which all introduces bugs. (It contains the new comment in the callback functions I don't like as listed above in the last minor point, so that could be reverted in the next edit.)
I'm switching back to code needs review and we can continue discussion from where it was at #62. (ie, shouldn't be any bugs, you may have style opinions you want to re-introduce...)
I'm thankful for more help on this patch, but it is not tolerable to introduce real bugs while discussing style issues. Please be more careful and remember that you are editing a working patch.
Comment #69
chx CreditAttribution: chx commentedhingo, my patch removed the whole
if ($i % 2 == 0)
business because I found it ugly. It's not just anelse
disappeared -- I replace that with an$i++
. I am not debating merely code style here. I was thinking on how to remove the switch, but it never came to me to replace that with a preg_replace_callback. Honestly: good idea. (Sigh, I wish we had lambda functions, would be useful here) What bugs did I introduce?Can you reroll with the
if ($i % 2 == 0)
and theswitch
removed?Edit: I agree wholeheartedly that introducing bugs into working patches is very frustrating. However, I am not introducing new features here, but trying to get a better code (in my eyes, better, at least). Really sorry if I broke your patch -- I know how it feels.
Comment #70
hingo CreditAttribution: hingo commented@chx: I tested your patch, with the particular input I had in mind. You're almost right, cause you didn't break what I thought. However, you are to blame for the 3 new notices we get:
(The fourth notice is because of the _filter_url_trim() change introducing a bug.)
This is because the $chunks array always ends with a text chunk, but in your patch the code always ends with trying to process a tag, so these notices are the PHP equivalent of ArrayIndexOutOfBounds complaints. It's not fatal (output actually is still ok) but it is obviously not acceptable either. In a more general description: Your trick doesn't work if there is an odd number of array elements, and in this case there often (always?) is.
I agree that the
if($i % 2 == 0)
thing may not be the nicest code either. What do you think about the following:1) Keep my original code hierarcy
2) Replace my modulo stuff with a variable that takes text values "text" and "tag" instead.
I mean:
What do you say? Btw, this introduces an "unnecessary" variable just for readability, is that considered good or bad in the Drupal community?
***
Going forward: I need to go to sleep now and will be on a business trip the whole next week. If nothing happens during that time, I will update the patch with the below changes. If someone is still bold enough, feel free to do these updates and continue the debate, then I'll come back a week from now and tell you what I think again.
FWIW, I've been a little cranky here earlier, I am in fact happy now that the last contributions have been addressing code and other substance, not just whitespace and grammar. (Note to self: check whitespace when done, I'd hate to be caught a fourth time for it!)
TODO, wrt patch in #68:
1) Agree with chx (and others) how to replace
if($i % 2 ...
and replace it2) Make the switch statement go away completely. Email regexp processing needs to be made into using a callback function too, and then use the name of each callback function as $task (See $patterns array in#65). (This is another "clever" thing I'm still lukewarm about, but it will make code significantly shorter and less indented, so if we do it with good comments I hope it will increase readability. I really never liked that switch, have to admit.)
3) Do NOT make any changes to
_filter_url_trim()
.4) Revert the changed comment inside the callback functions so that it is a comment introducing variable $i. If there is need to comment further on what the function is doing, do that too.
5) Double check, but I believe all issues in #50 have been addressed. (Either included or rejected.)
No patch attached, I'll be back in a week (plus perhaps a few days).
Comment #71
hingo CreditAttribution: hingo commentedAttached patch implements changes as discussed in comment #70. There are no outstanding issues so patch is ready to be reviewed and committed.
Comment #72
catchAll filter tests pass, no notices. Not looked at code changes.
Comment #73
hingo CreditAttribution: hingo commentedJust a short note to everyone following this thread. I have today committed the above version of URL filter into the Footnotes module, of which I'm the maintainer. You can now easily install and use a working version of URL filter by
Please see #281143: Better URL filter now available within Footnotes for more info. Please comment on Better URL filter in that thread and not here!
I hope this will resolve the standstill we have here and users who need a working URL filter can finally have it without letting this discussion hold them hostage for another year or two.
Comment #74
cwgordon7 CreditAttribution: cwgordon7 commentedThis causes core filters tests to pass, well done. I've looked through the code, and all looks good, except for one thing:
That makes me very very very unhappy. I don't have the heart, though, to mark it back to cnw, so rerolled (just those lines changed) and happily rtbc'ing instead. :) Please don't credit me on commit.
Comment #75
cwgordon7 CreditAttribution: cwgordon7 commentedPlease excuse mislabeled patch in previous comment. It is the correct patch, only mislabeled. :)
Comment #76
catchI've patched and run tests with this, confirmed RTBC.
Comment #77
hingo CreditAttribution: hingo commented@cwgordon
I originally removed the +3 since I don't understand the logic of it. If anything, it should be -3.
Now, if user has set max length for URL's to be 72 (for instance), then we should cut all URL's that are longer than 72, not longer than 75. Even so, this is not a major issue for me, so I'm letting your patch stand. I'm just saying I don't understand it and I don't see it doing what the comment says it is doing.
Comment #78
catch;)
Comment #79
cwgordon7 CreditAttribution: cwgordon7 commentedBump.
Comment #80
dmitrig01 CreditAttribution: dmitrig01 commenteddon't smashtogether variable names. This patch moves comments around, removes windows linebrakes, and renames some variables. no actual logic has changed.
Comment #81
catchPatch adds an extra _filter_url_parse_full_links() rather than replacing it.
Comment #82
catch#260484: Regression: Filter tests fail was committed, so those tests need to be rolled in with this patch.
Comment #83
hingo CreditAttribution: hingo commentedIndeed, patch #80 doesn't seem to remove much lines at all, same problem also with the other *_parse_*_links() functions for starters. Very unhelpful.
As I'm the original proponent of this patch, I guess it is fair to announce publicly that what happened in #260484: Regression: Filter tests fail made me loose whatever little faith I had left in this project and I don't intend to lift another finger working on this patch anymore.
(The good news is I will release version 2.0 of Footnotes during the weekend and users can just enable a working URL filter (this patch) from there instead, as it will be included.)
Comment #84
catchI've gone through and re-rolled #74 with some of the comment and variable naming improvements in #80 + the filter tests themselves. Back to needs review so we won't just be pretending we have no bugs in core :p.
Comment #85
hass CreditAttribution: hass commentedWhy are you not using "example.com" for tests?
Comment #86
catchThese tests were just rolled back from core this morning, I didn't change anything when rolling into this patch. Either way, fixed.
Comment #87
catchMissed some.
Comment #88
catchdoppelpost
Comment #89
catchWhoops: http://drupal.org/node/260484#comment-966252
re-rolled to get this back to a functioning state, although it'd be nice to clean this up later so it's not using non-legit domain names.
Comment #90
Damien Tournoud CreditAttribution: Damien Tournoud commentedI'm tired of this, mine for a few hours.
Comment #91
catchRe-rolled #89 and marking back to needs review since it still passes its own tests. We should either apply this, or get expectedFail() in. More elegant solutions can come later, leaving stuff broken in core when there's a working patch with tests for this long is bad.
Comment #93
lilou CreditAttribution: lilou commentedTest failure : #335122: Test clean HEAD after every commit
Comment #95
webchickOh you silly, silly test bot.
Also, wow. Somehow this managed to stay off my radar for a very long time. I guess my RTBC queue sprints and this patch's CNW statuses were not well-timed at all. Really sorry for that, hingo. :( You might have better luck asking for reviews in IRC? There's usually a group of us hanging out and talking D7 at all hours of the day.
Comment #96
catchMarking back to RTBC per #74 and pending non-broken test bot. There's been no significant changes to the patch since then apart from re-rolls if you don't count going around in circles.
Comment #97
webchickThis no longer applies after committing #276597: TestingParty08: filter.module. Some tests for the URL Filter were added as a part of that patch, so the tests here will have to be re-worked slightly to fit in there. Sorry for that, but I had to make a judgment call and XSS filter testing seemed pretty important.
Here's the format for assertions used in the other patch:
We should make this patch fit that as well. Having the entire contents of the test string in HEREDOC syntax is very odd. This exercise will also help reviewers to understand the specific issues being fixed by this patch, since right now it's not very clear.
Unsurprisingly ;), the patch itself conforms well to coding standards, and is very well documented, etc. I'm not positive about the logic therein, but if the tests prove it works, we can probably just commit it as is unless someone comes up with any alarming reasons why not to. Although before that, it would be good to have some baseline benchmarks just to ensure we're not increasing the time spent in this function by 400%.
Comment #98
sunOnly nitpick:
PHPDoc and comments should wrap at 80 chars - except the PHPDoc summary.
Comment #99
catchThis fixes the comment issue sun noticed, and adds two of wrwrwr's failing tests from http://drupal.org/node/276597#comment-1196760
May or may not get to looking at adding more assertions.
Comment #101
webchickNote: Henrik contacted me off-list and noted that one real advantage of the HEREDOC syntax is that you can test each of these conditions on "real world" data, rather than in isolation. I think that's a really good point.
Probably what makes the most sense though rather than embedding this data within the test itself is making a separate .html file in the modules/filter/tests/ folder that we can continue to add to as we come up with other crazy ways to break various filters.
Comment #102
sunWould be awesome to get this finally done for D7.
The list of tags to exclude should also contain PRE.
Comment #103
catchUntested re-roll.
Comment #104
catchJust to see how tests look, not really CNR 'cos this will need some tests adding back in.
Comment #106
sunWe also need to skip HTML comments.
Powered by Dreditor.
Comment #107
hingo CreditAttribution: hingo commentedHTML comments are properly skipped. Not because of that line, but because of line 70 in the most recent patch.
It could be added to the preceding code comment of course, if wanting to be pedantic, but it does also exclude html comments.
Comment #108
kenorb CreditAttribution: kenorb commentedRelated:
#362629: URL Filter doesn't make links if the chars are international
#340776: URL filter does not urldecode() the URL for link text
Comment #109
redndahead CreditAttribution: redndahead commentedThis patch fixes the failing tests, plus this doesn't seem right. I don't see this patch adding a variable and one doesn't exist at the moment. I have changed it. Tests now pass.
Powered by Dreditor.
Comment #110
sunNote that there are quite a few other bug fix patches for URL filter in the queue. Looking at the changes of this patch, it may be worth to tackle the other issues first. Just an idea.
Also, this issue was bumped to critical in 2008. Filter tests pass in the meantime, so this is just a regular bug.
Still seems to be missing PRE.
Trailing white-space.
Exceeds 80 chars.
94 critical left. Go review some!
Comment #111
redndahead CreditAttribution: redndahead commentedThis integrates sun's comments.
Comment #112
hingo CreditAttribution: hingo commented@sun
Note that this patch will fix almost half of the other URL filter bugs since most of them are just various symptoms of what this patch fixes. So it certainly makes sense to take this first, then see how many other bugs you can immediately close. #190466, #168227, #191744, #728380, #555280... from the first page of the list should all go away. (This is also the reason this was marked critical, the old URL filter fails in quite many ways and this addresses the root problem.)
For tests passing, it is because tests were removed, not because bugs were fixed. Add back the tests from comment #91 and you'll see a different picture. (How critical the issues are is of course another thing, but as you can see, there are quite many bugs filed against URL filter that are various permutations of this one.)
PRE tag: Personally, I don't see a reason why URL's inside a PRE tag couldn't be made clickable, but don't have a strong opinion either way. I agree that a strict interpretation of "preformatted" means there should be no other formatting inside the tag.
Only thing to still change:
Please use the tests from #91. The current "one liner" tests will not actually catch many of the bugs this patch addresses. It is important to test on data that contains multiple strings to make clickable, and multiple rows, and multiple paragraphs. If HEREDOC syntax is frowned upon, please separate the big test data chunk into a separate file and use it from there.
On a quick reading the last committed patch is good and would pass all tests from #91, so committing it is ok. But since the tests were written, I wouldn't mind adding them to the test harness too. This seems to be an area where some contributors may easily break things.
PS: I'm glad to see this patch is finally getting some serious attention. I'm happy to help if there are any questions. Unfortunately I have been affected by the European flight chaos, and will now embark on a 5 day trip back home, but will be back online mid next week.
Comment #113
klausiSame patch as in #111 plus reused tests from #91.
Comment #114
David Latapie CreditAttribution: David Latapie commentedHow would your patch compare to HTML Purifier? It lloks like we have at least three different tools to do the same thing (core URL filter, forked URL filter, HTML Purifier module). And I see no advantage in this.
Thanks in advance for your answer.
Comment #115
hingo CreditAttribution: hingo commented(This is based on a quick read of HTML Purifier code, not actually installing it.)
HTML Purify mainly does the same job as HTML filter (making HTML safe). But indeed, it contains also the "Linkify" class as some kind of addon service.
In general HTML Purify seems to be a rather complex beast comparing with Drupal's very simple core modules for doing the same thing. Also it is (probably?) based on actually validating the HTML with an HTML parser (correct, but slower), whereas Drupal filters tend to just use some regexps to get their job done (good enough if done right, faster).
As for the URL filter vs Linkify comparison, it is surprising to note that my patch here addresses more URI types and the Linkify class in HTML Purifier turns out to be rather simple. In fact also the original URL filter in Drupal core does more than the one in HTML Purify. This can probably be explained by the fact that this is not the main task of HTML Purify, so the implementation has been left as rather simplistic.
HTML Purify will only convert into links strings beginning with
http://|https://|ftp://
This patch is more complete and will catch all of:
http://|https://|ftp://|mailto:|smb://|afp://|file://|gopher://|news://|ssl://|sslv2://|sslv3://|tls://|tcp://|udp://
Also the original Drupal URL filter was more versatile than the one in HTML purify.
In addition both this patch and the old URL filter also convert email addresses (...@...com) and "plain" URL's without the "http://" (www.example.com or ftp.example.com).
In summary, Drupal's URL filter does more than the one in HTML Purify, and is (very probably) simpler and faster.
(And to those just tuning in: there is no difference with this patch vs the forked URL filter, they are the same. The difference between this patch and the old Drupal URL filter is that the old filter is rather buggy, it will miss many URL's that should be converted (especially if other HTML is involved) and it will mangle some places it shouldn't touch.)
Comment #116
hingo CreditAttribution: hingo commentedI didn't notice earlier that this was downgraded from critical in #110. As I explain in comment #112, the statement that "tests pass" is not correct and the argument that there are many other bugs in the queue is questionable too, since this patch will fix about half of those bugs! I apologize for not setting back to critical at that point, I was stuck in an airport somewhere with bad connectivity.
If there is some other reason not to regard this as critical, feel free to revisit, I'm just saying the earlier downgrade seems to have been done based on wrong assumptions.
Comment #117
redndahead CreditAttribution: redndahead commented#113: 161217-url-filter.patch queued for re-testing.
Comment #119
sunA bug in one of Filter module's filters is very very unlikely to be more important than normal. That is, because you can replace any filter with one of a contributed module, and D7 makes this even easier (e.g., swapping out the processing callback only).
Anyway, let's fix those tests. Will try to do a fresh review asap.
Comment #120
sunok, talked to hingo. Given that this patch seems to fix a full range of other open bug reports filed against URL filter currently, promoting to major.
Ideally, we should write (only) tests for those other issues to show that they fail.
Why do we initialize a variable with NULL that's supposed to be an array?
I don't think we need to escape
.,?!
here. Also, I wonder why*?
instead of just?
? Lastly, we don't need that matched value, so this should cut it:Or am I missing something? If I do, then the code is missing an inline comment.
IIRC, the (duplicating) title attribute has been removed by another patch in the meantime.
Some phrases in this testing text sufficiently explain what is being tested and expected, but some of them do not. We should make sure that expectations are sufficiently explained.
Alternatively, and I think I'd almost prefer it that way, we could build an array of testing strings and assertions to keep them together, and implode them afterwards, like so:
As you can see, this allows to cleanly define a single test string and test multiple expectations. For the single period (punctuation) test string, we can assert at least 4 things. Larger/multiline testing strings are also no problem.
It might make sense to put the actual testing code into a helper method. So we can build separate $test task lists for individual filter functionality, i.e. $test_absolute, $test_mail, $test_partial, or rather, any separation that would make sense. Technically, I think one could separate by topics characters/encoding, surrounding markup, security, and potentially more - splitting assertions up into such sections probably makes more sense.
In a separate follow-up issue, we should convert most of the existing filter unit tests to use this construct, as we have a lot of awkward assertion messages there.
Powered by Dreditor.
Comment #121
hingo CreditAttribution: hingo commentedHi sun, and thanks for the additional review. (Would've forgotten about the title attribute otherwise, is *that* checked by a test? :-)
So just for the public record, we did a thorough review and out of 9 open bugs, following are fixed by this patch:
#168227: URL filter fails on URLs surrounded by HTML entities
#877050: URL filter does not convert a link inside a div
#555280: Filter "Convert URLs into links" converts URLS within inline css background-image property
I can work on this during next week.
A few questions to sun:
1)
It would be easy to also fix
#190466: URL filter fails on square brackets
#728380: URL filter fails on parenthesis (RFC2396 Unreserved Characters; i.e., ( and ))
...should I alter the regexps here to cover entire unicode alphabet, or do you want to do that in a separate issue?
2)
When those other bugs have test cased posted, but they are not in D7 CVS yet, should I copy them into this patch?
I'm willing to do both if you answer yes.
Comment #122
sunRe-rolled against HEAD.
1) Let's deal with unicode separately. 2) Let's keep those other issues open, get this one in, and revisit the other issues afterwards.
Comment #124
sunReworked the tests. Clarified some comments.
The remaining test failures are for real.
Comment #126
sunFixed those regular expressions. I think this patch is ready to fly now. Anyone available for review + potential RTBC?
The new tests are entirely replacing the old tests now, as they're much more precise. I've taken extra care and double-checked that all of the existing assertions are indeed covered by the new ones.
Also, I had to mostly inline similar changes as already proposed in #257193: URL filter: Use filter_allowed_protocols variable, since my new tests tried to assert that a link using gopher:// gets converted, but apparently, that hard-coded list of protocols was totally outdated, and thus, check_url() replaced those invalid bad protocols with http:// ;)
Comment #127
sunAdditional took over a useful @see and assertion from my original patch over in #257193-5: URL filter: Use filter_allowed_protocols variable
Comment #128
hingo CreditAttribution: hingo commentedHi sun
Good work pushing this forward and getting the patch to apply again. Here's a review. I have errands today, but can work on these myself too if nobody else does it before me.
As you note in #257193: URL filter: Use filter_allowed_protocols variable, / and : are valid characters anywhere in the URL. So we could simply omit the addition of
(?://)?
and allow those characters liberally in the next part of the URL. The current regexp doesn't however allow them. So we would need to back to the plan of allowing most unicode characters anywhere in a URL. It can either be done right now, or we let the above stay as it is and then remember to remove it once we change the other regexp so that the above(?://)?
isn't needed anymore.(For clarity, the above needn't necessarily be fixed in this path.)
Good call to add $ip (whoever did it).
Same comment here: All of this will be much simpler if we in the future change this regexp to something that just matches a string of unicode characters until next whitespace. (Nothing to change here.)
From #120 above:
The *? is supposed to catch multiple trailing punctuation such as www.example.com... or (blablabla www.example.com.) or example.com!!! Note that it may not always repeat the same character, eg .) must match too.
It seems the email task doesn't contain that regexp at all so me@example.com. wouldn't be matched correctly. (This omission exists already in my patches and original code.)
The above code correctly ignores also multi-line comments. A separate test for this could be added though, or rather just add \n to the existing test.
That is all for the functional code. I have a meeting now, will also review test code separately (look forward to what you've done :-). Hopefully tonight.
Comment #129
hingo CreditAttribution: hingo commentedHeh, funny to watch how drupal.org itself behaves when I give examples of strings that should or shouldn't be converted.
In this case, this comment:
...is wrong, it works correctly for other reasons. Please ignore.
Comment #130
sunThanks for reviewing, hingo!
I may be mistaken, but this is not a valid domain name:
We need a separate pattern for $domain, so as to be able to properly match valid domain names and not match invalid domain names. At minimum, there are domain name length restrictions we have to account for, but also, not all characters can be used in all parts of a URL.
For now, if this is debatable, I'd recommend to move detailed discussion about protocols into #257193: URL filter: Use filter_allowed_protocols variable
Yeah, I had to get those tests passing. ;)
That's what I figured, too. However, we are not interested in further trailing text; just want to optionally exclude a certain (single) trailing punctuation character from the URL, if there happens to be one.
I agree that e-mails should get that punctuation treatment, too, and we should move the pattern into a new common $punctuation variable.
Good catch! Let's add a separate string test for multi-line comments then. Also not sure whether we should add another containing HTML markup within a comment.
Comment #131
sunAlright, incorporated all points from #130.
Problem: URLs in multi-line comments are not ignored. @hingo: Can you fix that?
Comment #133
hingo CreditAttribution: hingo commentedI'm not at my home computer at the moment, so cannot test, but since you are so active on this, a few quickie comments from my head. (...which may be wrong).
I can take a look later, but more importantly I actually tested a multi-line comment before I wrote my previous review and it worked correctly. I only had one newline, is your problem with more than that?
Adding yet another test with html markup in comments is a very good idea. (That may actually break, I realize now! The URL should be after the html...)
Yes, but now if you have such multicharacter punctuation, it stops on the last punctuation character, not the first? Ie it produces
<a href="http://www.example.com..">www.example.com..</a>.
Turns out emails never needed them, since they always end in a letter anyway. (like the m in .com)
Your comment about domain names is correct. But I was thinking there could be other protocols than http which don't have a domain name at the start. Even in mailto, I could see someone having "//@example.com" as their email address (yeah, probably illegal, need to check)
Comment #134
hingo CreditAttribution: hingo commentedLooked at the failed test now. I think a multiline comment works, it's your p tag that breaks it. The problem is the > character of the p tag matches here:
My first reaction is that we need to handle html comments as a special case. Probably the cleanest solution is to add a step that escapes all html inside comments, like some filters do in the $op="prepare" phase.
Comment #135
sunIf I'm not mistaken, then #222926: HTML Corrector filter escapes HTML comments solved a similar problem for the HTMLCorrector filter in D6 (where it still was regex-based). We may borrow a solution attempt from over there.
Let's assert this in a test. All of our expectations have to be asserted.
Let's not overcomplicate things here. ;)
Comment #136
hingo CreditAttribution: hingo commentedHere is a review of the tests, finally. As before, if nobody else is faster, I will also implement these myself asap.
Since I haven't really looked at these tests before, I'm reviewing the filter.test file itself, not just the most recent patch.
There is no reason to test both positively and negatively the same test. If the first test passes, then the second test automatically passes. Same for 3 and 4.
Also it is in general not a very useful approach to test some specific ways things could go wrong, since there are always a million ways things go wrong. We should just assert that the result is correct and that's enough. If it is known to be correct, we also know it is not wrong.
In other words this is sufficient:
This comment applies to many of the following tests too.
Maybe add cross referencing comments here and line 1325 in filter.module so that one remembers to update the test if the list of supported protocls changes.
Good that you also test for a not supported protocol.
Need to add back examples with multiple trailing punctuation characters from comment #128.
Learning from your previous comment, this test is ambiguous since when it fails you don't know if it fails on the HTML tag or because of the newlines. My favorite solution would be to also add newline to the previous test. Alternatively remove newline here and have a separate third test for it.
Ok, so now we have good test coverage.
From the email discussion we had, the approach you showed in building tests suggests it would be possible to easily test all HTML tags by iterating over an array that contains them. (You seem to have further developed that approach here so now it's slightly different.) Now we just test for DL list, P and some other random picks. But for me it is perhaps better to postpone that exercise to a separate issue and close this with the tests we have.
Even if I've already objected to it many times, you have now removed the one test I actually provided. The current tests do cover the various issues well in isolation, however it is also important to test on a larger document that combines many of the cases into one document. With the work you have done, it needn't be as complex as the one I did earlier, but I want to add back one test which would start with the following comment:
PS:
That's why I'm interested in the unicode support: the reason that is a tempting path to explore is that the regexps would be much simpler, like
($protocol:\S)$punctuation*?\s
Comment #137
sunFixed those tests. I need you to fix the multiline comments, not the tests. :)
Comment #139
hingo CreditAttribution: hingo commentedOk, here's a patch that
- reverts email handling to not use the punctuation stuff (I tried to say so in #129)
- includes fixes for the tests from my previous comment, including the test with large document
- adds back support for multiple trailing punctuation (and tests)
- correctly handles commented out HTML
- since I needed to touch the code anyway, finally fixes _filter_url_trim(). See http://drupal.org/node/260484#comment-966659
The code added to handle HTML comments definitively needs review.
I didn't use anything from #222926: HTML Corrector filter escapes HTML comments since it seems they don't have a working patch themselves. Maybe they can borrow my solution :-)
(Hmm... I cannot add new files with a diff, so I'm attaching a tar file separately that contains a new test/ directory.)
Comment #140
hingo CreditAttribution: hingo commentedNote that automated test will obviously fail now that 2 new files are missing from the patch. Tests naturally pass on my own machine.
Comment #142
sunYet another reason to put .test files into a /tests subdirectory... grml.
Comment #144
sunFixed those tests. Not entirely happy with the new test document -- takes hours + lots of debugging tools/code to find the actual cause for failing assertions. But we can go with this for now.
Comment #145
hingo CreditAttribution: hingo commentedLine numbers refer to the latest patch, not the source file:
Do you want to elaborate on why you have switched back to the broken version of _filter_url_trim()? In #141 the test was correct and the code was wrong. Now you have changed the code to match the test and they are both wrong.
In particular, you set filter_url_length to 20 characters, but now it cuts after 23 characters. (For instance, try with a URL with 22 characters and you see what I mean.) This is a minor issue, I'm just curious why you took the extra effort to remove my correct code?
**
FWIW, I'm ok with changes to _filter_url_escape_comments(). Using boolean variable doesn't semantically contain information about everything that the argument does, but I guess it saves so many bytes it is ok.
This is probably here by mistake? Also, it just moves a function a little bit, doesn't seem to do anything.
Comment #146
sun1) Sorry for reverting that URL length trimming change. I don't think it was mentioned in this issue yet, so I wasn't aware that it really belongs to this patch. I'd prefer to move this fix into a separate issue.
2) The verbose() method is moved from DrupalWebTestCase into the base class, DrupalTestCase, as it cannot be used in unit tests currently. To be able to debug URL filter's processing of the entire document, verbose() is needed, as that's the only way to look at the actual results.
So it looks like we're done here?
Comment #147
hingo CreditAttribution: hingo commented1) Ok. It is a different issue and can be handled separately. I mentioned it in #139. I'm ok with taking it away, just wanted to raise a flag since you reverted it without commenting.
2) If it was intentional, I have no comment, I'll focus on filter.module only.
+1 for #144 then.
Comment #148
sunI guess you meant RTBC.
Comment #149
sunThis patch blocks various other issues.
Comment #150
Dries CreditAttribution: Dries commentedInstead of calling those files
filter.url.input.test.txt
, can't we call themfilter-url-input-test.txt
?Comment #151
sunSince that is the only review issue, here's a proper patch. Don't forget to:
Comment #152
webchickWe need to back out moving modules/filter/filter.test to modules/filter/tests/filter.test. That's inconsistent with every other module in core, and has nothing to do with fixing this problem.
Comment #153
sunThat's not true. We have approx. the same amount of .test files directly in the module's folder as in a tests/ sub-folder. Additionally, having all testing stuff together in one location is only beneficial. Basically eliminates the t() in test assertion messages issue.
Anyway.
Comment #154
redndahead CreditAttribution: redndahead commentedJust to support Sun's claims these modules use "tests" directories.
Aggregator
Block
Field
File
Image
Locale
Node
Openid
RDF
Search
Simpletest
Trigger
Update
Comment #155
sunLet's discuss that detail in a separate issue. Latest patch no longer moves filter.test into the tests sub-directory. But of course, the testing text files have to be in there.
Comment #156
Dries CreditAttribution: Dries commentedCommitted to CVS HEAD. Thanks!
Let's talk about the location of tests files in a seperate issue indeed.
Comment #160
redndahead CreditAttribution: redndahead commentedgrumble grumble spammers.
Comment #161
ckngThis has not been backport to D6, open as new issue #1955590: URL filter breaks generated href tags (D6)
Comment #161.0
ckngCare home Swadlincote
Comment #161.1
sunCopy of the revision from April 2, 2011 - 13:12.