Download & Extend

Problem with Recognition of Word Boundaries eg. two consecutive smilies

Project:Smileys
Version:6.x-1.x-dev
Component:Code
Category:bug report
Priority:normal
Assigned:Unassigned
Status:reviewed & tested by the community

Issue Summary

If the stand-alone option is activated the current regular expression does not recognize boundaries of smilies or words correctly. For two consecutive Smilies separated by space, for example, the closing pattern "[ ,\.\?!:\(\)\r\n\<\>]" from the first smily "hides" the opening sequence from the second one, since it the space is part of the first recognizes expression.

Also &nbsp; is not part of the recognized pattern. Since double occurences of space should be encoded as HTML-Entities resolving this could also fix #376213: double blank space avoids smileys to appear.

I will provide a patch in a second...

Comments

#1

Status:active» needs review

The provided patch also avoids the function eregi_replace which will be deprecated from PHP 5.3 (see http://php.net/manual/en/function.eregi-replace.php)on an uses preg_replace instead.

AttachmentSize
smileys-567290.patch 1.56 KB

#2

No time to actually test now, but at least let me to confirm that this is a problem. On my site, stand-alone smilies were broken (not translated from textual form) occasionally, in particular every second one in a row of smilies, and I was never really able to put my finger on the source of the problem.

#3

Yes, that is exactly the problem addressed in the patch.

In your case the whitespace after the odd smilies are recognized as the last character of the smiley. Looking for another whitespace (or []<>...) as the beginning of the next smiley, the previously used regex fails for the even smiley and only finds the beginning of the odd one. You can turn that around by using   right before the first smiley. So much for the theory, please test if the patch solves your problem.

#4

Version:6.x-1.0-alpha5» 6.x-1.x-dev

OK, today I had a bit of time, so I tested this.

First, let me to clarify steps to reproduce the bug (as this took me quite a while to figure out): You need a row of smileys, one after another, set to "stand alone", and also they must be all the same smileys (multiple occurences of the same). Only identical smileys are processed by the same regex, which hesitate to process the space between smileys twice, resulting in this bug. If you have different smileys, they are processed separately by different instances of the regex, which re-inserts the spaces (so that they don't disappear), and the problem doesn't occur then.

I confirm that this patch fixes that.

I reviewed the regex carefully, based on the PHP manual website, and I believe it's correct (but might be improved, as explained below). Also the transition from deprecated eregi_replace() to preg_replace() is highly desirable IMO.

But there's more about these few lines of code:

- It didn't fix all my problems, I still had untranslated smileys (particularly at end of lines, if using the Texy filter in conjunction with Smileys). After a lot of debugging I finally found out that it's the issue I already attempted to fix ages ago, at #159835: More "blank" characters to recognize around smileys. Unfortunately, that issue got closed a year ago without commit, due to being considered as not "problematic for most of the websites". I forgot about it (but left the patch running on my site), and later when I upgraded my site to 6.x I got bitten by this again - and having not enough patience to troubleshoot, I just disabled "stand-alone" on every single smiley on the site.

While the addition of &nbsp; is all nice and correct, I believe we should also add the corresponding UTF-8 character (hex. A0, UTF-encoded as sequence C2 A0). This is also white-space, and what's worse, it's almost impossible to troubleshoot once some filter adds it, or it gets copy-pasted into a post. It doesn't show anywhere, I even tried to save the page from browser to disk, and dump the file hexadecimally, but it was already stripped away. Only after I added a hex-dumping script to the smileys module code, right in front of the regex, only then I was able to detect this sequence which caused my smileys to break.

I didn't dare to play with utf-8 support at regex level just for this one character, but I believe that adding it at php-level as hexadecimally-escaped sequence is safe, as both the bytes are safe themselves, and it doesn't really matter whether regex treats them as string of two codes, or a single UTF character - the match is still the same. It works fine in my use cases.

- Additionally, I also included assertions for start/end of the whole string, otherwise smileys glued to the string boundaries (with whitespace trimmed, that is) will also fail (given no other filter added some wrapping markup yet).

- While the stand-alone smileys have a css class, the non-stand-alone part was missing that class. This is obviously an oversight, we should generate the same markup independent of the stand-alone setting. (I had problems with this, it's difficult to theme smileys having no class on the img tags...) So I added that missing class too, while we touch that line.

- As I already proposed on the other issue, the usage of a regex for non-stand-alone part is unnecessary. In that case, all we actually do is plain old str_replace() - so I used that instead. Should be faster, and is therefore recommended according to PHP documentation.

- As a code-style fix, I cleaned up some unneeded escaped double quotes, where just putting the whole into single quotes is fine. Again, the two lines were out-of-sync in this regard: The non-stand-alone part was already cleaned up, so I just copied that into the stand-alone counterpart, which missed this cleanup so far.

- Also, I believe that we don't use if/else constructs without curly braces in Drupal (and in fact I got bitten by that while inserting temporary debug code), so I fixed code style in this part, too.

Quite a bit for 2 lines of code... ;-)
(I'm unsure about the usage of a GLOBAL for the base path, but I left that untouched for now.)

The attached patch contains my additional fixes. I tested it both in stand-alone and non-stand-alone mode, against the posts I had trouble with previously, and it works 100% fine for me.

Leaving at "needs review" due to changes I did in the patch. Also this is (and was, as I see) rolled against dev version (applies to alpha5 with offset).

AttachmentSize
standalone-smileys.patch 1.57 KB

#5

Hmm, thinking of it, str_replace() is fine, unless we *really* need to run the replacement as case-insensitive. I believe it's not much of a problem, but yes, this is a (tiny) change. Better option might be str_ireplace(), but that's PHP5 only, so not an option until D7... So attaching an alternative patch without this change (sticking to the [supposedly slower] preg_replace(), otherwise no difference, and I tested it all the same).

AttachmentSize
standalone-smileys2.patch 1.6 KB

#6

Status:needs review» needs work

Do not use $GLOBALS['base_url'], use $base_url instead.

#7

Status:needs work» needs review

There's no variable $base_url in the function being patched, so that won't work just like that. Obviously, we might initialize that variable, or write global $base_url, but that's extra line of code. This part was unchanged from previously existing code in the above patch. Although I already included a lot of extra cleanup, I'm really worried about scope creep in this issue, which was originally just about the regexp.

I have no problem to write that variable in a different way (although I really don't see it as more readable, with global keyword or variable initialization removed from the place of concern), but it's a debatable change introduced in the patch - I mean, the $GLOBALS['base_url'] way is now used consistently across the whole module, and we break that consistency with this change. I believe that widening the scope of this patch to cleanup of the whole module is not desirable, so I defer to module maintainer to decide whether we want to change it here or not. Matter of taste, I would say.

But OK, let's try the other way, to see which is more acceptable.

Additionally, I removed the addition of start/end of string assertions in the regex, which were unnecessary - I must've missed that line while reading the function first time, but the string is already padded with spaces on both ends earlier in code, meaning that we don't need to worry about this in the regexp.

Updated patch attached.

AttachmentSize
standalone-smileys3.patch 1.78 KB

#8

Status:needs review» needs work

Always use API functions if available, please. See http://api.drupal.org/api/function/theme_image/6

#9

Status:needs work» needs review

@hass: Please stop suggesting out-of-scope changes. This patch can't rewrite the whole Smileys module, I didn't introduce the way how markup is generated, I only just fix a problem with the regex. I already went too far with additional cleanup IMO. There are other places in the module, where images are generated the same way, and I really *really* don't have the time to rewrite that all. The missing theme call, although not really nice, doesn't break anything - while the regex problem is a real bug breaking the functionality. Also, smileys are inline contents rather than regular images, so the use of theme_image() is a subject to further discussion. Maybe we even need to introduce something like theme_smiley(). But, I emphasize, I didn't write that part of code, and I have no intention to change it. We need to draw a line, somewhere ;-)

So our options are:

1. Review the actual fix, hopefully commit it, and only then follow-up with other cleanup issues (with scope of the whole module, to avoid incosistencies)
2. Someone else (hass?) work on the patch further, to include all the desirable cleanup in a single patch
3. Otherwise, I'm sorry to say, this looks like blocking a needed bugfix (no offence intended)

#10

If you would like to add Unicode support, I think the provided regex is not quite enough. According to http://www.unicode.org/charts/PDF/U0000.pdf, http://www.unicode.org/charts/PDF/U0080.pdf and http://www.unicode.org/charts/PDF/U2000.pdf at least 10 more spaces need to be added (e.g. EN SPACE, EM SPACE, THREE-PER-EM SPACE, THIN SPACE, ZERO WIDTH SPACE, NARROW NO-BREAK SPACE...). All of these spaces might occur if text is added from Text Processing Software, so there is IMHO no reason to just add the \x02\xA0 and leave the other white space characters unhandled.

Another possibility to resolve the problem would be to use the PCRE-Expression \p{Zs} (see: http://www.php.net/manual/en/regexp.reference.unicode.php), but playing around with it, it occured, that there would be additional testing of the used encoding context and the PHP version used (there seems to be a difference between PHP-4.3.2 and PHP-4.3.3). So I'm not quite sure about the way, Unicode characters should be handled in drupal.

I think, that the Unicode-Problem is out of scope of this issue, even including the HTML-Entity &nbsp; is out of scope since the original issue was about not using lookahead expressions.

Besides that, I'm happy with the last supplied patch, it fixes the problem and probably does not create a new one...

#11

I would be happy to improve Unicode support, but my time is limited... The \xC2\xA0 sequence was specifically added, because there's already a contributed module (Texy) which adds it to filtered text right here in Drupal (converts some of &nbsp; occurences in fact, and we can't change that as it's a third-party library integration). So I would be happy to hear some suggestion how to reliably handle all Unicode spaces, but if there's none, I would rather not hold back the whole patch, and not remove the current incomplete-yet-needed pattern either. We can improve it later.

As of php version, Drupal 6 requires 4.3.5 minimum - system.module: define('DRUPAL_MINIMUM_PHP',    '4.3.5');, so that should be OK (as long as no inconsistency happens across later versions). What's the used encoding context is something I need to research before commenting on, though.

The question of the scope in this issue seems to be debatable (see above). I would say - all about the regex is in-scope, but not a blocker (as long as we don't introduce new problems), another cleanup to the same few lines of code is optional (no point in dropping what's already done, but let's not get further off), anything module-wide is out-of-scope. The goal is to improve the module in a way realistic for contrib and for the amount of contributors in there. Just my opinion on this point, feel free to disagree. (Would be nice to hear from the maintainer, too.)

#12

Thanks for pointing out the issue with Texy module and the minimum PHP version.

I found the problem I had minutes after writing my last comment. All I had to do was using the "/u"-modifier (http://php.net/manual/en/reference.pcre.pattern.modifiers.php). But at the same time I learned, that \p{Zs} only works from >=PHP-4.4.0 - so that would be no option for the current setup.

So just for the record: using $chunk = preg_replace("/(?<=&nbsp;|\p{Zs}|[,\.\?!:\(\)\r\n\<\>])". preg_quote($a) ."(?=&nbsp;|\p{Zs}|[,\.\?!:\(\)\r\n\<\>])/ui", '<img src="'. check_url($base_url .'/'. $smiley->image) .'" title="'. check_plain($alt) .'" alt="'. check_plain($alt) .'" class="smiley-content" />', $chunk); would probably be correct, but would need PHP-4.4.0 at least.

I think for the current setup the patch3 provided in #7 is the best we can do to resolve this issue. I strongly support your patch.

#13

The page you linked to says Unicode support in general works since PHP 4.1.0 / 4.2.3 / 4.3.5 (depending on operating system and feature we are looking for), so that part should be OK (and maybe it's why Drupal requires this exact version?).

As for the 4.4.0 (and 5.1.0!) version boundary - do you mean this page? It seems to only just mention some 3 additional escape sequences, the versions seems to not apply to the whole list (although I'm not 100% sure). Edit: I did misread that page on quick look, sorry. You were right.

So, your pattern might be good after all. And even if it wasn't, I guess that we might leave regular space in the explicitly listed characters, and hope that majority of users have a more recent php versions. If someone haven't, what's going to happen (not sure here)? Most probably that part of the pattern (the additional spaces) just won't match anything useful, leaving the functionality we had previously? If so, we might be fine with that for now, I guess.

Going to try and test your pattern, if I'm able to (to be determined, depends on my php version - not at my development setup while writing this comment).

#14

I tried to test that #12 pattern somehow, it was a bit hard time for me, and my result is: We cannot use the Unicode properties.

I had hoped that it would get just somehow ignored if php version was too low, but instead it breaks things badly, sometimes even if php version is ok (!). This is a feature of additional PCRE library, that may be disabled at compile time of that library. My testing setup have PHP 5.1.6 (i.e. OK, >4.4.0, >5.1.0), with PCRE library 6.7 (i.e. the one listed on php.net as bundled with 5.2.0), running in default configuration (as for PCRE) as distributed in RPM package (Mandriva Linux desktop). If I try this pattern, however, I get:
Warning: preg_replace() [function.preg-replace]: Compilation failed: support for \P, \p, and \X has not been compiled at offset [position] in [file] on line [line]
And what's worse, preg_replace() returns NULL, so the node in question show as completely empty (not just smileys unconverted - it removes all the text).

On production server it works (PHP 5.2.9), but in a situation where the new pattern truly breaks the whole display of nodes, not only on known (still supported by Drupal) older php versions, but on just any version where this or that setting was different at compile time, there's no chioce but to go with #7.

Thanks for support, frenkx.

Anyone going to either voice further concerns, or hit RTBC? ;-)

#15

Thanks for testing on different environments. Of course it would be possible to check certain things. There could be a basic pattern, just as the one we used in the current patch, but if $check = preg_match('/\p{Lu}/', 'A'); returns something but NULL or FALSE we could use an advanced unicode/PCRE - pattern.

The whole code could look like

<?php

$pattern
= preg_match('/\p{Lu}/u', 'A')?"/(?<=&nbsp;|\p{Zs}|[,\.\?!:\(\)\r\n\<\>])". preg_quote($a) ."(?=&nbsp;|\p{Zs}|[,\.\?!:\(\)\r\n\<\>])/ui":"/(?<=&nbsp;|\xC2\xA0|[ ,\.\?!:\(\)\r\n\<\>])". preg_quote($a) ."(?=&nbsp;|\xC2\xA0|[ ,\.\?!:\(\)\r\n\<\>])/i";
$chunk = preg_replace($pattern, '<img src="'. check_url($base_url .'/'. $smiley->image) .'" title="'. check_plain($alt) .'" alt="'. check_plain($alt) .'" class="smiley-content" />', $chunk);
?>

(sorry, not very readable, but that could be improved in an actual patch...). Maybe you could check, if it works on your test system? Right now I don't have some similar test system at hand - I need to set up a virtual machine, but don't really see, where that could fit into my schedule...

Otherwise, as I said, the current patch is a great improvement for the code, so please submit the patch.

#16

The #15 suggestion is likely to work (not tested yet), except that we would need to use something like @preg_match('/\p{Lu}/u', 'A'), to avoid the watchdog being flooded with warnings per #14, and possibly also cache this check in a static variable - we're running that regex over and over again, for every smiley configured on the site, for every filtered post (this may easily mean 1000 runs on frontpage after cache flush, assuming 10 promoted nodes and 6 custom blocks, 30 smileys with 2 acronyms each - i.e. :-) and :smile:). This might be noticeable in performance - per http://www.php.net/manual/en/regexp.reference.unicode.php :

Matching characters by Unicode property is not fast, because PCRE has to search a structure that contains data for over fifteen thousand characters.

But otherwise this looks to me like over-complicated. I don't think we're using such fallbacks in Drupal much (i.e. having different behavior on different PHP versions), and it's only just covering edge cases of advanced text-processing codes being pasted into a post, at a supposedly significant performance cost. So my opinion now is - let's stick with the simple list of codes for now. We may extend that list, if a use case is encountered with more characters involved, and we might mention the \p{Zs} option in a code comment as a future alternative, so that this may be revisited after we can rely on the feature (that's either 7.x, or even 8.x stuff - current 7.x HEAD requires PHP 5.2.0, which is OK version-wise, but means exactly the version of PCRE library I had trouble with).

#17

I totally agree with you but wanted to present the idea for completeness on possible solutions.

What's next? We wait for a maintainer to apply the patch?

#18

If you feel strong about the patch being OK (and you tested, preferrably), you can change status to "rewieved & tested by the community" (also known as RTBC), as this is how we put things into the maintainer's queue at Drupal. It's considered unfair if the author of the patch (i.e. me) did that himself. This workflow is perhaps not strictly followed at less-busy contributed projects (like Smileys is), but still it may help to draw appropriate attention here.

Otherwise you're right. Thanks for support.

#19

So where's the last patch I need to look at? I'm excited. JirkaRybka, do you believe that you can give me a good lending hand on the Smileys project for sometime?

#20

The patch is at #7 (unless we need to add a code comment per #16). Otherwise the extended discussion just explored additional options, and finally went back to the option already rolled and tested in #7.

(As for "good lending hand", whatever that exactly means, I'm afraid there's no more additional space on my personal time/schedule. These patches [for various projects, recently] are my maximum currently.)

#21

The attached patch consists of my initial patch for the regex using lookahead and HTML-Entity for NO-BREAK SPACE provided in #567290-1: Problem with Recognition of Word Boundaries eg. two consecutive smilies, additions for unicode NO-BREAK SPACE provided in #567290-4: Problem with Recognition of Word Boundaries eg. two consecutive smilies and additional code cleanup as mentioned in #567290-6: Problem with Recognition of Word Boundaries eg. two consecutive smilies. This is basically the patch provided in #567290-7: Problem with Recognition of Word Boundaries eg. two consecutive smilies.

As suggested in #567290-20: Problem with Recognition of Word Boundaries eg. two consecutive smilies I added a comment and link to this issue, especially to the discussion about unicode white space which is not part of the patch right now.

Since the functional part of the patch is already tested since #567290-7: Problem with Recognition of Word Boundaries eg. two consecutive smilies the testing of the current patch should be really quick and since I provided the last patch you might be able to proceed as stated in #567290-18: Problem with Recognition of Word Boundaries eg. two consecutive smilies without me blocking :-))

AttachmentSize
smileys-567290-standalone4.patch 2.19 KB

#22

To sum it up, #21 only just added the suggested code comment. My only concern about that comment is that the PHP version should actually be 5.1.0: On 4.x.x branch of PHP it works from 4.4.0 up, and on 5.x.x branch from 5.1.0 - so we have a gap >=5.0.0. <5.1.0 where it won't work. So let's simply say 5.1.0 - after all, D7 already wants 5.2.0.

It would be nice if someone re-rolled the patch with this, but since I can't do so right here in this minute, I'm not blocking the issue (so leaving at "needs review"). After all, there's a link to this issue in the comment, and the whole thing would need more research anyway if revisited (i.e. performance, availability of properly compiled PCRE library etc.)

#23

Patch re-rolled with PHP version in code comment per #22.

Otherwise identical to #21, and the actual code is identical to #7 too.

AttachmentSize
smileys-standalone5.patch 2.05 KB

#24

Status:needs review» reviewed & tested by the community

Great, so I hit the RTBC button. If I need to take further action, I ask you to guide me through.

#25

I have a problem where smileys that contain other smileys (like cry: ":(( ") don't work right because it does the inside smiley instead. I tried this patch just to see if it would fix it and it doesn't. Is that something that can be worked into this regex fixing or should I file a new issue for it?

Thanks,

Michelle

#26

I guess you should reorder the smileys, because the regex is applied repeatedly, one possible smiley at a time. So there's no way to fix this in the regex itself, as it each time knows about just one acronym.

#27

Does that suggestion help, Michelle?

#28

@Gurapartap: I don't know... Every time I try to re-arrange them, it gives me an error. :( I'll try again later. Still not an ideal solution, though, because that's going to change the order they are displayed in the box. I don't know a better solution, though, so I guess it will have to do... If it ever lets me do it...

Thanks,

Michelle

#29

As for the error - I don't know whether it's the same, but I did run into a problem with database schema on 5.x->6.x upgrade. The site in question was started on 4.7.x, and when I reached 6.x, the weight column in smileys table was still unsigned data type, and so didn't work with weights auto-generated from Drag-and-Drop. I just fixed that manually in database, because I was short on time (and short on enthusiasm regarding upgrades, after this), and therefore I didn't file an issue. Maybe we should now. A workaround is to browse with JavaScript disabled in browser, and choose weights manually so that they are all positive numbers.

Also note, that I didn't really check whether the weights have any effect on the actual processing order or not. Unless we check that in code, the only reliable testing might be with two smileys newly added in this or that order. Just a guess...

Anyway, this is heading a little bit off-topic here. I would love to see some progress with the existing patch here.

#30

Argh... I went through every single one and put in the weights... Only to discover that there's no way to save the page with javascript off.

Yeah, this is totally OT, sorry. I only meant to ask if the regex in the patch could be made to handle it, not totally derail the issue. I'll try fiddling with the order in the database.

Thanks,

Michelle

#31

Argh... I went through every single one and put in the weights... Only to discover that there's no way to save the page with javascript off.

Oops! My apologies - now I remember that I ran into this too (and re-entered all the weights again in PhpMyAdmin, I suppose), but totally forgot about that later. It's some 6 months ago... Sorry about my poor memory! (And we have another bug...)

I only meant to ask if the regex in the patch could be made to handle it

I believe that there's no way to do that, given the nature of existing code.

#32

@JirkaRybka: Changing the weights in the database had no effect, either. I'm going to put this on the back burner for now and may file an issue at some point if I figure something out. Thanks for the help. :)

Michelle

#33

RTBC +1.

Reiterating that this is RTBC as of comment #23 (or actually #7), is all I can do.

Thanks for your extensive testing and discussion - and for including the reference to the discussion in the code comment, so that a D7 version can pick up the work.

#34

Another RTBC +1.

nobody click here