uncorrect UTF filter

skotch - June 3, 2008 - 12:37
Project:Wordfilter
Version:5.x-1.x-dev
Component:Code
Category:bug report
Priority:normal
Assigned:Unassigned
Status:closed
Description

Hello!
Where are 2 problems with UTF8 (non-ascii texts and replacements)
In order to operate with utf-strings please correct function wordfilter_filter_process()
1. UPPER/LOW CASE replacement (use '/iu' insted of '/i')
2. Standalone word replacement (use '[\W]' instead of '[^a-z0-9]')
Correct code that I've tested:

        if ($word->standalone) {
          $text = preg_replace('/([\W])'. preg_quote($a) .'([\W])/iu', '$1'. $replacement .'$2', $text);
        }
        else {
          $text = preg_replace('/'. preg_quote($a) .'/iu', $replacement, $text);
        }

#1

jaydub - June 4, 2008 - 11:34

can you post some sample UTF8 characters that you
are trying to work with?

I am a bit worried about what the PHP website says
about using the /u flag:

Matching characters by Unicode property is not fast, because PCRE has to search a structure that contains data for over fifteen thousand characters. That is why the traditional escape sequences such as \d and \w do not use Unicode properties in PCRE.

I think maybe making it an option rather than the default
might be another approach especially if the performance hit
mentioned above is large enough.

#2

jaydub - June 23, 2008 - 16:06
Status:active» fixed

Ok I've added this in as an option. Please try out and post
results of test with the UTF8 content you wish to filter
on.

#3

Anonymous (not verified) - July 7, 2008 - 16:13
Status:fixed» closed

Automatically closed -- issue fixed for two weeks with no activity.

 
 

Drupal is a registered trademark of Dries Buytaert.