I'm working on a project for a client that has their Drupal site translated into Arabic. Running zen_id_safe() on any Arabic string turns it into a single dash, as none of the characters are alphanumeric.
The fix is to use a different regexp that uses \p instead of [a-z...] and to use the /u flag, like this:
return strtolower(preg_replace('/[^\p{L}\p{N}]+/u', '-', $string));
This is working for me so far, though I'd like to know if an even better regexp exists for this task.
Comments
Comment #1
samlerner commentedThis is a relatively easy bug to fix, is this regexp worth a patch?
Comment #2
avpadernoThe function should try to find the equivalent Latin character; I am not sure it's always possible.
Comment #3
avpadernoThere is already a module that translate not Latin characters into Latin characters; as the translation is not required by all Drupal installations, maybe the theme can used that module when it is installed.
It will be up to the administrator to decide if he needs that module, which is already used by a third-party module.
Comment #4
avpadernoThe module I was referring to is Transliteration (http://drupal.org/project/transliteration).
Anyway, I would rather avoid to use non Latin characters for CSS IDs. It can see a too heavy limitation, but already there are only Latin characters in HTML (and even a subset of them).
Comment #5
samlerner commentedIs there any technical reason for this? If it's a preference, that's fine, but I ran into this issue while working on a site in Arabic in which I had to automatically wrap
<h2>tags with an<a name="...">to create anchor links. The name was the same as the<h2>content, hence the need for Arabic anchor names, which I was using zen_id_safe() to create.If there is a technical reason to avoid non-Latin anchor names, I'd like to know so I can create a workaround. They work fine on FF3, Safari 3, and IE7/8 from what I've tested.
Comment #6
avpadernoThe fact there are compatibility problems is already a reason to avoid non-Latin characters.
The function the Zen theme implements is clearly not thought to be used for characters outside the ASCII character set, as it simply removes the characters outside that range.
It could implement the same code as Transliteration (or use that module) but would the CSS IDs be clear and still readable (especially with scripts like the Arabian one)?
if also the function would use the Unicode hexadecimal code rather than the Unicode character, would the CSS IDs still be readable?
What I suggested is only a personal preference that in the specific case would resolve the issue you are seeing. There is no technical reason, except the problem some characters causes to some browsers (otherwise the Zen theme would not implement
zen_id_safe()at all).Comment #7
avpadernoI think that the function follows some directives for the allowed characters in a CSS ID; I would not talk of bug, in this case.
Comment #8
avpadernoComment #9
johnalbinIndeed, zen_id_safe() references http://www.w3.org/TR/html4/types.html#type-name
Which shows that IDs “must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").”
The valid characters for a class are slightly different however: http://www.w3.org/TR/CSS21/syndata.html#characters
So, it would probably be good to have a class-specific function.
Comment #10
avpaderno+1 for the feature.
Comment #11
johnalbinSee also the D7 issue related to this: #464862: Add drupal_css_class() to clean class names and rename form_clean_id
Comment #12
johnalbinZen 6.x-2.x now includes a copy of Drupal 7's drupal_html_class which does not have the bug with stripping out valid UTF-8 characters.
http://drupalcode.org/viewvc/drupal/contributions/themes/zen/template.ph...