Localized alphabet? [#21638]

It seems to me that the module always generates an English alphabet list. In languages that have extra letters, this may be a problem. For instance, in Romanian, we also have the following letters which could start a sentence (or a node title, if you will): Ş (pronounced sch, as in "shoe") - follows the letter "S"; Ţ (pronounced tz, as in "tsunami") - follows the letter "T", and Î (pronounced igh; suitable correspondent does not exist in English). Currrently, and based on my limited testing, if one of the nodes starts with either of these characters, they are not included in the node list, and so the alphabet listing is not complete.

I suspect the problem can be extended to other languages as well (maybe German, and the umlauts)? What about other alphabets, like Cyrillic, Hebrew, or Arabic?

On an unrelated note, why is it that the node list page is always called "content list" even though I've translated the corresponding string?

Comments

Comment #1

njivy commented 1 May 2005 at 14:17

The limitation you describe is a result of PHP's strtolower() and preg_replace() functions, used near line 126 in the current version of nodelist.module. There is a way to improve this behavior, but it requires a non-standard PHP installation with Multibyte String Functions enabled. Unicode characters are multibyte.

I cannot recompile PHP at the moment, but here is what I recommend trying:

Alter line 126 of nodelist.module, replacing strtolower with mb_strtolower.
If, when you publish something new and reload the node list page, PHP complains that the function does not exist, then you know that PHP must be recompiled.
If that is the case, follow these installation instructions for multibyte string functions.
Be sure also to use --enable-mbregex so the preg_replace() will behave properly.

Until then, nodelist.module will ignore special characters and place the node in the list according to the first "normal" character in the string. Essentially, preg_replace('#^.*?(\w).*#i', '$1', $row->title) treates special characters like punctuation; these characters are ignored.

-----

Regarding your other question, the title of the node list is defined in admin/settings/nodelist, not through the locale.module.

Comment #2

baudolino commented 2 May 2005 at 03:01

Thanks for your detailed reply. Unfortunately, my host does not have PHP compiled with multibyte support, so I'm out of luck. I can't test your solution, and even if I tried it on my local machine, and it worked, I could not upload the fix :(

I was thinking about a workaround, maybe maybe using a lookup table, but it looks like the marginal improvement might not be worth the extra effort (+ additional overhead on the module, which would take a performance hit).

However, I'd suggest adding this note in a readme file, so that people won't file future bug reports complaining about missing or incorrectly filed nodes ;-)

Comment #3

njivy commented 2 May 2005 at 15:43

I added an overview of the Unicode character problem and a link to this discussion in the module's readme file. I would like to fix this problem with a general solution, but I do not yet know what that would be.

One option is to categorize by the first character in the title, whether it is punctuation or otherwise. But I don't think this is intuitive.

Another option is to manually strip common punctuation without using \w in the regular expression. But that would result in nodes listed separately for upper- and lower-case initial characters. Again, this is not intuitive for someone trying to find a node by its title.

Comment #4

Steven commented 2 May 2005 at 15:54

Perhaps you could take the same approach as search.module: use preg_replace with /u (UTF-8) and make your own character classes.

I used this script to generate them:
http://acko.net/dumpx/ucd-charclass.phps

Comment #5

njivy commented 2 May 2005 at 16:55

That's a great script! Thanks, Steven.

I updated the cvs version of nodelist.module with an updated preg_match(). This allows nodes to be sorted according to their Unicode letters and numbers, although upper- and lower-case variants of Unicode characters are currently listed separately.

Comment #6

baudolino commented 2 May 2005 at 20:54

Yes! It _works_ !

Re: lower case vs. upper case, I propose to use strtoupper, and convert first char to upper, rather than lower capitalization. The reason is that usually, node titles are capitalized, since they're titles. There is a very low probability, IMHO, to stumble upon a node title that actually begins with a UTF-8 special character lower cap. Unless, of course, it's one of the 31333t sitez with teh haxx0rz :-)

Anyway, many, many thanks for the fix. Now I have my last third of the nodes showing up in the list!

Comment #7

nbayaman commented 5 November 2005 at 22:50

I give up. I've tried to make it. As you wrote it is should be pretty easy for you, but not for me :) to make nodelist module work with utf8 characters... could be some of you be so kind to share patched module (if it should be) or give some tip, how to implement it...

Looking forward for any help.

Comment #8

nbayaman commented 5 November 2005 at 22:55

I thought may be somebody could explain me how to call character_class_regexp, in other words how to apply it to the nodelist module...

thanx for your patience with my ignorance :)

Comment #9

wmostrey commented 20 October 2006 at 09:14

Assigned:	Unassigned	» wmostrey
Status:	Active	» Fixed

This has been commited in versions 4.6 and 4.7.

Comment #10

(not verified) commented 3 November 2006 at 09:16

Status:

Fixed

» Closed (fixed)

Localized alphabet?

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

News items

Our community

Documentation

Drupal code base

Governance of community