Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
When tags are using non 7bits chars, they are not sorted correctly.
In tagadelic.module, I made the following change to _tagadelic_sort_by_title in order to use mbstring to sort tags correctly (it works at least for french tags):
Maybe iconv could be used there.
Function "to7bit" comes from php.net.
/**
* @args string $text line of encoded text
* string $from_enc (encoding type of $text, e.g. UTF-8, ISO-8859-1)
*
* @returns 7bit representation
*/
function to7bit($text,$from_enc) {
if (isset($from_enc))
$from_enc = mb_detect_encoding($text,"auto");
$text = mb_convert_encoding($text,'HTML-ENTITIES',$from_enc);
$text = preg_replace(
array('/ß/','/&(..)lig;/',
'/&([aouAOU])uml;/','/&(.)[^;]*;/'),
array('ss',"$1","$1".'e',"$1"),
$text);
return $text;
}
/**
* callback for usort, sort by count
*/
function _tagadelic_sort_by_title($a, $b) {
if (extension_loaded('mbstring'))
return strnatcasecmp(to7bit($a->name), to7bit($b->name));
else
return strnatcasecmp($a->name, $b->name);
}
Comment | File | Size | Author |
---|---|---|---|
#9 | tagadelic_sort.patch | 2.43 KB | chrissearle |
Comments
Comment #1
Bèr Kessels CreditAttribution: Bèr Kessels commentedFor reference, an old issue about this here: http://drupal.org/node/55842
Would you be so kind as to provide a real patch? this eases the review process a lot. Details here: http://drupal.org/diifandpatch
On the code:
I think its a better idea to use the Drupal core API instead of writing your own conversion api: http://api.drupal.org/api/4.7/function/drupal_convert_to_utf8
Comment #2
Leeteq CreditAttribution: Leeteq commented(subscribing...)
Comment #3
cquest CreditAttribution: cquest commentedI'm sorry for my quick and short post... I'm a bit new at digging in Drupal and modules code ;-)
The bug I found IS that strnatcasecmp is a single byte compare function, not a double-byte one if I'm not wrong. As text passed to it can be double-byte (mostly utf8), it does not work in all cases as expected (it was my case with obvious tags start with 'é' being put in from of 'a' tags instead of being mixed with 'e' ones).
So, to solve I see to possibilities:
- use a real double-byte comparison for the sort,
- convert the text into plain 7bit ascii and compare this instead of the double-byte text.
As I could not find a double-byte comparison function in mbstring or iconv, I moved to the second solution.
drupal_convert_to_utf8 seems to only provide "anything" to utf8 conversion... so I could not use it because I was looking for an "anything" to ascii conversion.
I found the to7bit sample code on php.net's comments. It doesn't look universal and rock solid.
I've just found that it could be replaced by a single call to iconv like:
iconv('UTF-8', 'ASCII//TRANSLIT',$str)
This will translate text like "Cœur déchiré" into "Coeur dechire" (seen the 'œ' converted to its 2 chars equivalent ?).
So... another, much simpler version could be (not tested yet):
I did not put any code to check iconv availablity as it is builtin in PHP since PHP4 (even under Windows).
I'll post a patch if it looks ok for you.
Best,
Christian
Comment #4
Bèr Kessels CreditAttribution: Bèr Kessels commentedThe problem with iconv is that it may be available for most systems, it is not really a requirement for Drupal. Hence, if you read http://api.drupal.org/api/4.7/function/drupal_convert_to_utf8 you see that it checks for existence of certain functions.
I have asked on #drupal for help on this, and will post a mail to the devel mailinglist.
Bèr
Comment #5
Bèr Kessels CreditAttribution: Bèr Kessels commentedMarking for HEAD. That way we can fix it for all branches, including 5 and 4.
Comment #6
Bèr Kessels CreditAttribution: Bèr Kessels commentedbump. Can someone please turn this into a patch?
Comment #7
Bèr Kessels CreditAttribution: Bèr Kessels commentedComment #8
chrissearle CreditAttribution: chrissearle commentedBèr - what's that status here?
Reading thru - you were going to ask on #drupal etc - did you get any response?
Or do you still want a patch rolled using either to7bit or iconv?
If I know what way you want to head I can roll a patch against CVS head or CVS d6 branch - since I'm also getting bitten by this ;)
Comment #9
chrissearle CreditAttribution: chrissearle commentedThinking - we could patch it so that tagadelic_sort_by_title calls a function similar to http://api.drupal.org/api/function/drupal_convert_to_utf8 - but that goes the other way. That would mean we check for each function in turn before converting and in this case - return the original string if no options available.
BUT
What encoding should we be converting from and to?
Can I assume that the tags are in UTF-8? Is it a given that converting to say ASCII will give the correct sort order? I'm not sure.
The following snippet:
(my test file encoded in iso-8859-1)
Responds
Which would mean that æ and å would be sorted as a and ø as o. This is wildly incorrect for the norwegian locale (all three should be sorted at the end after z).
So - translit doesn't solve the sorting for me - as far as I can see - I get the same sorting as I do with today's system.
Looking at the docs for strnatcasecmp (http://no.php.net/manual/en/function.strnatcasecmp.php) I see a post about existing bugs in the php implementation (albeit without links). This function _appears_ to give the correct sorting (very brief testing) for both source chars in latin1 and unicode at least.
Would this be acceptable? If so - patch attached (against HEAD).
Comment #10
chrissearle CreditAttribution: chrissearle commentedSetting to "code needs review" just in case using the above function is OK :)
Comment #11
Bèr Kessels CreditAttribution: Bèr Kessels commentedhttp://drupal.org/node/551500 was marked duplicate of this issue.
Comment #12
Bèr Kessels CreditAttribution: Bèr Kessels commentedI would also really like a small investigation on how other parts of drupal handle this:
* Views: do they take strange charachters in consideration? and if yes, how?
* Core tablesorting: afaik does not consider character, but orders on database. How does this work? (try e.g, the admin/content/node sorted by title)
* Tag overview: how does that sort? I would be inclined to say: let us sort exactly the way the core tag-listing sorts.
That said: I like your patch, but would prefer it if you drop the extra wrapper function, and call your private function yourself.
I also find that we should prefix that function with tagadelic: tagadelic_strnatcasemp() to adhere the drupal guidelines.
Bèr
Comment #13
Bèr Kessels CreditAttribution: Bèr Kessels commentedClosing "needs work" that has been open for a long time, without anyone working on it.
Comment #14
Jānis Bebrītis CreditAttribution: Jānis Bebrītis commentedmy function now looks like this, maybe it helps someone:
what do we do to get this permanently into module code in new releases?
Comment #15
Jānis Bebrītis CreditAttribution: Jānis Bebrītis commentedComment #16
Bèr Kessels CreditAttribution: Bèr Kessels commentedPlease open a pull request on github if you still want this feature.