When tags are using non 7bits chars, they are not sorted correctly.

In tagadelic.module, I made the following change to _tagadelic_sort_by_title in order to use mbstring to sort tags correctly (it works at least for french tags):

Maybe iconv could be used there.

Function "to7bit" comes from php.net.

/**
* @args string $text line of encoded text
* string $from_enc (encoding type of $text, e.g. UTF-8, ISO-8859-1)
*
* @returns 7bit representation
*/
function to7bit($text,$from_enc) {
if (isset($from_enc))
$from_enc = mb_detect_encoding($text,"auto");
$text = mb_convert_encoding($text,'HTML-ENTITIES',$from_enc);
$text = preg_replace(
array('/ß/','/&(..)lig;/',
'/&([aouAOU])uml;/','/&(.)[^;]*;/'),
array('ss',"$1","$1".'e',"$1"),
$text);
return $text;
}

/**
* callback for usort, sort by count
*/
function _tagadelic_sort_by_title($a, $b) {

if (extension_loaded('mbstring'))
return strnatcasecmp(to7bit($a->name), to7bit($b->name));
else
return strnatcasecmp($a->name, $b->name);
}
CommentFileSizeAuthor
#9 tagadelic_sort.patch2.43 KBchrissearle
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Bèr Kessels’s picture

For reference, an old issue about this here: http://drupal.org/node/55842

Would you be so kind as to provide a real patch? this eases the review process a lot. Details here: http://drupal.org/diifandpatch

On the code:
I think its a better idea to use the Drupal core API instead of writing your own conversion api: http://api.drupal.org/api/4.7/function/drupal_convert_to_utf8

Leeteq’s picture

(subscribing...)

cquest’s picture

I'm sorry for my quick and short post... I'm a bit new at digging in Drupal and modules code ;-)

The bug I found IS that strnatcasecmp is a single byte compare function, not a double-byte one if I'm not wrong. As text passed to it can be double-byte (mostly utf8), it does not work in all cases as expected (it was my case with obvious tags start with 'é' being put in from of 'a' tags instead of being mixed with 'e' ones).

So, to solve I see to possibilities:
- use a real double-byte comparison for the sort,
- convert the text into plain 7bit ascii and compare this instead of the double-byte text.

As I could not find a double-byte comparison function in mbstring or iconv, I moved to the second solution.

drupal_convert_to_utf8 seems to only provide "anything" to utf8 conversion... so I could not use it because I was looking for an "anything" to ascii conversion.

I found the to7bit sample code on php.net's comments. It doesn't look universal and rock solid.

I've just found that it could be replaced by a single call to iconv like:

iconv('UTF-8', 'ASCII//TRANSLIT',$str)

This will translate text like "Cœur déchiré" into "Coeur dechire" (seen the 'œ' converted to its 2 chars equivalent ?).

So... another, much simpler version could be (not tested yet):

/**
* callback for usort, sort by count
*/
function _tagadelic_sort_by_title($a, $b) {

$namea = iconv('UTF-8', 'ASCII//TRANSLIT',$a->name);
$nameb = iconv('UTF-8', 'ASCII//TRANSLIT',$b->name);
if (!$namea or !$nameb)
  return strnatcasecmp(to7bit($a->name), to7bit($b->name));
else
  return strnatcasecmp($namea, $nameb);
}

I did not put any code to check iconv availablity as it is builtin in PHP since PHP4 (even under Windows).

I'll post a patch if it looks ok for you.

Best,
Christian

Bèr Kessels’s picture

The problem with iconv is that it may be available for most systems, it is not really a requirement for Drupal. Hence, if you read http://api.drupal.org/api/4.7/function/drupal_convert_to_utf8 you see that it checks for existence of certain functions.

I have asked on #drupal for help on this, and will post a mail to the devel mailinglist.

Bèr

Bèr Kessels’s picture

Version: 4.7.x-2.0 » master

Marking for HEAD. That way we can fix it for all branches, including 5 and 4.

Bèr Kessels’s picture

Status: Active » Needs review

bump. Can someone please turn this into a patch?

Bèr Kessels’s picture

Title: tags not sorted correctly when using non 7bit ascii text (patch included) » tags not sorted correctly when using non 7bit ascii text
Status: Needs review » Active
chrissearle’s picture

Bèr - what's that status here?

Reading thru - you were going to ask on #drupal etc - did you get any response?

Or do you still want a patch rolled using either to7bit or iconv?

If I know what way you want to head I can roll a patch against CVS head or CVS d6 branch - since I'm also getting bitten by this ;)

chrissearle’s picture

FileSize
2.43 KB

Thinking - we could patch it so that tagadelic_sort_by_title calls a function similar to http://api.drupal.org/api/function/drupal_convert_to_utf8 - but that goes the other way. That would mean we check for each function in turn before converting and in this case - return the original string if no options available.

BUT

What encoding should we be converting from and to?

Can I assume that the tags are in UTF-8? Is it a given that converting to say ASCII will give the correct sort order? I'm not sure.

The following snippet:

  if (function_exists('iconv')) {
    print "1)" . iconv('UTF-8', 'ASCII', 'æ ø å') . "\n";
    print "2)" . iconv('UTF-8', 'ASCII//TRANSLIT', 'æ ø å') . "\n";
    print "3)" . iconv('ISO-8859-1', 'ASCII', 'æ ø å') . "\n";
    print "4)" . iconv('ISO-8859-1', 'ASCII//TRANSLIT', 'æ ø å') . "\n";
  }

(my test file encoded in iso-8859-1)

Responds

1)
2)
3)
4)ae o a

Which would mean that æ and å would be sorted as a and ø as o. This is wildly incorrect for the norwegian locale (all three should be sorted at the end after z).

So - translit doesn't solve the sorting for me - as far as I can see - I get the same sorting as I do with today's system.

Looking at the docs for strnatcasecmp (http://no.php.net/manual/en/function.strnatcasecmp.php) I see a post about existing bugs in the php implementation (albeit without links). This function _appears_ to give the correct sorting (very brief testing) for both source chars in latin1 and unicode at least.

Would this be acceptable? If so - patch attached (against HEAD).

chrissearle’s picture

Status: Active » Needs review

Setting to "code needs review" just in case using the above function is OK :)

Bèr Kessels’s picture

http://drupal.org/node/551500 was marked duplicate of this issue.

Bèr Kessels’s picture

Status: Needs review » Needs work

I would also really like a small investigation on how other parts of drupal handle this:
* Views: do they take strange charachters in consideration? and if yes, how?
* Core tablesorting: afaik does not consider character, but orders on database. How does this work? (try e.g, the admin/content/node sorted by title)
* Tag overview: how does that sort? I would be inclined to say: let us sort exactly the way the core tag-listing sorts.

That said: I like your patch, but would prefer it if you drop the extra wrapper function, and call your private function yourself.
I also find that we should prefix that function with tagadelic: tagadelic_strnatcasemp() to adhere the drupal guidelines.

Bèr

Bèr Kessels’s picture

Status: Needs work » Closed (won't fix)

Closing "needs work" that has been open for a long time, without anyone working on it.

Jānis Bebrītis’s picture

my function now looks like this, maybe it helps someone:

function _tagadelic_sort_by_title($a, $b) {
	if (function_exists('iconv')) {
		$namea = iconv('UTF-8', 'ASCII//TRANSLIT',$a->name);
		$nameb = iconv('UTF-8', 'ASCII//TRANSLIT',$b->name);
	} else {
		$namea = $a->name;
		$nameb = $b->name;
	}
	
	if (!$namea or !$nameb) return strnatcasecmp(to7bit($a->name), to7bit($b->name));
	else return strnatcasecmp($namea, $nameb);
}

what do we do to get this permanently into module code in new releases?

Jānis Bebrītis’s picture

Category: bug » support
Status: Closed (won't fix) » Needs review
Bèr Kessels’s picture

Status: Needs review » Closed (won't fix)

Please open a pull request on github if you still want this feature.