language_from_browser() doesn't parse language tags correctly, has a broken logic

iliphil - February 14, 2008 - 15:15
Project:Drupal
Version:7.x-dev
Component:language system
Category:bug report
Priority:normal
Assigned:Unassigned
Status:needs work
Description

MSIE sends accept language headers that look like this "en-nz,en-us;q=0.7,sv-se;q=0.3" but language_from_browser() wont match that to either 'en' or 'sv' which is what is in the language table

This drop-in replacement code for language_from_browser() works better. (I may try to send a patch later).

<?php
function language_from_browser() {
 
// Specified by the user via the browser's Accept Language setting
  // Samples: "hu, en-us;q=0.66, en;q=0.33", "hu,en-us;q=0.5"
  // MSIE 7: "en-nz,en-us;q=0.7,sv-se;q=0.3"

 
$browser_langs = array();

  if (isset(
$_SERVER['HTTP_ACCEPT_LANGUAGE'])) {
   
preg_match_all('"(((\S\S)-?(\S\S)?)(;q=([0-9.]+))?)\s*(,\s*|$)"',strtolower($_SERVER['HTTP_ACCEPT_LANGUAGE']),$browser_accept);
    for (
$i = 0; $i < count($browser_accept); $i++) {
     
// The language part is either a code or a code with a quality.
      // We cannot do anything with a * code, so it is skipped.
      // If the quality is missing, it is assumed to be 1 according to the RFC.
      
if(!empty($browser_accept[2][$i])) $browser_langs[$browser_accept[2][$i]] = ($browser_accept[6][$i]? (float) $browser_accept[6][$i] : 1.0);
       if(!empty(
$browser_accept[3][$i]) && empty($browser_langs[$browser_accept[3][$i]])) $browser_langs[$browser_accept[3][$i]] = ($browser_accept[6][$i]? (float) $browser_accept[6][$i]-0.01 : 0.99);
   
    }
  }

 
// Order the codes by quality
 
arsort($browser_langs);

 
// Try to find the first preferred language we have
 
$languages = language_list('enabled');
  foreach (
$browser_langs as $langcode => $q) {
    if (isset(
$languages['1'][$langcode])) {
      return
$languages['1'][$langcode];
    }
  }
}
?>

#1

Gábor Hojtsy - February 15, 2008 - 13:34

Well, a patch would indeed be great.

#2

iliphil - February 25, 2008 - 21:24

here is the patch

AttachmentSizeStatusTest resultOperations
language.patch1.48 KBIdleFailed: 7387 passes, 0 fails, 1 exceptionView details | Re-test

#3

Gábor Hojtsy - February 27, 2008 - 10:58
Status:active» needs review

- There are some coding style issues in the patch (no spaces after commas, bad indentation, no brackets for the ifs)
- You match on \S\S-\S\S which means two char language and variant codes. Are we sure that we can only see these? (I am not)
- Is it guaranteed that the language codes are in descending priority order? The current code expects this (while the original code expected unique list of language codes).

#4

iliphil - February 27, 2008 - 14:53

Apologies for the style

The match is on (\S\S-?(\S\S)?) so the second \S\S is not required for a match. en, en-us, en-gb will all match.
It doesn't require that languages be in descending priority order so it doesn't matter if they aren't.
I can only say try it out.
arsort($browser_langs);
print_r( $browser_langs);

#5

j.somers - September 3, 2008 - 09:16

I filed a duplicated entry for this bug.

Any changes of it getting fixed in a next release?

#6

Damien Tournoud - September 3, 2008 - 15:22
Title:function language_from_browser doesn't handle language headers from MSIE7» language_from_browser() don't parse language tags correctly, has a broken logic
Version:6.0» 7.x-dev

Well, both the parsing and the logic of language_from_browser() is broken. According to RFC 2616 section 14.4:

A language-range matches a language-tag if it exactly equals the tag, or if it exactly equals a prefix of the tag such that the first tag character following the prefix is "-". The special range "*", if present in the Accept-Language field, matches every tag not matched by any other range present in the Accept-Language field.

So we have to go thru all enabled languages, assign them a qweight using the matching logic described above and select the language that fit the most.

Here is a full test case for language_from_browser() that outline the issue in the current implementation, along with a new implementation that fully pass the test case.

AttachmentSizeStatusTest resultOperations
221712-browser-language-test.patch4.39 KBIdleFailed: 7235 passes, 16 fails, 0 exceptionsView details | Re-test
221712-browser-language.patch2.72 KBIdleFailed: 9020 passes, 3 fails, 3 exceptionsView details | Re-test

#7

Damien Tournoud - September 3, 2008 - 15:23
Title:language_from_browser() don't parse language tags correctly, has a broken logic» language_from_browser() doesn't parse language tags correctly, has a broken logic

#8

j.somers - September 3, 2008 - 15:27

Can I safely use this patch on my D6 installation or is it going to cause conflicts.

Right now I started using the previously provided patch which seems to work pretty OK for what I need it.

#9

Damien Tournoud - September 3, 2008 - 15:34

Here is the same patch for D6.

AttachmentSizeStatusTest resultOperations
221712-browser-language-d6.patch3.85 KBIgnoredNoneNone

#10

Damien Tournoud - September 3, 2008 - 15:42

Oh, and the comparison between language tags should also be case insensitive (RFC 2616 section 3.10). Here is an updated patch.

AttachmentSizeStatusTest resultOperations
221712-browser-language-d6.patch3.06 KBIgnoredNoneNone
221712-browser-language.patch3.06 KBIdleFailed: 10267 passes, 1 fail, 0 exceptionsView details | Re-test

#11

j.somers - September 3, 2008 - 17:54

Thank you.

You have just been promoted to my monthly hero!

#12

System Message - November 16, 2008 - 21:45
Status:needs review» needs work

The last submitted patch failed testing.

#13

lilou - November 17, 2008 - 13:35

#14

System Message - February 23, 2009 - 08:30
Status:needs review» needs work

The last submitted patch failed testing.

#15

Damien Tournoud - March 3, 2009 - 13:53
Status:needs work» needs review

Reroll.

AttachmentSizeStatusTest resultOperations
221712-browser-language.patch7.26 KBIdleFailed: 10592 passes, 1 fail, 0 exceptionsView details | Re-test

#16

System Message - March 22, 2009 - 11:15
Status:needs review» needs work

The last submitted patch failed testing.

#17

Bodo Maass - July 2, 2009 - 13:56
Title:language_from_browser() doesn't parse language tags correctly, has a broken logic» patch fails to give precendence to double language code over short language code

Hi Damien,
Great work on this issue. I just realized that this is still broken in D6 when I played with the language settings of my browser and always got the wrong language. While your code is a great step in the right direction, I think it still doesn't work correctly.
I set my browser (Firefox 3.5) to accept languages in the following order: de-de, en, de, en-us
This results in this header: "de-de,en;q=0.8,de;q=0.5,en-us;q=0.3"

My site has the languages 'de' and 'en' enabled. According to the setting, my site should be displayed in German (de). What happens is that your code iterates the languages available on the site ('en' and 'de'). When it finds a match for the single language code, it skips the double language code, which has a higher qvalue but gets ignored. So in this example the code incorrectly ignores de-de and finds that en has a higher value than de. Hence the incorrect result is 'en', while it should be 'de'.

#18

Bodo Maass - July 2, 2009 - 13:58
Title:patch fails to give precendence to double language code over short language code» language_from_browser() doesn't parse language tags correctly, has a broken logic

sorry, didn't mean to change the title of this issue

#19

Bodo Maass - July 2, 2009 - 21:30

Here is a slightly modified patch that works correctly for double language codes.

AttachmentSizeStatusTest resultOperations
221712-browser-language_1.patch7.42 KBIgnoredNoneNone

#20

Damien Tournoud - July 23, 2009 - 17:10

@Bodo: if you think you found a bug in my patch, please extend the test case.

But in that case, I believe you are wrong: you are saying that your site offers 'en' and 'de'. Not 'en-US' and 'de-DE', the generic 'en' and 'de'. In that case, the correct match for...

Accept-Language: de-de,en;q=0.8,de;q=0.5,en-us;q=0.3

... is really 'en', because your site doesn't offer the 'de-DE' specific language.

#21

Bodo Maass - August 12, 2009 - 08:43

Hi Damien,

I haven't had time yet to look at the D7 testing framework, but will do so and then update the test.

About the issue: As far as I understand, drupal sites don't normally offer compound locales such as 'en-US' or 'de-DE', but always just simple locales such as 'en' and 'de'. Is that correct? I guess it is possible to set up compound locales, but all the translations that ship with drupal are only simple locales.
If so, the compound locales come from the browser settings of the user.

I believe the intention of the code was that when the users browser requests a compound locale such as 'en-US', and the drupal site supports the main locale for this (in this case 'en'), there should be a match. Is that correct? I think the original code didn't work like that, but I agree that the tests should be used to define the behaviour.

#22

Damien Tournoud - August 12, 2009 - 08:58

@Bodo: the patch above is supposed to support serving compound locales.

Anyway, if you only have a main locale 'en' and 'de', the correct match a strange language configuration like:

Accept-Language: de-de,en;q=0.8,de;q=0.5,en-us;q=0.3

is definitely 'en', because your site doesn't provide the specific 'de-de' language.

#23

Bodo Maass - August 12, 2009 - 15:06

@Damien:
Fair enough about serving 'en' in the previous case.

What should happen if the site only has 'en' and 'de', and the client browser is set to accept 'de_de' and 'en_us' in that order, and nothing else? (Accept-Language: de-de,en-us;q=0.5)

My user expectation would be that the site will be served in 'de', although there is no direct match of locales. Is that what the spec would imply?

I guess the answer is NO, but this is unintuitive for a non-technical user. IE and FF also offer no help whatsoever for this setting beyond 'select your preferred languages here'. FF makes this worse by offering a multitude of compound locales that might get accidentally selected instead of the main locale.

#24

Damien Tournoud - August 12, 2009 - 15:13

RFC2616 section 14.4 (language-range is the thing returned by the browser, language-tag is the server provided languages):

A language-range matches a language-tag if it exactly equals the tag, or if it exactly equals a prefix of the tag such that the first tag character following the prefix is "-". The special range "*", if present in the Accept-Language field, matches every tag not matched by any other range present in the Accept-Language field.

ie. 'en' sent by the browser matches 'en-US' provided by the server, but 'en-US' sent by the browser doesn't match 'en' provided by the server.

In the case you described, no language would match, so the server will its default language.

#25

Damien Tournoud - August 12, 2009 - 15:14

If you are not convinced already, RFC2616 has the following node at the end of the same section:

Note: When making the choice of linguistic preference available to
the user, we remind implementors of the fact that users are not
familiar with the details of language matching as described above,
and should provide appropriate guidance. As an example, users
might assume that on selecting "en-gb", they will be served any
kind of English document if British English is not available.
A
user agent might suggest in such a case to add "en" to get the
best matching behavior.

(emphasis mine)

#26

Bodo Maass - August 12, 2009 - 15:30

Hi Damien,

I'm already convinced, and your code seems to work according to the spec. It just took me a while to understand that selecting any of the compound locales without the main locale did not do what I expected (as the RFC helpfully points out).
As I said, all browsers I checked (FF, IE, Opera) offer no help at all for this, so I might not be the only one who got this wrong. The browser should probably add the main locale (or at least suggest to do so) if it is not already present, but that is not a Drupal issue.

In any case, thanks for your explanation. I guess your original code is correct after all.

#27

Bodo Maass - August 12, 2009 - 16:29

I just realized how I got started with this whole confusion. My installation of IE7 doesn't offer two letter locales at all. Instead, only compound locales are available for selection, which is bound to produce unexpected results as discussed above. I'll have to check if this is fixed in IE8, and if this is normal for IE7 or if my install is somehow corrupted.

#28

Plascual - October 19, 2009 - 18:29

I'm running into this problem. Is there any new development ?

All browsers work as expected, but IE7, IE8 and Safari.

Even if IE or Safari has fr-CA as prefered language, it displays the front-page in the default language (English).

Does this patch apply to D6 (14) ?

 
 

Drupal is a registered trademark of Dries Buytaert.