MSIE sends accept language headers that look like this "en-nz,en-us;q=0.7,sv-se;q=0.3" but language_from_browser() wont match that to either 'en' or 'sv' which is what is in the language table

This drop-in replacement code for language_from_browser() works better. (I may try to send a patch later).


function language_from_browser() {
  // Specified by the user via the browser's Accept Language setting
  // Samples: "hu, en-us;q=0.66, en;q=0.33", "hu,en-us;q=0.5"
  // MSIE 7: "en-nz,en-us;q=0.7,sv-se;q=0.3"

  $browser_langs = array();

  if (isset($_SERVER['HTTP_ACCEPT_LANGUAGE'])) {
    preg_match_all('"(((\S\S)-?(\S\S)?)(;q=([0-9.]+))?)\s*(,\s*|$)"',strtolower($_SERVER['HTTP_ACCEPT_LANGUAGE']),$browser_accept);
    for ($i = 0; $i < count($browser_accept); $i++) {
      // The language part is either a code or a code with a quality.
      // We cannot do anything with a * code, so it is skipped.
      // If the quality is missing, it is assumed to be 1 according to the RFC.
       if(!empty($browser_accept[2][$i])) $browser_langs[$browser_accept[2][$i]] = ($browser_accept[6][$i]? (float) $browser_accept[6][$i] : 1.0);
	   if(!empty($browser_accept[3][$i]) && empty($browser_langs[$browser_accept[3][$i]])) $browser_langs[$browser_accept[3][$i]] = ($browser_accept[6][$i]? (float) $browser_accept[6][$i]-0.01 : 0.99);
	
    }
  }

  // Order the codes by quality
  arsort($browser_langs);

  // Try to find the first preferred language we have
  $languages = language_list('enabled');
  foreach ($browser_langs as $langcode => $q) {
    if (isset($languages['1'][$langcode])) {
      return $languages['1'][$langcode];
    }
  }
}


Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Gábor Hojtsy’s picture

Well, a patch would indeed be great.

iliphil’s picture

FileSize
1.48 KB

here is the patch

Gábor Hojtsy’s picture

Status: Active » Needs review

- There are some coding style issues in the patch (no spaces after commas, bad indentation, no brackets for the ifs)
- You match on \S\S-\S\S which means two char language and variant codes. Are we sure that we can only see these? (I am not)
- Is it guaranteed that the language codes are in descending priority order? The current code expects this (while the original code expected unique list of language codes).

iliphil’s picture

Apologies for the style

The match is on (\S\S-?(\S\S)?) so the second \S\S is not required for a match. en, en-us, en-gb will all match.
It doesn't require that languages be in descending priority order so it doesn't matter if they aren't.
I can only say try it out.
arsort($browser_langs);
print_r( $browser_langs);

j.somers’s picture

I filed a duplicated entry for this bug.

Any changes of it getting fixed in a next release?

Damien Tournoud’s picture

Title: function language_from_browser doesn't handle language headers from MSIE7 » language_from_browser() don't parse language tags correctly, has a broken logic
Version: 6.0 » 7.x-dev
FileSize
2.72 KB
4.39 KB

Well, both the parsing and the logic of language_from_browser() is broken. According to RFC 2616 section 14.4:

A language-range matches a language-tag if it exactly equals the tag, or if it exactly equals a prefix of the tag such that the first tag character following the prefix is "-". The special range "*", if present in the Accept-Language field, matches every tag not matched by any other range present in the Accept-Language field.

So we have to go thru all enabled languages, assign them a qweight using the matching logic described above and select the language that fit the most.

Here is a full test case for language_from_browser() that outline the issue in the current implementation, along with a new implementation that fully pass the test case.

Damien Tournoud’s picture

Title: language_from_browser() don't parse language tags correctly, has a broken logic » language_from_browser() doesn't parse language tags correctly, has a broken logic
j.somers’s picture

Can I safely use this patch on my D6 installation or is it going to cause conflicts.

Right now I started using the previously provided patch which seems to work pretty OK for what I need it.

Damien Tournoud’s picture

Here is the same patch for D6.

Damien Tournoud’s picture

Oh, and the comparison between language tags should also be case insensitive (RFC 2616 section 3.10). Here is an updated patch.

j.somers’s picture

Thank you.

You have just been promoted to my monthly hero!

Status: Needs review » Needs work

The last submitted patch failed testing.

lilou’s picture

Status: Needs work » Needs review

Status: Needs review » Needs work

The last submitted patch failed testing.

Damien Tournoud’s picture

Status: Needs work » Needs review
FileSize
7.26 KB

Reroll.

Status: Needs review » Needs work

The last submitted patch failed testing.

Bodo Maass’s picture

Title: language_from_browser() doesn't parse language tags correctly, has a broken logic » patch fails to give precendence to double language code over short language code

Hi Damien,
Great work on this issue. I just realized that this is still broken in D6 when I played with the language settings of my browser and always got the wrong language. While your code is a great step in the right direction, I think it still doesn't work correctly.
I set my browser (Firefox 3.5) to accept languages in the following order: de-de, en, de, en-us
This results in this header: "de-de,en;q=0.8,de;q=0.5,en-us;q=0.3"

My site has the languages 'de' and 'en' enabled. According to the setting, my site should be displayed in German (de). What happens is that your code iterates the languages available on the site ('en' and 'de'). When it finds a match for the single language code, it skips the double language code, which has a higher qvalue but gets ignored. So in this example the code incorrectly ignores de-de and finds that en has a higher value than de. Hence the incorrect result is 'en', while it should be 'de'.

Bodo Maass’s picture

Title: patch fails to give precendence to double language code over short language code » language_from_browser() doesn't parse language tags correctly, has a broken logic

sorry, didn't mean to change the title of this issue

Bodo Maass’s picture

Here is a slightly modified patch that works correctly for double language codes.

Damien Tournoud’s picture

@Bodo: if you think you found a bug in my patch, please extend the test case.

But in that case, I believe you are wrong: you are saying that your site offers 'en' and 'de'. Not 'en-US' and 'de-DE', the generic 'en' and 'de'. In that case, the correct match for...

Accept-Language: de-de,en;q=0.8,de;q=0.5,en-us;q=0.3

... is really 'en', because your site doesn't offer the 'de-DE' specific language.

Bodo Maass’s picture

Hi Damien,

I haven't had time yet to look at the D7 testing framework, but will do so and then update the test.

About the issue: As far as I understand, drupal sites don't normally offer compound locales such as 'en-US' or 'de-DE', but always just simple locales such as 'en' and 'de'. Is that correct? I guess it is possible to set up compound locales, but all the translations that ship with drupal are only simple locales.
If so, the compound locales come from the browser settings of the user.

I believe the intention of the code was that when the users browser requests a compound locale such as 'en-US', and the drupal site supports the main locale for this (in this case 'en'), there should be a match. Is that correct? I think the original code didn't work like that, but I agree that the tests should be used to define the behaviour.

Damien Tournoud’s picture

@Bodo: the patch above is supposed to support serving compound locales.

Anyway, if you only have a main locale 'en' and 'de', the correct match a strange language configuration like:

Accept-Language: de-de,en;q=0.8,de;q=0.5,en-us;q=0.3

is definitely 'en', because your site doesn't provide the specific 'de-de' language.

Bodo Maass’s picture

@Damien:
Fair enough about serving 'en' in the previous case.

What should happen if the site only has 'en' and 'de', and the client browser is set to accept 'de_de' and 'en_us' in that order, and nothing else? (Accept-Language: de-de,en-us;q=0.5)

My user expectation would be that the site will be served in 'de', although there is no direct match of locales. Is that what the spec would imply?

I guess the answer is NO, but this is unintuitive for a non-technical user. IE and FF also offer no help whatsoever for this setting beyond 'select your preferred languages here'. FF makes this worse by offering a multitude of compound locales that might get accidentally selected instead of the main locale.

Damien Tournoud’s picture

RFC2616 section 14.4 (language-range is the thing returned by the browser, language-tag is the server provided languages):

A language-range matches a language-tag if it exactly equals the tag, or if it exactly equals a prefix of the tag such that the first tag character following the prefix is "-". The special range "*", if present in the Accept-Language field, matches every tag not matched by any other range present in the Accept-Language field.

ie. 'en' sent by the browser matches 'en-US' provided by the server, but 'en-US' sent by the browser doesn't match 'en' provided by the server.

In the case you described, no language would match, so the server will its default language.

Damien Tournoud’s picture

If you are not convinced already, RFC2616 has the following node at the end of the same section:

Note: When making the choice of linguistic preference available to
the user, we remind implementors of the fact that users are not
familiar with the details of language matching as described above,
and should provide appropriate guidance. As an example, users
might assume that on selecting "en-gb", they will be served any
kind of English document if British English is not available.
A
user agent might suggest in such a case to add "en" to get the
best matching behavior.

(emphasis mine)

Bodo Maass’s picture

Hi Damien,

I'm already convinced, and your code seems to work according to the spec. It just took me a while to understand that selecting any of the compound locales without the main locale did not do what I expected (as the RFC helpfully points out).
As I said, all browsers I checked (FF, IE, Opera) offer no help at all for this, so I might not be the only one who got this wrong. The browser should probably add the main locale (or at least suggest to do so) if it is not already present, but that is not a Drupal issue.

In any case, thanks for your explanation. I guess your original code is correct after all.

Bodo Maass’s picture

I just realized how I got started with this whole confusion. My installation of IE7 doesn't offer two letter locales at all. Instead, only compound locales are available for selection, which is bound to produce unexpected results as discussed above. I'll have to check if this is fixed in IE8, and if this is normal for IE7 or if my install is somehow corrupted.

Plascual’s picture

I'm running into this problem. Is there any new development ?

All browsers work as expected, but IE7, IE8 and Safari.

Even if IE or Safari has fr-CA as prefered language, it displays the front-page in the default language (English).

Does this patch apply to D6 (14) ?

markus_petrux’s picture

For those insterested in a fix for D6, without the need to patch core, so until the fix is committed, I'm using the following code snippet at the bottom of settings.php

/**
 * Help Drupal identify the correct browser language.
 *
 * Make sure browser languages are lowercase, and simple langcode
 * entries exist when complex language entries are used.
 *
 * @see language_from_browser() in includes/language.inc
 */
if (isset($_SERVER['HTTP_ACCEPT_LANGUAGE'])) {
  $http_accept_language = array();
  $browser_languages = array_filter(array_map('trim', explode(',', strtolower($_SERVER['HTTP_ACCEPT_LANGUAGE']))));
  foreach ($browser_languages as $browser_language) {
    if (preg_match('!^(?:([a-z]+)([-_a-z]*))(;q=(?:[\.0-9]+))?$!', $browser_language, $found)) {
      if (!empty($found[1])) {
        if (!isset($http_accept_language[$found[1]])) {
          if (!empty($found[2])) {
            $http_accept_language[$browser_language] = $browser_language;
            $http_accept_language[$found[1]] = $found[1] . (!empty($found[3]) ? $found[3] : ';q=1.0');
          }
          else {
            $http_accept_language[$found[1]] = $browser_language;
          }
        }
      }
      else {
        $http_accept_language[$browser_language] = $browser_language;
      }
    }
    else {
      $http_accept_language[$browser_language] = $browser_language;
    }
  }
  $_SERVER['HTTP_ACCEPT_LANGUAGE'] = implode(',', $http_accept_language);
}

In IE8, I've found language entries that start with es_xxx, or es-ES, and the above snippet rebuilds the variable $_SERVER['HTTP_ACCEPT_LANGUAGE'] to ensure everything is lower cased, and the main language prefix exists in the list (ie. when it finds 'es-ES', it makes sure 'es' exists), before it reaches language_from_browser() in Drupal core.

GiorgosK’s picture

Had the problem described above with IE8 and drupal6
this code in settings.php seems to solve the problem

Damien Tournoud’s picture

Status: Needs work » Needs review
FileSize
6.62 KB

Rerolled one last time.

Status: Needs review » Needs work

The last submitted patch, 221712-fix-browser-language-detection.patch, failed testing.

Damien Tournoud’s picture

Status: Needs work » Needs review
FileSize
6.61 KB

Oups. I was returning a language object instead of a language code (and the unit test did the same).

This one should pass :)

markus_petrux’s picture

I think the preg_match_all() operation and the next foreach loop could be simplified a little if strtolower() is applied to the Accept-language header.

if (preg_match_all('@([a-z]{1,8}(?:-[a-z]{1,8})?|\*)(?:;q=(1(?:\.000)?|0(?:\.[0-9]{0,3})?))?\s*,?\s*@', strtolower($_SERVER['HTTP_ACCEPT_LANGUAGE']), $matches, PREG_SET_ORDER)) {
chx’s picture

Status: Needs review » Needs work

We talked on IRC and Damien will rewrite his C code in Drupal-style PHP.

Damien Tournoud’s picture

Status: Needs work » Needs review
FileSize
6.93 KB

This should do it.

chx’s picture

Status: Needs review » Needs work
    if (isset($browser_langs[$langcode])) {
      $qvalue = $browser_langs[$langcode];
    }
    else if (isset($browser_langs[$prefix])) {
      $qvalue = $browser_langs[$prefix];
    }
    else if (isset($browser_langs['*'])) {
      $qvalue = $browser_langs['*'];
    }

isnt

   foreach(array($langcode, $prefix, '*') as $key) {
     if (isset($browser_langs[$key])) {
       $qvalue = $browser_langs[$key];
       break;
     }
   }

shorter?

Status: Needs work » Needs review

Re-test of language.patch from comment #2 was requested by azovsky.

drubage’s picture

Subscribing. For my purposes just adding this line works for now:

      if (preg_match("!([a-z-]+)(;q=([0-9\\.]+))?!", trim($browser_accept[$i]), $found)) {
	$found[1]=str_replace('-','',$found[1]); // ADDED LINE TO REMOVE DASHES FROM LANGUAGE CODE
        $browser_langs[$found[1]] = (isset($found[3]) ? (float) $found[3] : 1.0);
      }

Although better logic would be to first search on the full langauge (like for Brazilian Portuguese it would look for "pt-br" and then, if not found, it would search for just the base code "pt").

-Drew

c960657’s picture

Status: Needs review » Needs work

It took me a while to grasp the RFC, but I think the patch follows it closely—with one exception: A language-tag may contain any number of subparts, e.g. en-GB-oed (Oxford English Dictionary spelling), and there are a few of these in the official language registry. I don't know if they have any use in practice, though. I guess we can ignore them for now—any sane user would supply en-GB in addition to en-GB-oed.

+  // RFC2616 (section 14.4) defines the Accept-Language header as followed:

Elsewhere in core (with one exception) we write “RFC 2616”, i.e. with a space.

+    else if (isset($browser_langs[$prefix])) {

This should use elseif instead.

+      $weight = isset($match[2]) ? (float) $match[2] : 1;
+      $browser_langs[$langcode] = (int) ($weight * 1000);

I think $weight is a bad name for a variable containing a q-value. The term weight is used in a comment in the same function to describe the priority of the enabled languages.

+      'en-US' => (object) array(
+        'language' => 'en-US',
+        'enabled' => 1,
+        'weight' => 0.6,
+      ),

In the database, {languages}.weight is an integer column. I think we should use integers here too.

How about a few tests that contain both fr and fr-CA. Contrary to en/en-US, the most specific has the highest priority.

Bodo Maass’s picture

I'm still not convinced that Internet Explorer 8 implements the RFC correctly. On my new install of a german Windows 7, IE8 lists only the following language preference:
Deutsch [de-DE]

That's it. No [de] to fall back on. The interface will now let me add Deutsch [de] if I go to the language settings of the browser, but the average user probably won't do that, and is probably even less likely to have read (and understood) the RFC.
Interestingly, my IE8 install on German Windows XP, which was upgraded from IE6 and then IE7, does not list the single language entries as options in the settings.

In conclusion, I'm all for supporting standards, but the reality is (as we all know), that few site owners would insist on following the standard at the cost of breaking the site for IE users. Hence I strongly request (if that's not happening yet) that some of the tests reflect the default installations of IE8 to make sure that those users get to see the correct language.

markus_petrux’s picture

Yes, I also saw non-standard things in IE8 with Spanish related options. Quoting myself from #29:

In IE8, I've found language entries that start with es_xxx, or es-ES, and the above snippet rebuilds the variable $_SERVER['HTTP_ACCEPT_LANGUAGE'] to ensure everything is lower cased, and the main language prefix exists in the list (ie. when it finds 'es-ES', it makes sure 'es' exists), before it reaches language_from_browser() in Drupal core.

I think the parser should be prepared to accept "_" as a separator between "major-minor" language codes, also, it should be ready to deal with "aa-XX" or "aa_XX" with a higher q than "bb", and fallback to "aa" when necessary.

tomsm’s picture

Status: Needs review » Needs work

subscribing

In Belgium we have the same problem. We have "nl-BE" or "fr-BE" as preferred languages in our browser.
Drupal does not recognize this and shows the English frontpage.

I added the code of #29 to settings.php. Works great with Firefox 3.6 and IE8.

Damien Tournoud’s picture

Status: Needs work » Needs review
FileSize
7.96 KB

Following c960657 review (thanks!), here is an improved patch with (1) fixed code style, (2) support for language codes with multiple dashes, (3) improved test coverage.

I replaced the strict regexp with a more lenient (but more readable) one. I'm not sure we need to add special cases for IE8, at least not until we have proof that such broken behavior is actually in the wild.

CKoch’s picture

Status: Needs work » Needs review

I can say that my fresh install of Windows 7 with IE8 does not use any single language codes, only fully qualified locales.

My browser sends:

["Accept-Language"]=>
string(5) "en-CA"

Testing out on a VM for testing,
I get the following:

["Accept-Language"]=>
string(5) "fr-CA"

The patch will need to address IE Behaviour of sending only locales and not any pure language codes.

plach’s picture

Title: language_from_browser() doesn't parse language tags correctly, has a broken logic » locale_language_from_browser() doesn't parse language tags correctly, has a broken logic

retitling

xecstasyx’s picture

tomsm’s picture

#29 works fine on my site (running on IIS), but when I enable Performance >> Page Cache >> Normal the language is not changed correctly. When I clear the cache it works one time, then it is broken again.
When I disable page cache, it works fine.
How can resolve this?

On my local WAMP server I do not have this problem. So it has something to do with caching and Windows IIS.

plach’s picture

@tomsm: are you are using D6? If so you are probably experiencing a known bug: #339958: Cached pages returned in wrong language when browser language used.

In D7 core browser language detection is skipped when page cache is enabled.

tomsm’s picture

@plach: Yes, I am using the latest version of Drupal 6.
I have tested it again on WAMP locally. I also have the issue with WAMP.
So yes, I think that this is the reason.

Disabled in D7? Isn't page cache an important feature to improve performance?

I thought that page cache was only used for anonymous users in D6.
My default language is English, but I have many Dutch and French anonymous visitors.
Is disabling the page cache the best solution?

plach’s picture

Disabled in D7? Isn't page cache an important feature to improve performance?

Sure, but browser language is a form of user preference which does not play well with the very concept of page cache, AAMOF we cache pages only for anonymous users. See #339958-85: Cached pages returned in wrong language when browser language used for details.

My default language is English, but I have many Dutch and French anonymous visitors.
Is disabling the page cache the best solution?

AFAIK it's the only solution available if you want to use the browser preference. You might want to check #339958: Cached pages returned in wrong language when browser language used to find more answers.

Anyway, please let's continue in the other issue, we are off topic here.

tomsm’s picture

@ plach: thanks for the info!

Damien Tournoud’s picture

Now implementing a fix for Internet Explorer: when the browser send a specific language tag (fr-CA) without the corresponding generic tag (fr), we implicitly add this generic tag. We take the lowest value of the specific tags as the value of the generic tag.

toemaz’s picture

D6 backport of #53 Thx Damien!

idflood’s picture

Here is a patch file from D6 backport of #53. The backport seems to be working well but I haven't tested the 7.x-dev version. Thanks to Damien and toemaz!

GiorgosK’s picture

tested #55

tested with cache off
OPERA 10.62 - en / el ("el-GR" not recognized as "el")
Chrome 6 - en / el
IE8 en / el / el-GR
IE7 en / el / el-GR

but this behavior was same as before the patch (me thinks)

tried with cache on normal
first visit to homepage determines language shown to every other visitor

is the caching compatibility handled by another patch ??

ALSO
how do you test #53 when i18n is not working yet for D7, i18n_variables are not working in D7 ?

plach’s picture

Version: 7.x-dev » 8.x-dev
Status: Needs review » Needs work
Issue tags: +Needs backport to D7
plach’s picture

Status: Needs work » Needs review
plach’s picture

@GiorgiosK:

tried with cache on normal
first visit to homepage determines language shown to every other visitor is the caching compatibility handled by another patch ??

See #339958: Cached pages returned in wrong language when browser language used.

loganfsmyth’s picture

Here's a rerolled version against current HEAD and with -p1.

loganfsmyth’s picture

Issue tags: +D8MI

Tagging D8MI because we want to reuse this logic for #1260716: Improve language onboarding user experience.

Gábor Hojtsy’s picture

Status: Needs review » Reviewed & tested by the community

Thanks for the reroll. This is an issue I've seen referenced at many places, and reviewing the code it seems to have covered the above concerns and includes a very good set of test coverage. IMHO good to go and still needs to be committed to D7 too (which has the same code, so should be easy to commit there too).

Lars Toomre’s picture

Looking over the patch in #60, I see that the new test has been added to a DrupalWebTestCase type. It really belongs in a unit test type. If such a class does not exist yet for the locale module, I would suggest we create one here; otherwise, this test should be moved to an existing unit test class.

Gábor Hojtsy’s picture

Status: Reviewed & tested by the community » Needs work

@Lars: perfect, can you help with updating the patch for that? (Locale module does not seem to have unit tests yet)?

Lars Toomre’s picture

Sorry I cannot roll a revised patch right now. I am trying to help by at least reading submitted patches and submitting comments on how they might be improved.

Lars Toomre’s picture

Looking at the tests in #60 again, I am confused about how the weight variable is defined and being used in tests. I thought that the Drupal way to use weights was that items with heavier weights dropped in importance.

Hence, in the test in this patch, I would expect that 'en-US' would be preferred over 'en'. That the test results all pass suggest that my previous understanding was wrong. Is this use of weight correct?

Gábor Hojtsy’s picture

@Lars Toomre: yeah, good find. The weights in the $languages array are in reverse order compared to how Drupal would apply them. *However*, if you look, locale_language_from_browser() will not look at the weights themselves, it will just assume that the weights are in the "right priority order". And the $languages arguments that this function gets comes from language_provider_invoke(), which uses language_list(), which returns languages in ascending weight order. So by the time a real language list arrives at the function, its in the right weight order. The test indeed lists weight numbers in reverse order, however, at this point only the order of the array items matter for it to work well, so it still proves it works well.

I've also noticed that ordering the browser languages by qvalue is not really useful with the current logic + we put in elements (at the end) after we ordered it, so we break up the order anyway. The code later is not dependent on order of $browser_langs (as far as I see), so we can just skip ordering it.

So looks like todo here:

1. Add a DrupalUnitTestCase extending test class to locale.test and put this method there.
2. Reverse the order of weights in the $languages list in the test or remove the weights altogether to avoid misunderstandings.
3. Remove the arsort() on $browser_langs.

@loganfsmyth: can you look into these?

Gábor Hojtsy’s picture

Status: Needs work » Needs review
FileSize
7.89 KB

All right, here is a quick update with those changes and also removing the enabled property on languages, which the function does not check either. It just assumes it gets the right data (because it does :), which I've documented in the phpdoc now. Good now? (Tests pass on my machine).

Gábor Hojtsy’s picture

@Lars: any more feedback?

Lars Toomre’s picture

@Gábor - Sorry, I did not notice the revised patch before. I did not test this so only commenting from a logical read standpoint. This is looking quite good. I would suggest a couple of things here to polish the patch off:

1) Type hinting for docblocks are now encouraged and will become mandatory soon for D8. I would suggest this patch include type hinting for locale_language_from_browser() function.

2) Leave in comment explaining what $_SERVER['HTTP_ACCEPT_LANGUAGE'] is.

3) I believe you want to initialize $best_match_langcode to FALSE instead of NULL to match documentation. If $languages = array() in unlikely event were passed to routine, NULL now would be returned.

4) Change s/Unit tests for the language_from_browser() function./* Unit tests for the locale_language_from_browser() function./

5) As whole, unit test looks really good and is faster than a DrupalWebTest class. I would suggest adding tests for 'en ', ' en', 'EN', and 'english' for completeness. I am thinking that the first two should pass because we should be using a trim($_SERVER['HTTP_ACCEPT_LANGUAGE']) in the function.

6) I have not checked what various browsers pass back for $_SERVER['HTTP_ACCEPT_LANGUAGE']. Do they always used mixed case ASCII strings? Should we be doing comparisons always on lower-case strings?

7) Looking at the 'de,pl' case, I am now wondering if the desired return from the function is NULL instead of FALSE as documented.

8) Also now thinking start of function should be:

function locale_language_from_browser($languages) {
  // Briefly explain what $_SERVER['HTTP_ACCEPT_LANGUAGE'] string is like  
  if (!isset($_SERVER['HTTP_ACCEPT_LANGUAGE']) || !is_array($languages)) {

As I said above, the current patch looks good. These are just final polishing to make air-tight.

Gábor Hojtsy’s picture

All right, here is an updated patch:

1) Please link in agreed standards. If its still in discussion, I prefer not to apply them, we have enough to debate about in this patch, looks like.
2) The patch does not remove any comments on that, in fact it adds several lines already above to explain what is there. I've added even more explanation, hope it looks good now: "The Accept-Language header contains information about the language preferences configured in the user's browser / operating system."
3) Initialized it to FALSE and changed tests to expect FALSE too as documented.
4) Done.
5) Added trim() and test coverage for case differences and whitespace even though the spec does not allow whitespace like you assume, we can cover that.
6) Yes, language tags are defined as mixed-case, see http://www.w3.org/International/articles/language-tags/. We do lowercase both the values coming from the header and the values in our language array, so we can compare both lowercased.
7) See 3)
8) Drupal does not babysit broken code like that. If we'd do that, we can just as well put a type hint in the function signature, right? We don't check for the internal structure of the individual language objects either. I did change the !isset() check to empty() though which should speed this up if it was not provided or was provided empty.

What do you think of this now?

Lars Toomre’s picture

Thanks Gábor. This patch now looks great!! I think it is ready to be committed assuming all comes back green.

The coding standards change for type hinting is detailed in #711918: Documentation standard for @param and @return data types. It is pretty much resolved, but reading that issue again I see that it will be only recommended for a period before being required. Everyone is encouraged to start doing so in the meantime.

Gábor Hojtsy’s picture

Status: Needs review » Reviewed & tested by the community

All right, let's get this in then.

Damien Tournoud’s picture

Status: Reviewed & tested by the community » Needs work
+  // Some browsers (especially some versions of Internet Explorer) sometimes
+  // send a specific language tag (fr-CA) without the corresponding generic
+  // tag (fr). In that case, we assume that the lowest value of the specific
+  // tags is the value of the generic language.
   arsort($browser_langs);
+  foreach ($browser_langs as $langcode => $qvalue) {
+    $generic_tag = strtok($langcode, '-');
+    if (!isset($browser_langs[$generic_tag])) {
+      $browser_langs[$generic_tag] = $qvalue;
+    }
+  }

The arsort() was required here, because of the logic that is described in the comment (we want "the lowest value of the specific tags"). Please restore it and extend the test cases to cover this.

Gábor Hojtsy’s picture

Can you suggest a test case to cover that?

Gábor Hojtsy’s picture

Status: Needs work » Needs review
FileSize
8.43 KB

Discussed this with Damien in IRC. We figured out that the arsort() was indeed "bugos" and that it picked the least important q value for the language instead of the most important one (which is IMHO what we want). Damien suggested some excellent test cases for this, which I've included in the code and added back the arsort() with updated comments.

Updated patch attached.

Gábor Hojtsy’s picture

Damien figures out with even more digging that we should take the lowest value for generic tags to conform as close to the 1.1 HTTP spec as possible. Updated comments, test and code to comply. Comment looks like this now:

// We should take pristine values from the HTTP headers, but Internet Explorer
// from version 7 sends only specific language tags (eg. fr-CA) without the
// corresponding generic tag (fr) unless explicitly configured. In that case,
// we assume that the lowest value of the specific tags is the value of the
// generic language to be as close to the HTTP 1.1 spec as possible.
// See http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.4 and
// http://blogs.msdn.com/b/ie/archive/2006/10/17/accept-language-header-for...

Damien Tournoud’s picture

Status: Needs review » Reviewed & tested by the community

Nice to see this moving forward. This looks like a good compromise between a fully conforming implementation and the real world :)

webchick’s picture

Status: Reviewed & tested by the community » Fixed

Wow, this issue is crazy! :)

This is an API change, but it looks like it is required in order to fix the problem here.

Committed and pushed to 8.x and 7.x. Thanks!

Does this deserve an API change notice?

Gábor Hojtsy’s picture

@webchick: I don't know where this would be an API change. There are certainly no arguments or parameters changed anywhere. The patch fixes language detection to be more accurate to browser settings, that is it. No new APIs provided, no APIs made different.

webchick’s picture

Ok cool. I was confused by:

-  return FALSE;
+  return $best_match_langcode;

but it looks like I just read it too fast. Thanks!

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Gábor Hojtsy’s picture

Issue tags: +language-base

Tagging for base language system.

Gábor Hojtsy’s picture

Issue tags: +negotiation

Tagging for language negotiation too.

JonMcL’s picture

I'm just wondering if anyone has had success applying this to D7? Maybe with some manual finessing to get it to path?

JonMcL’s picture

Never mind my question above. It appears the patch is in place with v7-12!

Sephi77’s picture

If I have the following languages defined in my browser (in the following order):

en-zw
fr
en

Then the function returns fr instead of en, which is incorrect imo. Also this case is not covered by the tests.

Gábor Hojtsy’s picture

@Sephi77: the order in itself is not enough to decide. Do you have the same q values for them too? In short, what is your full language preference header in a request?

Sephi77’s picture

This one causes "fr" to be taken (in my browser, en-zw is set to take precedence over fr):

en-zw,fr;q=0.7,en;q=0.3

But this one also causes "fr" to be taken instead of "en":

en-zw;q=0.8,fr;q=0.7,en;q=0.3

Gábor Hojtsy’s picture

We have some follow-up issues for Chinese and language precedence in general. Please see #365615: Language detection not working correctly for most Chinese readers (and add a user interface for all browser language mappings) and review! Thanks.

Damien Tournoud’s picture

Version: 8.x-dev » 6.x-dev
Status: Closed (fixed) » Active
Issue tags: +needs backport to 6.x

We should consider backporting this to D6.

I just marked #1037900: language_from_browser failes on Chrome as a duplicate.

lmeurs’s picture

In the Accept-Language specification I could not find whether the order of language tags represents the importance of one language tag over another, but it seems logical to me for language tags with the same qvalue.

When ie. Accept-Language: nl,en (occurred on stock Android browser on Samsung Galaxy Tab), both languages get a qvalue of 1 in locale_language_from_browser() (includes/locale.inc, Drupal v7.19). Then the site's languages are iterated in the order of the languages' weights and the first language with qvalue 1 will be set as $best_match_langcode. This way the order of Accept-Language language tags is ignored.

I suggest Drupal should not assign qvalues of 1 to both languages, but ie. 2 for the first language tag and 1.9 for the second. I understand that the specification states that 1 is the top value, but with a higher value of 2 it will take a while before it might interfere with language tags that do have a qvalue.

The next code:

  // The Accept-Language header contains information about the language
  // preferences configured in the user's browser / operating system.
  // RFC 2616 (section 14.4) defines the Accept-Language header as follows:
  //   Accept-Language = "Accept-Language" ":"
  //                  1#( language-range [ ";" "q" "=" qvalue ] )
  //   language-range  = ( ( 1*8ALPHA *( "-" 1*8ALPHA ) ) | "*" )
  // Samples: "hu, en-us;q=0.66, en;q=0.33", "hu,en-us;q=0.5"
  $browser_langcodes = array();
  if (preg_match_all('@(?<=[, ]|^)([a-zA-Z-]+|\*)(?:;q=([0-9.]+))?(?:$|\s*,\s*)@', trim($_SERVER['HTTP_ACCEPT_LANGUAGE']), $matches, PREG_SET_ORDER)) {
    foreach ($matches as $match) {
      // We can safely use strtolower() here, tags are ASCII.
      // RFC2616 mandates that the decimal part is no more than three digits,
      // so we multiply the qvalue by 1000 to avoid floating point comparisons.
      $langcode = strtolower($match[1]);
      $qvalue = isset($match[2]) ? (float) $match[2] : 1;
      $browser_langcodes[$langcode] = (int) ($qvalue * 1000);
    }
  }

could become:

  // The Accept-Language header contains information about the language
  // preferences configured in the user's browser / operating system.
  // RFC 2616 (section 14.4) defines the Accept-Language header as follows:
  //   Accept-Language = "Accept-Language" ":"
  //                  1#( language-range [ ";" "q" "=" qvalue ] )
  //   language-range  = ( ( 1*8ALPHA *( "-" 1*8ALPHA ) ) | "*" )
  // Samples: "hu, en-us;q=0.66, en;q=0.33", "hu,en-us;q=0.5"
  $browser_langcodes = array();
  $language_tags_without_qvalue = 0;
  if (preg_match_all('@(?<=[, ]|^)([a-zA-Z-]+|\*)(?:;q=([0-9.]+))?(?:$|\s*,\s*)@', trim($_SERVER['HTTP_ACCEPT_LANGUAGE']), $matches, PREG_SET_ORDER)) {
    foreach ($matches as $match) {
      // We can safely use strtolower() here, tags are ASCII.
      // RFC2616 mandates that the decimal part is no more than three digits,
      // so we multiply the qvalue by 1000 to avoid floating point comparisons.
      $langcode = strtolower($match[1]);
      if (isset($match[2]) && $match[2] != 1) {
        $qvalue = (float) $match[2];
      }
      else {
        $qvalue = 2 - .1 * $language_tags_without_qvalue++;
      }
      $browser_langcodes[$langcode] = (int) ($qvalue * 1000);
    }
  }

Does this make sense?

Gábor Hojtsy’s picture

@lmerus: the spec says the qvalues are used to specify different priorities; with languages in the same priority, I did not find any specs either. I assume since you equally find those languages good, we can pick based on what the site prefers, which is what we do. If others believe this is not a good approach, we need to go back all the way up to Drupal 8 and fix there and then port downward.

lmeurs’s picture

@Gábor Hojtsy: I understand your point: I hadn't thought of when languages without a qvalue weigh equally, it's up to the website to decide which language to serve.

The reason why I had not thought of it and the reason why I read into this subject is that a client's tablet had a language tag of 'nl, en', without any qvalues. His tablet's UI was completely in Dutch, but his website was being served in English. It appeared to me that the order of languages should play part, but it makes reason not to.

My guess is that his tablet should have provided a qvalue for at least the 2nd langcode in his language tag. Is this behavior seen often?

By changing the order of languages (for languages of equal qvalues, 'nl' is preferred) and setting 'en' as default language (when a visitor has 'nl' nor 'en' in his language tag, 'en' is served) at admin/config/regional/language my problem seems sorted out without my modifications in the code.

Thanks for the effort!

Garrett Albright’s picture

#77 causes some interesting but perhaps not incorrect behavior, as #89 mentions. To elaborate, given an Accept-Language header of "en-us,ja;q=0.7,en;q=0.3", which is silly but not impossible (particularly since Firefox lets you define language preference order in its preferences instead of just using the operating system's settings, and still has this option in its preferences even though it threw out the checkbox to disable JavaScript because we don't want those silly users to "break the web," scheez even Safari for iOS lets me toggle JavaScript on a whim, but I digress, and this isn't even under the Advanced tab BUT I DIGRESS)…

Firefox preferences screenshot

AND MOVE UP AND MOVE DOWN BUTTONS FOR MOVING LIST ITEMS IN THE TINY UNRESIZABLE LIST WIDGET BECAUSE DRAG AND DROP IS FOR CHUMPS.

Anyway, if your user agent does this and you visit a site that can serve pages in English ("en") and Japanese ("ja"), you'll get a page in Japanese. Now I suppose this is technically correct, since the site can't serve you a page in en-us and the browser is specifying that you would prefer to see ja more than en. Nonetheless it's sort of surprising behavior.

Should we bother doing something about it?

xjm’s picture

Status: Active » Closed (outdated)

Automatically closed because Drupal 6 is no longer supported. If the issue verifiably applies to later versions, please reopen with details and update the version.