language_from_browser() doesn't parse language tags correctly, has a broken logic
iliphil - February 14, 2008 - 15:15
| Project: | Drupal |
| Version: | 7.x-dev |
| Component: | language system |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | needs work |
Description
MSIE sends accept language headers that look like this "en-nz,en-us;q=0.7,sv-se;q=0.3" but language_from_browser() wont match that to either 'en' or 'sv' which is what is in the language table
This drop-in replacement code for language_from_browser() works better. (I may try to send a patch later).
<?php
function language_from_browser() {
// Specified by the user via the browser's Accept Language setting
// Samples: "hu, en-us;q=0.66, en;q=0.33", "hu,en-us;q=0.5"
// MSIE 7: "en-nz,en-us;q=0.7,sv-se;q=0.3"
$browser_langs = array();
if (isset($_SERVER['HTTP_ACCEPT_LANGUAGE'])) {
preg_match_all('"(((\S\S)-?(\S\S)?)(;q=([0-9.]+))?)\s*(,\s*|$)"',strtolower($_SERVER['HTTP_ACCEPT_LANGUAGE']),$browser_accept);
for ($i = 0; $i < count($browser_accept); $i++) {
// The language part is either a code or a code with a quality.
// We cannot do anything with a * code, so it is skipped.
// If the quality is missing, it is assumed to be 1 according to the RFC.
if(!empty($browser_accept[2][$i])) $browser_langs[$browser_accept[2][$i]] = ($browser_accept[6][$i]? (float) $browser_accept[6][$i] : 1.0);
if(!empty($browser_accept[3][$i]) && empty($browser_langs[$browser_accept[3][$i]])) $browser_langs[$browser_accept[3][$i]] = ($browser_accept[6][$i]? (float) $browser_accept[6][$i]-0.01 : 0.99);
}
}
// Order the codes by quality
arsort($browser_langs);
// Try to find the first preferred language we have
$languages = language_list('enabled');
foreach ($browser_langs as $langcode => $q) {
if (isset($languages['1'][$langcode])) {
return $languages['1'][$langcode];
}
}
}
?>
#1
Well, a patch would indeed be great.
#2
here is the patch
#3
- There are some coding style issues in the patch (no spaces after commas, bad indentation, no brackets for the ifs)
- You match on \S\S-\S\S which means two char language and variant codes. Are we sure that we can only see these? (I am not)
- Is it guaranteed that the language codes are in descending priority order? The current code expects this (while the original code expected unique list of language codes).
#4
Apologies for the style
The match is on (\S\S-?(\S\S)?) so the second \S\S is not required for a match. en, en-us, en-gb will all match.
It doesn't require that languages be in descending priority order so it doesn't matter if they aren't.
I can only say try it out.
arsort($browser_langs);
print_r( $browser_langs);
#5
I filed a duplicated entry for this bug.
Any changes of it getting fixed in a next release?
#6
Well, both the parsing and the logic of
language_from_browser()is broken. According to RFC 2616 section 14.4:So we have to go thru all enabled languages, assign them a qweight using the matching logic described above and select the language that fit the most.
Here is a full test case for language_from_browser() that outline the issue in the current implementation, along with a new implementation that fully pass the test case.
#7
#8
Can I safely use this patch on my D6 installation or is it going to cause conflicts.
Right now I started using the previously provided patch which seems to work pretty OK for what I need it.
#9
Here is the same patch for D6.
#10
Oh, and the comparison between language tags should also be case insensitive (RFC 2616 section 3.10). Here is an updated patch.
#11
Thank you.
You have just been promoted to my monthly hero!
#12
The last submitted patch failed testing.
#13
See: #335122: Test clean HEAD after every commit and http://pastebin.ca/1258476
#14
The last submitted patch failed testing.
#15
Reroll.
#16
The last submitted patch failed testing.
#17
Hi Damien,
Great work on this issue. I just realized that this is still broken in D6 when I played with the language settings of my browser and always got the wrong language. While your code is a great step in the right direction, I think it still doesn't work correctly.
I set my browser (Firefox 3.5) to accept languages in the following order: de-de, en, de, en-us
This results in this header: "de-de,en;q=0.8,de;q=0.5,en-us;q=0.3"
My site has the languages 'de' and 'en' enabled. According to the setting, my site should be displayed in German (de). What happens is that your code iterates the languages available on the site ('en' and 'de'). When it finds a match for the single language code, it skips the double language code, which has a higher qvalue but gets ignored. So in this example the code incorrectly ignores de-de and finds that en has a higher value than de. Hence the incorrect result is 'en', while it should be 'de'.
#18
sorry, didn't mean to change the title of this issue
#19
Here is a slightly modified patch that works correctly for double language codes.
#20
@Bodo: if you think you found a bug in my patch, please extend the test case.
But in that case, I believe you are wrong: you are saying that your site offers 'en' and 'de'. Not 'en-US' and 'de-DE', the generic 'en' and 'de'. In that case, the correct match for...
Accept-Language: de-de,en;q=0.8,de;q=0.5,en-us;q=0.3... is really 'en', because your site doesn't offer the 'de-DE' specific language.
#21
Hi Damien,
I haven't had time yet to look at the D7 testing framework, but will do so and then update the test.
About the issue: As far as I understand, drupal sites don't normally offer compound locales such as 'en-US' or 'de-DE', but always just simple locales such as 'en' and 'de'. Is that correct? I guess it is possible to set up compound locales, but all the translations that ship with drupal are only simple locales.
If so, the compound locales come from the browser settings of the user.
I believe the intention of the code was that when the users browser requests a compound locale such as 'en-US', and the drupal site supports the main locale for this (in this case 'en'), there should be a match. Is that correct? I think the original code didn't work like that, but I agree that the tests should be used to define the behaviour.
#22
@Bodo: the patch above is supposed to support serving compound locales.
Anyway, if you only have a main locale 'en' and 'de', the correct match a strange language configuration like:
Accept-Language: de-de,en;q=0.8,de;q=0.5,en-us;q=0.3is definitely 'en', because your site doesn't provide the specific 'de-de' language.
#23
@Damien:
Fair enough about serving 'en' in the previous case.
What should happen if the site only has 'en' and 'de', and the client browser is set to accept 'de_de' and 'en_us' in that order, and nothing else? (Accept-Language: de-de,en-us;q=0.5)
My user expectation would be that the site will be served in 'de', although there is no direct match of locales. Is that what the spec would imply?
I guess the answer is NO, but this is unintuitive for a non-technical user. IE and FF also offer no help whatsoever for this setting beyond 'select your preferred languages here'. FF makes this worse by offering a multitude of compound locales that might get accidentally selected instead of the main locale.
#24
RFC2616 section 14.4 (language-range is the thing returned by the browser, language-tag is the server provided languages):
ie. 'en' sent by the browser matches 'en-US' provided by the server, but 'en-US' sent by the browser doesn't match 'en' provided by the server.
In the case you described, no language would match, so the server will its default language.
#25
If you are not convinced already, RFC2616 has the following node at the end of the same section:
(emphasis mine)
#26
Hi Damien,
I'm already convinced, and your code seems to work according to the spec. It just took me a while to understand that selecting any of the compound locales without the main locale did not do what I expected (as the RFC helpfully points out).
As I said, all browsers I checked (FF, IE, Opera) offer no help at all for this, so I might not be the only one who got this wrong. The browser should probably add the main locale (or at least suggest to do so) if it is not already present, but that is not a Drupal issue.
In any case, thanks for your explanation. I guess your original code is correct after all.
#27
I just realized how I got started with this whole confusion. My installation of IE7 doesn't offer two letter locales at all. Instead, only compound locales are available for selection, which is bound to produce unexpected results as discussed above. I'll have to check if this is fixed in IE8, and if this is normal for IE7 or if my install is somehow corrupted.
#28
I'm running into this problem. Is there any new development ?
All browsers work as expected, but IE7, IE8 and Safari.
Even if IE or Safari has fr-CA as prefered language, it displays the front-page in the default language (English).
Does this patch apply to D6 (14) ?