Updated: Comment #10
Problem/Motivation
Drupal's core/lib/Drupal/Core/Locale/CountryManager.php currently uses data from the Debian project, whose data is derived from ISO 3166-1 which takes it country names from the United Nations. The UN list is a politically-charged list of countries partly due to member states deciding on the list and ignoring non-member states input. The goals of the UN are to create a list that is amicable to all governments of member states. Clarity of naming is not a consideration.
For example, the ISO/UN data lists "Korea, Democratic People's Republic of" and "Korea, Republic of" instead of the more commonly known "North Korea" and "South Korea", respectively.
Basically, the interests of parties in the UN do not align with the interests of software developers (and open-source developers in particular), and, thus, doesn't make for a good source of country/territory data. See also http://en.wikipedia.org/wiki/ISO_3166-1#Naming_and_code_construction
CLDR seems to be the emerging standard for localization and internationalization, used by many organizations that make software.
From the CLDR website:
The Unicode CLDR provides key building blocks for software to support the world's languages, with the largest and most extensive standard repository of locale data available. This data is used by a wide spectrum of companies for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks.
Proposed resolution
Start using the CLDR data as the up-stream source of country list. That data can be found by:
- Going to CLDR's downloads page at http://cldr.unicode.org/index/downloads
- Clicking on the "Data" link on the latest release.
- Choosing the json.zip file. (For example, at: http://unicode.org/Public/cldr/23.1/ )
- Using the appropriate country list data. The English language country data is found in main/en/territories.json
Update/rename the update-iso-3166.sh import script created in #1068840: core/includes/standard.inc contains inaccurate country data to use the latest CLDR data.
Remaining tasks
Patch review!
User interface changes
Instead of using Debian's data (derived from ISO 3166-1 which is derived from the United Nations list of countries), we'd use the data provided by CLDR. This would change the text shown in Drupal 8’s installer for the list of countries.
API changes
None.
Related Issues
Original issue where Debian was used as upstream source of data: #1068840: core/includes/standard.inc contains inaccurate country data
Proposal to possibly remove the entire list from core: #1933614: [META] Locale settings in Drupal make little (UX) sense
Comment | File | Size | Author |
---|---|---|---|
#39 | 1938892-39-country-list.patch | 14.64 KB | JohnAlbin |
#39 | 1938892-14-39-interdiff.txt | 1.1 KB | JohnAlbin |
#30 | 1938892-compare-bash-scripts.diff | 4.12 KB | Pancho |
#14 | 1938892-14-country-list.patch | 13.87 KB | JohnAlbin |
#8 | 1938892-7-country-list.patch | 13.78 KB | JohnAlbin |
Comments
Comment #1
JohnAlbinFrom #1068840: core/includes/standard.inc contains inaccurate country data:
Oh, god, no. Debian is doing a terrible job at maintaining a country code list. They are tone deaf to changes in the country list and are militant in their conformance to the ISO list, which is a copy-and-paste of the UN's list of countries, despite the fact that that list is a politically-biased list maintained by member states of the UN with no voice for governments who have lost their seat at the UN, like Taiwan.
I spent weeks arguing with Debian regarding Taiwan's entry. Alas, their bug tracking website is so gawd awful you can't see any of the numerous comments from numerous different people who disagree with Debian's position. Otherwise, I'd provide a link.
The selection “Taiwan, Province of China" is very problematic. To some people it would be like relabeling the USA as "US, province of Britain" or Germany as "Germany, former state of USSR". There's a somewhat rambling Wikipedia article on the subject: http://en.wikipedia.org/wiki/Taiwan,_China
And if you think I'm being over-sensitive, I'd point out that the government of Taiwan sued the ISO for using “Taiwan, Province of China" in their standard. http://www.chinapost.com.tw/taiwan/2007/10/02/124980/Taiwan-sues.htm
Automated process for getting up-to-date country lists: +1000!
Using Debian as the up-stream source for this data: -1,000,000
Using CLDR looks promising. I'll update the issue summary with links to their data.
Comment #1.0
JohnAlbinsdfgsdfg
Comment #2
JohnAlbinUpdated title and issue summary
Comment #3
JohnAlbinComment #3.0
JohnAlbinExpanded issue summary
Comment #4
droplet CreditAttribution: droplet commentedWhere's the country list ?? I saw timezone list only.
Comment #5
JohnAlbinAs posted in the issue summary, the English language region list is in
main/en/territories.json
. There's one for several different languages.*snip*
Comment #6
junedkazi CreditAttribution: junedkazi commentedAlso I see some more inconsistency with the list in core right now like the name is not complete.
'TZ' => t('Tanzania, United Republic of'),
'VE' => t('Venezuela, Bolivarian Republic of'),
There is nothing mention as to Republic of ???
Comment #7
catchThat's just the way to write 'United Republic of Tanzania' so it shows up alphabetically as Tanzania.
Comment #8
JohnAlbinIt turns out the
update-iso-3166.sh
script was already broken by the move fromcore/includes/standard.inc
tocore/lib/Drupal/Core/Locale/CountryManager.php
. :-\Ok. This patch updates the script name to be update-countries.sh, adds instructions on how to use the script and also includes the changeset on
core/lib/Drupal/Core/Locale/CountryManager.php
after running the script.The script includes a code stub for $alt_codes if later we want to use any of CLDR's alternate territory names instead of the default territory names. I've attached CLDR's latest territories.json for your perusing convenience so you can see what the dataset looks like.
Comment #8.0
JohnAlbinUpdated issue summary.
Comment #9
jimyhuang CreditAttribution: jimyhuang commentedAgree with “Taiwan, Province of China" is very problematic. Instead of "Province of China", we at lease have our president elected from people in Taiwan at 2012. How a province have a president?
As an asia user, I'm sure this patch "fix" many data of county name.
Such as "Korea, Democratic People's Republic of" and "Korea, Republic of". I can't imagine other country people can recognize which is South Korea, which is another.
Comment #9.0
JohnAlbinUpdated issue summary.
Comment #10
JohnAlbinI've added Jimmy's example of North/South Korea to the issue summary. Its not just about the naming of Taiwan (the thing that got me to write the patch), but about a normal user to be able to recognize the country names. Yes, in the installer, you only need to recognize your own country, but the CountryManager.php API is supposed to be used by any functionality needing a country list, so having recognizable country names is essential. The ISO/UN data does not provide that clarity.
Comment #10.0
JohnAlbinAdd Korean examples.
Comment #11
amourowThe correction of Taiwan in #8 patch is right.
Taiwan, officially the Republic of China, is never a "Province of China". ISO 3661-1 doesn't reflect the actual situation of Taiwan.
The problem occurs often in the Internet. Google also helped to remove "province of China" from Google Maps.
http://news.ebrandz.com/miscellaneous/2005/433-taiwans-province-tag-foll...
Thanks to @JohnAlbin for the patch.
Comment #12
jamesliu78The #8 patch is greater.
"Some of" is not friendly for end user. And clear it's also more shorter, easier and comfortable.
Thanks to JohnAlbin for the issue.
Comment #13
droplet CreditAttribution: droplet commentedThe script looks good. Few improves can be done:
Old script falls back to online sources. New script, I think we need a file existence check.
It can be
$code = strtok($code, '-');
Comment #14
JohnAlbinUnfortunately, the territories.json file is not accessible directly from the web. It's only available as part of a downloadable .zip file.
This patch incorporates the changes droplet mentioned above.
Comment #15
agrozyme CreditAttribution: agrozyme commentedAt the page http://en.wikipedia.org/wiki/ISO_3166-1 just record the thing happened in AD 2007 - 2009.
In 2009, the Federal Supreme Court of Switzerland has no court the case.
So the problem still has no result.
But we can ask the people who live in Taiwan: What is the name of your country?
Comment #16
tim.plunkettSome of these could be just as touchy as the change we're fixing here...
Comment #17
droplet CreditAttribution: droplet commentedCode side, RTBC!!!
Political issue, no comments.
Comment #18
JohnAlbinActually, no. You're comparing the ISO data added in March of this year to this patch. If you compare the Drupal 7 country list to this patch, you'll find they are much more similar. If they were more touchy, we'd have issues in the Drupal 7 queue already. Here's the diff between the patch and D7:
As you can see, the are very few changes with my patch. It was the ISO patch in March that was touchy. See that list of changes in the Issue Summary of #1068840: core/includes/standard.inc contains inaccurate country data
In addition, the new update-countries.sh script has stub code to use any alternate country names that we wish to use. Take a look at the regions.json data I attached in the comment above. So, for example, if the Drupal community members in "United Kingdom" say we should be using "U.K." instead, we can easily add that option to our script.
Comment #19
JohnAlbinYou just did. http://drupal.org/user/32095 :-D
Also, of the commenters above, amourow, jamesliu78 and jimyhuang live in Taiwan.
Yep, the courts in Switzerland threw the case out because it was a "political matter" and not a legal one, or something, something BS.
But that just highlights why Drupal using the Debian data source is so perverse. As of right now, Drupal core says "fix this problem in the upstream Debian code". Debian says "we just use the ISO standard. we won't fork the standard.". The ISO says "take it up with the UN". But, the government of Taiwan can't enter the UN and can't sue the ISO. So…
It's literally impossible to make any changes to ANY country unless we switch the data source.
Comment #20
tim.plunkettAh, the comparison in #18 is very informative. That makes me worry much less. Thanks @JohnAlbin
Comment #21
agrozyme CreditAttribution: agrozyme commentedMay be we can think another solution.
If the Drupal 8 is released and the list still shows "Taiwan, Province of China".
Should we have a hook function to replease the list?
If we have the hook function, we can write a module "Taiwan Patch" to fix it.
In Taiwan, when we use Drupal to build the Taiwan government case, we must use the name : "Taiwan" or "Republic of china" .
Of cause, I don't want to write the module....
Comment #22
tim.plunkettThere is already a hook_countries_alter() you can use.
Comment #23
adammaloneAs a commenter who speaks from, and lives outside of any country altered by this patch, this issue is perhaps less emotive for me.
That being said, I'm of the opinion that the country list should expose options that citizens of, and those residing in the countries would recognise as the name of the country.
Admittedly a huge political issue although the diff in #18 does show what I would consider more widely used names for said countries.
Comment #24
jamesliu78I don't think hook function is a good way to fix it.
That's just make a lot of modules to fix the list for they own country.
Now we already got a patch here, why just make it better?
Comment #25
tim.plunkettOh I agree we should proceed with the patch, I'm not suggesting the alter hook is a solution. Just that it exists to address the suggestion in #21.
Comment #26
droplet CreditAttribution: droplet commentedNot a win-win game. Who knows what Kim Jong-un wanted ?
RTBC to me.
Follow Up task:
To suggest additions or corrections, please file a ticket.
Comment #27
agrozyme CreditAttribution: agrozyme commentedThe hook function solution is the last choice.
In fact, Taiwan government want to join UN, but we are not a member of UN now. (see this)
Because the UN does not recognize the Republic of China which governs Taiwan and considers the territory to be part of the People's Republic of China.
May be we can't understand what Kim Jong-un wanted, but we are not Kim Jong-un.
We can discuss this issue and get a better result.
I was born in Taiwan at 1974 and I live Taiwan today.
I know this is a very problematic, but I must say that: Taiwan or Republic of China is the name of our country.
Comment #28
tky CreditAttribution: tky commentedI am not a coder but I know one thing clearly about internet: code is law.
In any way, you are unable to separate law/political issue from code one by this or that easy excuse. When you are coding, you are making laws or extending the area of law in the real world into virtual space.
I agree with the comments made by JohnAlbin, jimyhuang, amourow, jamesliu78 and agrozyme. No matter what names Taiwanese would like to call themselves, Taiwan or ROC., the country is simply not a province of China, legally or politically.
If you knew there is someing wrong in the code, correct it, make thing right.
Comment #29
PanchoI absolutely agree with this being RTBC: both from a political perspective and usability-wise, CLDR data is much better.
And as a bonus it includes so much more locale data that we might want to leverage at a later point.
It would be even nicer if built into Update Manager, but that's beyond this issue's scope.
So another clear +1 from me.
Comment #30
PanchoAs the bash script is being changed and renamed, I'm just providing a
diff -u
of the two file for easier review.[edit:] Didn't know files with a
.diff
extension are sent to testbot, so please ignore the meaningless test results. #14 remains RTBC.Comment #32
Damien Tournoud CreditAttribution: Damien Tournoud commentedNo concerns on my side either. We are not maintaining the list of countries ourselves, which is all that matters from my perspective.
That said, the documentation of
CountryManager::getStandardList()
clearly referencesISO 3166-1
, which we are not following anymore. This need to be fixed.Comment #33
Damien Tournoud CreditAttribution: Damien Tournoud commentedI was quoted above, so I feel like I should answer.
While I can agree that the ISO list is not a silver bullet, you have your logic totally backward here. Debian is doing a great job at maintaining an "accurate, ISO-compliant lists of language, territory, currency, script codes (*and* their translations in many languages)". The keyword here is ISO-compliant.
Comment #34
Damien Tournoud CreditAttribution: Damien Tournoud commentedComment #35
Damien Tournoud CreditAttribution: Damien Tournoud commentedI also removed the "regression" tag, because Drupal always pretended to follow ISO-3166-1, but never actually did. The list of Drupal 8 is an accurate ISO-3166-1 list, so the only way of seeing this is that it is an improvement, not a regression.
That said, I'm in favor of switching to CLDR.
Comment #36
droplet CreditAttribution: droplet commented#14 is the correct patch.
@Pancho,
end with do-not-test.patch or interdiff.patch next time :)
Comment #37
Damien Tournoud CreditAttribution: Damien Tournoud commented#32 points some work that still need to happen.
Comment #38
JohnAlbin@Damien “The keyword here is ISO-compliant.” Fair enough! Your original statement about Debian is accurate then. Sorry about that!
I'l re-roll the patch to update the code comment about ISO.
Comment #39
JohnAlbinActually, "ISO 3166-1" is mentioned twice in the code comments. Nice catch, Damien! The latest patch fixes those.
Comment #40
droplet CreditAttribution: droplet commentedThanks @Damien.
OK, re-tested. 2 new docs changes. and alt-short name, exclude country code, all work.
Comment #41
Dries CreditAttribution: Dries commentedCommitted to 8.x. Thanks!
Comment #42
JohnAlbinThanks, Dries! :-)
Comment #43
Alan D. CreditAttribution: Alan D. commentedUm, any one check the data?
Diego Garcia is part of the British Indian Teritory
Ceuta and Melilla - Spain's two autonomous cities, Ceuta and Melilla
Canary Islands - one of Spain's 17 autonomous communities
Saint Martin is an island is divided roughly 60/40 between France / Kingdom of the Netherlands. Isn't this almost like stating that North American is better than Canada, US, Mexico, just on a different scale?
Outlying Oceania.
+Falkland Islands
-Falkland Islands (Malvinas)
Politically dropping iso for cldr, the question should be: who do we want to piss off? Britain or Argentina? Same for Taiwan, Palestine and probably many other countries world wide...
Fairly ugly, would one want to stick with the official name
- 'MM' => t('Myanmar'),
+ 'MM' => t('Myanmar [Burma]'),
Not going to re-open, but really?
Comment #44
PanchoI'm not going to reopen this for now, but want to respond to Alan's comment #43, starting with the two disputed names:
1. Myanmar / Burma:
In a majority of languages this country would be simply called something similar to "Myanmar", see CLDR Territories, and in Burmese this would simply be "ကမ္ဘာ" which probably is just "Myanmar".
However, in English, the old name "Burma" is at least as commonly used as "Myanmar", for exampe leading to the Wikipedia article bearing the lemma "Burma", see also:
(http://en.wikipedia.org/wiki/Burma#Etymology)
So while politically we might find it slightly ugly, adding Burma as an alternative name, is just depicting reality in a fairly politically correct way, namely adding Burma only as an alternative name for Myanmar.
Note also that with Sanmyanmar, a Myanmar-based IT-company is associated Unicode member, and still I can't find a ticket against CLDR asking for removal of the "Burma" alternative name. So the policy seems to be acceptable for all.
2. Falkland/Malvinas:
This is a very complicated issue. By customary, English language usage there is probably no reason to add "Malvinas" as an alternative name, while in Spanish, Portuguese, French this is translated to "Islas Malvinas", "Ilhas Malvinas", "Îles Malouines", because that is customary.
While it might be more sensible to always include the other name in brackets, I think that's acceptable for us, and if not, we can file a ticket with CLDR to change the rule on http://cldr.unicode.org/translation/country-names
The other disputed territories are actually a question of inclusion rules. I agree that we didn't sufficiently address this aspect. This doesn't seem to be a deficiency of CLDR but of the way we use it. So certainly no reason for a rollback. However, we might have to file a followup.
I will cover this aspect separately, as I need to do some more research on how to get it right, and then will probably file a followup.
Comment #45
Alan D. CreditAttribution: Alan D. commentedSorry if the above thread was rushed, just a quickie before heading out, but diverging is going to bring up a real potpourri of issues.
1) What is a country?
The United Kingdom is a realm consisting of four countries: England, Northern Ireland, Scotland, and Wales.
The People's Republic of China (PRC) contains five autonomous regions, Guangxi, Inner Mongolia, Ningxia, Xinjiang and Tibet plus the two special administrative regions of Hong Kong and Macau.
2) Who defines a country?
Palestine / Israel
The State of Palestine has received recognition from only 132 states. Israel is not recognised as a state by 32 UN members and by the SADR
Sahrawi Arab Democratic Republic has received nearly no recognition outside of Africa.
3) Who's definition of the name should you use?
Taiwan was the example used here,
Macedonia was one of the more contested ones: http://en.wikipedia.org/wiki/Macedonia_naming_dispute
So the main point is that do we want to diverge from an international recognized standard? Guessing that everyone has gone for a fuzzy definition, using the ISO standards with the CLDR naming.
As from the ISO standard, alpha-2 code AA, QM to QZ, XA to XZ, and ZZ are user defined and should be excluded from the import, maybe this was where some of the territories appeared from?
Comment #46
PanchoGood points.
Created #2036219: [policy] Inclusion criteria for CLDR territories in CountryManager::getStandardList() as a critical followup issue, so it would be great to continue discussion over there.
Comment #47
PanchoNote that both trunk data and releases indeed are available online in an online repo:
Comment #48
PanchoRetroactively tagging.
Comment #49.0
(not verified) CreditAttribution: commentedUpdated issue summary.
Comment #50
sunAwesome to see this! Evolution++
My main goal for the previous issue was to automate the update of country codes as much as possible, disregarding the actual data being imported (whereas there was no clear data source before). In a sense, separating the concern of updating from the concern of which data source to use.
It's great to see that we've further improved the data source now. (And thanks for updating the script! :))
The more we do this in the future, the better our toolset will become. The same mechanism could e.g. be applied to the list of language codes. And who knows, perhaps we even want to move these declarations into XML, JSON, or YAML files at some point. :)
Comment #51
matsbla CreditAttribution: matsbla commentedIt looks like 'AN' (Netherlands Antilles) is still inside core:
https://github.com/drupal/drupal/blob/8.0.x/core/lib/Drupal/Core/Locale/...
Should it be there? It is not a code in either CLDR or ISO (but looks like it was an ISO code before)
- 'AN' => t('Netherlands Antilles'),
'QO' (Outlying Oceania) is also in core:
https://github.com/drupal/drupal/blob/8.0.x/core/lib/Drupal/Core/Locale/...
Looks like it is used as a "subcontinent" together with 'Eastern Africa', 'Caribbean', 'Central Asia', etc:
http://www.unicode.org/cldr/charts/latest/supplemental/territory_contain...
It is listed as a "private use subtag" in 'Core Specification':
http://cldr.unicode.org/core-spec
Should this be part of the country list?
- 'QO' => t('Outlying Oceania'),
Another thing, in latest CLDR square brackets indicating alternative names have been changed to parentheses:
This change is mentioned in the release note for CLDR v24 under "Formatting":
http://cldr.unicode.org/index/downloads/cldr-24
I don't know with you, but I think it looks kind of nicer
In release note v27 it is mentioned "effort was made to clean up country names":
http://cldr.unicode.org/index/downloads/cldr-27
Maybe it could be a good idea to do the mentioned "automated process for getting up-to-date country lists" now?
The abbreviation 'St.' have been extensively implemented:
I'm not sure why but 'Sint Maarten' remains unchanged.
'and' is replaced several places with '&'
Obs, that change also go for: St. Kitts & Nevis, St. Vincent & Grenadines + St. Pierre & Miquelon which is mentioned over.
For myself, I would also simplify these names which I find clumbsy and a little bit strenuous
I guess these are more sensitive issues, but I think that is much more aligned with the names people use in everyday informal language, and for almost all other names we now ignore the official version and focus on using the simple and short version, so why make exceptions for these 3 names?
Comment #52
bojanz CreditAttribution: bojanz at Centarro commentedYeah, for commerceguys/intl we have the following country filtering:
CLDR changed quite a lot from v24 to v28, so it's a good time to reimport the list.
All of this deserves a fresh issue.
Comment #53
matsbla CreditAttribution: matsbla commentedOkay I opened a new issue here:
#2595595: Update the country list to match latest CLDR