It's no problem if you site language is English. But I have actually a problem because my site language is not English. So I see three solutions:

  1. Convert: mother tongue names -> ASCII names (“kind of” mother tongue). Sometimes easy, sometimes impossible (dependent on language), but either way, it's awkward, monster-like :-).
  2. Provide translation: native names -> English names. This is painful, costly, sometimes impossible. Completely unnecessary, if you site have only one language.
  3. Allow UTF-8 names of bundles (content types) and fields.

Comments

yched’s picture

Status: Active » Closed (won't fix)

There can not be UTF8 in bundle names, we generate db tablenames from them, possibly variables, function names
A bundle is a machine name. Untranslatable, a-z and _. Just like current node type names.
If that's not in the API docs, we should make that clear.

Unless I'm missing something, this is a won't fix.

mki’s picture

Title: UTF-8 in bundle (content type) / field names » non-Latin letters in bundle (content type) / field names
Status: Closed (won't fix) » Postponed (maintainer needs more info)

There can not be UTF8 in bundle names, we generate db tablenames from them, possibly variables, function names

According to MySQL 4.1 manual:

Beginning with MySQL 4.1, identifiers are stored using Unicode (UTF-8). This applies to identifiers in table definitions that are stored in .frm files and to identifiers stored in the grant tables in the mysql database. The sizes of the identifier string columns in the grant tables are measured in characters. You can use multi-byte characters without reducing the number of characters allowed for values stored in these columns, something not true prior to MySQL 4.1. The allowable Unicode characters are those in the Basic Multilingual Plane (BMP). Supplementary characters are not allowed.

According to PostgreSQL 7.3 manual:

SQL identifiers and key words must begin with a letter (a-z, but also letters with diacritical marks and non-Latin letters) or an underscore (_). Subsequent characters in an identifier or key word can be letters, digits (0-9), or underscores, although the SQL standard will not define a key word that contains digits or starts or ends with an underscore.

Bundle (content type) and field names are something considered to be data, not algorithm, classes, functions, variables, or any hard-coded identifiers. Please distinguish these two area. As data these names ARE translatable.

A bundle is a machine name.

Yes, these names are symbolic names. But there are plenty of symbolic names, mostly notable URIs, domain names, directory/file names. And they ARE translatable.

So I would be happy with non-Latin bundle/filed names. Please consider this slightly seriously before "won't fix".

bjaspan’s picture

Bundle and field names are NEVER shown to end users as data. We are working seriously on making field data translatable (#367595: Translatable fields) but I cannot see any reason to expend serious effort to make field names.

That said, I'm not sure why we actually care that field and bundle names only contain [a-z0-9_]. I thought perhaps that PHP would not accept UTF-8 identifers but apparently it does:

$x = 'greek_κόσμε';
print "$x\n";
$obj->{$x} = 'foo';
print $obj->{$x}."\n";

So, yched: What would happen if we simply removed the preg check in field_create_field() that requires fields to be alphanumeric? If a particular database can't support it, then "don't use utf-8 field names with that database."

KarenS’s picture

We need check what characters work as :

1) Function names
2) Table names
3) Column names
4) 'pseudo' field and table names in Views
5) Others, I feel there might be more

Plus we need to at least keep spaces out of the names (a common problem).

Also, do we care about case? Nix cares but Windows ignores it so 'MyField' will be the same as 'myfield' in Windows but two different fields in Nix.

We still have this check for content type names in the current code (or did last time I checked), so does the same argument apply there?

KarenS’s picture

Edit, you said field and bundle names, so forget the last comment :)

KarenS’s picture

One more thing, some databases case about case. I think I remember that DB2 required all column names to be uppercased, so you couldn't have both MyField and myfield, both would have to become MYFIELD, so we need to think about ways that this could become a problem in various databases.

mki’s picture

Bundle and field names are NEVER shown to end users as data.

There is one place in Drupal 6 and Drupal 7.x-dev where users can see content type name: URL http://example.com/node/add/content-type-name. This is very important, please take a look at URL as UI (created in 1999! but very true so far).

We are working seriously on making field data translatable [...] but I cannot see any reason to expend serious effort to make field names.

Note #91744: Component based translation for paths. But this is not the point of this issuse.

Let's explain this issue by way of analogy. In String translation: why using t() for user specified text is evil? (2006-11-10) Gábor Hojtsy wrote:

The concept behind t() is that you write your module/theme/.info file source in English, and apply t() to literal English strings. The primary language of Drupal is English. If you add a menu item, you need to add it in English, even if you don't have a publicly visible English interface (because you only provide French and Dutch interface for example). Even if you add a menu item on the French admin interface, you need to provide it in English, so that it can be translated to other supported languages. [...] It is popular to abuse this system, if you don't have a public English interface.

In #141461: Object translation option #1: locale system, optimization strategies (2007-05-04) Gábor Hojtsy also wrote:

Drupal has a simple mechanism for translating source strings: t(). It works nicely on standalone strings, it works nicely with gettext based tools, translation import and export is possible. t() has the following important characteristics:

  • It assumes you translate from English to some other language.
  • It assumes that all text you pass at it are standalone strings of characters (no relation between chars).
  • It assumes that a generic string editing for widget is fine for translating these.

Unfortunately none of these stand for user defined data. Take aggregator fields and categories as examples (but keep in mind that this can be site settings, content type details, user profile fields, etc):

  • You define your aggregator categories and feeds in your site default language (this is what we are about to assume). This can be anything, oftentimes English.
  • You need translations of different properties of your objects at once, ie. when you get an aggregator category displayed, you need both the title and description translated. Properties of your objects have close relations.
  • You might need specific widgets to translate user defined data. Best would be to have an aggregator category title and description translated on the same screen, so the translator sees the relation.

So we need to translate against some language, which is not English, we need to be able to load related translations at once, and we might need to be able to provide a UI to translate related strings together.

To sum up, I believe that we shouldn't assume that developer will use only English or Latin alphabet.

If your website contains simple content types, like Story, Article, and the like, that's OK. But if your website contains content types and fields that are specific for some area of knowledge (for example: medicine, economics, geography), then you have real troubles how to translate these names to English or "translate" to Latin alphabet.

When I'm thinking about Drupal as data repository (RDF, SPARQL, linked data, different storage engines etc.) for different area of knowledge, not as a simple sort of website like blog, this become a real problem.

I'm pleased with some naming convenion for bundle/filed name: word_word_word (PHP, database) and word-word-word (URL). That's OK as long as I can name things in my own language without effort into providing English translation that I will never use or even can't provide.

mki’s picture

PHP Manual make it clear:

Function names follow the same rules as other labels in PHP. A valid function name starts with a letter or underscore, followed by any number of letters, numbers, or underscores. As a regular expression, it would be expressed thus: [a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*.

(There is also UNICODE support to name variables and other PHP labels issue that have status "won't fix".)

So there is no default place for Unicode identifiers in PHP. Even if such code will work, this is going to be dirty hack that may stop working someday or somewhere. I'm not happy with such solution in production website.

I wonder if some kind drupaller could tell me why and where Drupal uses bundle/field name as variable, function or class name; why this is not considered as just configuration that appears only in database. Maybe this issue should be marked "by design", but first I'm trying to understand what's going on. (See also these comments).

bjaspan’s picture

Okay, so UTF-8 for object properties are out, so UTF-8 for field names are out.

Bundle names show up in table names but not (as far as I can remember at the moment) in object property names. It appears that at the moment we are not checking anything about bundle names, so I doubt we would reject a UTF-8 bundle name; it would just work. But that's not a promise.

The one valid use case you have identified so far for actually caring about this issue field and bundle names in URLs. The URLs are not actually part of Field API, they are part of the CCK UI (which may be merged into core at some point, but isn't yet). There is no particular reason we cannot have a "UI name" for fields, like Label is for humans, which the UI puts into URLs instead of the field or bundle name.

So, I suggest identifying any other locations where field or bundle names are exposed to end users, and retargeting this issue to address them specifically instead of suggesting that field or bundle names themselves be allowed to contain UTF-8.

sun.core’s picture

Status: Postponed (maintainer needs more info) » Closed (won't fix)

I agree with Barry.

mki’s picture

Status: Closed (won't fix) » Postponed

PHP 6 allows Unicode characters in PHP identifiers. More information: http://schlueters.de/blog/archives/116-Unicode-identifiers.html

bjaspan’s picture

"Postponed until we require PHP 6" is sorta like "postponed until the Second Coming." But whatever.

mki’s picture

Status: Postponed » Closed (won't fix)