Recently, the default encoding of Drupal has been changed to utf-8.
In utf-8 encoding, every character is encoded as a sequence of between 1 to 3 bytes.
This means that the while a string may be n characters long, it is 3*n bytes long for storage purposes.
So to store a 64 characters long string (Titles of user defined blocks for example), the database field should be 64*3=192 bytes long. Otherwise, long strings will be cut short in storage. This is done silently by mysql.
The solution is to triple the size of database fields which store user supplied strings.

Comments

killes@www.drop.org’s picture

This makes me want to simply keep my iso-8859-1 database
when upgrading.
Would be very much appreciated.

dries’s picture

UTF-8 uses 1 to 4 bytes, not 1 to 3 bytes. Also, characters whose ASCII value is less than 128 are stored using one byte. For most languages (such as those that use ISO-8859-1) the storage requirement is a lot less than 4 * n.

That aside I don't think this needs fixing. We are already specifying the size of each table field in _characters_, not bytes. You probably want to configure your MySQL server to store data in UTF-8 when you want to store non-ISO-8859-1 data.

steph’s picture

I am reading the mysql documentation, and it seems that mysql has support for UTF8 only from 4.1.

The requirements for Drupal is mysql 3.23.

Would it be possible to review the requirements and/or post a doc, link, or something about how to set the database to UTF8

killes@www.drop.org’s picture

Thanks for clarifying this.
However, I was not only concerned about storage requirements. I simply
do not like UTF-8. It is not needed for many languages and thus I would
like not to be bothered by it.

calanya’s picture

Since there is no released version of MySQL which supports UTF-8 (4.1 is in alpha), it's pointless to use UTF-8 as the standard encoding. You can't safely store any texts in the database. It makes UTF-8 dangerous. Please revert the standard encoding of drupal back to ISO-8859-1, and the make the encoding localizable (through locale()).

As to the 3*n vs. 4*n - There are no Unicode characters defined yet which would take 4 bytes. For all currently defined languages which Unicode supports, UTF-8 will use up to 3 bytes per character.

al’s picture

This is a seriously complicated issue, and one which I've been looking at for a while. If you use MS-SQL or PostgreSQL, this problem goes away, because they both have supported UTF-8 natively for ages. It's only MySQL which has issues. Now, storing stuff as UTF-8 in MySQL is only an issue when you issue SQL statements that need to understand multibyte charsets (things like SUBSTRING, etc.). It's not an issue most of the time, where you're just dumbly pulling data in and out.

If we start having per-site charset stuff, we then need to make PHP do conversions between charsets. There's no way to do this, other than using the non-standard iconv functions, which many people don't have in their PHP build.

For most Western Europeans and Americans, UTF-8 will look entirely like single-byte ASCII - storage requirements are the same. By using UTF-8, we can use a single charset across *all* languages, which enables sharing of content between sites of two different languages without having to worry about impossible charset conversions. What happens if you have an ISO-8859-1 site which suddenly wants to include a Polish news feed? The entire site will have to have the charset changed, which will break all the existing content, etc. With UTF-8, it all Just Works. We also gain much simplicity. No one needs to work out which charset to use. For many people, this is a non-trivial issue. In short, there are many many good reasons to use UTF-8, and we shouldn't wait for broken databases to do so when (for the most part), it won't make any difference to anyone.

I do think that all the default database fields like title, etc. should be increased in size to a more sensible max VARCHAR limit of 250, though.

killes@www.drop.org’s picture

The majority of Drupal users uses MySQL. So we should not use features
that are not properly implemented in MySQL. The fact that it will not be
an isssue "most of the time" is not enough.

Your statement that we'd need to have a means to convert charsets is
not correct. The majority of users will have only one language on their
site as multiple language sites are not yet really supported. Also, I'd
probably not want a Polish feed on a site that is in German or French.

I agree that UTF-8 is nice in principle but as long as it is not yet
fully supported by MySQL we should abstain from using it.

dries’s picture

Priority: Critical » Normal

I tend to agree with Al here. Furthermore, limiting Drupal to West European languages doesn't strike me as an option.

Maybe you can provide some concrete examples of the suggested problems? That might be convincing.

Either way, I don't think this is a critical issue so I'm setting the priority to "normal".

killes@www.drop.org’s picture

I do not at all want to limit Drupal to West European
languages! I just want to have an option to keep my
existing Site (and database) in iso-8859-1.

Forcing people to change their site encoding
is not warranted by allowing other people to use Drupal.

A practical example would be a private module that
I wrote that uses SUBSTRING in a query.

calanya’s picture

Title: utf8 and database field size limit » A concrete example

With the encoding of web pages set to UTF-8, the user enters a text in his local language in a text field.
The text field is limited to 64 characters. The user enters 60 characters, but he uses some local characters (say German umlauts), so those 60 characters are 66 bytes in UTF-8.
Drupal inserts the user's input into MySQL, where the database field is limited to 64 characters. Because MySQL doesn't know about UTF-8, it thinks you're trying to insert a 66 characters long string, and silently removes the last 2 characters.
At a later time, Drupal reads the fields, and display the first 64 bytes to the user. Because the web pages aer encoded in utf-8, the last character of the string displayed to the user comes out as garbage. It usually causes a lot of other problems (the web browser thinks that a html tag which follow this broken string is actually part of the string, and interprets the page incorrectly).
You simply cannot trust MySQL to handle UTF-8. MySQL can only use specific local encodings, which is a good solution because there are very few multilingual web sites. MySQL 4.1, with UTF-8 support, should be out in a few months time. Please hold back with UTF-8 in Drupal till that time.

ax’s picture

Title: A concrete example » utf8 and database field size limit
cdpark’s picture

See also Bug #2310.

Kjartan’s picture

For better or worse UTF-8 is here to stay, marking this as such.