See http://api.drupal.org/api/HEAD/function/user_load

The code

else {
      $query[]= "LOWER($key) = LOWER('%s')";
      $params[] = $value;
    } 

applies to the search on the name. Although names are often case insensitive due to the default collations in e.g. MySQL, nothing forbids the site admin to allow names that differ only in case by changing the collation for the table user from e.g. utf8_general_ci to utf8_bin.

Moreover, enabling the utf8_bin collation in MySQL is often unavoidable while migrating from the earlier versions of MySQL, and this post explain why.

There are three levels of fixes possible:

  • Make a separate condition for the ($key == 'name) case
  • Get rid of LOWER() in the else part
  • Review the rest of the code to drop the LOWER() function when it duplicates the database functionality.
CommentFileSizeAuthor
#3 no_lower.patch8.39 KBchx

Comments

ChrisKennedy’s picture

Version: 4.7.4 » 6.x-dev
Category: bug » feature

This is by design to prevent usernames that differ only by capitalization. It would be confusing to others and would make it difficult to differentiate the posts between users. Are you sure that having the same username but different capitalizations is something that would be useful to you as a drupal administrator?

mikhailian’s picture

Category: feature » bug

The database software (e.g.MySQL) already enforces this constraint through unique indexes. It goes even further by making this constraint configurable, allowing to specify different collations for any single charset, enabling the admins to adapt the application for the innumerable uses.

A collation does not only specify the case sensitiveness but other relationships between characters. As an example, cyrillic e and ë are considered one character or two depending on the collation. An official site would rather use one collation, a literary site would prefer another.

An admin of an english site would probably configure the system to make e, ê è and é to be recognized one and the same letter, for the sake of simplicity. He could have also chosen to differ e, ê, é and è everywhere except the search_index and search_total tables. A french admin would definitely want to differ these e-letters everywhere.

Let's go back to the case-sensitiveness. The use of LOWER() breaks the flexibility provided by the collation mechanism of the database. This is why I would rather consider the use LOWER() as a bug.

It became apparent as people started to migrate their UTF-8 date from the latin1 to the utf8 datastore (see the link in my original bugreport) as enabling case-sensitiveness is the only way to keep the data intact.

chx’s picture

Title: code should not make assumptions about case sensitivity in the database (on the example of user_load()) » [performance] code should not make assumptions about case sensitivity in the database
Component: user system » base system
Assigned: Unassigned » chx
Status: Active » Needs review
StatusFileSize
new8.39 KB

Yes. It should be the task of the database and not Drupal to decide on collation (for example lowercase collates with uppercase). Also, these LOWER hurt performance, too.

dries’s picture

I don't think it would be a good idea if there was a 'dries' on drupal.org (in addition to 'Dries'), or both an 'Admin' and an 'admin'. People really expect case-insensitivity.

So for the behavior to remain the same, the database needs to use the utf8_general_ci collation, not? How do we ensure that is the case before after we applied this patch?

chx’s picture

We can provide an update path to utf8_general_ci but I think that's the default AND http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html there are a great many utf8_something_ci sets so upping to general might not be ideal.

chx’s picture

To further clarify:

*_bin: represents binary case sensitive collation
*_cs: case sensitive collation
*_ci: case insensitive collation

and the default is _ci , so I believe we do not need to make any special steps because the only way it can be _cs or _bin is when the admin sets up so and then we should not override that. In short: being ci or cs is not Drupal's to decide.

dries’s picture

At table creation time, we now use:

  $table['mysql_suffix'] = "/*!40100 DEFAULT CHARACTER SET UTF8 */";

I'd still like to see if we can change something at table creation time to enforce '_ci'.

Also, what does postgresql think of this? :)

chx’s picture

We could but we do not want to. Leaving out explicit collation definiton from there, we honor whatever the mysql administrator set up. But, if you really want to , I can add collate utf8_general_ci there but that somewhat defies the point of this patch.

chx’s picture

postgresql people can solve for themselves:

Currently PostgreSQL relies on the underlying operating system to provide collation support. This means that each platform has a slightly different way of doing collation. Some like the BSDs do not support UTF-8 collation. Some like glibc have fairly complete collation support.

from http://developer.postgresql.org/index.php/Todo:ICU

Potentially on some systems it'd be possible to generate a case insensitive collation as part of a locale and then use such for LC_COLLATE on initdb which would make all comparisons of text fields case insensitive

from http://www.thescripts.com/forum/thread173373.html

chx’s picture

There are many solutions for postgresql even when some OS does not support proper strcoll() like OpenBSD, http://gborg.postgresql.org/project/citext/projdisplay.php for text and for example http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg37618.html for varchar. I believe those who run that obscure database system can deal with their obscure problems.

sammys’s picture

Hi all,

I'm looking at this with a flu affected head so please bear with me. :)

I agree that code shouldn't make assumptions on how data is stored.

I confirm that PostgreSQL doesn't currently support a table specific (nor field specific) collation setting making the collation setting used at database creation time effective for all tables. This is a major problem for sites with PostgreSQL. Considering the efficiency benefit is huge enough, something has to be done to ensure PostgreSQL users don't get a surprise "feature" of case sensitivity on usernames.

One solution is to keep the LOWER() parts for PostgreSQL only. Another possible fix, albeit a hack, for this LOWER() performance problem was suggested in http://drupal.org/node/83738 where a lowercase version of the string is placed in another field. This would cover any of the other DBMSs automatically should they not implement table/field based collation settings. It was shot down by Dries understandably for its hackness. However, it does provide a fix for the problem regardless of collation and DBMS.

I'm going to sit on the fence and let you guys debate which way to go. I'll just provide a patch, if necessary, for my favourite obscure DBMS.

--
Sammy Spets
Synerger
http://synerger.com

catch’s picture

catch’s picture

Status: Needs review » Closed (duplicate)
mikhailian’s picture

Status: Active » Closed (duplicate)

I don't think it would be a good idea if there was a 'dries' on drupal.org (in addition to 'Dries'), or both an 'Admin' and an 'admin'. People really expect case-insensitivity.

What would you say about 'Driеs', 'Drіes' and 'Drіеs' in addition to 'Dries'? These are perfectly valid even within utf8_general_ci ;-)

P.S. I used cyrillic 'i' and 'e'.

mikhailian’s picture

Status: Closed (duplicate) » Active
richmck’s picture

Status: Closed (duplicate) » Active

Are there any mysql optimizers that are smart enough to *NOT* do a table scan (i.e. ignore LOWER()) when LOWER() is applied to a case-insensitive column within a where clause?

This could be, perhaps, done at the mysql client library level?

mikhailian’s picture

Version: 6.x-dev » 7.x-dev
Assigned: chx » Unassigned

Shouldn't we solve this in 7.x before it is too late again?

damien tournoud’s picture

Status: Active » Closed (won't fix)

On the contrary, we *should* make assumptions about case sensitivity in the database. Some matching should be case insensitive (username, etc.), while other shouldn't (paths, cache ids, locale source strings, etc.). To do that, we will have to manage collation in both the Schema API and the query builder, and I'm pretty confident we will manage to do that in D7. But this issue by itself is a won't fix.

mikhailian’s picture

Damien,

When you say "we *should* make assumptions about case sensitivity in the database", you fail to provide further reasoning.

I already demonstrated a few comments above that checking case-sensitivity will not help to fight impersonation. I do not see any other reason checking case-sensitivity on usernames should be mandatory and not left as an option to the one who builds a drupal-based site.

For the list of items where you suggest not checking case-sensitivity, I see several problems:

  • paths: If we do not check the case-sensitivity, we fail to respect RFC2396 & Co.
  • cache ids it dangerous to say whatsoever on the cache ids, as case-sensitivity is implementation-specific. AFAIR, memcached has case-sensitive, so there is already one problem.
  • locale source strings I also disagree here, but I would have to do some research to back it by facts. Just remember that Dutch, English and French are not the only languages in the world.
mikhailian’s picture

Status: Closed (won't fix) » Active

I realise that I use wrong wording. Case-sensitivity does not require checking. There is no action needed from the developer to have it. It is case-insensitivity that bring problems and requires development effort, and, in case of Drupal, creates a preformance hit that is hard to solve elegantly.

lilou’s picture

Issue tags: +Performance

Add tag.

mikhailian’s picture

Issue tags: +Beauty, +Correctness

Added two more tags

damien tournoud’s picture

Status: Active » Closed (won't fix)

I think we agree here, even if @mikhailian understood the opposite of what I was saying in #19.

mikhailian’s picture

Status: Closed (won't fix) » Active

Can someone else from the core developers look at this issue? Damien and me have opposing views (that probably originate in the different ways we use Drupal).

I would like to hear more opinions.

damien tournoud’s picture

Version: 7.x-dev » 8.x-dev
Category: bug » feature

There is no real point in keeping this issue open. Those LOWER() calls will only go away if we introduce collation support (and the associated operator support) to the Schema API. That whole issue is a portability nightmare:

  • MySQL support value-level collation along with character sets (ie. utf8_general_ci and utf8_general_bin), but doesn't support functional indexes;
  • PostgreSQL only supports database-level string collation, but has the concept of case sensitive and case insensitive matching (LIKE vs ILIKE), and supports functional indexes;
  • SQLite doesn't really support any useful collation;
  • SQL server supports value-level collation.

Bumping to D8.

mikhailian’s picture

Status: Active » Closed (fixed)

I see that this issue has been solved in #279851: Replace LOWER() with db_select() and LIKE() where possible, closing this one as well.