LOWER(name) queries perform badly; add column name_lower and index.

robertDouglass - September 12, 2006 - 13:53
Project:Drupal
Version:7.x-dev
Component:database system
Category:bug report
Priority:critical
Assigned:Unassigned
Status:patch (code needs work)
Description

We currently have an index on the name column of the users table, but it isn't doing us any good because the typical name-based user queries look for SELECT * FROM users WHERE LOWER(name) = LOWER('%s'). This effectively blocks the database from using the index and forces it to iterate through all user rows, perform LOWER on the name, and do the string comparison. This is a significant performance bottleneck in the following situations:

  • user login
  • user_load(array('name' => 'foobar'));
  • new user validation
  • user search
  • user autocomplete

The motivation for using LOWER is that usernames should be case insensitive. They should be case insensitive to prevent people from spoofing other users' identies (you're "Big_chief" on your site and someone registers as "Big_Chief" and starts acting like the site owner).

The solution I am proposing involves adding a column to the users table, name_lower, which contains an already lowercase version of name. The name_lower column has an index and the index actually gets used because the queries are now written like this:
SELECT * FROM users WHERE name_lower = LOWER('%s').

To test whether the speed increase would be worthwhile, I created a test database with 500K users. Here is the before and after:

Before name_lower

Table size: 66.76 MB.

1237.3 0 eval select * from users where LOWER(name) = LOWER('cfcd208495d565ef66e7dff9f98764da')
1295.39 0 eval select * from users where LOWER(name) = LOWER('c4ca4238a0b923820dcc509a6f75849b')
1287.55 0 eval select * from users where LOWER(name) = LOWER('c81e728d9d4c2f636f067f89cc14862c')
1266.64 0 eval select * from users where LOWER(name) = LOWER('eccbc87e4b5ce2fe28308fd9f2a7baf3')
1273.97 0 eval select * from users where LOWER(name) = LOWER('a87ff679a2f3e71d9181a67b7542122c')

After name_lower

Table size: 76.29 MB.

0.53 0 eval select * from users where name_lower = LOWER('cfcd208495d565ef66e7dff9f98764da')
0.38 0 eval select * from users where name_lower = LOWER('c4ca4238a0b923820dcc509a6f75849b')
0.39 0 eval select * from users where name_lower = LOWER('c81e728d9d4c2f636f067f89cc14862c')
0.36 0 eval select * from users where name_lower = LOWER('eccbc87e4b5ce2fe28308fd9f2a7baf3')
0.37 0 eval select * from users where name_lower = LOWER('a87ff679a2f3e71d9181a67b7542122c')

So in a typical case we're comparing 1273.97 ms to 0.37 ms. It is a big difference.

Remaining questions

  1. Where do UTF-8 and collations fit into all of this? I'm not in a position to speak to these issues with authority.
  2. Are there other solutions that don't introduce extra columns of essentially duplicated data?
  3. Will this work as well on Postgres?

I have rolled the patch against 4.7 but will provide patches against HEAD as requested. I'm hoping to get feedback and garner discussion before I spend more time rolling patches and benchmarking.

Oh, and you really have to try the user autocomplete field before and after this patch! Where it once felt sluggish, and often fell behind the user's typing, now it is brilliantly fast, even with 500K users!!!

AttachmentSize
name_lower.patch5.06 KB

#1

chx - September 12, 2006 - 14:03
Version:4.7.3» x.y.z
Priority:normal» critical
Status:active» patch (code needs review)

This is very true. Taxonomy should follow. I can attest that the slowest query (far surpassing search and all) at a huge site I run is the LOWER(name) query of taxonomy.

#2

chx - September 12, 2006 - 14:05

Why critical? Two reasons. One, the autocomplete Robert mentions and the other is the taxonomy that should follow -- have you ever tried taxonomy pages when your table has hundreds of thousands of terms?

#3

Dries - September 12, 2006 - 14:45

This is not critical, IMO. Also, the proposed solution is a hack, imo.

#4

robertDouglass - September 12, 2006 - 14:51

I'd be thrilled with an alternate solution. But if we cannot get the same performance gain any other way, I am fully comfortable with the proposed solution as well. What are our alternatives?

#5

Dries - September 12, 2006 - 14:52

By default, MySQL searches are not case sensitive so the lower()s could be removed. Unfortunately, PostgreSQL is case sensitive so it can't be removed. Maybe there is a clean way to make PostgreSQL not case sensitive?

#6

Dries - September 12, 2006 - 14:56

Crazy idea (not tested): can we create an index on lower(field)?

CREATE INDEX my_index ON my_table (LOWER(name));

#7

robertDouglass - September 12, 2006 - 14:59

I'll give it a try =)

#8

robertDouglass - September 12, 2006 - 15:12

That would be for Postgres, right? (guess I should learn how to use it) It doesn't seem like it would do any good to have the index be lowercase since MySQL can't use the index to begin with. And the goal of creating the index that way would have to be to force Postgres into behaving like MySQL (ie case insensitive search). It seems like the best solution would be if Postgres had some setting that would configure the behavior of their string comparison operator. Lacking that, I don't see another solution (other than whipping out the database-specific queries... yuck).

#9

robertDouglass - September 12, 2006 - 15:14

Just to clarify the problem once more: any time we write SQL that DOES_SOMETHING(column-name), eg. LOWER(name), there is no chance for existing indexes of any nature to be used. Forming the index differently won't correct the problem, only taking the function off of the column name can help.

#10

webchick - September 12, 2006 - 15:32

Not sure...

CREATE INDEX my_index ON my_table (LOWER(name));

My mad DB skillz are a bit rusty, but it strikes me that even if one or two oddball DBMS systems support something like that, it's going to be the exception rather than the rule. This is especially important now that people are active developing Oracle and MS SQL abstraction layers for Drupal.

An extra column, on the other hand, is something that can be used across the board.

#11

Gerhard Killesreiter - September 12, 2006 - 16:17

Just a comment: I don't think that adding a lowered version of the column is a particularly bad hack. it is about similiarly hacky as the introduction of the node_comment_statistics table.

#12

dww - September 12, 2006 - 18:09

we're talking 4 or 5 orders of magnitude faster, here. ;) i've seen far worse hacks for far less benefit. i agree w/ killes, this doesn't strike me as particularly hacky at all, and storing an already-LOWER() copy of name seems much more portable than playing with various DB-specific hacks.

that said, a little bit of info from the pgsql docs and google results:

  • (less promising) http://archives.postgresql.org/pgsql-php/2002-03/msg00031.php
    suggests another way to do it, but i suspect the performance will be crappy:
    SELECT * FROM table WHERE field ~* 'searchterm' AND ....
    i think that's just another way of doing ILIKE the case-insensitive version of LIKE, that some other thread mentioned was about 10 times slower than LIKE
  • i'll leave it upto the core committers to argue about portability vs. simplicity vs. hackery. personally, i think storing the LOWER(name) as a seperate column is just fine, and will definitely be the easiest to port (if slightly inefficient w/ space in the DB).

    #13

    m3avrck - September 12, 2006 - 18:14

    What if we had a global variable ... kind of like $connection, but with the $db_type (or do we have this? I forget):

    if mysql:
    select * from users WHERE name = 'bob'

    else if pgsql:
    select * from users WHERE lower(name) == lower('bob')

    Wouldn't that be the cleanest way?

    #14

    chx - September 12, 2006 - 18:23

    my 2cents are on the separate column. you are heading in the a very dark woods with that if $db_type -- what happens if someone else wants to write a similar query ? Multiple that if? brrrr.

    #15

    webchick - September 12, 2006 - 18:24

    The problem with that approach is that core can conceivably have many more than just mysql and postgresql supported. There are patches in the queue to add Oracle and MS SQL support, for example. And only a matter of time before someone does an SQLite port, or a Sybase port, or a DB2, etc. That kind of database-specific code will get unmanagable really, really fast.

    #16

    webchick - September 12, 2006 - 18:26

    Also, then postgres and other database systems don't benefit from the incredible performance benefit of not having to lowercase the fields.

    So yep, duplicate column still seems the best way to go for now. :)

    #17

    m3avrck - September 12, 2006 - 18:27

    I concur about the if $db_type, just playing devils advocate there.

    Way back when I talked with Robert about this I actually suggested the extra "lower" column in the first place as probably the best solution. And I think it still stands.

    Still doesn't hurt to throw out bad suggestions ... makes the best solution look all that much better :-p

    I still agree with my original thought (and the path this patch took), have a separate "lower" column.

    #18

    jvandyk - September 12, 2006 - 18:49

    I took the extra column approach a while ago with a very popular site. I think the payoff is worth the extra column.

    #19

    drinkypoo - September 12, 2006 - 20:07

    Just in case anyone needed further confirmation, I just applied this patch, and now autocomplete is actually usable. (I'm on a shared host, so it was pretty useless before.)

    #20

    dww - September 12, 2006 - 21:14
    Status:patch (code needs review)» patch (code needs work)

    finally had a chance to look closely at the patch itself. ;) mostly, it seems fine on a quick skim (i haven't tried testing it personally). however, system_update_183() is obviously missing the pgsql cases, so this needs work before it could be committed...

    #21

    robertDouglass - September 12, 2006 - 21:50

    @dww - did you test on HEAD or 4.7? I rolled it against 4.7 and haven't had time to apply it to HEAD.

    #22

    dww - September 12, 2006 - 22:18

    i didn't test at all. i just visually inspected that there's no 'pgsql' case in system_update_183() in http://drupal.org/files/issues/name_lower.patch

    speaking of which, if this issue is targetting the trunk (as the version currently states), that update # needs to be changed, too...

    #23

    sammys - September 13, 2006 - 00:10

    -1 to the patch as it can't be tested on pgsql. Time for someone to practice some pgsql syntax! :)

    Other than that, IMHO the concept is good and the extra 10% of space for 4 orders of magnitude more speed is worth it in core.

    Sammy Spets
    Synerger
    http://synerger.com

    #24

    sammys - September 15, 2006 - 15:20
    Status:patch (code needs work)» patch (code needs review)

    patch including pgsql update code for 4.7.

    AttachmentSize
    name_lower_20060916_0118_patch.txt5.03 KB

    #25

    robertDouglass - September 15, 2006 - 15:34

    This is getting rolled out on a client site today, so we'll have some reports from the wild for you soon.

    #26

    Zen - September 15, 2006 - 18:50

    Or you could just get rid of the "preserve case" feature and store the usernames in lower case right from the get-go? Capitalisation or w/e can be done in the presentation layer.. The only people who'll get ruffled by this are those with 1337 nicks like "deEp ThRo4T" or "Marquis de Sade" ..

    Unless I'm missing the big picture here, this is a minor feature that we can just nix..

    Just a thought.
    -K

    #27

    webchick - September 15, 2006 - 18:53

    @Zen: I dislike that idea. It goes against Drupal's philosophy of not mangling user input.

    I should be able to have the nick "webchick" and someone else should be able to have the nick "Joan of Arc" -- no amount of logic on the theme layer is going to allow both of those things to happen.

    #28

    Zen - September 15, 2006 - 18:58

    Also, if this feature needs to be retained, it might be more elegant to use PgSQL's operator overloading feature to accomplish this. Not sure if it's possible, but I'll try and ask Cvbge about it. The separate column approach is IMHO rather tacky.

    Cheers,
    -K

    #29

    robertDouglass - September 15, 2006 - 22:18

    It would be good to research some of the other rdbms solutions out there as well and see what issues we might run into if we do an Oracle or MSSQL port. The one shining advantage of the name_lower approach is that we can eat our cake and port it too. It is absolutely guaranteed to run the same on any database we use.

    I'm not against a db specific approach, especially if we can apply it to the other major systems I mentioned. But the "hackishness" or "tackiness" or "this hurts my normalized eyes-ness" of the patch shouldn't prevent us from harvesting the performance gain. One way or another, this bottleneck needs to be resolved.

    #30

    drumm - September 21, 2006 - 15:44
    Version:x.y.z» 4.7.3

    Looks like a 4.7 patch.

    #31

    killes@www.drop.org - September 24, 2006 - 12:23
    Version:4.7.3» x.y.z

    The issue still exists in HEAD:

    #32

    dww - September 24, 2006 - 12:52

    i think neil's point is that the performance implications of this fix are so enormous that we could consider this a bug that deserves to be fixed in 4.7, in addition to being fixed in the trunk. i'm not going to move the version again, but i'd vote for fixing this in both branches.

    #33

    pwolanin - September 24, 2006 - 14:41

    (subscribing) I'd wondered why the autocomplete lookup of usernames was so painfully slow. I'll try to test this patch on a copy of my 4.7 site.

    #34

    pwolanin - September 24, 2006 - 15:44

    With this patch in place, username autocomplete is remarkably faster (qualitatively).

    #35

    gopherspidey - September 24, 2006 - 17:36

    Well at least for postgres, it looks like comment #6 is the correct way to help list problem. According to the postgres manual.

    http://www.postgresql.org/docs/8.1/interactive/indexes-expressional.html

    Since mysql is by default case insensitive, and does not seem to allow indexs on SQL functions. A hack around the problem, we could just remove the "LOWER()" from the where clauses in select statements before we make the query.

    #36

    dww - September 24, 2006 - 23:32

    re: pgsql -- yeah, as i linked in #12. however, instead of trying to find DB-specific code for this on every DB we currently (and in the future, might) support, we could just add the extra field. it's simple, it'll always work, we'll never have to port it, and it's 4 or 5 orders of magnitude faster than what we have now.

    #37

    JohnAlbin - October 1, 2006 - 23:34

    No one seems to be addressing Robert's comment about spoofing addresses.

    The motivation for using LOWER is that usernames should be case insensitive [...] to prevent people from spoofing other users' identies (you're "Big_chief" on your site and someone registers as "Big_Chief" and starts acting like the site owner).

    If you look at this comment (http://drupal.org/node/56487#comment-107050) on login spoofing:

    AIM (AOL Instant Messenger) [...] allows spaces in logins, but "folds" them, so an AIM login of "JimBob32" is equivalent to "Jim Bob 32" is equivalent to "J i m B o b 3 2".

    ..we could extend the "second column idea" to be both index-able and less spoof-able.

    So this indexable field would be created by lowercasing name and stripping all non-alphanumeric characters.

    Ideally, this new value could be placed in name, with a new field called display_name which has the capitalization, spacing and punctuation the user prefers to be displayed in comments, etc. (As a bonus, you could allow apostrophes in the the display name for those named O'Brian). And since the display_name can be "folded" to get the name, you could still use the display name in logins and Drupal Authentication.

    Of course, this would mean the ubiquitous $user->name would need to be changed to $user->display_name. And would affect user login validation and a slew of other components.

    I don't know if this is feasible, but I thought I would make the suggestion before we implement a second column.

    #38

    pwolanin - October 2, 2006 - 00:58

    That is certainly a valuable suggestion, but I think that changing the meaning of $user->name is unlikely to find much support (at least prior to Drupal 6.0) due to the widespread breakage that would ensue. For now, it seems it would be pretty trivial to strip the whitespace, etc. from the entry in the name_lower column and that would make the system more robust.

    #39

    robertDouglass - October 2, 2006 - 08:21

    Nice suggestions. I think we'll most likely keep $user->name (as the display name), but I can definitely see stripping white space and non-alphnumeric chars from the name_lower column. This, like you said, would make it harder to spoof user names and easier to customize your own user name.

    This is for a later patch, however, because there is still a lot of resistance to the idea of having a second column for lookups to begin with.

    #40

    Hayakawa - October 2, 2006 - 10:48

    i was really lost in the comments. is this final version of patch after the suggestions? and are there any bottlenecks or imcompatibilities caused by this patch?
    thank you very much

    #41

    robertDouglass - October 2, 2006 - 18:21

    @Hayakawa: this patch is ready to go, if it is decided that it is an approach that we want to pursue. If someone comes up with a different approach that people feel is a better architecture, it might win out. So far, many prominent developers have expressed support for the patch, and a couple have expressed misgivings. What that means is that a better solution is hoped for. But if you want to use the patch, I feel confident that it will work well for you. It is being used by the biggest client that I work for on their production site and has caused the performance gains that are hoped for.

    #42

    JohnAlbin - October 2, 2006 - 19:04

    If Robert's latest patch can be finalized and put into Drupal 5.0 (or even 4.7), I would recommend that we do it. It provides (right now) a substantial performance improvement.

    And for those developers who are reticent to add a new column simply to get a performance gain, think of it this way...

    In Drupal 6.0, the new column could be a trifecta for these issues:

    1. username lookup performance
    2. prevention of much username spoofing
    3. better display name capabilities (allowing apostrophes, ampersands, commas, etc.)

    #43

    Steven - October 3, 2006 - 14:13

    Stripping the lower column of whitespace is a neat idea, but could break existing sites. It would be cool if we could add this for new sites though.

    One possible approach would be to only use literal lowercasing in system_update_N() to create the column, but to insert the stripped version for any new users from then on. Their usernames will be checked with the strict restrictions too.

    That way we don't have to provide an option and we don't break any site.

    #44

    pwolanin - October 3, 2006 - 14:43

    @Steven - I'm not sure that would work, since you need a apply a function to lower case and strip whitespace before you run the query. If you don't know in advance whether or not to strip, that won't work.

    Is it possible to provide a pre-update tool to check whether there will be conflicting usernames after the whitespace stripping? Or is there any way to provide an interactive process during the update?

    #45

    sammys - October 4, 2006 - 05:12

    FYI: Correct syntax for creating pgsql indexes for drupal tables is:

    CREATE INDEX {table}_fieldname_idx ON {table} (fieldname)

    #46

    moshe weitzman - October 5, 2006 - 22:19
    Status:patch (code needs review)» patch (code needs work)

    needs a 5.0 port.

    but further, this is one of those updates which is difficult in our update syste, once you update your code, you can't login. we handled that at one time by having people manually update their sessions table. might be needed here too.

    #47

    chx - October 6, 2006 - 00:02
    Status:patch (code needs work)» patch (code needs review)

    I rerolled for HEAD, was surprisingly easy, only hunk failed for user.module.

    Why you can't login? I do not see it.

    AttachmentSize
    lower.patch4.96 KB

    #48

    moshe weitzman - October 6, 2006 - 01:03
    Status:patch (code needs review)» patch (reviewed & tested by the community)

    sorry, i was unclear. if a user is not logged as uid=1 after updating source, he is unable to login and perform update because the new column doesn't exist. however, i tested setting $access_check = FALSE in update.php and that works so is a good workaround for this situation. we'll just have to document this in the upgrade notes for 5.0

    i tested chx' recent patch for functionality and update path. works well. rtbc.

    #49

    Dries - October 8, 2006 - 06:30
    Category:bug report» task
    Priority:critical» normal
    Assigned to:Anonymous» Dries

    I still think this is not the cleanest approach. Also, I am going to benchmark this before it can get committed. I'm tempted to postpone this, as I think we might be able to do a better job with this ...

    Changing the status from 'critical bug report' to 'normal task'. This doesn't break your website and therefore isn't critcal.

    #50

    Dries - October 8, 2006 - 06:31
    Status:patch (reviewed & tested by the community)» patch (code needs work)

    Also changing the status to code needs work. Will update the status once benchmarked.

    #51

    robertDouglass - October 8, 2006 - 07:01

    Let me know if there is a way I can help with the benchmarking.

    #52

    coreb - November 30, 2006 - 02:22
    Version:x.y.z» 5.x-dev

    Moving out of the "x.y.z" queue to a real queue.

    It doesn't appear to be a new feature, so I'm putting it in 5.x-dev.

    #53

    m3avrck - January 7, 2007 - 07:30

    Let's bring this back into the light -- lower() sucks :-)

    #54

    Gary Feldman - January 18, 2007 - 07:02

    I just came across a problem where users can't login, because the clause "LOWER(name) = LOWER('The Users Name') was failing when the name was in mixed case (on MySQL 4.1, Drupal 4.7.5). This seems to be related to this thread. I fixed it by simply removing the lower (for now) in user_load, trusting that the unique key will prevent duplicates that differ in case.

    It seems incredible to me that a problem this severe could exist for this long (though given the number of keyboards without shift keys, perhaps it's not so incredible) . So I'm willing to assume for the moment that I'm missing something. If so, I'd appreciate it if someone could let me in on the secret. If not, should I raise the priority of this or submit it as a separate bug?

    #55

    Gary Feldman - January 18, 2007 - 07:56

    Removing the LOWER made the user id case sensitive. So I changed it to

       ...CONVERT($key using utf8) = '%s'";

    This is specific to MySql 4.1.1 and higher.

    The way I read the MySQL docs, the existing Drupal code using LOWER would work on previous versions of MySQL. Furthermore, I think it would work on a clean install of Drupal with later versions of MySQL, because the clean install sets users to be a varchar. It's the upgrade process that converts varchar columns to varbinary, for reasons I have yet to discern.

    #56

    JohnAlbin - January 18, 2007 - 17:33
    Version:5.x-dev» 6.x-dev

    Hi Gary,

    This thread is about a proposed performance improvement. And while the patch does remove the LOWER() MySQL function call, it's really unrelated to the bug you are experiencing.

    Please, submit a new bug so that we can tackle both of these issues separately.

    Thanks!

    #57

    quicksketch - April 16, 2007 - 03:26

    Revisiting Dries's idea of making pgSQL case insensitive. It's possible to override operators in PostGreSQL (so I've heard). If it would be possible to make the = operator case insensitive for varchar comparisions, maybe that would be better solution as it would work for all database columns and keep the entire problem contained in the pgsql database implementation.

    From http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:49827:

    Some operations on data can be case insensitive. So your solution is not case insensitive data, your
    solution is a case insensitve operator. Luckily it is very easy to define your own operators in
    PostgreSQL. For instance, this defines case-insensitive equality and inequality operators for text
    datatypes:

    CREATE FUNCTION case_insensitive_equality(text, text) RETURNS boolean
        AS 'SELECT LOWER($1) = LOWER($2)'
        LANGUAGE SQL
        IMMUTABLE
        RETURNS NULL ON NULL INPUT;
    CREATE FUNCTION case_insensitive_inequality(text, text) RETURNS boolean
        AS 'SELECT LOWER($1) <> LOWER($2)'

        LANGUAGE SQL
        IMMUTABLE
        RETURNS NULL ON NULL INPUT;
    CREATE OPERATOR === (
        leftarg = text,
        rightarg = text,
        procedure = case_insensitive_equality,
        commutator = ===,
        negator= <=>
    );
    CREATE OPERATOR <=> (
        leftarg = text,
        rightarg = text,
        procedure = case_insensitive_inequality,

        commutator = <=>,
        negator= ===
    );

    I would recommend against overwriting the current = and <> operators with these operators because I
    have no idea what the side-effects may be and I don't think they are indexable.

    So it looks like we have some questions to answer here. They might not indexable, but surely this will offer a performance improvement over the LOWER() method currently in use which kills the indexes anyway for all database types.

    #58

    dww - April 16, 2007 - 03:39

    They might not indexable, but surely this will offer a performance improvement over the LOWER() method currently in use

    huh? ;) read the definition of the === operator you just pasted above:

    SELECT LOWER($1) = LOWER($2)

    by defining your own case-insensitive equality operator, you're just hiding the LOWER() calls, not removing them.

    #59

    kbahey - April 16, 2007 - 04:45

    @robertDouglass

    Are you sure the query cache is not a factor in the improved results at the top of this page? Try restarting MySQL between tests and see.

    @All
    This approach is one of the no-nos in database design. The reason is that you have data redundancy and one field is a product of the other in the same row. Things get out of hand if one part of the application would update one field but forget to update the other, and then you have data integrity issues and all sorts of problems.

    This may not be such an issue if we only have one place for the INSERT and one place for the UPDATE of the name. But if we have more then we are just opening ourselves to trouble in the future.

    One approach that allows a new column, yet ensure its integrity, is if we had triggers: this would automatically update the name_lower field every time we have an insert/update to the row, and would be transparent to the application. But these are only available in MySQL 5.0.2 onwards.

    The other option to solve this is to get rid of the LOWER()s in SQL altogether. The database is case insensitive anyway, and uniqueness is guaranteed since we have a unique index on name. We can do the lower part in PHP on the name before we pass it to db_query().

    #60

    kbahey - April 16, 2007 - 04:50

    I mean instead of this:

    <?php
    db_query
    ("SELECT uid FROM {users} WHERE LOWER(name) = LOWER('%s')", $name);
    ?>

    We do something like this:

    <?php
    db_query
    ("SELECT uid FROM {users} WHERE name = '%s'", strtolower($name));
    ?>

    Can someone benchmark this?

    #61

    m3avrck - April 16, 2007 - 04:50

    I agree with Khalid. Do this at the PHP level with strtolower(), the database should not store 2 representations of the same thing, it's going to cause problems eventually.

    #62

    quicksketch - April 16, 2007 - 04:56

    by defining your own case-insensitive equality operator, you're just hiding the LOWER() calls, not removing them.

    Yeah you're right, I should read code more carefully. Shoot, another option down the drain :(

    #63

    quicksketch - April 16, 2007 - 05:02

    The problem with kbahey's approach is that the user table stores the data including mixed case usernames. Using strtolower() in PHP and then searching will cause any username containing an uppercase letter to fail when using pgSQL whose comparisons are case sensitive, the original cause of all the trouble in this issue.

    #64

    kbahey - April 16, 2007 - 13:25

    But Robert's original test and patch was on MySQL.

    The motivation was bad performance on MySQL because of the lack of using the index since the name was wrapped in LOWER().

    #65

    pwolanin - April 16, 2007 - 14:52

    Until reliable triggers are available in MYSQL, it seems very hard to think of a solution other than storing the lower-cased username as another column. While this may not be a great practice, it's database-agnostic.

    Even if Drupal was pushed in MySQL 5.0, it's not clear that most users would be able to create a TRIGGER:

    CREATE TRIGGER was added in MySQL 5.0.2. Currently, its use requires the SUPER privilege.

    http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html

    #66

    robertDouglass - April 16, 2007 - 21:45

    I'm still in favor of the extra column, but rerolled to remove whitespace as was suggested earlier in the thread. Anybody is welcome to beat me to it, since I won't be doing it this week.

    #67

    pwolanin - April 16, 2007 - 22:39

    How do you deal with the update path then? by removing the white space, previously valid usernames may collide.

    #68

    kbahey - April 18, 2007 - 04:50

    I still don't like an extra derived column. It is not good practice and a whole can of worms.

    Here are some benchmarks using MySQL 5.0.22.

    I used today's HEAD and created 10,000 users using devel.

    Then I ran two queries, searching for a user called "paritoh" (lower case p in the db):

      $sql = "SELECT * FROM {users} WHERE LOWER(name) = LOWER('%s')";
      db_query($sql, 'Paritoh'));

    This is how it is in core today.

    Result: Time=22.491 Rows=1 SQL=select * from users where LOWER(name) = LOWER('%s')

      $sql = "SELECT * FROM {users} WHERE name = '%s'";
      db_query($sql, strtolower('Paritoh'));

    This is using the case conversion outside the database:

    Result: Time=0.395 Rows=1 SQL=select * from users where name = '%s'

    56X orders of magnitude improvement.

    So, for MySQL, the performance problem is solved by doing the case conversion in PHP, with no backward compatibility issues or the need for a derived column.

    This leaves PostgreSQL, which is case sensitive.

    By using database abstraction, we always have compromises such as lowest common denominator, or optimizing for one but not the other.

    Alternatives:

    - Store names in lower case. They are already unique and have a unique index, so no data issues are there, but will be offputting to users to see their names change case. In this case, we can still use LOWER in the SQL, but like so SELECT * FROM {users} WHERE name = LOWER('%s') which works with all databases, and is as fast as the PHP strtolower() for MySQL.

    - Do some magic in the abstraction layer to rewrite the SQL for each engine. For example, if we have LOWER(name), we make it just name for MySQL.

    - Other alternatives?

    #69

    catch - October 26, 2007 - 23:06
    Status:patch (code needs work)» duplicate

    http://drupal.org/node/105161 and http://drupal.org/node/181625 were duplicates.

    #70

    catch - October 26, 2007 - 23:07
    Status:duplicate» patch (code needs work)

    oops, setting status back, too many tabs.

    #71

    merlinofchaos - November 1, 2007 - 04:21
    Category:task» bug report
    Priority:normal» critical

    Coming back over from http://drupal.org/node/181625

    I am setting this back to critical and back to bug report.

    From firebus' report (I've been working with him), not fixing this made nanowrimo.org unusuable with 800 logged in users; fixing this got him up to 1400 logged in users and a slow but usable site.

    This is a critical performance problem.

    #72

    hswong3i - November 1, 2007 - 09:07

    Subscribing

    #73

    chx - November 1, 2007 - 09:57
    Assigned to:Dries» chx
    Status:patch (code needs work)» patch (code needs review)

    I would side with merinofchaos, as posted in another issue, one single term query on NowPublic runs for 2.42s w/ LOWER and 0.02 without. So, I would argue Dries' earlier

    This doesn't break your website and therefore isn't critcal.

    -- while the website does run but if its so slow that it can't be used, then IMO that's a broken. The patch is not difficult. The update is not nice but that's life.

    AttachmentSize
    lower_0.patch10.71 KB

    #74

    chx - November 1, 2007 - 10:04

    Doh! Forgot the indexes.

    AttachmentSize
    lower_1.patch10.93 KB

    #75

    chx - November 1, 2007 - 10:23

    forgot access table as I only search modules. I do not think there is any point keeping the mask case sensitive it's not the usual user submitted data that can not be altered. So, I simply lowercase that and not keep a lower_mask.

    AttachmentSize
    lower_2.patch12.91 KB

    #76

    hswong3i - November 1, 2007 - 10:23

    Give a simple scan about the patch in #74 and sounds beauty: including both upgrade and normal queries patch.

    BTW, seems not yet include the handling of drupal_is_denied() (http://drupal.org/node/174025)? It is also a critical issue that affected by non-indexed LOWER() handling. Should we also take care of that within this issue? May be also handled by 2 columns? One for compare (lower case) and the other for admin UI (natural case) ?

    #77

    chx - November 1, 2007 - 10:30

    hswong3i says, I should warn the admin about the mask change. Be it.

    AttachmentSize
    lower_3.patch13.47 KB

    #78

    John Morahan - November 1, 2007 - 11:28
    Status:patch (code needs review)» patch (code needs work)

    Applied #77, installed a new site into a fresh database, and on completing installation, user 1 was not logged in. Trying to log in gives this error: "The username admin has not been activated or is blocked."

    Also, the indexes were not created. (looks like they were only added to the upgrade path)

    #79

    hswong3i - November 1, 2007 - 12:06

    Sorry, please hold on for a while! A simple suggestion: should we split the user "real name", "account name" and "user ID" as 3 different type of fields? Let's take vexim (http://silverwraith.com/vexim/) database users table schema as example:

    1. user_id: as like as our uid, auto_increment.
    2. localpart: the local part of email address (e.g. localpart@domain). It is always unique, and restricted with email local part restriction: [a-z0-9_-]. Similar as our name field, plus lower case limitation.
    3. realname: An additional field with NO strict limitation. User will able to input whatever string as their display name.

    So my suggested changes:

    1. uid: unchange.
    2. name: field keeped, and translate all account name into lower case. As like as name_lower proposed in chx's patch.
    3. realname: new field, free typing. As backward compatible, we can simply copy the existing name into this new field.

    Benefit include:

    1. No change to name and keep its function. Contribute developers should not be confused by the lower case transformation.
    2. realname is much more meaningful and useful. E.g. I always use profile.module and create some field similar as this nature for user full/real name.
    3. name_lower is only meaningful when compare, but no other benefit. On the other hand, contribute developer may feel confusing: which field should use for which cases? Should I always use name_lower during compare? Or combine the use of both field? Or some other style? We need to document the change clearly.

    Yes, this suggestion is a bit complicated when compare with chx's patch: patch in #77 is much simple and direct, and should be functioning with no question (may be we need some more debug, but this shouldn't be difficult). This suggestion may require for more changes, e.g. most UI should update and display both user real name and account name. But this is not a new idea: most forum system supported for this style.

    Please feel free to work hard with chx's name_lower patch, it is target for today. BTW, my suggestion shouldn't be too complicated, and is suitable for a long term development. If it should belongs to D7, I will come back for this afterward, and focus on chx's work for the needs of today :)

    #80

    alpritt - November 1, 2007 - 14:57

    @#79: While I agree with a need for real display names, it seems like something that would complicate this issue. Whatever the outputted username is, will be the one people search for and must therefore be translatable into the lowercase version. So we cannot have a display name of 'John Doe' and then a username of 'jdoe' since other users would end up searching for the wrong name.

    Comment #37 seems like a much more workable idea. And probably something that should be taken into consideration in support of the above patch -- even if we intend to implement it at a later date. If this is the case, maybe the naming of the field should not be as specific as 'name_lower', but should be something more generic like 'name_unique'.

    #81

    alpritt - November 1, 2007 - 15:06

    From #79: "This suggestion may require for more changes, e.g. most UI should update and display both user real name and account name. But this is not a new idea: most forum system supported for this style."

    Yes, this would solve the problem I raised above. However, it will do so at the cost of the UI. Displaying both names would be confusing.

    #82

    webchick - November 1, 2007 - 15:09

    I think hswong's suggestion is when someone enters "John Doe" as their username on registration, behind the scenes, "john doe" is stored as the user.username field, and "John Doe" is stored in the user.realname field.

    How does all this lowercase business work with accented letters, Japanese characters, etc?

    #83

    catch - November 1, 2007 - 15:20

    @webchick: This suggests extra lower case column for accented characters so shouldn't be an issue in this case.

    Japanese doesn't have a concept of lower case, so at the db level I assume it makes no difference at all (could only find a mysql internals list discussion from 2000 to support this, and it wasn't directly related).

    #84

    hswong3i - November 1, 2007 - 15:52

    For clarify my idea (http://edin.no-ip.com/html/?q=node/346), here are some useful screenshots from vexim and eGroupWare:
    http://edin.no-ip.com/html/files/vexim_adminuseradd.png
    http://edin.no-ip.com/html/files/vexim_adminuserchange.png
    http://edin.no-ip.com/html/files/egroupware_edit_user.png
    http://edin.no-ip.com/html/files/egroupware_setup_authentication.png

    The proposal (split field into realname, name and uid) is not too complicated, backward compatible, and widely adopted by many other systems. Should this be one of the possible solution?

    #85

    hswong3i - November 1, 2007 - 16:05

    May be we should restrict "account name" (name) as strict as email local part: [a-z0-9-_.], and leave "real name" (realname) as free as possible. This can simplify the checking (whatever UTF8 or not, we only accept simple text input), and also extensible if we may adopt SMTP authentication backend.

    #86

    catch - November 1, 2007 - 16:31

    hswong3i - username currently accepts most UTF-8 characters - including Japanese etc., your suggestion would be a regression if it restricted to [a-z0-9-_.], and I really think the "real name" idea should be left for D7 - we already have profile.module for this.

    Looks like #77 needs schema declaration updated in .install, will try to reproduce John Morahan's bug later on.

    #87

    John Morahan - November 1, 2007 - 16:50

    I'm using MySQL version 5.0.22-Debian_0ubuntu6.06.5-log - if that helps.

    #88

    hswong3i - November 1, 2007 - 16:57

    @catch: agree. The idea of restricted name ([a-z0-9-_.]) may be too new and also too late within D6 dev cycle. We may need more time and discussion for this additional issue. I am just exploring about its possibility.

    BTW, It should be a good idea if we employ the pair up of realname + name, rather than name + name_lower. This should be more meaningful, and backward compatible: no harm to contribute developer even they don't update the old LOWER(name) handling, will not break existing code but just without performance boost.

    #89

    kbahey - November 1, 2007 - 17:02

    As I said earlier in #68, the problem does not exist in MySQL, and we can overcome it with a simple code change that does not involve any schema changes.

    Again, I am objecting to chx's approach of derived columns that only hold a variation of an existing column. These are just trouble down the road, and we will pay dearly for them in maintenance and complexity.

    If Postgres needs that, then why not limit the complexity to it?

    As to hswong3i's suggestion, I need clarifications on:

    - The local part of the email may not be unique. There can be someone@hotmail and someone@yahoo with the same "someone" part, but different people. Unlikely but possible.

    - Your scheme enforces that the login name ('name' in your writeup) is always lower case. This means that now we cannot have 'Dries', or 'Amazon', but only 'dries' and 'amazon'? That may not be acceptable to some people ("Why can't the login name be what I want it to be?") Also, what about UTF-8 characters? How would Jose/Andre with accents on the e be handled? Strip those as well?

    #90

    moshe weitzman - November 1, 2007 - 17:37

    your suggestion to use a different schema for pg users is is more complex than the extra column, IMO. what about al lthe other DBMS? you have objected without proposing an acceptable alternative.

    #91

    hswong3i - November 1, 2007 - 17:52

    @moshe: totally agree. We shouldn't have a hybrid solution which is case-by-case and database-by-database dependent.

    @kbahey: local part is not unique, just as like as it is not unique when using drupal.module within D5.3. SMTP authentication should always binded with A authentication server. In case of eGroupWare, we can define the "Mail domain (for Virtual mail manager)", so user will able to use both "username@example.com" and "username" if Mail domain is set as "example.com" (Sorry this topic shouldn't belongs to here, I will create a new one for D7)

    On the other hand, login name is restricted, but not realname. Realname allow UTF-8, with no strict limitation. The idea is similar as the "Full name" field which is used within drupal.org but created by profile.module. Realname should be something always binded with user, no matter how the drupal site is used for (in general speaking...). It is not necessary to split this essential field into a non-standardized profile.module handling.

    #92

    chx - November 1, 2007 - 19:30

    realname whatever it is , too big for D6 and does not belong to this thread. Bring that somewhere else. (So much banter and noone fixed the patch!)

    #93

    John Morahan - November 1, 2007 - 19:30

    Found the login bug: there was a missing break; in the switch statement in user_save(). Leaving as CNW because this still needs the schema update.

    AttachmentSize
    lower_4.patch15.18 KB

    #94

    John Morahan - November 1, 2007 - 19:39

    Also, user_save() does not populate {users}.name_lower for newly created accounts.

    #95

    John Morahan - November 1, 2007 - 19:47

    Now it does.

    AttachmentSize
    lower_5.patch15.68 KB

    #96

    John Morahan - November 1, 2007 - 19:59
    Status:patch (code needs work)» patch (code needs review)

    Added the schema change, and added placeholder-for-uid-1 to {users}.name_lower in system_install(), so as not to break the new unique key.

    AttachmentSize
    lower_6.patch17.12 KB

    #97

    catch - November 1, 2007 - 20:19

    Ok so I patched a fresh checkout and installed with no issues. offset with comment module but I think my tarball is a little out of date.

    Made a new user manually and logged in with them, then logged out and back in with the admin, all fine. Posted nodes and comments, did some user admin etc. no problems there either. Also checked the db and the name_lower column and respective index were there as expected.

    Looking pretty good!

    #98

    catch - November 1, 2007 - 20:22
    Status:patch (code needs review)» patch (reviewed & tested by the community)

    missed a bit.

    #99

    jvandyk - November 1, 2007 - 20:45
    Status:patch (reviewed & tested by the community)» patch (code needs review)

    Fixed some grammar issues.

    The tests for lowercase masks in the access rules UI were incorrect.

    AttachmentSize
    lower20071101jv.patch17.02 KB

    #100

    David Strauss - November 1, 2007 - 22:05

    The index on user.name needs to go, but this patch looks great otherwise.

    #101

    John Morahan - November 1, 2007 - 22:57

    Like so?

    AttachmentSize
    lower_7.patch17.13 KB

    #102

    John Morahan - November 1, 2007 - 23:32

    Seems user_autocomplete() got lost somewhere along the way

    AttachmentSize
    lower_8.patch17.89 KB

    #103

    kbahey - November 2, 2007 - 02:12

    @moshe and @hswong3i:

    My comment on MySQL, is that 95% of Drupal sites use MySQL already (we can debate the figure separately). So for the sake of the 5% we are introducing a derived column, denormalizing the schema, which is not needed by those 95%. Dowing a strlower() in PHP solves the problem for MySQL.

    I don't use PostgreSQL, nor my clients do, and hence have no need to solve that problem. Someone who uses it can scratch that itch.

    The issue with using derived columns is that they have to computed every time the original changes, and the probability of errors because they get out of sync is there. Yes, the UPDATE and INSERT can be in only a few places, but there are modules that can mess this up easily.

    Anyways, it seem that I am the only one voicing this concern. I wish we can poll a PostgreSQL expert on what options are there before we see this as the ultimate solution to this problem.

    #104

    hswong3i - November 2, 2007 - 03:48

    Sorry that since drupal_is_denied() is more high priority than user.name (drupal_is_denied() is called for EVERY page generation, whatever deny or not), I adopt both killes's patch (http://drupal.org/node/174025#comment-317309) and chx's patch comment (http://drupal.org/node/83738#comment-332240), and shift the focus of drupal_is_denied() back to its own issue (http://drupal.org/node/174025#comment-332305).

    drupal_is_denied() should delay no more, and shouldn't wait for user.name :)

    #105

    hswong3i - November 2, 2007 - 10:40

    I folk this issue and handle with what I proposed in #79: http://drupal.org/node/188734. That issue should not be as simple as name_lower, but should be more complete and elegant :)

    #106

    markus_petrux - November 2, 2007 - 11:23

    hmm... not sure if someone checked this already... there's a way to extend PostgreSQL by creating new data types, so it is possible to create a "Case insensitive text data type". Here's a project that does this:

    http://gborg.postgresql.org/project/citext/projdisplay.php

    #107

    chx - November 2, 2007 - 15:04

    denormalizing is not for speed purposed, next version we can even put in triggers, about citext, well, that's not civarchar so it's somewhat problematic.