LOWER(name) queries perform badly; add column name_lower and index.
| Project: | Drupal |
| Version: | 7.x-dev |
| Component: | database system |
| Category: | bug report |
| Priority: | critical |
| Assigned: | Unassigned |
| Status: | patch (code needs work) |
We currently have an index on the name column of the users table, but it isn't doing us any good because the typical name-based user queries look for SELECT * FROM users WHERE LOWER(name) = LOWER('%s'). This effectively blocks the database from using the index and forces it to iterate through all user rows, perform LOWER on the name, and do the string comparison. This is a significant performance bottleneck in the following situations:
- user login
- user_load(array('name' => 'foobar'));
- new user validation
- user search
- user autocomplete
The motivation for using LOWER is that usernames should be case insensitive. They should be case insensitive to prevent people from spoofing other users' identies (you're "Big_chief" on your site and someone registers as "Big_Chief" and starts acting like the site owner).
The solution I am proposing involves adding a column to the users table, name_lower, which contains an already lowercase version of name. The name_lower column has an index and the index actually gets used because the queries are now written like this:
SELECT * FROM users WHERE name_lower = LOWER('%s').
To test whether the speed increase would be worthwhile, I created a test database with 500K users. Here is the before and after:
Before name_lower
Table size: 66.76 MB.
1237.3 0 eval select * from users where LOWER(name) = LOWER('cfcd208495d565ef66e7dff9f98764da')
1295.39 0 eval select * from users where LOWER(name) = LOWER('c4ca4238a0b923820dcc509a6f75849b')
1287.55 0 eval select * from users where LOWER(name) = LOWER('c81e728d9d4c2f636f067f89cc14862c')
1266.64 0 eval select * from users where LOWER(name) = LOWER('eccbc87e4b5ce2fe28308fd9f2a7baf3')
1273.97 0 eval select * from users where LOWER(name) = LOWER('a87ff679a2f3e71d9181a67b7542122c')After name_lower
Table size: 76.29 MB.
0.53 0 eval select * from users where name_lower = LOWER('cfcd208495d565ef66e7dff9f98764da')
0.38 0 eval select * from users where name_lower = LOWER('c4ca4238a0b923820dcc509a6f75849b')
0.39 0 eval select * from users where name_lower = LOWER('c81e728d9d4c2f636f067f89cc14862c')
0.36 0 eval select * from users where name_lower = LOWER('eccbc87e4b5ce2fe28308fd9f2a7baf3')
0.37 0 eval select * from users where name_lower = LOWER('a87ff679a2f3e71d9181a67b7542122c')So in a typical case we're comparing 1273.97 ms to 0.37 ms. It is a big difference.
Remaining questions
- Where do UTF-8 and collations fit into all of this? I'm not in a position to speak to these issues with authority.
- Are there other solutions that don't introduce extra columns of essentially duplicated data?
- Will this work as well on Postgres?
I have rolled the patch against 4.7 but will provide patches against HEAD as requested. I'm hoping to get feedback and garner discussion before I spend more time rolling patches and benchmarking.
Oh, and you really have to try the user autocomplete field before and after this patch! Where it once felt sluggish, and often fell behind the user's typing, now it is brilliantly fast, even with 500K users!!!
| Attachment | Size |
|---|---|
| name_lower.patch | 5.06 KB |

#1
This is very true. Taxonomy should follow. I can attest that the slowest query (far surpassing search and all) at a huge site I run is the LOWER(name) query of taxonomy.
#2
Why critical? Two reasons. One, the autocomplete Robert mentions and the other is the taxonomy that should follow -- have you ever tried taxonomy pages when your table has hundreds of thousands of terms?
#3
This is not critical, IMO. Also, the proposed solution is a hack, imo.
#4
I'd be thrilled with an alternate solution. But if we cannot get the same performance gain any other way, I am fully comfortable with the proposed solution as well. What are our alternatives?
#5
By default, MySQL searches are not case sensitive so the lower()s could be removed. Unfortunately, PostgreSQL is case sensitive so it can't be removed. Maybe there is a clean way to make PostgreSQL not case sensitive?
#6
Crazy idea (not tested): can we create an index on lower(field)?
CREATE INDEX my_index ON my_table (LOWER(name));
#7
I'll give it a try =)
#8
That would be for Postgres, right? (guess I should learn how to use it) It doesn't seem like it would do any good to have the index be lowercase since MySQL can't use the index to begin with. And the goal of creating the index that way would have to be to force Postgres into behaving like MySQL (ie case insensitive search). It seems like the best solution would be if Postgres had some setting that would configure the behavior of their string comparison operator. Lacking that, I don't see another solution (other than whipping out the database-specific queries... yuck).
#9
Just to clarify the problem once more: any time we write SQL that DOES_SOMETHING(column-name), eg. LOWER(name), there is no chance for existing indexes of any nature to be used. Forming the index differently won't correct the problem, only taking the function off of the column name can help.
#10
Not sure...
CREATE INDEX my_index ON my_table (LOWER(name));My mad DB skillz are a bit rusty, but it strikes me that even if one or two oddball DBMS systems support something like that, it's going to be the exception rather than the rule. This is especially important now that people are active developing Oracle and MS SQL abstraction layers for Drupal.
An extra column, on the other hand, is something that can be used across the board.
#11
Just a comment: I don't think that adding a lowered version of the column is a particularly bad hack. it is about similiarly hacky as the introduction of the node_comment_statistics table.
#12
we're talking 4 or 5 orders of magnitude faster, here. ;) i've seen far worse hacks for far less benefit. i agree w/ killes, this doesn't strike me as particularly hacky at all, and storing an already-LOWER() copy of name seems much more portable than playing with various DB-specific hacks.
that said, a little bit of info from the pgsql docs and google results:
claims that the following works for exactly this sort of problem:
CREATE INDEX id_lower_content ON mytable(lower(column_name))for more on this, check out: http://www.pgadmin.org/docs/dev/pg/indexes-expressional.html
suggests another way to do it, but i suspect the performance will be crappy:
SELECT * FROM table WHERE field ~* 'searchterm' AND ....i think that's just another way of doing
ILIKEthe case-insensitive version ofLIKE, that some other thread mentioned was about 10 times slower thanLIKEi'll leave it upto the core committers to argue about portability vs. simplicity vs. hackery. personally, i think storing the LOWER(name) as a seperate column is just fine, and will definitely be the easiest to port (if slightly inefficient w/ space in the DB).
#13
What if we had a global variable ... kind of like $connection, but with the $db_type (or do we have this? I forget):
if mysql:
select * from users WHERE name = 'bob'
else if pgsql:
select * from users WHERE lower(name) == lower('bob')
Wouldn't that be the cleanest way?
#14
my 2cents are on the separate column. you are heading in the a very dark woods with that if $db_type -- what happens if someone else wants to write a similar query ? Multiple that if? brrrr.
#15
The problem with that approach is that core can conceivably have many more than just mysql and postgresql supported. There are patches in the queue to add Oracle and MS SQL support, for example. And only a matter of time before someone does an SQLite port, or a Sybase port, or a DB2, etc. That kind of database-specific code will get unmanagable really, really fast.
#16
Also, then postgres and other database systems don't benefit from the incredible performance benefit of not having to lowercase the fields.
So yep, duplicate column still seems the best way to go for now. :)
#17
I concur about the if $db_type, just playing devils advocate there.
Way back when I talked with Robert about this I actually suggested the extra "lower" column in the first place as probably the best solution. And I think it still stands.
Still doesn't hurt to throw out bad suggestions ... makes the best solution look all that much better :-p
I still agree with my original thought (and the path this patch took), have a separate "lower" column.
#18
I took the extra column approach a while ago with a very popular site. I think the payoff is worth the extra column.
#19
Just in case anyone needed further confirmation, I just applied this patch, and now autocomplete is actually usable. (I'm on a shared host, so it was pretty useless before.)
#20
finally had a chance to look closely at the patch itself. ;) mostly, it seems fine on a quick skim (i haven't tried testing it personally). however,
system_update_183()is obviously missing the pgsql cases, so this needs work before it could be committed...#21
@dww - did you test on HEAD or 4.7? I rolled it against 4.7 and haven't had time to apply it to HEAD.
#22
i didn't test at all. i just visually inspected that there's no 'pgsql' case in
system_update_183()in http://drupal.org/files/issues/name_lower.patchspeaking of which, if this issue is targetting the trunk (as the version currently states), that update # needs to be changed, too...
#23
-1 to the patch as it can't be tested on pgsql. Time for someone to practice some pgsql syntax! :)
Other than that, IMHO the concept is good and the extra 10% of space for 4 orders of magnitude more speed is worth it in core.
Sammy Spets
Synerger
http://synerger.com
#24
patch including pgsql update code for 4.7.
#25
This is getting rolled out on a client site today, so we'll have some reports from the wild for you soon.
#26
Or you could just get rid of the "preserve case" feature and store the usernames in lower case right from the get-go? Capitalisation or w/e can be done in the presentation layer.. The only people who'll get ruffled by this are those with 1337 nicks like "deEp ThRo4T" or "Marquis de Sade" ..
Unless I'm missing the big picture here, this is a minor feature that we can just nix..
Just a thought.
-K
#27
@Zen: I dislike that idea. It goes against Drupal's philosophy of not mangling user input.
I should be able to have the nick "webchick" and someone else should be able to have the nick "Joan of Arc" -- no amount of logic on the theme layer is going to allow both of those things to happen.
#28
Also, if this feature needs to be retained, it might be more elegant to use PgSQL's operator overloading feature to accomplish this. Not sure if it's possible, but I'll try and ask Cvbge about it. The separate column approach is IMHO rather tacky.
Cheers,
-K
#29
It would be good to research some of the other rdbms solutions out there as well and see what issues we might run into if we do an Oracle or MSSQL port. The one shining advantage of the name_lower approach is that we can eat our cake and port it too. It is absolutely guaranteed to run the same on any database we use.
I'm not against a db specific approach, especially if we can apply it to the other major systems I mentioned. But the "hackishness" or "tackiness" or "this hurts my normalized eyes-ness" of the patch shouldn't prevent us from harvesting the performance gain. One way or another, this bottleneck needs to be resolved.
#30
Looks like a 4.7 patch.
#31
The issue still exists in HEAD:
#32
i think neil's point is that the performance implications of this fix are so enormous that we could consider this a bug that deserves to be fixed in 4.7, in addition to being fixed in the trunk. i'm not going to move the version again, but i'd vote for fixing this in both branches.
#33
(subscribing) I'd wondered why the autocomplete lookup of usernames was so painfully slow. I'll try to test this patch on a copy of my 4.7 site.
#34
With this patch in place, username autocomplete is remarkably faster (qualitatively).
#35
Well at least for postgres, it looks like comment #6 is the correct way to help list problem. According to the postgres manual.
http://www.postgresql.org/docs/8.1/interactive/indexes-expressional.html
Since mysql is by default case insensitive, and does not seem to allow indexs on SQL functions. A hack around the problem, we could just remove the "LOWER()" from the where clauses in select statements before we make the query.
#36
re: pgsql -- yeah, as i linked in #12. however, instead of trying to find DB-specific code for this on every DB we currently (and in the future, might) support, we could just add the extra field. it's simple, it'll always work, we'll never have to port it, and it's 4 or 5 orders of magnitude faster than what we have now.
#37
No one seems to be addressing Robert's comment about spoofing addresses.
If you look at this comment (http://drupal.org/node/56487#comment-107050) on login spoofing:
..we could extend the "second column idea" to be both index-able and less spoof-able.
So this indexable field would be created by lowercasing
nameand stripping all non-alphanumeric characters.Ideally, this new value could be placed in
name, with a new field calleddisplay_namewhich has the capitalization, spacing and punctuation the user prefers to be displayed in comments, etc. (As a bonus, you could allow apostrophes in the the display name for those named O'Brian). And since thedisplay_namecan be "folded" to get thename, you could still use the display name in logins and Drupal Authentication.Of course, this would mean the ubiquitous
$user->namewould need to be changed to$user->display_name. And would affect user login validation and a slew of other components.I don't know if this is feasible, but I thought I would make the suggestion before we implement a second column.
#38
That is certainly a valuable suggestion, but I think that changing the meaning of $user->name is unlikely to find much support (at least prior to Drupal 6.0) due to the widespread breakage that would ensue. For now, it seems it would be pretty trivial to strip the whitespace, etc. from the entry in the name_lower column and that would make the system more robust.
#39
Nice suggestions. I think we'll most likely keep $user->name (as the display name), but I can definitely see stripping white space and non-alphnumeric chars from the name_lower column. This, like you said, would make it harder to spoof user names and easier to customize your own user name.
This is for a later patch, however, because there is still a lot of resistance to the idea of having a second column for lookups to begin with.
#40
i was really lost in the comments. is this final version of patch after the suggestions? and are there any bottlenecks or imcompatibilities caused by this patch?
thank you very much
#41
@Hayakawa: this patch is ready to go, if it is decided that it is an approach that we want to pursue. If someone comes up with a different approach that people feel is a better architecture, it might win out. So far, many prominent developers have expressed support for the patch, and a couple have expressed misgivings. What that means is that a better solution is hoped for. But if you want to use the patch, I feel confident that it will work well for you. It is being used by the biggest client that I work for on their production site and has caused the performance gains that are hoped for.
#42
If Robert's latest patch can be finalized and put into Drupal 5.0 (or even 4.7), I would recommend that we do it. It provides (right now) a substantial performance improvement.
And for those developers who are reticent to add a new column simply to get a performance gain, think of it this way...
In Drupal 6.0, the new column could be a trifecta for these issues:
#43
Stripping the lower column of whitespace is a neat idea, but could break existing sites. It would be cool if we could add this for new sites though.
One possible approach would be to only use literal lowercasing in system_update_N() to create the column, but to insert the stripped version for any new users from then on. Their usernames will be checked with the strict restrictions too.
That way we don't have to provide an option and we don't break any site.
#44
@Steven - I'm not sure that would work, since you need a apply a function to lower case and strip whitespace before you run the query. If you don't know in advance whether or not to strip, that won't work.
Is it possible to provide a pre-update tool to check whether there will be conflicting usernames after the whitespace stripping? Or is there any way to provide an interactive process during the update?
#45
FYI: Correct syntax for creating pgsql indexes for drupal tables is:
CREATE INDEX {table}_fieldname_idx ON {table} (fieldname)
#46
needs a 5.0 port.
but further, this is one of those updates which is difficult in our update syste, once you update your code, you can't login. we handled that at one time by having people manually update their sessions table. might be needed here too.
#47
I rerolled for HEAD, was surprisingly easy, only hunk failed for user.module.
Why you can't login? I do not see it.
#48
sorry, i was unclear. if a user is not logged as uid=1 after updating source, he is unable to login and perform update because the new column doesn't exist. however, i tested setting $access_check = FALSE in update.php and that works so is a good workaround for this situation. we'll just have to document this in the upgrade notes for 5.0
i tested chx' recent patch for functionality and update path. works well. rtbc.
#49
I still think this is not the cleanest approach. Also, I am going to benchmark this before it can get committed. I'm tempted to postpone this, as I think we might be able to do a better job with this ...
Changing the status from 'critical bug report' to 'normal task'. This doesn't break your website and therefore isn't critcal.
#50
Also changing the status to code needs work. Will update the status once benchmarked.
#51
Let me know if there is a way I can help with the benchmarking.
#52
Moving out of the "x.y.z" queue to a real queue.
It doesn't appear to be a new feature, so I'm putting it in 5.x-dev.
#53
Let's bring this back into the light -- lower() sucks :-)
#54
I just came across a problem where users can't login, because the clause "LOWER(name) = LOWER('The Users Name') was failing when the name was in mixed case (on MySQL 4.1, Drupal 4.7.5). This seems to be related to this thread. I fixed it by simply removing the lower (for now) in user_load, trusting that the unique key will prevent duplicates that differ in case.
It seems incredible to me that a problem this severe could exist for this long (though given the number of keyboards without shift keys, perhaps it's not so incredible) . So I'm willing to assume for the moment that I'm missing something. If so, I'd appreciate it if someone could let me in on the secret. If not, should I raise the priority of this or submit it as a separate bug?
#55
Removing the LOWER made the user id case sensitive. So I changed it to
...CONVERT($key using utf8) = '%s'";This is specific to MySql 4.1.1 and higher.
The way I read the MySQL docs, the existing Drupal code using LOWER would work on previous versions of MySQL. Furthermore, I think it would work on a clean install of Drupal with later versions of MySQL, because the clean install sets users to be a varchar. It's the upgrade process that converts varchar columns to varbinary, for reasons I have yet to discern.
#56
Hi Gary,
This thread is about a proposed performance improvement. And while the patch does remove the
LOWER()MySQL function call, it's really unrelated to the bug you are experiencing.Please, submit a new bug so that we can tackle both of these issues separately.
Thanks!
#57
Revisiting Dries's idea of making pgSQL case insensitive. It's possible to override operators in PostGreSQL (so I've heard). If it would be possible to make the = operator case insensitive for varchar comparisions, maybe that would be better solution as it would work for all database columns and keep the entire problem contained in the pgsql database implementation.
From http://www.houseoffusion.com/groups/cf-talk/thread.cfm/threadid:49827:
So it looks like we have some questions to answer here. They might not indexable, but surely this will offer a performance improvement over the LOWER() method currently in use which kills the indexes anyway for all database types.
#58
huh? ;) read the definition of the === operator you just pasted above:
SELECT LOWER($1) = LOWER($2)by defining your own case-insensitive equality operator, you're just hiding the LOWER() calls, not removing them.
#59
@robertDouglass
Are you sure the query cache is not a factor in the improved results at the top of this page? Try restarting MySQL between tests and see.
@All
This approach is one of the no-nos in database design. The reason is that you have data redundancy and one field is a product of the other in the same row. Things get out of hand if one part of the application would update one field but forget to update the other, and then you have data integrity issues and all sorts of problems.
This may not be such an issue if we only have one place for the INSERT and one place for the UPDATE of the name. But if we have more then we are just opening ourselves to trouble in the future.
One approach that allows a new column, yet ensure its integrity, is if we had triggers: this would automatically update the name_lower field every time we have an insert/update to the row, and would be transparent to the application. But these are only available in MySQL 5.0.2 onwards.
The other option to solve this is to get rid of the LOWER()s in SQL altogether. The database is case insensitive anyway, and uniqueness is guaranteed since we have a unique index on name. We can do the lower part in PHP on the name before we pass it to db_query().
#60
I mean instead of this:
<?phpdb_query("SELECT uid FROM {users} WHERE LOWER(name) = LOWER('%s')", $name);
?>
We do something like this:
<?phpdb_query("SELECT uid FROM {users} WHERE name = '%s'", strtolower($name));
?>
Can someone benchmark this?
#61
I agree with Khalid. Do this at the PHP level with strtolower(), the database should not store 2 representations of the same thing, it's going to cause problems eventually.
#62
Yeah you're right, I should read code more carefully. Shoot, another option down the drain :(
#63
The problem with kbahey's approach is that the user table stores the data including mixed case usernames. Using strtolower() in PHP and then searching will cause any username containing an uppercase letter to fail when using pgSQL whose comparisons are case sensitive, the original cause of all the trouble in this issue.
#64
But Robert's original test and patch was on MySQL.
The motivation was bad performance on MySQL because of the lack of using the index since the name was wrapped in LOWER().
#65
Until reliable triggers are available in MYSQL, it seems very hard to think of a solution other than storing the lower-cased username as another column. While this may not be a great practice, it's database-agnostic.
Even if Drupal was pushed in MySQL 5.0, it's not clear that most users would be able to create a TRIGGER:
http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html
#66
I'm still in favor of the extra column, but rerolled to remove whitespace as was suggested earlier in the thread. Anybody is welcome to beat me to it, since I won't be doing it this week.
#67
How do you deal with the update path then? by removing the white space, previously valid usernames may collide.
#68
I still don't like an extra derived column. It is not good practice and a whole can of worms.
Here are some benchmarks using MySQL 5.0.22.
I used today's HEAD and created 10,000 users using devel.
Then I ran two queries, searching for a user called "paritoh" (lower case p in the db):
$sql = "SELECT * FROM {users} WHERE LOWER(name) = LOWER('%s')";db_query($sql, 'Paritoh'));
This is how it is in core today.
Result: Time=22.491 Rows=1 SQL=select * from users where LOWER(name) = LOWER('%s')
$sql = "SELECT * FROM {users} WHERE name = '%s'";db_query($sql, strtolower('Paritoh'));
This is using the case conversion outside the database:
Result: Time=0.395 Rows=1 SQL=select * from users where name = '%s'
56X orders of magnitude improvement.
So, for MySQL, the performance problem is solved by doing the case conversion in PHP, with no backward compatibility issues or the need for a derived column.
This leaves PostgreSQL, which is case sensitive.
By using database abstraction, we always have compromises such as lowest common denominator, or optimizing for one but not the other.
Alternatives:
- Store names in lower case. They are already unique and have a unique index, so no data issues are there, but will be offputting to users to see their names change case. In this case, we can still use LOWER in the SQL, but like so
SELECT * FROM {users} WHERE name = LOWER('%s')which works with all databases, and is as fast as the PHP strtolower() for MySQL.- Do some magic in the abstraction layer to rewrite the SQL for each engine. For example, if we have LOWER(name), we make it just name for MySQL.
- Other alternatives?
#69
http://drupal.org/node/105161 and http://drupal.org/node/181625 were duplicates.
#70
oops, setting status back, too many tabs.
#71
Coming back over from http://drupal.org/node/181625
I am setting this back to critical and back to bug report.
From firebus' report (I've been working with him), not fixing this made nanowrimo.org unusuable with 800 logged in users; fixing this got him up to 1400 logged in users and a slow but usable site.
This is a critical performance problem.
#72
Subscribing
#73
I would side with merinofchaos, as posted in another issue, one single term query on NowPublic runs for 2.42s w/ LOWER and 0.02 without. So, I would argue Dries' earlier
-- while the website does run but if its so slow that it can't be used, then IMO that's a broken. The patch is not difficult. The update is not nice but that's life.
#74
Doh! Forgot the indexes.
#75
forgot access table as I only search modules. I do not think there is any point keeping the mask case sensitive it's not the usual user submitted data that can not be altered. So, I simply lowercase that and not keep a lower_mask.
#76
Give a simple scan about the patch in #74 and sounds beauty: including both upgrade and normal queries patch.
BTW, seems not yet include the handling of drupal_is_denied() (http://drupal.org/node/174025)? It is also a critical issue that affected by non-indexed LOWER() handling. Should we also take care of that within this issue? May be also handled by 2 columns? One for compare (lower case) and the other for admin UI (natural case) ?
#77
hswong3i says, I should warn the admin about the mask change. Be it.
#78
Applied #77, installed a new site into a fresh database, and on completing installation, user 1 was not logged in. Trying to log in gives this error: "The username admin has not been activated or is blocked."
Also, the indexes were not created. (looks like they were only added to the upgrade path)
#79
Sorry, please hold on for a while! A simple suggestion: should we split the user "real name", "account name" and "user ID" as 3 different type of fields? Let's take vexim (http://silverwraith.com/vexim/) database
userstable schema as example:user_id: as like as ouruid, auto_increment.localpart: the local part of email address (e.g. localpart@domain). It is always unique, and restricted with email local part restriction:[a-z0-9_-]. Similar as ournamefield, plus lower case limitation.realname: An additional field with NO strict limitation. User will able to input whatever string as their display name.So my suggested changes:
uid: unchange.name: field keeped, and translate all account name into lower case. As like asname_lowerproposed in chx's patch.realname: new field, free typing. As backward compatible, we can simply copy the existingnameinto this new field.Benefit include:
nameand keep its function. Contribute developers should not be confused by the lower case transformation.realnameis much more meaningful and useful. E.g. I always use profile.module and create some field similar as this nature for user full/real name.name_loweris only meaningful when compare, but no other benefit. On the other hand, contribute developer may feel confusing: which field should use for which cases? Should I always usename_lowerduring compare? Or combine the use of both field? Or some other style? We need to document the change clearly.Yes, this suggestion is a bit complicated when compare with chx's patch: patch in #77 is much simple and direct, and should be functioning with no question (may be we need some more debug, but this shouldn't be difficult). This suggestion may require for more changes, e.g. most UI should update and display both user real name and account name. But this is not a new idea: most forum system supported for this style.
Please feel free to work hard with chx's
name_lowerpatch, it is target for today. BTW, my suggestion shouldn't be too complicated, and is suitable for a long term development. If it should belongs to D7, I will come back for this afterward, and focus on chx's work for the needs of today :)#80
@#79: While I agree with a need for real display names, it seems like something that would complicate this issue. Whatever the outputted username is, will be the one people search for and must therefore be translatable into the lowercase version. So we cannot have a display name of 'John Doe' and then a username of 'jdoe' since other users would end up searching for the wrong name.
Comment #37 seems like a much more workable idea. And probably something that should be taken into consideration in support of the above patch -- even if we intend to implement it at a later date. If this is the case, maybe the naming of the field should not be as specific as 'name_lower', but should be something more generic like 'name_unique'.
#81
From #79: "This suggestion may require for more changes, e.g. most UI should update and display both user real name and account name. But this is not a new idea: most forum system supported for this style."
Yes, this would solve the problem I raised above. However, it will do so at the cost of the UI. Displaying both names would be confusing.
#82
I think hswong's suggestion is when someone enters "John Doe" as their username on registration, behind the scenes, "john doe" is stored as the user.username field, and "John Doe" is stored in the user.realname field.
How does all this lowercase business work with accented letters, Japanese characters, etc?
#83
@webchick: This suggests extra lower case column for accented characters so shouldn't be an issue in this case.
Japanese doesn't have a concept of lower case, so at the db level I assume it makes no difference at all (could only find a mysql internals list discussion from 2000 to support this, and it wasn't directly related).
#84
For clarify my idea (http://edin.no-ip.com/html/?q=node/346), here are some useful screenshots from vexim and eGroupWare:
http://edin.no-ip.com/html/files/vexim_adminuseradd.png
http://edin.no-ip.com/html/files/vexim_adminuserchange.png
http://edin.no-ip.com/html/files/egroupware_edit_user.png
http://edin.no-ip.com/html/files/egroupware_setup_authentication.png
The proposal (split field into
realname,nameanduid) is not too complicated, backward compatible, and widely adopted by many other systems. Should this be one of the possible solution?#85
May be we should restrict "account name" (
name) as strict as email local part:[a-z0-9-_.], and leave "real name" (realname) as free as possible. This can simplify the checking (whatever UTF8 or not, we only accept simple text input), and also extensible if we may adopt SMTP authentication backend.#86
hswong3i - username currently accepts most UTF-8 characters - including Japanese etc., your suggestion would be a regression if it restricted to [a-z0-9-_.], and I really think the "real name" idea should be left for D7 - we already have profile.module for this.
Looks like #77 needs schema declaration updated in .install, will try to reproduce John Morahan's bug later on.
#87
I'm using MySQL version 5.0.22-Debian_0ubuntu6.06.5-log - if that helps.
#88
@catch: agree. The idea of restricted name ([a-z0-9-_.]) may be too new and also too late within D6 dev cycle. We may need more time and discussion for this additional issue. I am just exploring about its possibility.
BTW, It should be a good idea if we employ the pair up of realname + name, rather than name + name_lower. This should be more meaningful, and backward compatible: no harm to contribute developer even they don't update the old LOWER(name) handling, will not break existing code but just without performance boost.
#89
As I said earlier in #68, the problem does not exist in MySQL, and we can overcome it with a simple code change that does not involve any schema changes.
Again, I am objecting to chx's approach of derived columns that only hold a variation of an existing column. These are just trouble down the road, and we will pay dearly for them in maintenance and complexity.
If Postgres needs that, then why not limit the complexity to it?
As to hswong3i's suggestion, I need clarifications on:
- The local part of the email may not be unique. There can be someone@hotmail and someone@yahoo with the same "someone" part, but different people. Unlikely but possible.
- Your scheme enforces that the login name ('name' in your writeup) is always lower case. This means that now we cannot have 'Dries', or 'Amazon', but only 'dries' and 'amazon'? That may not be acceptable to some people ("Why can't the login name be what I want it to be?") Also, what about UTF-8 characters? How would Jose/Andre with accents on the e be handled? Strip those as well?
#90
your suggestion to use a different schema for pg users is is more complex than the extra column, IMO. what about al lthe other DBMS? you have objected without proposing an acceptable alternative.
#91
@moshe: totally agree. We shouldn't have a hybrid solution which is case-by-case and database-by-database dependent.
@kbahey: local part is not unique, just as like as it is not unique when using drupal.module within D5.3. SMTP authentication should always binded with A authentication server. In case of eGroupWare, we can define the "Mail domain (for Virtual mail manager)", so user will able to use both "username@example.com" and "username" if Mail domain is set as "example.com" (Sorry this topic shouldn't belongs to here, I will create a new one for D7)
On the other hand, login name is restricted, but not realname. Realname allow UTF-8, with no strict limitation. The idea is similar as the "Full name" field which is used within drupal.org but created by profile.module. Realname should be something always binded with user, no matter how the drupal site is used for (in general speaking...). It is not necessary to split this essential field into a non-standardized profile.module handling.
#92
realname whatever it is , too big for D6 and does not belong to this thread. Bring that somewhere else. (So much banter and noone fixed the patch!)
#93
Found the login bug: there was a missing
break;in the switch statement in user_save(). Leaving as CNW because this still needs the schema update.#94
Also, user_save() does not populate {users}.name_lower for newly created accounts.
#95
Now it does.
#96
Added the schema change, and added placeholder-for-uid-1 to {users}.name_lower in system_install(), so as not to break the new unique key.
#97
Ok so I patched a fresh checkout and installed with no issues. offset with comment module but I think my tarball is a little out of date.
Made a new user manually and logged in with them, then logged out and back in with the admin, all fine. Posted nodes and comments, did some user admin etc. no problems there either. Also checked the db and the name_lower column and respective index were there as expected.
Looking pretty good!
#98
missed a bit.
#99
Fixed some grammar issues.
The tests for lowercase masks in the access rules UI were incorrect.
#100
The index on user.name needs to go, but this patch looks great otherwise.
#101
Like so?
#102
Seems user_autocomplete() got lost somewhere along the way
#103
@moshe and @hswong3i:
My comment on MySQL, is that 95% of Drupal sites use MySQL already (we can debate the figure separately). So for the sake of the 5% we are introducing a derived column, denormalizing the schema, which is not needed by those 95%. Dowing a strlower() in PHP solves the problem for MySQL.
I don't use PostgreSQL, nor my clients do, and hence have no need to solve that problem. Someone who uses it can scratch that itch.
The issue with using derived columns is that they have to computed every time the original changes, and the probability of errors because they get out of sync is there. Yes, the UPDATE and INSERT can be in only a few places, but there are modules that can mess this up easily.
Anyways, it seem that I am the only one voicing this concern. I wish we can poll a PostgreSQL expert on what options are there before we see this as the ultimate solution to this problem.
#104
Sorry that since drupal_is_denied() is more high priority than user.name (drupal_is_denied() is called for EVERY page generation, whatever deny or not), I adopt both killes's patch (http://drupal.org/node/174025#comment-317309) and chx's patch comment (http://drupal.org/node/83738#comment-332240), and shift the focus of drupal_is_denied() back to its own issue (http://drupal.org/node/174025#comment-332305).
drupal_is_denied() should delay no more, and shouldn't wait for user.name :)
#105
I folk this issue and handle with what I proposed in #79: http://drupal.org/node/188734. That issue should not be as simple as name_lower, but should be more complete and elegant :)
#106
hmm... not sure if someone checked this already... there's a way to extend PostgreSQL by creating new data types, so it is possible to create a "Case insensitive text data type". Here's a project that does this:
http://gborg.postgresql.org/project/citext/projdisplay.php
#107
denormalizing is not for speed purposed, next version we can even put in triggers, about citext, well, that's not civarchar so it's somewhat problematic.