Replace LOWER() with db_select() and LIKE() where possible

catch - July 8, 2008 - 13:27
Project:Drupal
Version:7.x-dev
Component:user.module
Category:bug report
Priority:critical
Assigned:Unassigned
Status:needs work
Issue tags:Performance
Description

This patch attempts to address the issues at #188734: LOWER(name) queries perform badly; add fullname and restrict name as lowercase., #83738: LOWER(name) queries perform badly; add column name_lower and index., #102679: Add a Display Name field to core in addition to Username and #181625: LOWER() is slow and can not be indexed - but it since it doesn't fit that neatly into any of those issues, takes a new approach, and they've all got quite long discussions already about different approaches, I'm hoping a new issue will be OK. However, if you're planning to suggest a radically different approach to this, please read those issues first - since many different approaches have been discussed and rejected in those issues already. Thanks :)

What this patch does:
Adds users.username to {users} table in addition to user.name
Modifies user queries to use that table to avoid expensive LOWER() queries, however $user->name is returned just the same as now.

What it doesn't do:
Change anything in the UI or break backwards compatibility for either site owners or modules which use $user->name

What it will allow us to do:
Populate $user->name to anything they like via a contrib module or followup core patch either programmatically or via form_alter on user forms - they'll need to also handle validation and searches for custom user names if they do so - since there could be various approaches, it's best that complexity is left out of this patch.

So, we get a big performance win, and the flexibility to have more variations in user name handling down the line in either core or contrib.

Current version of the patch causes 6 failures in UserRegistrationTestCase() (starting with user_check_password(), which ought to be unrelated), but marking as needs review for general concept.

AttachmentSizeStatusTest resultOperations
userlower.patch7.86 KBIdleFailed: Failed to apply patch.View details | Re-test

#1

Senpai - July 8, 2008 - 18:27

Subscribing for a review.

#2

Arancaytar - July 8, 2008 - 20:05

You didn't mention whether {user}.username will be constrained to lower case, but I presume that that's how you got rid of the LOWER() query?

I'll review and run some test cases...

#3

Arancaytar - July 8, 2008 - 20:58

I'm getting the same 6 fails in registration, plus 25 fails in uploading a picture (haven't compared with the unpatched site yet). I'll investigate a bit.

#4

catch - July 8, 2008 - 21:22

{user}.username will be constrained to lower case, yeah.

#5

Arancaytar - July 8, 2008 - 21:23

False alarm on upload. It's just curl being stupid again, returning zero-length responses. When I'm done with this, I'll rewrite my copy of simpletest to use drupal_http_request; it's really completely unusable as it is for me.

This is the content of a debug file for user_check_password:

Password: foo
Account:  {"uid":"3","name":"simpletest_zyFE","username":"simpletest_zyfe","pass":"$P$Cy92GxnBfhOWNhWMZJq7wXKTQ3J13I.","mail":"simpletest_zyFE@example.com","theme":"","signature":"","created":"1215551808","access":"0","login":"0","status":"1","timezone":"3600","language":"","picture":"","init":"simpletest_zyFE@example.com","data":null}
Old hash: $P$Cy92GxnBfhOWNhWMZJq7wXKTQ3J13I.
New hash: $P$Cy92GxnBfSETyVzJNGe2gNaGPEMKBQ1

Password: 4xB6FEL37E
Account:  {"uid":"3","name":"simpletest_zyFE","username":"simpletest_zyfe","pass":"$P$Cy92GxnBfhOWNhWMZJq7wXKTQ3J13I.","mail":"simpletest_zyFE@example.com","theme":"","signature":"","created":"1215551808","access":"1215551816","login":"1215551816","status":"1","timezone":"3600","language":"","picture":"","init":"simpletest_zyFE@example.com","data":null,"roles":{"2":"authenticated user"}}
Old hash: $P$Cy92GxnBfhOWNhWMZJq7wXKTQ3J13I.
New hash: $P$Cy92GxnBfI76LOCSfUyxhbe1n0no1K1

Password: 4xB6FEL37E
Account:  {"uid":"3","name":"simpletest_zyFE","username":"simpletest_zyfe","pass":"$P$Cy92GxnBfhOWNhWMZJq7wXKTQ3J13I.","mail":"simpletest_zyFE@example.com","theme":"","signature":"","created":"1215551808","access":"1215551816","login":"1215551816","status":"1","timezone":"3600","language":"","picture":"","init":"simpletest_zyFE@example.com","data":null}
Old hash: $P$Cy92GxnBfhOWNhWMZJq7wXKTQ3J13I.
New hash: $P$Cy92GxnBfI76LOCSfUyxhbe1n0no1K1

I'll look at _crypt_password next...

#6

pwolanin - July 8, 2008 - 22:37
Status:needs review» needs work

bug here:

+    else if ($key == 'username') {
+      $query[] = "username = LOWER '%s')";
+      $params[] = $value;
+    }

#7

catch - July 8, 2008 - 23:14
Status:needs work» needs review

Fixed that bug. Also spoke to pwolanin about this in irc, and he suggested a way to avoid drupal_strotolower in user_save() - which means that all the lowercasing is handled by the database now instead of some php, some database.

Still 6 fails in user registration tests. Thanks Arancaytar for the debugging :)

AttachmentSizeStatusTest resultOperations
userlower.patch7.86 KBIdleFailed: Failed to apply patch.View details | Re-test

#8

pwolanin - July 9, 2008 - 01:55

perhaps this line in the test should change too?

+    $this->assertEqual($user->username, drupal_strtolower($name), t('Username matches. '));

#9

Crell - July 9, 2008 - 04:14

Bless you. :-) Visual review:

- I still do not like not having a unique constraint on the display name (nee name). As coded, what would happen if someone set their name to "Crell" and someone else to "crell"? The unique key is on username, which is the case-folded version. So what would happen is username would get set to "crell" for both of them, which would create an SQL error, wouldn't it?

- This code:

<?php
// Set the username to a lower case version of $user->name if not set.
+  if (empty($array['username'])) {
+   
$array['username'] = db_result(db_query("SELECT LOWER('%s')", $array['name']));
+  }
?>

Why are you using the database for the lowercasing? Would its result be any different than doing it in PHP space and avoiding the extra query?

- So then for a contrib module to modify the name value, it would need to:
+ form_alter to show "username"
+ form_alter to NOT show "name"
+ inject its own validate and submit hooks *before* the default ones to set the "name" field by its own logic.

Or alternatively, do the same in hook_user validate/save. Right?

Otherwise I think this is a good direction at this point.

#10

catch - July 9, 2008 - 08:38

@Crell: I've added the unique constraint back to users.name - if a contrib module wants to handle that other issue that's been discussed elsewhere, they could remove the unique key via schema API.

Using the database for lowercasing exclusively was pwolanin's idea, which I reluctantly agree with. There's a chance that certain collations on certain databases are going to operate differently to mb_strtolower (I don't have enough practical experience with collations to know how much of a risk this is) - but by using the database we ensure consistency. Also the database is only being used for lowercasing on writes now.

Additionally, php.net suggests that LOWER() is actually faster than mb_strtolower, so for micro-optimisation there's probably not much in it.

Contrib needs to do exactly what you suggest, yes.

I'll probably update user.test to attempt creating users with siMiLar usernames if that's not in there already. Still at a loss as to why the password hashing is failing.

AttachmentSizeStatusTest resultOperations
userlower.patch7.84 KBIdleFailed: Failed to apply patch.View details | Re-test

#11

Dries - July 9, 2008 - 09:37

Personally, I'd find it very confusing if my display name was 'Dries Buytaert' but my login/username was 'dries'. I'd consider that to be a serious usability regression. Unfortunately, this patch seems to take us down that road and I'm not a big fan of that. Maybe I misunderstood where this was heading though.

#12

Arancaytar - July 9, 2008 - 10:08

I would argue that it makes sense for login names to be permanent and displayed names to be changeable. The big problem with changing login names is that it is hell for the password manager - you don't necessarily want a different login each time the displayed name changes...

#13

Damien Tournoud - July 11, 2008 - 08:53

Edit: I misunderstood the scope of this patch.

@Dries: the new 'username' column is just pregenerated to 'LOWER(name)', in order to avoid a LOWER operation during SELECTs (that can't be indexed).

There is no change whatsoever here on the UI or user-experience part.

@others: would it be possible to improve that patch a little by removing spaces from the username column? That way you will not be able to register both 'Dries Buytaert' and 'DriesBuytaert'.

Did a quick visual review of the code, and it looks great.

#14

catch - July 9, 2008 - 10:46

@Dries, my primary concern here is to remove the LOWER() in user queries. Robert's comparisons from 2006 show the difference in performance with 500k users. And the queries haven't changed since then:

in a typical case we're comparing 1273.97 ms to 0.37 ms

I agree 100% on the 'display name' issue, however, there's a lot of demand for it (including from people who wouldn't argue for it without good reason like Crell and Webchick). This issue is an enabler for that - either in core or contrib, but I'm absolutely not going to introduce any explicit support for such a thing with this patch - there are zero API or UI changes here. The only connection is that this patch doesn't introduce the extra column in such a way that it would block the other issue (by requiring yet another column or something).

@Damien - existing sites might already have a Dries Buytaert and DriesBuytaert - which'd give us a duplicate key error on update and lock one of them out. This patch currently has zero effect on the username policies of any existing sites, and I'd like to keep it that way.

@Arancaytar: #102679: Add a Display Name field to core in addition to Username is over there ;)

#15

Arancaytar - July 9, 2008 - 14:19

Sorry, I misunderstood the scope of this patch.

How severe should we make this uniqueness constraint, as in how much should be stripped from the username field? Beyond spaces, we could strip everything but letters, so that "Dries Buytaert02345" would be unavailable as well. However, the stronger the constraint, the more problems people will have when updating their databases that may violate the new rules.

#16

Crell - July 9, 2008 - 15:01

@Dries: There are actually extremely good reasons to allow that distinction. In my case, I can't control the login name people choose when signing up. However, nobody on my site knows each other by their login name. We actually use real mother-given names for everything. Then when I try to select someone's name using the auto-complete author field, how do I know that "Dries Buytaert", who I'm trying to set as the author of a node, is really "trkfan849"? I don't. Allowing these fields to be split is a usability improvement, because right now if you use theme_username then you get a disconnect between user's identifying strings in different contexts. That said, this patch as pwolanin said doesn't change the default behavior other than silently setting up a pre-lowercased username for performance. I suspect Drupal.org would never override that.

@pwolanin: If that's the reason for using LOWER() in SQL, that should be documented in the code.

Also, having written a D5 module that tries to custom-set the login name field, I can tell you that it is not as easy as that to set a custom value there because the user workflow sucks. :-) I'd have to try it with this mechanism, but let's be careful that "just form alter it" isn't our answer unless we know that will actually work. Anyone have time to write up a demo module as proof of concept? (I wish I did, but I probably don't.)

#17

catch - July 9, 2008 - 15:16

The unique constraint is exactly like it is now - a unique index on the column and form validation to avoid duplicates with case variations. This means you can't have Dries and dries, but you can have Dries1 etc. - again, exactly the same as now. Anyone wanting to force more unique usernames also needs to find another issue to do so in, since it's not within scope.

#18

catch - July 9, 2008 - 15:26

@Crell. I've added documentation to both user.module and user.test to explain the justification for using LOWER(). If someone wants to come along and tell me it's not necessary, that's fine too.

AttachmentSizeStatusTest resultOperations
userlower.patch8.14 KBIdleFailed: Failed to apply patch.View details | Re-test

#19

JohnAlbin - July 9, 2008 - 19:51

I think this patch also dovetails nicely as an excellent first step on the road to "display name" capabilities in core (and that’s the last I’ll say about it in this issue.) Given how out of control the others issues have become, we definitely need to limit the scope of this issue.

Catch, are you re-rolling the patch after your IRC chat with chx? Or did I misread that discussion?

#20

catch - July 9, 2008 - 19:34

Following discussion with chx, we keep LOWER() within the database during user_save(). I've also modified user.test to use a real SELECT query as well.

Also, this version of the patch passes all user tests now (as a result of that change), and doesn't harm any other tests either.

AttachmentSizeStatusTest resultOperations
userlower.patch8.5 KBIdleFailed: Failed to apply patch.View details | Re-test

#21

catch - July 9, 2008 - 20:34

Last patch had some silly errors in it, this one should be good.

AttachmentSizeStatusTest resultOperations
userlower.patch9.46 KBIdleFailed: Failed to apply patch.View details | Re-test

#22

dahacouk - July 11, 2008 - 08:43

@catch: Well done for coming up with this approach! Will your patch make it easier for a contrib modules to add non-unique display name functionality (emphasis on "non-unique")? In other words will $user->name (the display name) be able to be made non-unique if desired?

@Dries in #11: There are many webadmins calling out for display names being distinct from login names. There are many hacks being produced for what, some see, as a core function of a web site: being able to display a real name as different from their login name. This scenario would be optional for webadmins. And there's no reason why the current situation (of just one display/login name) couldn't be maintained too. But it's good to get some standardisation for this functionality into core so that developers can start to deliver more consistent solutions. Have a peek at #102679: Add a Display Name field to core in addition to Username to see more people wanting display name functionality in core. Sure, this patch doesn't go all the way but it's a start.

@Arancaytar in #12: Why is the "big problem with changing login names... hell for the password manager"? Surely, before a login name change request is actioned, the password manager just has to check that there are no duplicates. How is that "hell"? I say keep things flexible unless there's a real security issue or big hit on performance.
-1 to permanent login names

@Damien Tournoud in #13: Why "removing spaces"? If you are talking about "removing spaces from the username" - as in display name - then we wont be able to mimic real names. So, I see no advantage here.
-1 to removing spaces from the username column

@JohnAlbin #19: Nicely restrained! ;-) You're right, it does seem that we need to do this in small steps. Looks like we'll get "display name" capabilities in core around Drupal9. ;-)

Cheers Daniel

#23

Damien Tournoud - July 11, 2008 - 10:18

@dahacouk: I misunderstood the scope of this patch at first, please don't broaden it. This patch will ease the job of contrib modules to implement display names distinct from login names. It is only a first step in that direction.

Here is an updated version of catch#21:

  • Change queries in user_save() to use uid not name as the key,
  • Change user_is_blocked() to use the username, (this needs to be discussed)
  • Fix query in user_search(),
  • Fix _user_edit_validate() to allow modules to implement not distinct names.
AttachmentSizeStatusTest resultOperations
userlower-5.patch10.16 KBIdleFailed: Failed to apply patch.View details | Re-test

#24

catch - July 11, 2008 - 10:50
Title:Add column 'username' to {user}, remove LOWER(name)» Performance: Add column 'username' to {user}, remove LOWER(name)

Changing the title to be more descriptive.

I checked coverage.cwgordon.com and noticed that some of the user queries aren't being tested yet (i.e. user search), so this adds a test for search/user to verify they work as expected.

I'm not really set up for benchmarks, but will see if I can do some basic comparisons.

AttachmentSizeStatusTest resultOperations
userlower.patch12.39 KBIdleUnable to apply patch userlower_5.patchView details | Re-test

#25

JohnAlbin - July 11, 2008 - 11:03
Title:Performance: Add column 'username' to {user}, remove LOWER(name)» Add column 'username' to {user}, remove LOWER(name)
Status:needs review» needs work

Visual review:

  1. user_is_blocked still has a $name variable in it. Should be $username.
  2. In _user_edit_validate, it should be $has_username = !empty($edit['username']); note the missing “!”.
  3. Also, Damien, I think your additions to _user_edit_validate to allow non-distinct names is scope creep.
  4. So I’m looking at the un-patched user_is_blocked and I’m wondering how WHERE status = 0 AND name = LOWER('%s') even works; shouldn't that be WHERE status = 0 AND LOWER(name) = LOWER('%s')?

#26

Damien Tournoud - July 11, 2008 - 11:11

@JohnAlbin:

1. user_is_blocked still has a $name variable in it. Should be $username.
2. In _user_edit_validate, it should be $has_username = !empty($edit['username']); note the missing “!”.

Indeed, thanks.

3. Also, Damien, I think your additions to _user_edit_validate to allow non-distinct names is scope creep.

We are not far from being able to drop the unique constraint on {user}.name, why not following the route a little further? At the minimum, we need to only validate the uniqueness of {user}.username, because if we don't, contrib modules will not be able to drop the constraint themselves.

4. So I’m looking at the un-patched user_is_blocked and I’m wondering how WHERE status = 0 AND name = LOWER('%s') even works; shouldn't that be WHERE status = 0 AND LOWER(name) = LOWER('%s')?

It works on MySQL because the default collation is case-insensitive. Obviously there is a bug here.

#27

catch - July 11, 2008 - 11:16
Title:Add column 'username' to {user}, remove LOWER(name)» Performance: Add column 'username' to {user}, remove LOWER(name)
Status:needs work» needs review

Crossposted with Damien, and looks like JohnAlbin crossposted with me. #23 introduces 5 test failures which aren't in #24 and also uses uid as a string in user_save.

Rolled the user_search fix from #23 into this one, which otherwise is identical to #24. I'm not knee-jerk against the other changes, but please no test regressions or scope creep.

AttachmentSizeStatusTest resultOperations
userlower.patch12.44 KBIdleFailed: Failed to apply patch.View details | Re-test

#28

catch - July 11, 2008 - 11:19

Or typos ;)

AttachmentSizeStatusTest resultOperations
userlower_8.patch12.48 KBIdleFailed: Failed to apply patch.View details | Re-test

#29

JohnAlbin - July 11, 2008 - 11:30
Status:needs review» needs work

Trying VERY hard not to cross post again…

My review in #25 was of Damien's patch in #23. :-p

We are not far from being able to drop the unique constraint on {user}.name, why not following the route a little further? At the minimum, we need to only validate the uniqueness of {user}.username, because if we don't, contrib modules will not be able to drop the constraint themselves.

Again, Damien, this idea is scope creep. Dropping the unique constraint on {user}.name is highly debatable (even for contrib.) But the debate is out-of-scope for this issue. We can revisit your proposed change to _user_edit_validate in a follow up issue. :-)

Catch, I think Damien is correct; the user_save db queries do need to do WHERE uid = %d instead of WHERE name = '%s'", $array['name']. That’s what user_save uses currently: http://api.drupal.org/api/function/user_save/7 Or does that cause simpletest breakage? I haven't run them as I’m still visually reviewing the patches.

Also patch in #28 has “No newline at end of file”. But that’s highly pedantic, so you can ignore me on that one if you want.

#30

Damien Tournoud - July 11, 2008 - 11:33

Dropping the unique constraint on {user}.name is highly debatable (even for contrib.) But the debate is out-of-scope for this issue. We can revisit your proposed change to _user_edit_validate in a follow up issue. :-)

Well, we can drop the INDEX constraint on the {user}.name column while keeping it on {user}.username. In the standard behavior, the uniqueness of the {user}.name column is still guaranteed (because out of the box the {user}.username column is the lowercase form of {user}.name, thus if the former is unique, the latter also is).

#31

catch - July 11, 2008 - 11:42
Status:needs work» needs review

Bug in the test. Should try to get patches right harder than I try not to crosspost.

@JohnAlbin's question in #25 - using uid instead of $array['name'] there caused simpletest breakage, yeah. Ways around this much appreciated.

AttachmentSizeStatusTest resultOperations
userlower_10.patch12.58 KBIdleFailed: Failed to apply patch.View details | Re-test

#32

Damien Tournoud - July 13, 2008 - 14:09

Catch: sorry for #23 I should have indicated that those were completely untested.

Here is a update from #31, including the change to use uid as a key in user_save().

Edit: fix bad spelling that hurt my eyes.

AttachmentSizeStatusTest resultOperations
userlower_11.patch12.85 KBIdleFailed: Failed to apply patch.View details | Re-test

#33

catch - July 11, 2008 - 12:18

#32 is fine. Thanks Damien :)

Here's a quick comparison for performance on 98,000 users generated by devel, these run from the command line:

SELECT * FROM users WHERE LOWER(name) = LOWER('nogahi');
1 row in set (0.14 sec)

Subsequent queries for different names always take between 0.13 to 0.14 seconds. query_cache is only used if the name is identical.

SELECT * FROM users WHERE username = LOWER('nogahi');
1 row in set (0.04 sec)

query_cache gets used even for different names - so 0.00 seconds for subsequent similar queries.

So we get a double improvement - 3-4 times faster on these selects the first time they're run, but the query_cache in MySQL will handle most subsequent requests.

If someone has a decent benchmarking setup, it'd be great to get more scientific data than this - or with some advice I can probably do something a bit better at home.

#34

pwolanin - July 13, 2008 - 13:51

minor concerns re: text/code comments:

The schema description text seems a little incorrect or misleading:

-        'description' => t('Unique user name.'),
+        'description' => t('User name for display.'),
+      ),
+      'username' => array(
+        'type' => 'varchar',
+        'length' => 60,
+        'not null' => TRUE,
+        'default' => '',
+        'description' => t('Unique user name for login.'),

$user->name *is* still constrained to be unique, so I think the description does not need to change at all.
The username description should reference the fact that it's a LOWER() version of $user->name used to speed lookups during login, search, etc.

+    // We use the database to do the lowercasing to ensure that there are\
+    // no discrepencies in collation handling between the database and PHP.

I think should reference character handling or character set, not collation, since I think collation just determines sort order.

something like:

+    // We use the database to do the lowercasing to ensure correct searches
+    // in case of any discrepancy in character handling between the database and PHP.

#35

Damien Tournoud - July 13, 2008 - 14:15

@pwolanin: in MySQL terminology, "collations" refers to the set of rules for comparing characters (for both inequalities and equalities). Collations for are used for sorting, for matching and for normalizing strings before putting them into index classes.

This said, I like the wording of your comment better :)

#36

catch - July 13, 2008 - 15:49

Those changes are fine, and while collation is correct, I agree with Damien that the revised version explains the motivation better.

New patch attached.

AttachmentSizeStatusTest resultOperations
user_lower.patch13.15 KBIdleFailed: Failed to apply patch.View details | Re-test

#37

pwolanin - July 13, 2008 - 17:38

to be consistent with other schema comments, this:

t('LOWER() version of $user->name for login, search performance.'),

should be written more like:

t('LOWER() version of {users}.name for login, search performance.'),

#38

catch - July 13, 2008 - 20:05

Yes. good point.

AttachmentSizeStatusTest resultOperations
user_lower.patch13.18 KBIdleFailed: Failed to apply patch.View details | Re-test

#39

R.Muilwijk - July 14, 2008 - 07:28

Tests look good to me , ran them all. It doesn't break any tests that don't fail normally. I also like the idea of not having to use a LOWER() on the database. So +1 from me!

#40

Senpai - July 14, 2008 - 20:09
Status:needs review» reviewed & tested by the community

Tested against a fresh checkout of HEAD, but not a clean DB.

I created 10 users with all manner of mixed case, spaces, symbols, and the like. All form submissions were treated as they should have been, with illegal symbols being prohibited, and mixed-case or spaces being allowed. In each instance, I checked the database to verify core's actions and they were good.

I then deleted nine of those users, and in each case the deletion happened flawlessly as it should have. I kept the last user for a Change Own Username test, and it passed that with flying colors.

This patch allows users to login with either their uppercase name, lowercase name, or any mixed-case variant as long as the $user->name field matches the same sequence of letters and numbers stored in $user->username. I don't see that as a problem, but some people might. It basically allows me to login as either 'Dries Buytaert' or 'dries buytaert' but not 'driesbuytaert' because there's a space in the original $user->name.

Simpletests
User.mod == 231 passes, 0 fails, 0 exceptions
(should we be testing for user name changes as well?)

I'm good with this patch. RTBC.

#41

dahacouk - July 15, 2008 - 06:13

Is this patch meant to enable me to have a (display name) $user->name = "Daniel Harris" and a (login name) $user->username = "dahacouk" - as I usually like to do?

Or is $user->username only just a lowercase representation of $user->name?

#42

Damien Tournoud - July 15, 2008 - 09:50

@dahacouk: this is out of this patch's scope, but this patch do open the way for contribution modules to implement what you describe.

#43

Senpai - July 15, 2008 - 23:00

Daniel, This patch creates a duplicate username of every user in the system, and automatically lowercases it. That's all it does. It's an enabler, if you will. A stepping stone to further greatness. It has to go into core before anything like the Add a Display Name to core issue could even begin to be realized.

The great thing about this patch is that it speeds up Drupal by a magnitude or two when trying to locate or manage users within the system. Once this is committed, other patches can surface to handle what actually happens to these usernames. For now, we're all going to be extremely happy when this is committed.

#44

catch - August 17, 2008 - 18:14

Re-rolled following #237381: user_save parameter poorly named

AttachmentSizeStatusTest resultOperations
user_lower.patch13.04 KBIdleFailed: Failed to apply patch.View details | Re-test

#45

JohnAlbin - August 17, 2008 - 21:53

The short description of a function's docblock should be on one line, so this updated patch only changes the docblock for user_update_7002(). No code changes from catch's patch, so credits still go to him. :-)

Re-ran ALL tests; they all still pass after the patch. “4940 passes, 0 fails, 0 exceptions”

AttachmentSizeStatusTest resultOperations
user_lower.patch12.81 KBIdleFailed: Failed to apply patch.View details | Re-test

#46

Senpai - August 23, 2008 - 15:40

Bumping this as an RRTBCTRBC (Really Really Reviewed and Tested by the Community and Thoroughly Ready to be Committed).

Commit commit commit!

#47

maartenvg - September 7, 2008 - 11:52
Status:reviewed & tested by the community» needs work

Patch no longer applies.

patching file modules/system/system.install
Hunk #1 FAILED at 366.
1 out of 1 hunk FAILED -- saving rejects to file modules/system/system.install.rej
patching file modules/user/user.install
patching file modules/user/user.module
Hunk #7 succeeded at 1545 (offset 2 lines).
patching file modules/user/user.pages.inc
Hunk #1 FAILED at 12.
1 out of 1 hunk FAILED -- saving rejects to file modules/user/user.pages.inc.rej
patching file modules/user/user.test
Hunk #3 FAILED at 468.
1 out of 3 hunks FAILED -- saving rejects to file modules/user/user.test.rej

#48

drawk - September 7, 2008 - 13:36
Status:needs work» needs review

Reroll

AttachmentSizeStatusTest resultOperations
user_lower_3.patch12.71 KBIdleFailed: Failed to apply patch.View details | Re-test

#49

catch - September 7, 2008 - 23:15

This probably ought to be re-rolled again for the new db layer changes. However, since it was at RTBC for a very long time with no comment, I'd quite like view from webchick or Dries on the general approach - which I don't expect to change after conversion.

#50

BioALIEN - September 8, 2008 - 15:45

@catch: It's a performance improvement, but I guess they want to see some more benchmarks before they commit this? What you've covered in #33 is great. Can we see some benchmarks on bigger (or even tiny) sites to better understand the performance gain?

+1 on testing this out on scratch.d.o

#51

pwolanin - September 8, 2008 - 18:47

The patch adds a new unique key for 'username' but if we are doing all look ups using the new column is there any reason to retain the unique key on user.name?

#52

David Strauss - September 8, 2008 - 19:57

@pwolanin: There is no need to keep any key/index on {user}.name. We should never be putting conditions or sort criteria on {user}.name after this patch.

#53

Crell - September 8, 2008 - 20:46

Not true. I may want to list all users alphabetically by their public name as part of a member directory. At minimum having an index on that field will keep such queries from killing the server. (I've already said my piece on the uniqueness question.)

#54

pwolanin - September 8, 2008 - 21:29

@Crell - I don't understand your comment. This patch maintains a 1-to-1 mapping of name to username.

#55

pwolanin - September 9, 2008 - 00:58

I re-rolled the patch to remove the index on name, and altered the code in other modules so all WHERE or ORDER BY should be using username rather than name.

Ran all tests - all tests pass with this patch (5431 passes, 0 fails, 0 exceptions).

AttachmentSizeStatusTest resultOperations
user-lower-279851-55.patch62.06 KBIdleFailed: Failed to apply patch.View details | Re-test

#56

maartenvg - September 9, 2008 - 10:07

It applies fine, and I get no errors. But I was wondering: although this is a database change, there is nothing for update.php to do, is that by design?

#57

pwolanin - September 9, 2008 - 11:36

If you already applied this patch before, the update might not run.

#58

maartenvg - September 9, 2008 - 13:09

Nope, that is not the case, I applied it to a fresh install.
- download fresh 7x tarball
- extract
- install
- apply patch
- run update.php

The last step says "No pending updates."

I've checked and the installation does not detect the presence of user_update_7002() (database reports last scheme to be 7001). I already came across the same strange behavior when creating a patch for #165786: DB Silliness: aggregator_item.description ought to be called body, but thought it might be my fault.

I believe this behavior was introduced when #286035: Remove update.php number dropdowns went in, not sure though.

EDIT: I've traced this back to http://drupal.org/cvs?commit=137218, therefore it is unrelated to this issue. You can ignore my comments about not functioning updates.

#59

JohnAlbin - September 9, 2008 - 16:00
Status:needs review» needs work

Yes, Maarten, #286035: Remove update.php number dropdowns is the culprit for the "no pending updates" issue. But it has just now been fixed (again).

@pwolanin: re: Crell's last comment. He meant he's already talked about the uniqueness several times. starting with #9 above in this issue and all over #102679: Add a Display Name field to core in addition to Username.

And, after a really, REALLY long discussion, we finally got everyone in #102679 to agree that name needs to remain unique.

In addition, removing the unique constraint on name isn't needed for a performance patch. Let's keep the scope focused so as not to repeat #102679.

#60

catch - September 9, 2008 - 16:22
Status:needs work» needs review

Unnecessary indexes are a performance hit on insert, there's no reason to add that hit if we don't need it. Uniqueness is enforced in PHP anyway, at least without contrib intervention, so I'm marking this back to needs review.

#61

David Strauss - September 9, 2008 - 17:07

@Crell: I do agree that a unique index on the case-preserved name would enhance data integrity a little, but it's going to be inconsistent. The index is built off the collation that DB server is using, so the uniqueness on "name" will be case-insensitive on MySQL and case-sensitive on PostgreSQL. Trying to create equivalent contraints on "name" brings us back to the problem of making a portable, case-insensitive index, which is the problem this patch is trying to solve.

#62

Crell - September 9, 2008 - 20:07

OK, I think everyone misunderstood what I was saying in #53.

1) I've said my piece on having a UNIQUE constraint on the name column, so I'm not going to say more on it here.

2) There are use cases where you will want to ORDER BY name, so having an index (unique or otherwise) on that field will help performance in those cases. So -1 on not having an index on the name column at all. That field won't be modified often enough that the extra cost on INSERT will matter, IMO.

#63

maartenvg - September 9, 2008 - 20:42

I agree with Crell on at least 2). There are enough use cases where we would want to sort on the `name` column, rather than the `username` column. Especially with the future (by core or contributions) in mind where `name` won't be a 1-to-1 mapping to `username` anymore.

#64

dahacouk - September 16, 2008 - 11:11

@JohnAlbin in #59: I am not trying to "repeat #102679" here - this is a performance patch after all.

But I must take exception to your statement "And, after a really, REALLY long discussion, we finally got everyone in #102679 to agree that name needs to remain unique."

Firstly, I did not agree to uniqueness of name so please either remove that statement for the record or qualify who exactly you mean by "everyone". Lets be accurate folks!

#102679 and this issue have different goals, remember. This issue does not solve the issue of #102679. Please don't muddy the waters.

After this patch has been accepted I'll be excited to find out how it can really help fulfil #102679.

Full steam ahead! ;-)

Cheers Daniel

#65

JohnAlbin - September 16, 2008 - 14:37

Re: catch: "Unnecessary indexes are a performance hit on insert, there's no reason to add that hit if we don't need it."

Agreed, but I don't feel this is an unnecessary index.

Since core won't be using the name column for any of its querying (just for displaying a capitalization-preserved version of username), it opens up the possibility that contrib can extend the data of that field. But actively deciding to remove the existing index means we have to fight to get it back in another issue, because (regardless if its unique or not) that column would need an index to query on it. And if the length of #102679 is any indication, getting it back will never, never happen.

And, yes, I know my argument is scope-creep. But I'm not asking for a new feature of the patch, I'm asking for you to not remove an existing index.

Also, how big is this hit for a db with a large number of users? seconds? miliseconds? http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html shows that, for a table with 320,000,000 records (everyone in the US), each index insertion would be log(320,000,000) = 8.5 slower than an empty table, but that page doesn't show actual numbers. When I tested inserting a single user into a 200,000 users table with and without a name index, both results were the same "Query OK, 1 row affected (0.01 sec)". That, of course, is nowhere near exhaustive testing, but extrapolating to a table with 320 million users means the performance hit would be somewhere less than 0.085 seconds. Unless my math is wrong, which it could be, leaving in the index seems a negligible performance "hit".

#66

Crell - September 16, 2008 - 15:04

JohnAlbin said what I was trying to say far better than I did. :-)

Let's also remember we're talking about a table that on a super-busy site like d.o has had ~300,000 inserts in 7 years. It's not like we're talking about hammering watchdog. If we lose a tenth of a second when creating a new user but avoid a full table scan for any contrib that wants to do something useful with that column, that's a huge win.

Really, removing the index is unnecessary premature optimization.

#67

catch - September 16, 2008 - 16:53
Status:needs review» needs work

I'm fine with putting the index back in. This needs re-rolling for dbtng as well.

#68

webchick - September 17, 2008 - 02:17

Holy cow, guys. ;)

Ok. My take on this patch is:

  1. The only point of this patch is to add a lowercase username column is for improved query performance. That's it. It's not for a usability improvement. It's not so we can do non-unique display names. It's not so users can have separate display names from internal usernames. It's not for enhanced security. It's not for $insert_pet_registration_feature_here. Performance! That's it. Any other related features/changes enabled by this patch should be discussed in other issues. Thanks for doing your part to keep patches to one context.
  2. It should not affect any of the existing behaviour of the registration/login system, from a user perspective. They should not see the user.username field exposed to them, and should not be required to log in with anything but their normal users.name. This was Dries's concern, and it's mine as well. The patch as it stands does not (or at least doesn't intend to; haven't tested it myself yet to make sure it works), and that is wonderful and as it should be.
  3. I'm not that fussed about index performance on inserts, to be honest. I agree that there are many use cases, particularly with Views, where pulling up a list of users sorted by users.name, so having an index on it probably makes a lot of sense. The important thing to me, again, is to NOT change the behaviour of the current code in any way.

I think that's everything that needs weighing in on here. Hope that helps.

#69

pwolanin - September 17, 2008 - 03:09

In terms of an extra and unneeded index - I think the size in memory of the index is much more of a concern than insert speed - hence I'd assert that #52 is correct. No query should be putting a condition on name after this patch, only on username. Hence we don't need the index.

#70

robertDouglass - September 17, 2008 - 09:35

I'm thrilled with the direction of this patch. Thorn in my side since 2006.

#71

catch - September 17, 2008 - 19:13

Since username and name with be identical in core, if contrib wants to try to do something funky with name, they could also add the index back via schema API. If I can find the motivation I'll re-roll with both options though if no-one beats me to it.

#72

Crell - September 17, 2008 - 22:22

admin/user/user in core allows sorting on the username. Presumably that will use "name" not "login name" in the new setup, so that when contribs do fork off the display name it will make sense. (And let's be honest, that's not the main purpose of this patch but we all know that is going to happen.) Thus, a core page will be sorting by name, not login name. Thus, we do need the index on that column.

#73

pwolanin - September 18, 2008 - 00:56

@Crell - see above - a contrib module that needs a different index can add it. Let's deal with what core needs.

#74

catch - September 18, 2008 - 01:47

Here's a re-roll to re-sync with HEAD and use named placeholders in all cases except the user_load query builder (which needs updating in a separate patch) and wildcards since I'm not sure how they work now. I'm getting ten fails with dblog.test and some in the user tests which I'll track down tomorrow. Looks like user.admin.inc was missed in pwolanin's patch to me, we should be able to use username for sorting there too really.

AttachmentSizeStatusTest resultOperations
user_lower.patch17.66 KBIdleFailed: Failed to apply patch.View details | Re-test

#75

catch - September 18, 2008 - 12:40
Status:needs work» needs review

Fixed the typo that was causing test failures and updated user.admin.inc to use the correct column for tablesort_sql. So here's a fully functional patch for review.

AttachmentSizeStatusTest resultOperations
user_lower.patch20.16 KBIdleFailed: Failed to apply patch.View details | Re-test
user_lower.patch18.82 KBIdleFailed: Failed to apply patch.View details | Re-test

#76

yoroy - September 18, 2008 - 12:43
Status:needs review» needs work

Catch polled me about what made moste sense to me in the sorting: A-Z, then a-z,or a combined Aa - Zz.
I'd strongly suggest using Aa - Zz, otherwise you basically get 2 seperate lists.

#77

catch - September 18, 2008 - 13:05
Status:needs work» needs review

Sticking back to needs review since this is the behaviour in the current patch, I think yoroy cross-posted.

#78

yoroy - September 18, 2008 - 14:04

yes, I didn't change status, must've been a crosspost.

#79

chx - September 18, 2008 - 17:42
Status:needs review» reviewed & tested by the community

This patch is a good step forward. While I am not entirely surely this is the best direction we can go, there are indeed a big number of advantages of this, namely the possibility of having a separate username (edit:), the small code footprint, the simplicity. By fulfilling this popular request, you nicely sidestepped the necessity to sync the username and LOWER(name) . For now, I think it will do. We can always do more later.

#80

webchick - September 18, 2008 - 18:17
Status:reviewed & tested by the community» needs review

So. Since the upgrade path from D6 => D7 is horribly broken right now, I /know/ none of you all tested the upgrade path. :P

Can someone please bring over their users table from a production D5/D6 site with lots of real life examples including names like "Bill O'Malley" and "SüP3RguRL27! and run user_update_7002() and report results?

#81

catch - September 18, 2008 - 19:28

hmm, I tested the upgrade path from d7 - d7 yesterday, but not 6-7. I'm also pretty sure I tested the updates before the 6-7 path got broken (last time it was RTBC), but it seems I didn't post about it and I wouldn't want to swear 100% I actually did.

#82

maartenvg - September 18, 2008 - 19:39

I did a test on a user base (about 1200 users). Everything went smooth but during the update, as was the endresult in the database itself. Although I have to admit that I have nothing as extreme as "SüP3RguRL27!, most users have numbers, there are a lot with punctuations (_.:;'^<>) and some use the occasional é and è.

Perhaps someone using a non-Latin alphabet can confirm my positive results?

EDIT: It was a user table from D5, that I copied into the D7 database (and changed the schema a bit to match that of the old D7)

#83

webchick - September 18, 2008 - 19:43

Ok, and those people with punctuation and és are able to log in okay, and are displayed properly in the user list at admin/user/user?

#84

maartenvg - September 18, 2008 - 20:35

I'm getting some conflicting results. I'll do some more rigorous testing.

#85

Damien Tournoud - September 18, 2008 - 21:09

I upgraded the whole users table from drupalfr.org (5000+) users. All went smoothly. The conversion was done properly.

But users with special characters in their names could not login... That's a different issue thou: #310447: Add back SET NAMES="utf8".

With SET NAMES back, all works as expected.

#86

xzilla - September 18, 2008 - 21:12

I'm curious if anyone has done testing to see if there is a performance regression by copying all of the usernames into a second column within the table, thereby adding additional footprint and/or i/o load to all of the other queries referencing the table? I ask because I have addressed the above problem (slow lower(name) queries) by adding a function index on lower(name) which solves this specific problem much better while not cuasing additional overhead where it isn't needed.

#87

maartenvg - September 18, 2008 - 21:37

Ok. The problems I am having are not related to this issue, but with collation and other char set problems during importing .e.d.

But when I manually create a few users with strange names, and then apply the patch, I do not get problems with displaying or editing these users.

#88

chx - September 18, 2008 - 22:30
Status:needs review» reviewed & tested by the community

Setting status based on the latest comments especially #85 .

#89

lilou - October 10, 2008 - 17:27
Status:reviewed & tested by the community» needs work

Patch #75 cannot apply against CVS/HEAD :

drupal\7.x\modules\system\system.install (Cannot apply hunk @@ 17 )
drupal\7.x\modules\user\user.module (Cannot apply hunk @@ 7 )

Back to CNW.

#90

robertDouglass - October 12, 2008 - 09:19

In system.install, why is status left out of this query?

<?php
db_query("INSERT INTO {users} (name, username, mail, created, data) VALUES(:placeholder, :placeholder, :placeholder, :time, :serial)", array(':placeholder' => 'placeholder-for-uid-1', ':time' =>  REQUEST_TIME, ':serial' => serialize(array())));
db_query("INSERT INTO {users} (name, mail, created, status, data) VALUES('%s', '%s', %d, %d, '%s')", 'placeholder-for-uid-1', 'placeholder-for-uid-1', REQUEST_TIME, 1, serialize(array()));
?>

#91

robertDouglass - October 12, 2008 - 09:33
Status:needs work» needs review

Rerolled.

AttachmentSizeStatusTest resultOperations
lower.patch18.94 KBIdleFailed: Failed to apply patch.View details | Re-test

#92

robertDouglass - October 12, 2008 - 09:55

100% tests pass: 6988 passes, 0 fails, 0 exceptions

#93

catch - October 12, 2008 - 09:57
Status:needs review» needs work

We should be using db_insert() for that query too.

#94

c960657 - November 30, 2008 - 17:39

Patch no longer applies to HEAD.

#95

David Strauss - December 7, 2008 - 12:04

Here's a re-roll. I have not tested it.

AttachmentSizeStatusTest resultOperations
remove_lower.patch16.28 KBIdleFailed: Failed to apply patch.View details | Re-test

#96

David Strauss - December 7, 2008 - 12:05
Status:needs work» needs review

#97

swentel - December 7, 2008 - 13:46
Status:needs review» needs work

We don't use t() anymore in schema descriptions, so that has to go out.

#98

David Strauss - December 7, 2008 - 21:16
Status:needs work» needs review

Re-roll to remove t() from schema description.

AttachmentSizeStatusTest resultOperations
remove_lower.patch16.28 KBIdleFailed: Failed to install HEAD.View details | Re-test

#99

David Strauss - December 7, 2008 - 21:19

Let's also fix the erroneous description of the old "name" column.

AttachmentSizeStatusTest resultOperations
remove_lower.patch16.44 KBIdleFailed: Failed to install HEAD.View details | Re-test

#100

System Message - December 7, 2008 - 21:25
Status:needs review» needs work

The last submitted patch failed testing.

#101

David Strauss - December 7, 2008 - 22:39
Status:needs work» needs review

Re-roll with non-conflicting update.

AttachmentSizeStatusTest resultOperations
remove_lower.patch16.44 KBIdleFailed: 7319 passes, 6 fails, 20 exceptionsView details | Re-test

#102

System Message - December 7, 2008 - 22:55
Status:needs review» needs work

The last submitted patch failed testing.

#103

David Strauss - December 7, 2008 - 23:24

From what I can tell, we seem to be experiencing a bug in the test framework. It doesn't make sense for user login to fail only under the picture submission test case. The exceptions result from trying to insert "u" as the line number into the test results table.

#104

c960657 - December 7, 2008 - 23:41

This is due to #328781: Fix horrible things in the error reporting. If you apply the patch from that issue, you will get better error messages when running the tests.

#105

David Strauss - December 7, 2008 - 23:45

All tests pass if we apply #328781, though we still get "Undefined index: name" exceptions in various places.

#106

stewsnooze - December 26, 2008 - 22:59

subscribe

#107

Dave Reid - January 20, 2009 - 05:22

#108

asimmonds - January 22, 2009 - 21:38
Status:needs work» needs review

Rerolled for current HEAD.

The "Undefined index: name" exceptions were from that user_save() can be called with only what needs to be updated in the $edit array. For example, in /admin/user when you block/unblock multiple users, $edit only has the status attribute set, and not the user or pass attributes.

AttachmentSizeStatusTest resultOperations
remove_lower_3.patch17.45 KBIdleFailed: 9304 passes, 0 fails, 1 exceptionView details | Re-test

#109

System Message - January 27, 2009 - 00:55
Status:needs review» needs work

The last submitted patch failed testing.

#110

asimmonds - February 7, 2009 - 10:08
Status:needs work» needs review

Rerolled for current HEAD.

AttachmentSizeStatusTest resultOperations
remove_lower_5.patch18.91 KBIdlePassed: 9740 passes, 0 fails, 0 exceptionsView details | Re-test

#111

Dave Reid - February 7, 2009 - 15:27
Status:needs review» needs work

1) These in system_install() will fail on either PostgreSQL or SQLite because of a bug in PDO that placeholders cannot be reused:

<?php
db_query("INSERT INTO {users} (name, username, mail) VALUES(:empty, :empty, :empty)", array(':empty' => ''));
db_query("INSERT INTO {users} (name, username, mail, created, data) VALUES(:placeholder, :placeholder, :placeholder, :time, :serial)", array(':placeholder' => 'placeholder-for-uid-1', ':time' =>  REQUEST_TIME, ':serial' => serialize(array())));
?>

2) Having the following code twice in user_save is dubious:

<?php
+    // Ensure $user->username is lowercase if it's the same as $user->name.
+    // We use the database to do the lowercasing to ensure correct searches
+    // in case of any discrepencies in character handling between the
+    // database and PHP.
+    // LOWER() is also slightly faster than mb_strtolower() for this task.
+    if (!empty($edit['name']) && $edit['username'] == $edit['name']) {
+     
db_query("UPDATE {users} SET username = LOWER(username) WHERE uid = :uid", array(':uid' => $account->uid));
+    }
?>

I don't see why we can't do something in the beginning of user_save() instead of having to add an extra query to every call to user_save(). Are the comments still valid even though we now require PHP 5.2 and have drupal_strtolower()? In fact, you have an added section earlier in user_save:
<?php
// If username isn't set, populate it with name.
+  if (!empty($edit['name']) && empty($edit['username'])) {
+   
$edit['username'] = $edit['name'];
+  }
?>

That should be simply:
<?php
  $edit
['username'] = drupal_strtolower($edit['name']);
?>

3) On the new test case, class UserSearchTestCase needs a phpdoc block right above it, getInfo() does not, setUp() does not, and testUserSearch() does need a phpdoc block. See the testing code standards at http://drupal.org/node/325974.

4) In addition, there are some sections in testUserSearch() that look smooshed together.

5) A few coding issues:

<?php
db_query("UPDATE {users} SET uid = uid - uid WHERE username = :empty", array(':empty' =>'')); 
?>

#112

Crell - February 7, 2009 - 20:45

To be clear, the non-reusability of placeholders in PDO is a design decision, not a bug. Some databases support reuse, some don't. Earlier versions of PDO tried to work around that limitation for those databases that didn't, and failed miserably. So in later PDO versions they ripped that out and declared placeholders non-reusable on any DB. (I spoke with Wez Furlong about that about 2 years ago. :-))

#113

catch - February 7, 2009 - 21:00

@davereid, we can't guarantee that the result of drupal_strtolower() and LOWER() will be exactly the same on all collations due to (potentially) different handling of characters in their lowercase versions. I haven't yet seen an example of where this would actually break at all, but it's one of the annoyances which makes both this and the equivalent taxonomy issue ugly. If anyone's got decent information on that subject - whether it's critical, whether we could just document the potential behaviour, that'd be handy.

#114

catch - March 27, 2009 - 19:49

Marked #83738: LOWER(name) queries perform badly; add column name_lower and index. as duplicate despite being the older issue to concentrate efforts in one area.

#115

markus_petrux - March 27, 2009 - 20:16

Minor issue about the patch in comment #110: it mixes 'Username' and 'User name' in translatable strings. Which one is correct?

#116

kbahey - March 27, 2009 - 21:27

Subscribing.

(Before dww fixes +1 subscribe)

#117

chx - April 5, 2009 - 04:04

I think the db_query("UPDATE {users} SET username = LOWER(username) WHERE uid = :uid", array(':uid' => $edit['uid'])); piece needs to go in a helper function. It can't be avoided, agreed.

#118

catch - May 5, 2009 - 11:29
Title:Performance: Add column 'username' to {user}, remove LOWER(name)» Performance: Add column 'username' to {users}, remove LOWER(name)

#119

blake_ - May 8, 2009 - 14:20

I don't think we should be storing the LOWER() value in the database at all, especially if there's a unique index on it. For any characters that aren't ASCII, each different database may handle the LOWER call differently. For example: MySQL & MSSQL use whatever collation is currently relevant to lower characters, I couldn't find out what Postgres does, SQLlite only lowers ASCII.

My view is that we're introducing here a new column, to provide a sort of "normalised name" that can be used to search for users / prevent a user creating a new account with "similar" characters. I think that the column should be named after what it actually does (something like "normalized_name", "standard_name" etc), and there should be a function (in PHP) that generates this value that then gets stored in the DB when we insert a user. To start with this function can just use PHP string lower function.

The advantages are:

  • We don't depend on an external function (LOWER) that could be implemented differently between different databases.
  • A module can strengthen the function if desired. The admin might decide that spaces should be removed, or that vowels with accents/umlauts/macrons were treated as undecorated, or that the characters I l 1 | all got transformed into one value etc. This would reduce similar looking usernames on such sites.
  • We actually fix this performance issue! We will just be using an index over values in a column, rather than computing something every row.
  • It extends to any future database engine. It is only another database column. We don't need to write triggers, and we don't need to worry about DB case sensitivity/insensitivity.

My $0.02 :) Discuss.

#120

joshmiller - August 23, 2009 - 23:47

Just read through this post and wanted to find out how it went from RTBC to needs work and not updated for some time. So here's where we are (for all of us):

Why should this go in? A stat in by catch #14 and a quote by Angie in #65

in a typical case we're comparing 1273.97 ms to 0.37 ms

The only point of this patch is to add a lowercase username column is for improved query performance. That's it.

Issue summary since last RTBC #89

  • #85 through #111 - Issues raised have been addressed
  • #111 - Dave Reid has some issues (see below for summary)
  • #112 - Crell says #111-01 issue is real and should be addressed (but, BTW it is by design) -- I added my own "reading between the lines" filter
  • #113 - catch raises a new issue for LOWER()
  • #118 - chx agrees that #111-02 is the way to go
  • #119 - blake_ points out that we could just make it a php function to avoid #113

What should the next patch include to help it be RTBC?

Ok, so that's the summary, and depending on how people feel about #119, we could potentially find another RTBC prior to 9/1 by addressing the following from #111 and #112

  1. Queries cannot include multiple placeholders
  2. We have duplicated code in user_save
  3. On the new test case, class UserSearchTestCase needs a phpdoc block right above it, getInfo() does not, setUp() does not, and testUserSearch() does need a phpdoc block.
  4. Code formatting issues in testUserSearch()
  5. There is a code problem with an SQL statement
  6. Document or fix the LOWER() issue raised by catch in #113

That's what the next patch needs. Hopefully this helps. Also, lets keep in mind Angie's guidance in #65

#121

joshmiller - August 24, 2009 - 00:10

Ignore my last comment, as we have a much better solution provided by catch in IRC:

"I think we can just convert these queries to db_select(), use LIKE which is case insenstive and mapped to ILIKE on postgresql and that'll work without the extra column."

Josh

#122

joshmiller - August 24, 2009 - 00:11
Title:Performance: Add column 'username' to {users}, remove LOWER(name)» Rework username queries to improve performance

changing title

#123

Crell - August 24, 2009 - 04:08

That particular feature of DBTNG was added with this issue in mind. I am not convinced that it is a good solution, mind you, and a lookup column would likely be faster and more flexible, but in theory it should work.

#124

kbahey - August 24, 2009 - 04:49

I am actually in favor of a solution that does not add an extra column. An extra column can introduce inconsistencies, and complicates teh schema. Since only PostgreSQL is case sensitive and the ILIKE solves that, I like that solution since it will remove a slow query for MySQL.

And it seems that we already solved it this way when catch did the user_load_multiple() patch. It already uses db_select(). So the one that caused the most trouble is already out of the way!

<?php
   
// If the conditions array is populated, add those to the query.
   
if ($conditions) {
     
// TODO D7: Using LIKE() to get a case insensitive comparison because Crell <--- COMMENT
      // and chx promise that dbtng will map it to ILIKE in postgres. <--- COMMENT
     
if (isset($conditions['name'])) {
       
$query->condition('u.name', $conditions['name'], 'LIKE');
        unset(
$conditions['name']);
      }
      if (isset(
$conditions['mail'])) {
       
$query->condition('u.mail', $conditions['mail'], 'LIKE');
        unset(
$conditions['mail']);
      }
      foreach (
$conditions as $field => $value) {
       
$query->condition('u.' . $field, $value);
      }
    }
?>

It just needs cleanup by taking out the comment above.

The others that are done by the admin (e.g. searching for a user, ...etc.) are infrequent and not a cause for performance issues, but for the sake of code consistency and DX, it is a nice to have.

There is one in comment that is checked when a comment is posted. So only applicable for sites that get very frequent comments.

#125

robertDouglass - August 24, 2009 - 07:02

Sorry. Kbahey, you've stated this many times in the history of the issue:

I am actually in favor of a solution that does not add an extra column. An extra column can introduce inconsistencies, and complicates teh schema.

But you've never really explained what these inconsistencies are? Or how the "complication" of the schema is so evil that it offsets the known performance gain?

I'm glad we've come to a possible solution with LIKE() that we think will make everyone happy. Maybe, like kbahey says, we're just a cleanup patch away from closing it. It seems that we did so without any benchmarking, so I hope that it actually does make Drupal faster.

#126

robertDouglass - August 24, 2009 - 07:02

And joshmiller, thank you for the excellent summary!

#127

kbahey - August 24, 2009 - 17:07

@Robert

It is all about data redundancy vs normalization. If you have the same data in different columns, then there are risks that contradict good database design. For example, there is the risk that at some point a new piece of code will update one place and forget the other, leading to inconsistency. Also looking at different places gives different answers. There is also the added storage overhead. This is why deriving data that can be derived is better, for consistency, than storing multiple pieces.

Having triggers and stored procedure in the database help minimize this risk, since no external code need to be modified, but we are not using any of those, and they have their own challenges (database portability, opaqueness, ...).

I understand that breaking normalization rules is necessary in some cases. Done right, they are a great help (e.g. tracker2, prev_next, ...). They are sort of external to the schema and the data is totally replaceable any time.

Since we have better options here, this is a case where an extra column is bad.

Yes, we only need cleanup for this just for DX and consistency. The LOWER that is used now is only for admin pages, which are infrequently used.

#128

pwolanin - September 2, 2009 - 13:17

Just discussed this problem with Davide Strauss - our consensus was that we should add a case insensitive comparison operator to DBTNG and push this into the driver layer.

This would allow us to optimize this for MySQL (at least), would allow this to be used also for terms, etc, where we are doing text-based lookups and would be readily available for contrib.

postgres version would initially at least just use the LOWER() = LOWER(). So performance would be the same as it is currently in core.

#129

pwolanin - September 2, 2009 - 13:43

talking with catch, he confirms what Khalid says above - essentially this is in core already - so this is really a cleanup issue to make sure it works right in SQLite3 and that we are removing LOWER() where we don't need it in core queries.

The bad part is that we are still generally doing %string% matches, which cannot be indexed.

#130

Dave Cohen - September 2, 2009 - 14:54

Do I understand correctly that with this patch username is used for authentication and searches, while name is used for display and third-party module can choose to make name non-unique?

If so, FINALLY!

#131

pwolanin - September 2, 2009 - 20:20

@Dave Cohen - that has been proposed elsewhere, and there was no consensus. This issue is just about finding ways to improve query performance.

#132

dahacouk - September 3, 2009 - 14:41

Display Name Code Sprint Saturday September 5th...

Dave Cohen and I just had a really good face to face discussion with Angie (webchick). In summary, we are going to use the Saturday code sprint to make sure that Drupal 7 will facilitate us building contributed modules to implement display names.

This may mean altering some parts of core in a very small way. If you are passionate about cleaning up "theme_username" throughout core and making it easier for contributed modules to implement display names in a clean and consistent way, then join us at DrupalCon Paris or over the wires.

Saturday September 5th 2009 10:00 to 17:00 CEST
Location: DrupalCon Paris and online elsewhere

If you use IRC then #kendra is ready and waiting for you during those times.

I'll make sure developers are kept well feed and watered. So, if you are at DrupalCon and passionate about making it easy to implement display names then get in touch.

This is pretty much our last chance for making Drupal 7 display name friendly! Let's make it happen!

Cheers Daniel

mobile: +44 7853 627 355
username: dahacouk on skype, jabber, irc...
Kendra Initiative - http://www.kendra.org.uk

#133

dahacouk - September 3, 2009 - 14:47

#134

jromine - September 4, 2009 - 22:16

subscribe

#135

dahacouk - September 5, 2009 - 10:34

OK. So, the strategy over the next few weeks (during code sludge) is to get theme_username() used consistently in core. From there contributed modules will be able to implement display names with more ease.

So, if you have time over next few weeks please help Dave Cohen work on the following issue:

#192056: User's raw login name should not be output directly

Many thanks,

Daniel

#136

catch - December 1, 2009 - 07:09
Title:Rework username queries to improve performance» Replace LOWER() with db_select() and LIKE() where possible
Category:task» bug report
Assigned to:catch» Anonymous

Now we have LIKE/ILIKE support in the database layer, it's a bug not to be using it where appropriate.

Here's a start on a conversion, hoping someone else can pick up the other places in core where this needs doing.

Wherever you see LOWER(:placeholder) this can be converted to db_select() and LIKE. The only exception is LIKE LOWER(%:placeholder%) which has its own problems we can't deal with here.

catch@catch-laptop:~/www/7$ grep -rl "LOWER(" *
modules/system/system.install
modules/user/user.module
modules/comment/comment.module
modules/taxonomy/taxonomy.module
modules/statistics/statistics.admin.inc
modules/profile/profile.admin.inc
modules/profile/profile.pages.inc
modules/profile/profile.module

Marking CNR to give test bot a go at it.

AttachmentSizeStatusTest resultOperations
lower.patch832 bytesIdlePassed on all environments.View details | Re-test

#137

catch - December 1, 2009 - 07:35
Status:needs work» needs review

#138

catch - December 1, 2009 - 08:11

Did a couple more.

TODO:
user_autocomplete()
profile_admin_settings_autocomplete()
profile_autocomplete()

AttachmentSizeStatusTest resultOperations
lower.patch1.55 KBIdlePassed on all environments.View details | Re-test

#139

catch - December 1, 2009 - 11:15

Somehow managed not to inciude the previous patch, todos remain the same.

Also user_is_blocked() should cast to bool, but keeping this to minimum necessary changes for now.

AttachmentSizeStatusTest resultOperations
lower.patch2.36 KBIdlePassed on all environments.View details | Re-test

#140

meba - December 1, 2009 - 13:04

A small review:

- Applied succesfully
- Created a new account with some uppercase characters - works
- Logged in as this user - works
- Created a new article and commented on it - works
- Created a profile field with a category, including some uppercase characters. Created second with in the same category - works
- Blocked my new user, tried to log in as the blocked user - doesn't allow login - works

#141

c960657 - December 1, 2009 - 18:18
Status:needs review» needs work

The wildcard characters % and _ need to be escaped. Otherwise confusing things happend if they appear in the user-supplied string. I suggest something like this:

-    $query->where('LOWER(category) = LOWER(:category)', array(':category' => $category));
+    $query->condition('category', addcslashes($category, '%_\\'), 'LIKE');

AFAICT SQL-92 says that there is no escape character if ESCAPE is omitted in the LIKE expression (... WHERE foo LIKE 'abc%' ESCAPE '\'), though MySQL and PostgreSQL uses \. In order to make all databases behave the same, we can make the following change to DatabaseCondition::mapConditionOperator() in database/query.inc:

@@ -1326,15 +1326,15 @@ class DatabaseCondition implements Query
       'NOT IN' => array('delimiter' => ', ', 'prefix' => ' (', 'postfix' => ')'),
       'IS NULL' => array('use_value' => FALSE),
       'IS NOT NULL' => array('use_value' => FALSE),
       // These ones are here for performance reasons.
       '=' => array(),
       '<' => array(),
       '>' => array(),
       '>=' => array(),
       '<=' => array(),
-      'LIKE' => array(),
+      'LIKE' => array('postfix' => " ESCAPE '\\''"),
     );
     if (isset($specials[$operator])) {
       $return = $specials[$operator];
     }
     else {

This pattern should be applied elsewhere in core, e.g. in various autocomplete callbacks.

Alternatively, perhaps we can make a new pseudo-operator in mapConditionOperator() that handles escaping automatically:

+    $query->condition('category', $category, 'CASE_INSENSITIVE_COMPARE');

It would make the code easier to read at a glance, though the reader may have trouble figuring out where that operator is defined. If the schema API some day supports case-sensitivity to be specified for each column individually, we cannot necessarily rely on LIKE (on MySQL the behaviour of LIKE depends on the column's collation), and then a special operator may be useful.

#142

Crell - December 1, 2009 - 21:25

There was an issue open for a year to figure out what to do with LIKE vs. ILIKE and similar mapping and no one cared about it so we just left it at "fold everything to use case-insensitive LIKE like MySQL". As it's an API change to rewrite that mapping I don't think we're able to do that at this point.

SelectQuery is not cheap, and does incur an additional PHP-space cost due to all the stack calls. I'd want to see benchmarks before we start using it everywhere.

The % question is... ugh. I'm not sure what the correct way to handle that is, since we *do* want to allow both real wildcards and % in a literal. That's probably separate from this thread, though ,which is a disturbingly long performance thread.

#143

catch - December 2, 2009 - 01:53

@Crell I'm pretty sure there's benchmarks earlier in this issue or its predecessors, db_select() takes at most 1-2ms, LOWER() can take hundreds of milliseconds or seconds afaik on large datasets.

We're already relying on LIKE/ILIKE mapping elsewhere in core, this issue is just applying it consistently. That means we probably need a spin-off for the % and _ issue which is nasty, and perhaps CASE_INSENSITIVE_COMPARE - both of these would be API additions rather than changes as such, and fixing a critical performance issue, so doesn't seem too bad at all.

 
 

Drupal is a registered trademark of Dries Buytaert.