Closed (won't fix)
Project:
Drupal core
Version:
6.x-dev
Component:
database system
Priority:
Critical
Category:
Bug report
Assigned:
Unassigned
Reporter:
Created:
8 Oct 2007 at 07:22 UTC
Updated:
1 Nov 2007 at 08:58 UTC
Jump to comment: Most recent file
Comments
Comment #1
chx commentedWe can do this if we want to.
Comment #2
hswong3i commentedSubscribing.
Sorry that since I have not much understanding about the real case, hope to clarify somethings:
LIKEand=a globally standard, or only a MySQL-specific lossy handling?Regards,
Edison
Comment #3
keith.smith commentedI haven't applied and tried the patch yet, but will do so.
As a small nitpick, I notice this section in the patch:
The code comment immediately preceding this change should probably be changed to reflect the new comparison, and...I guess...is there a possibility this could introduce a UTF-8 conversion issue?
Comment #4
webchick@hswong:
1. There are some benchmarks available in this issue: http://drupal.org/node/83738. I vastly prefer this approach to the one suggested in that issue for dealing with this (significant) performance hit.
2. No, but this function is defined in each of the database-specific include files, so Oracle or whatever would implement its own (if all else fails, just LOWER(blah)). Also, it's worth pointing out that MySQL is case-insensitive by nature. The ILIKE operator is specific to pgsql. That messed me up too, at first.
Comment #5
hswong3i commentedoic, so the performance impact is counted in dimension-level, which seems not much reason for opposing this issue :)
BTW, I would like to propose some counter idea, which just for brainstorming:
LOWERcomparison? According to the above patch, seems only user_access() which would be called for each page generation; others are for some specific cases which wouldn't call usually, e.g. user_search()Regards,
Edison
Comment #6
chx commentedI will adjust the comment if necessary. I would say the most important aspect of this patch is that it fixes the term page which is a very important thing as some sites can have millions of (freetagging) terms. Not being able to index that is a huge blow. I believe the profile pages are being fixed here as well, but that's not as widely used as. hswong3i you are again proposing a complex feature patch instead of a simple solution,
Comment #7
hswong3i commented@chx: Sorry that. "Access API" is just for brainstorming. That is another issue and please take it easy :)
BTW, seems you are trying to propose it for D6? Would this be a bit late among D6 development cycle? It change the database API once again (which is not too critical), and seems more suitable for D7?
Comment #8
robertdouglass commented/me kisses chx. Loading user objects also will benefit from this. grep user.module for LOWER for all the cases.
Comment #9
quicksketchWow, what an excellent idea! I don't know why we didn't consider this before (or maybe we did, but I missed it). Subscribe (code review coming).
Comment #10
Freso commentedhswong3i:
This won't remove anything for anyone who uses
LOWER(), nor will it change their ability to use this SQL function. It merely adds a function to make the use ofLOWER()more effective. Thus, while it is changing the API, it is changing only by adding to it, not removing or altering anything existing therein, and as such won't break any code (already) ported to D6.I'm not sure it's a bug though, seems more like a task or feature to me...
Comment #11
hswong3i commentedI am respecting about the performance boost of this patch, but seems shouldn't be the only consideration of adding additional abstraction function into our database API.
First of all, I would like to clarify my position: there is no different between mysql/pgsql/oracle/db2/mssql/etc within my concept. They are both database, a tool that serving us among different requirement, but nothing other than this. They are equal weight for me (even mysql own our existing market and most OSS field, where oracle is the leader of commercial market with no question). I would like to support for an abstraction, if it is something that need to be standardize due to compatibility concern. I did a horizontal research among different databases once before (http://edin.no-ip.com/html/?q=node/310), but seems this issue is a bit different with the behavior of standardizing RAND()/LENGTH/SUBSTRING/etc (http://drupal.org/node/167335).
Moreover, if we abstract something due to performance consideration, this should be more drawback than benefit. E.g. since case sensitive comparison is a globally standard (it shouldn't be a difficult concept: PHP and C/C++ also acting in similar behavior), this abstraction will seems to be a bit fancy for all other database, just except mysql. It is only meaningful to mysql lossy checking style, but none of performance improvement for all other databases (including pgsql). It is can't help to the real problem (LOWER() can't boost by indexing): we will need to review this problem once again, when we find this as a performance impact (yes, it is) to other databases in the coming future, which shouldn't be our hope. We may consider about solving this performance impact by some alternative solution, which should be database independence.
On the other hand, it seems not to be a suitable timing for another database API change after D6 beta1 release. It will violate the idea of code freeze, and can't help about boosting up the speed of stable release. I am giving a great hope about this issue based on its performance boost, but I will keep myself under control and waiting for this powerful feature in D7, with more discussion :)
Comment #12
merlinofchaos commentedhswong3i: With all due respect, at the time of this writing, 99% of all Drupal installations run on MySQL. A performance enhancement for MySQL that does not harm other databases is a good thing. I see no drawback here, beyond making some of the query writing a little harder. In Drupal 7, we can maybe find a better way to help out all databases with this, but it's worth a performance boost right now.
I need to test this patch tho; I'm dubious about the case insensitive nature. I've had complaints in Views about case sensitivity.
Comment #13
chx commentedhswong3i, enough. If you follow up more, I will delete it from the database, period. A vast majority of Drupal users are using mysql so we should cater for them while keeping open the door for the minority. This is exactly what this patch does.
Comment #14
chx commentedMySQL has case sensitive collations indeed but the default is case insensitive. Those who are proficient enough to set up their own non-default collation can surely copypaste the function from pgsql easily.
Comment #15
chx commentedI discussed with merlinofchaos and he did not like my lazy approach. So now I use the collation on system to determine whether the database collation is _cs or _ci.
Comment #16
chx commentedSee http://www-cgi.uni-regensburg.de/WWW_Server/Dokumentation/MySQL/Manual-4... an example of how this indeed works in MySQL 4.1. I was unable to find this on dev.mysql.com .
Comment #17
david straussI like this patch, but it does have one flaw: if {system}'s first column has a case-insensitive collation but the column used in db_stricmp() has a case-sensitive collation, the comparison logic returned by db_stricmp() will be case-sensitive.
This is acceptable as long as the collation used is database-wide. So, I'll say, "RTBC on the condition that we document the pitfalls."
Comment #18
chx commentedI documented and RTBC'd per David comment. Please note that for the exotic cases (when you set collation per table or even per column) mentioned in the patch you always needed to hack core. You are actually better off currently because you can overwrite the variable in settings.php to uniformally get rid of LOWER() or to enforce it.
Comment #19
chx commentedTypofix (and debug removed)
Comment #20
robertdouglass commentedThis patch reads well. The English text is fine, though I don't know if we really need to encourage anyone to hack core :P If they're the core hacking types, they'll hack without encouragement.
Comment #21
dries commentedI want to think about this patch some more before it gets committed. Please don't commit it yet.
Comment #22
john morahan commentedshouldn't that be
?
Comment #23
john morahan commentedThe above in patch form.
Comment #24
chx commentedExtremely nice catch, yes, that's what it shoudl be, it remaiend from the db_lower patch.
Comment #25
moshe weitzman commentedAbout the rare use case where a column is specifically collated as case sensitive: instead of hacking this function we could ask people to modify their schema definition for the relevant column and then check that here. we'll have to make up a syntax for collation. that patch could probably come later when someone has a need for it. could take years.
Comment #30
moshe weitzman commented/me hugs DrupalTestbedBot. Thanks all.
so does it unit tests or just check that the patch applies? ideally the link should show all greens if it passed unit tests.
Comment #31
dries commentedI've given this a fair amount of thought, and I think this patch is a hack. I don't like it. Sorry.
Comment #32
david straussThe best RDBMS-independent solution is to create a name_lower column that stores the username in all lowercase.
Comment #33
chx commentedDries, please reconsider. We are talking of magnitudes faster queries. Care to consider maybe the patch in the original post called db_lower?
Comment #34
catchhttp://drupal.org/node/174025 is related.
Comment #35
catchI've marked the other issue as duplicate. This is killes' last patch from that for easier review.
To emphasise how much difference this is making, quoting killes from the other issue again:
Comment #36
firebus commentedi also vote to reconsider this patch.
we have just started running drupal on www.nanowrimo.org, a medium sized site that is ALL logged in users all of them updating their user profiles all the time.
we've have two big breakthroughs in performance. the first was getting rid of all node_access modules. the second was this patch.
before this patch, we hit a wall at around 800 logged in users with page loads taking ~30 secs. after this patch, we are running slow but acceptable with around 1400 logged in users. page loads are ~1-5 secs, a little longer for those involving inserts. QPS has increased from 1.15K to 1.5K.
it's a hack, but it's essential for a high performance site.
Comment #37
chx commentedThe patch to consider is the original post. catch's patch belongs to some other issue, not here.
Comment #38
moshe weitzman commentedthis has bitten so many of us. upgrade to critical.
Comment #39
hswong3i commentedIs LOWER() slow? Yes, it is. It is proved by benchmarking and research. This is an issue, which affect performance.
Is db_lower() a complete solution for scalability? No, it isn't. A single server solution should always have its limitation, no matter how we tweak our queries as database dependent as possibility. People asking for a solution of scalability issue should always consider about database replication, or even clustering. (BTW, even database replication hack is now used within drupal.org, we are now seems it as a HACK and so left it as an unofficial supporting, until a better solution)
Should we accept db_lower() as a core feature for D6? Please answer me if this patch a HACK, or a cross database compatible solution, or at least not a case-by-case solution for MySQL? If a hack is always acceptable even it is a HACK, it seems nothing we should reject. The nature of this patch is already judged by Dries which shouldn't debate. Case is similar as database replication, shouldn't we consider this patch as similar handling: an unofficial supporting but not core feature?
Even if we ASSUME this HACK needed for a core supporting, it is still too late within D6 development cycle: almost HALF MONTH after D6 beta2!? On the other hand, this issue is proposed on 8th Oct, which is also too late within D6 Development cycle. We don't have enough time to improve this handling before it comes to be solid (and so we finally found this is a HACK by not patch: it is even case-by-case functioning under MySQL...) Time is always needed for a complete and solid solution.
Long story short, we should always obey the judgment of core committer.
Comment #40
david strauss"People asking for a solution of scalability issue should always consider about database replication, or even clustering."
I'm getting quite tired of reading plugs for commercial clustering solutions. Discussion of replication and clustering is 100% irrelevant to this patch.
Comment #41
hswong3i commented@David: Please back to my main concern about this topic: is this patch a HACK, or a cross database compatible solution, or at least not a case-by-case solution for MySQL? Please don't lost focus :)
Comment #42
chx commentedI need to hold myself hard to not exploit admin privileges here but hswong3i, if you once again spam in my issues I will ask for your ban based on spamming because this is that -- this is about fixing queries that can not be indexed and nothing to do with indexing. Could you please get out of my life?
Comment #43
merlinofchaos commentedPlease back to my main concern about this topic: is this patch a HACK, or a cross database compatible solution, or at least not a case-by-case solution for MySQL? Please don't lost focus
I don't care if it's a hack. Did you SEE the NUMBERS? It is major problem and it needs to be addressed.
We are talking about REAL SITES. Oracle is a pipe dream.
Maybe a dozen corporate users are going to use Oracle behind Drupal. Bully for them. I don't have $10K to drop on a bloated piece of software when I could spend that $10K on 3 or 4 machines and a free database solution. Which one is going to give me better performance? I'll take the hardware cluster for $500, Alex.
If holding open the door for Oracle means driving away Drupal sites because we refuse to apply performance patches that are known to work because you think they're a hack...I say Oracle is the one that can go.
This is a performance problem that we can fix; even if it's just a bandaid and we come up with a better solution in Drupal 7.
(I'll note that we had a patch that was a better solution. It created an extra field for lookups that we could index. It, too, was considered a hack). I, for one, am tired of legitimate performance enhancements being called hacks. These performance problems are going to be the primary blocker for Drupal over the next 2 years. They must be corrected.
Comment #44
david strauss"I'll note that we had a patch that was a better solution. It created an extra field for lookups that we could index. It, too, was considered a hack."
That is my preferred method for solving this problem. And yes, denormalization is not a hack.
Comment #45
chx commentedI just ran this on our term_data table:
select 1 from term_data where name like 'london%'; 0.02s
select 1 from term_data where lower(name) like lower('london%'); 2.42 sec
Also note that the first is O(log N) the second is O(N) where N is the number of terms.
Comment #46
chx commentedIf Dries says so, I will roll a patch immediately that adds a lowercase column to each affected table. Sure. Just post "nod" and I will deliver.
Comment #47
merlinofchaos commentedI would support either solution; I would be happier to support a denormalization solution. I think hswong3i would too, as all databases would benefit from the fix. I'm not sure if the existing patch would still apply or if it would be worth it to try; the patch itself is not that difficult to write.
Comment #48
david strauss@chx I don't know about Dries, but I would instantly RTBC a patch that solves this with a corresponding, lowercase-only column.
Comment #49
hswong3i commentedI am voting for an additional lower column, because:
*_csand*_ci, and the risk of using this case-by-case handling. On the other hand, no additional API are required. We only need to modify DB schema, INSERT/UPDATE, and use the lower column rather than LOWER() existing column. All of this are in query level, which is simple enough.*_cior*_cs), PgSQL (and ANY ELSE databases) will also benefit with this changes.Regards,
Edison
Comment #50
alpritt commentedhttp://drupal.org/node/83738 is the issue regarding adding an extra column.
Comment #51
hswong3i commentedSeems all of us are agreeing about handle the slow LOWER() issue by an additional column, which should belongs to another issue: http://drupal.org/node/83738. Should we roll this issue back as "won't fix" and "normal", and shift our focus to the suitable thread?
Comment #52
chx commentedOK