Comments

chx’s picture

StatusFileSize
new10.88 KB

We can do this if we want to.

hswong3i’s picture

Subscribing.

Sorry that since I have not much understanding about the real case, hope to clarify somethings:

  1. How much performance impact, based on using LOWER within MySQL?
  2. Is case insensitive comparison for LIKE and = a globally standard, or only a MySQL-specific lossy handling?

Regards,
Edison

keith.smith’s picture

Status: Needs review » Needs work

I haven't applied and tried the patch yet, but will do so.

As a small nitpick, I notice this section in the patch:

@@ -901,7 +901,7 @@ function _profile_get_fields($category, 
   }
   else {
     // Use LOWER('%s') instead of PHP's strtolower() to avoid UTF-8 conversion issues.
-    $filters[] = "LOWER(category) = LOWER('%s')";
+    $filters[] = db_stricmp('category');
     $args[] = $category;
   }

The code comment immediately preceding this change should probably be changed to reflect the new comparison, and...I guess...is there a possibility this could introduce a UTF-8 conversion issue?

webchick’s picture

@hswong:

1. There are some benchmarks available in this issue: http://drupal.org/node/83738. I vastly prefer this approach to the one suggested in that issue for dealing with this (significant) performance hit.

2. No, but this function is defined in each of the database-specific include files, so Oracle or whatever would implement its own (if all else fails, just LOWER(blah)). Also, it's worth pointing out that MySQL is case-insensitive by nature. The ILIKE operator is specific to pgsql. That messed me up too, at first.

hswong3i’s picture

oic, so the performance impact is counted in dimension-level, which seems not much reason for opposing this issue :)

BTW, I would like to propose some counter idea, which just for brainstorming:

  1. Is it such often for a LOWER comparison? According to the above patch, seems only user_access() which would be called for each page generation; others are for some specific cases which wouldn't call usually, e.g. user_search()
  2. Even for user_access(), how many percentage of query time will it occupy, among existing normal page generation? If it is currently occupy for below 2~3% of query expenses, it seems not such useful for an additional API, for such specific and limited improvement
  3. According to "access checking", I am scheduling to propose an "Access API" solution, based on http://drupal.org/node/168403 founding. It will combine with PHP regex and Drupal cache system, which should able to replace such fancy, expansive and specific SQL comparison. I am going to summarize it into a much complete solution, and so it can apply to both drupal_is_denied(), block visibility check, user_access(), etc. As an alternative solution, It should make help for more different case, and left our database abstraction layer be simpler

Regards,
Edison

chx’s picture

Status: Needs work » Needs review

I will adjust the comment if necessary. I would say the most important aspect of this patch is that it fixes the term page which is a very important thing as some sites can have millions of (freetagging) terms. Not being able to index that is a huge blow. I believe the profile pages are being fixed here as well, but that's not as widely used as. hswong3i you are again proposing a complex feature patch instead of a simple solution,

hswong3i’s picture

@chx: Sorry that. "Access API" is just for brainstorming. That is another issue and please take it easy :)

BTW, seems you are trying to propose it for D6? Would this be a bit late among D6 development cycle? It change the database API once again (which is not too critical), and seems more suitable for D7?

robertdouglass’s picture

/me kisses chx. Loading user objects also will benefit from this. grep user.module for LOWER for all the cases.

quicksketch’s picture

Wow, what an excellent idea! I don't know why we didn't consider this before (or maybe we did, but I missed it). Subscribe (code review coming).

Freso’s picture

hswong3i:
This won't remove anything for anyone who uses LOWER(), nor will it change their ability to use this SQL function. It merely adds a function to make the use of LOWER() more effective. Thus, while it is changing the API, it is changing only by adding to it, not removing or altering anything existing therein, and as such won't break any code (already) ported to D6.
I'm not sure it's a bug though, seems more like a task or feature to me...

hswong3i’s picture

I am respecting about the performance boost of this patch, but seems shouldn't be the only consideration of adding additional abstraction function into our database API.

First of all, I would like to clarify my position: there is no different between mysql/pgsql/oracle/db2/mssql/etc within my concept. They are both database, a tool that serving us among different requirement, but nothing other than this. They are equal weight for me (even mysql own our existing market and most OSS field, where oracle is the leader of commercial market with no question). I would like to support for an abstraction, if it is something that need to be standardize due to compatibility concern. I did a horizontal research among different databases once before (http://edin.no-ip.com/html/?q=node/310), but seems this issue is a bit different with the behavior of standardizing RAND()/LENGTH/SUBSTRING/etc (http://drupal.org/node/167335).

Moreover, if we abstract something due to performance consideration, this should be more drawback than benefit. E.g. since case sensitive comparison is a globally standard (it shouldn't be a difficult concept: PHP and C/C++ also acting in similar behavior), this abstraction will seems to be a bit fancy for all other database, just except mysql. It is only meaningful to mysql lossy checking style, but none of performance improvement for all other databases (including pgsql). It is can't help to the real problem (LOWER() can't boost by indexing): we will need to review this problem once again, when we find this as a performance impact (yes, it is) to other databases in the coming future, which shouldn't be our hope. We may consider about solving this performance impact by some alternative solution, which should be database independence.

On the other hand, it seems not to be a suitable timing for another database API change after D6 beta1 release. It will violate the idea of code freeze, and can't help about boosting up the speed of stable release. I am giving a great hope about this issue based on its performance boost, but I will keep myself under control and waiting for this powerful feature in D7, with more discussion :)

merlinofchaos’s picture

hswong3i: With all due respect, at the time of this writing, 99% of all Drupal installations run on MySQL. A performance enhancement for MySQL that does not harm other databases is a good thing. I see no drawback here, beyond making some of the query writing a little harder. In Drupal 7, we can maybe find a better way to help out all databases with this, but it's worth a performance boost right now.

I need to test this patch tho; I'm dubious about the case insensitive nature. I've had complaints in Views about case sensitivity.

chx’s picture

hswong3i, enough. If you follow up more, I will delete it from the database, period. A vast majority of Drupal users are using mysql so we should cater for them while keeping open the door for the minority. This is exactly what this patch does.

chx’s picture

MySQL has case sensitive collations indeed but the default is case insensitive. Those who are proficient enough to set up their own non-default collation can surely copypaste the function from pgsql easily.

chx’s picture

StatusFileSize
new11.27 KB

I discussed with merlinofchaos and he did not like my lazy approach. So now I use the collation on system to determine whether the database collation is _cs or _ci.

chx’s picture

See http://www-cgi.uni-regensburg.de/WWW_Server/Dokumentation/MySQL/Manual-4... an example of how this indeed works in MySQL 4.1. I was unable to find this on dev.mysql.com .

david strauss’s picture

I like this patch, but it does have one flaw: if {system}'s first column has a case-insensitive collation but the column used in db_stricmp() has a case-sensitive collation, the comparison logic returned by db_stricmp() will be case-sensitive.

This is acceptable as long as the collation used is database-wide. So, I'll say, "RTBC on the condition that we document the pitfalls."

chx’s picture

Status: Needs review » Reviewed & tested by the community
StatusFileSize
new11.67 KB

I documented and RTBC'd per David comment. Please note that for the exotic cases (when you set collation per table or even per column) mentioned in the patch you always needed to hack core. You are actually better off currently because you can overwrite the variable in settings.php to uniformally get rid of LOWER() or to enforce it.

chx’s picture

StatusFileSize
new11.57 KB

Typofix (and debug removed)

robertdouglass’s picture

This patch reads well. The English text is fine, though I don't know if we really need to encourage anyone to hack core :P If they're the core hacking types, they'll hack without encouragement.

dries’s picture

I want to think about this patch some more before it gets committed. Please don't commit it yet.

john morahan’s picture

+          $result = pager_query('SELECT name, uid, mail FROM {users} WHERE '. db_stricmp('name') .' LIKE '. db_stricmp("'%%%s%%'") .' OR '. db_stricmp('mail', 'LIKE', "'%%%s%%'"), 15, 0, NULL, $keys, $keys);

shouldn't that be

+          $result = pager_query('SELECT name, uid, mail FROM {users} WHERE '. db_stricmp('name', 'LIKE', "'%%%s%%'") .' OR '. db_stricmp('mail', 'LIKE', "'%%%s%%'"), 15, 0, NULL, $keys, $keys);

?

john morahan’s picture

StatusFileSize
new13.3 KB

The above in patch form.

chx’s picture

Extremely nice catch, yes, that's what it shoudl be, it remaiend from the db_lower patch.

moshe weitzman’s picture

About the rare use case where a column is specifically collated as case sensitive: instead of hacking this function we could ask people to modify their schema definition for the relevant column and then check that here. we'll have to make up a syntax for collation. that patch could probably come later when someone has a need for it. could take years.

DrupalTestbedBot tested chx's patch (http://drupal.org/files/issues/db_stricmp-181625-15.patch), the patch passed. For more information visit http://testing.drupal.org/node/108

DrupalTestbedBot tested chx's patch (http://drupal.org/files/issues/db_stricmp-181625-18.patch), the patch passed. For more information visit http://testing.drupal.org/node/110

DrupalTestbedBot tested chx's patch (http://drupal.org/files/issues/db_stricmp-181625-19.patch), the patch passed. For more information visit http://testing.drupal.org/node/111

DrupalTestbedBot tested John Morahan's patch (http://drupal.org/files/issues/db_stricmp-181625-23.patch), the patch passed. For more information visit http://testing.drupal.org/node/113

moshe weitzman’s picture

/me hugs DrupalTestbedBot. Thanks all.

so does it unit tests or just check that the patch applies? ideally the link should show all greens if it passed unit tests.

dries’s picture

Status: Reviewed & tested by the community » Closed (won't fix)

I've given this a fair amount of thought, and I think this patch is a hack. I don't like it. Sorry.

david strauss’s picture

The best RDBMS-independent solution is to create a name_lower column that stores the username in all lowercase.

chx’s picture

Status: Closed (won't fix) » Needs review

Dries, please reconsider. We are talking of magnitudes faster queries. Care to consider maybe the patch in the original post called db_lower?

catch’s picture

catch’s picture

StatusFileSize
new3.31 KB

I've marked the other issue as duplicate. This is killes' last patch from that for easier review.

To emphasise how much difference this is making, quoting killes from the other issue again:

The query

db_result(db_query_range("SELECT CASE WHEN status=1 THEN 0 ELSE 1 END FROM {access} WHERE type = '%s' AND LOWER('%s') LIKE LOWER(mask) ORDER BY status DESC", $type, $mask, 0, 1));

Is the the slowest query on drupaldb.drupal.org (after all the pager queries and the search queries have been sent to the slave).

This is probably due to the " AND LOWER('%s') LIKE LOWER(mask) " part.

firebus’s picture

i also vote to reconsider this patch.

we have just started running drupal on www.nanowrimo.org, a medium sized site that is ALL logged in users all of them updating their user profiles all the time.

we've have two big breakthroughs in performance. the first was getting rid of all node_access modules. the second was this patch.

before this patch, we hit a wall at around 800 logged in users with page loads taking ~30 secs. after this patch, we are running slow but acceptable with around 1400 logged in users. page loads are ~1-5 secs, a little longer for those involving inserts. QPS has increased from 1.15K to 1.5K.

it's a hack, but it's essential for a high performance site.

chx’s picture

The patch to consider is the original post. catch's patch belongs to some other issue, not here.

moshe weitzman’s picture

Priority: Normal » Critical

this has bitten so many of us. upgrade to critical.

hswong3i’s picture

Is LOWER() slow? Yes, it is. It is proved by benchmarking and research. This is an issue, which affect performance.

Is db_lower() a complete solution for scalability? No, it isn't. A single server solution should always have its limitation, no matter how we tweak our queries as database dependent as possibility. People asking for a solution of scalability issue should always consider about database replication, or even clustering. (BTW, even database replication hack is now used within drupal.org, we are now seems it as a HACK and so left it as an unofficial supporting, until a better solution)

Should we accept db_lower() as a core feature for D6? Please answer me if this patch a HACK, or a cross database compatible solution, or at least not a case-by-case solution for MySQL? If a hack is always acceptable even it is a HACK, it seems nothing we should reject. The nature of this patch is already judged by Dries which shouldn't debate. Case is similar as database replication, shouldn't we consider this patch as similar handling: an unofficial supporting but not core feature?

Even if we ASSUME this HACK needed for a core supporting, it is still too late within D6 development cycle: almost HALF MONTH after D6 beta2!? On the other hand, this issue is proposed on 8th Oct, which is also too late within D6 Development cycle. We don't have enough time to improve this handling before it comes to be solid (and so we finally found this is a HACK by not patch: it is even case-by-case functioning under MySQL...) Time is always needed for a complete and solid solution.

Long story short, we should always obey the judgment of core committer.

david strauss’s picture

"People asking for a solution of scalability issue should always consider about database replication, or even clustering."

I'm getting quite tired of reading plugs for commercial clustering solutions. Discussion of replication and clustering is 100% irrelevant to this patch.

hswong3i’s picture

@David: Please back to my main concern about this topic: is this patch a HACK, or a cross database compatible solution, or at least not a case-by-case solution for MySQL? Please don't lost focus :)

chx’s picture

I need to hold myself hard to not exploit admin privileges here but hswong3i, if you once again spam in my issues I will ask for your ban based on spamming because this is that -- this is about fixing queries that can not be indexed and nothing to do with indexing. Could you please get out of my life?

merlinofchaos’s picture

Please back to my main concern about this topic: is this patch a HACK, or a cross database compatible solution, or at least not a case-by-case solution for MySQL? Please don't lost focus

I don't care if it's a hack. Did you SEE the NUMBERS? It is major problem and it needs to be addressed.

We are talking about REAL SITES. Oracle is a pipe dream.

Maybe a dozen corporate users are going to use Oracle behind Drupal. Bully for them. I don't have $10K to drop on a bloated piece of software when I could spend that $10K on 3 or 4 machines and a free database solution. Which one is going to give me better performance? I'll take the hardware cluster for $500, Alex.

If holding open the door for Oracle means driving away Drupal sites because we refuse to apply performance patches that are known to work because you think they're a hack...I say Oracle is the one that can go.

This is a performance problem that we can fix; even if it's just a bandaid and we come up with a better solution in Drupal 7.

(I'll note that we had a patch that was a better solution. It created an extra field for lookups that we could index. It, too, was considered a hack). I, for one, am tired of legitimate performance enhancements being called hacks. These performance problems are going to be the primary blocker for Drupal over the next 2 years. They must be corrected.

david strauss’s picture

"I'll note that we had a patch that was a better solution. It created an extra field for lookups that we could index. It, too, was considered a hack."

That is my preferred method for solving this problem. And yes, denormalization is not a hack.

chx’s picture

I just ran this on our term_data table:

select 1 from term_data where name like 'london%'; 0.02s
select 1 from term_data where lower(name) like lower('london%'); 2.42 sec

Also note that the first is O(log N) the second is O(N) where N is the number of terms.

chx’s picture

If Dries says so, I will roll a patch immediately that adds a lowercase column to each affected table. Sure. Just post "nod" and I will deliver.

merlinofchaos’s picture

I would support either solution; I would be happier to support a denormalization solution. I think hswong3i would too, as all databases would benefit from the fix. I'm not sure if the existing patch would still apply or if it would be worth it to try; the patch itself is not that difficult to write.

david strauss’s picture

@chx I don't know about Dries, but I would instantly RTBC a patch that solves this with a corresponding, lowercase-only column.

hswong3i’s picture

I am voting for an additional lower column, because:

  1. Strict forward. Problem cause by the lose effect of indexed column, when working with LOWER() during run time. Giving an additional column with lower case content can quickly solve this problem.
  2. Simple. We don't need to understand the different between MySQL's *_cs and *_ci, and the risk of using this case-by-case handling. On the other hand, no additional API are required. We only need to modify DB schema, INSERT/UPDATE, and use the lower column rather than LOWER() existing column. All of this are in query level, which is simple enough.
  3. Reliable. It is always truth for MySQL (whatever with *_ci or *_cs), PgSQL (and ANY ELSE databases) will also benefit with this changes.
  4. Acceptable. As I remember someone mention that we did something similar for node_comment_statistics. IF an additional column for special requirement and performance boosting is acceptable for that case, we should able to employ that in here, too.

Regards,
Edison

alpritt’s picture

http://drupal.org/node/83738 is the issue regarding adding an extra column.

hswong3i’s picture

Seems all of us are agreeing about handle the slow LOWER() issue by an additional column, which should belongs to another issue: http://drupal.org/node/83738. Should we roll this issue back as "won't fix" and "normal", and shift our focus to the suitable thread?

chx’s picture

Status: Needs review » Closed (won't fix)

OK