taxonomy_autocomplete (and taxonomy_get_term_by_name) use LOWER(term_data.name) in their queries. LOWER() can't be indexed so I think we should look into having a name_lower column in term_data and query against that instead.
Attached is a proof of concept patch, which uses mb_strtolower($name, 'UTF-8') to lowercase the term names (strtolower isn't safe for multibyte encoding) on save and removes LOWER() from the autocomplete query.
Basic idea came from http://drupal.org/node/83738 and pwolanin directed me towards mb_strtolower in connection with that issue. If this is a goer, then we could apply it to other places in core.
Obviously this adds storage overhead for taxonomy. Any benefits from this approach are going to apply to sites with free tagging vocabularies, and they're most likely to have big vocabularies, so the space/speed trade off ought to balance out.
Update path should be add_column, then UPDATE term_data SET name_lower = LOWER(name) - since LOWER and mb_strtolower() ought to have the same results.
There's plenty missing from this patch, mainly interested in whether it's an idea worth pursuing at all at this stage.
Comment | File | Size | Author |
---|---|---|---|
#3 | term_lower.patch | 3.99 KB | catch |
#1 | term_lower.patch | 4 KB | catch |
term_lower.patch | 2.67 KB | catch | |
Comments
Comment #1
catchLooking more at http://drupal.org/node/83738 this ought to be fine. New patch has an upgrade path which I've tested, and a slight tidy up for the code so marking to needs review. taxonomy tests still pass as well.
If this is suitable, we could apply the same approach to user.module
Comment #2
Wim LeersVery quick look at this patch:
- spacing issues in taxonomy_update_7000()
-
// Must use mb_strtolower() to be multi-byte safe.
: one space too much before "to be"One concern: shouldn't we use the Batch API for the upgrade path? What if a site has > 1 million terms? (Playing the devil's advocate here, not sure if other upgrade paths take this into account.)
Comment #3
catchNew patch fixes whitespace issues, thanks.
On batch API, I don't know to be honest.
We have queries like this in system_update_x though:
db_query('UPDATE {term_node} t SET vid = (SELECT vid FROM {node} n WHERE t.nid = n.nid)');
$ret[] = update_sql("UPDATE {profile_fields} SET category = 'Account settings' WHERE LOWER(category) = 'account'");
Which deal with potentially much larger numbers of records, so I'd rather avoid adding in batch API for now, at least until there's major objections.
Comment #4
Wim LeersYep, true. This is one of the remaining "Drupal for large scale web sites" problems then.
Patch looks good, but I haven't got the time to test right now. I'll test this later this week unless somebody else already does so before me.
Comment #5
Damien Tournoud CreditAttribution: Damien Tournoud commentedWell. It looks like LIKE is nearly always case insensitive anyway (both on MySQL with the default utf8_general collation and on PostgreSQL running in LOCALE=C), so the LOWER is not even required here.
Please just drop the LOWER and confirm it works as expected.
Comment #6
Damien Tournoud CreditAttribution: Damien Tournoud commentedMy bad, LIKE *is* case-sensitive on PostgreSQL, so please ignore my previous suggestion.
For your information, http://drupal.org/node/51220 has all you need to write foolproof upgrade functions.
Comment #7
catchDamien, do you have a link to the relevant postgres docs for this? All I can find is:
Which suggests otherwise, but I know very little about postgres so...
Additionally, SQLite docs suggest LIKE is case_sensitive there too - although this is mentioned parenthetically as a bug. Since we're hoping to add SQLite this cycle, I'd prefer a solution which works with all three databases reliably. http://www.sqlite.org/lang_expr.html
Having said that, if I'm wrong, and we can just remove LOWER() from these queries, that'd be lovely, and easily back portable to D6, but it'd be nice to know it won't just store up problems for later first.
Will do a quick test on MySQL later, don't have postgres to test on. Back to needs review until we're sure just stripping it out will work across at least the three target databases. edit: cross-posted.
Comment #8
Damien Tournoud CreditAttribution: Damien Tournoud commented@catch, as I clarified in #6, LIKE really is case-sensitive in PostgreSQL, which is a real pain.
Comment #9
Senpai CreditAttribution: Senpai commentedUnsubscribing so I can review/test this bugger later.
Comment #10
pwolanin CreditAttribution: pwolanin commentedsubscribe
Comment #11
catchI got bored waiting for devel generate to create 1 million terms, but with 52485 terms it takes 6.85 seconds to run that query, which I think ought to be acceptable - tracker select queries can take 25-30 seconds or more on big sites so for a one off it seems doable.
Comment #12
Damien Tournoud CreditAttribution: Damien Tournoud commentedWell, there is a
drupal_strtolower()
, so independently of whether this works or not (I haven't tested it), this is a cause of CNW.Comment #13
catchSpoke to chx and Crell on irc, and they said the new database layer may allow us to use ILIKE in postgres and LIKE in mysql for queries like this - getting rid of the case sensitivity issues without requiring a new column. Obviously that'd be an ideal solution, so marking postponed (rather than won't fix, so I remember to revisit this when that's possible).
Comment #14
Damien Tournoud CreditAttribution: Damien Tournoud commented@catch: I'm unsure if PostgreSQL can use it's indexes for ILIKE queries. This will have to be verified.
Comment #15
catchNever that easy is it :(
http://archives.postgresql.org/pgsql-sql/2003-09/msg00395.php
Comment #16
Damien Tournoud CreditAttribution: Damien Tournoud commentedIt looks like you can create indexed based on an expression in PostgreSQL. For example:
This could lead to noticeable performance improvements in all current PostgreSQL deployments.
An alternative and maybe complementary approach is to build an index based on case-insensitive operators, as detailed in [1] (but for an old version of PostgreSQL). Like this:
See [2] for more details.
These techniques looks more elegant than the creation of *_lower columns.
[1] http://archives.postgresql.org/pgsql-hackers/2003-10/msg01550.php
[2] http://www.postgresql.org/docs/8.3/interactive/indexes-opclass.html
Comment #17
catchIf we can use the new DB layer to swap LIKE for ILIKE, and then add the computed index in postgres via schema API, then I agree this would be a more elegant solution (although possibly username/name in user.module is worth doing with the extra column anyway for other reasons).
However, this opens us up to maintaining a different schema between postgres and MySQL - and then we may have to introduce other special cases for SQLite or whatever gets supported in the future (or they'll just be slower). webchick already brought this up on one of the grandfather issues to this one: http://drupal.org/node/83738#comment-447137
Probably we should wait and see what happens with the DB layer, and explore an index on lower in more depth. However I'll leave this at needs work instead of postponed in the hope that others will chime in.
Comment #18
Damien Tournoud CreditAttribution: Damien Tournoud commented@catch: we really should abstract dirty implementation details as most as possible in the Schema API. Write the same queries, define the same schema, and let the Schema API and the Database Layer collaborate to do the rest.
This means we will probably need a new abstract column type "ivarchar".
Comment #19
catchRetitling now dbtng is in.
Comment #20
David StraussI'm not sure how I missed this issue, but this is not a productive change. Our autocomplete uses a "%text%" pattern for LIKE, which MySQL cannot use an index to satisfy, whether or not we pre-compute the lowercase names and index them.
In case someone feels the need to re-open this issue, I'm reducing this to priority "normal" because taxonomy autocompletion performance is most definitely not "critical."
Comment #21
catchWe're still using a straight LOWER() in taxonomy_get_term_by_name(), which is used in taxonomy_node_save() on each entered free-tagging term amongst other places in contrib.
So I think it's worth having an indexable query for that, which we can do easily with the LIKE to ILIKE mapping. But indeed, autocomplete was the main win here and that's looks un-fixable.
Comment #22
David StraussThe current scalability for autocompletion is at least as bad as for taxonomy_get_term_by_name(). That's a problem because, if taxonomy_get_term_by_name() gets a big win from this issue, it means we have a critical issue with slowness for autocompletion.
@catch You are correct. I overlooked the taxonomy_get_term_by_name() benefit, which is comparatively small to achieving an indexed autocomplete but still nontrivial.
Comment #23
catchMarked #175421: taxonomy_get_term_by_name uses "wrong" SQL as duplicate - this snippet is interesting about the LIKE indexes - http://pastie.caboo.se/96865
Comment #24
catchtaxonomy_get_term_by_name() now uses LIKE on MySQL and ILIKE on postgresql. So marking this duplicate of #343788: Taxonomy module doesn't use it's own APIs. Like David said, the autocomplete is unfixable with LIKE '%%%$string%%%'.
Comment #25
David Strauss@catch Indeed.
* Before removal of LOWER() on autocomplete: O(n)
* After removal of LOWER() on autocomplete: O(n)