taxonomy_autocomplete (and taxonomy_get_term_by_name) use LOWER(term_data.name) in their queries. LOWER() can't be indexed so I think we should look into having a name_lower column in term_data and query against that instead.

Attached is a proof of concept patch, which uses mb_strtolower($name, 'UTF-8') to lowercase the term names (strtolower isn't safe for multibyte encoding) on save and removes LOWER() from the autocomplete query.

Basic idea came from http://drupal.org/node/83738 and pwolanin directed me towards mb_strtolower in connection with that issue. If this is a goer, then we could apply it to other places in core.

Obviously this adds storage overhead for taxonomy. Any benefits from this approach are going to apply to sites with free tagging vocabularies, and they're most likely to have big vocabularies, so the space/speed trade off ought to balance out.

Update path should be add_column, then UPDATE term_data SET name_lower = LOWER(name) - since LOWER and mb_strtolower() ought to have the same results.

There's plenty missing from this patch, mainly interested in whether it's an idea worth pursuing at all at this stage.

CommentFileSizeAuthor
#3 term_lower.patch3.99 KBcatch
#1 term_lower.patch4 KBcatch
term_lower.patch2.67 KBcatch
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

catch’s picture

Status: Needs work » Needs review
FileSize
4 KB

Looking more at http://drupal.org/node/83738 this ought to be fine. New patch has an upgrade path which I've tested, and a slight tidy up for the code so marking to needs review. taxonomy tests still pass as well.

If this is suitable, we could apply the same approach to user.module

Wim Leers’s picture

Status: Needs review » Needs work

Very quick look at this patch:

- spacing issues in taxonomy_update_7000()
- // Must use mb_strtolower() to be multi-byte safe.: one space too much before "to be"

One concern: shouldn't we use the Batch API for the upgrade path? What if a site has > 1 million terms? (Playing the devil's advocate here, not sure if other upgrade paths take this into account.)

catch’s picture

Status: Needs work » Needs review
FileSize
3.99 KB

New patch fixes whitespace issues, thanks.

On batch API, I don't know to be honest.

We have queries like this in system_update_x though:

db_query('UPDATE {term_node} t SET vid = (SELECT vid FROM {node} n WHERE t.nid = n.nid)');
$ret[] = update_sql("UPDATE {profile_fields} SET category = 'Account settings' WHERE LOWER(category) = 'account'");

Which deal with potentially much larger numbers of records, so I'd rather avoid adding in batch API for now, at least until there's major objections.

Wim Leers’s picture

Yep, true. This is one of the remaining "Drupal for large scale web sites" problems then.

Patch looks good, but I haven't got the time to test right now. I'll test this later this week unless somebody else already does so before me.

Damien Tournoud’s picture

Status: Needs review » Needs work

Well. It looks like LIKE is nearly always case insensitive anyway (both on MySQL with the default utf8_general collation and on PostgreSQL running in LOCALE=C), so the LOWER is not even required here.

Please just drop the LOWER and confirm it works as expected.

Damien Tournoud’s picture

My bad, LIKE *is* case-sensitive on PostgreSQL, so please ignore my previous suggestion.

For your information, http://drupal.org/node/51220 has all you need to write foolproof upgrade functions.

catch’s picture

Status: Needs work » Needs review

Damien, do you have a link to the relevant postgres docs for this? All I can find is:

The key word ILIKE can be used instead of LIKE to make the match case-insensitive according to the active locale. This is not in the SQL standard but is a PostgreSQL extension.

Which suggests otherwise, but I know very little about postgres so...

Additionally, SQLite docs suggest LIKE is case_sensitive there too - although this is mentioned parenthetically as a bug. Since we're hoping to add SQLite this cycle, I'd prefer a solution which works with all three databases reliably. http://www.sqlite.org/lang_expr.html

Having said that, if I'm wrong, and we can just remove LOWER() from these queries, that'd be lovely, and easily back portable to D6, but it'd be nice to know it won't just store up problems for later first.

Will do a quick test on MySQL later, don't have postgres to test on. Back to needs review until we're sure just stripping it out will work across at least the three target databases. edit: cross-posted.

Damien Tournoud’s picture

@catch, as I clarified in #6, LIKE really is case-sensitive in PostgreSQL, which is a real pain.

Senpai’s picture

Unsubscribing so I can review/test this bugger later.

pwolanin’s picture

subscribe

catch’s picture

I got bored waiting for devel generate to create 1 million terms, but with 52485 terms it takes 6.85 seconds to run that query, which I think ought to be acceptable - tracker select queries can take 25-30 seconds or more on big sites so for a one off it seems doable.

Damien Tournoud’s picture

Status: Needs review » Needs work

Well, there is a drupal_strtolower(), so independently of whether this works or not (I haven't tested it), this is a cause of CNW.

catch’s picture

Status: Needs work » Postponed

Spoke to chx and Crell on irc, and they said the new database layer may allow us to use ILIKE in postgres and LIKE in mysql for queries like this - getting rid of the case sensitivity issues without requiring a new column. Obviously that'd be an ideal solution, so marking postponed (rather than won't fix, so I remember to revisit this when that's possible).

Damien Tournoud’s picture

@catch: I'm unsure if PostgreSQL can use it's indexes for ILIKE queries. This will have to be verified.

catch’s picture

Status: Postponed » Needs work
Damien Tournoud’s picture

It looks like you can create indexed based on an expression in PostgreSQL. For example:

CREATE INDEX index ON table (LOWER(column));

This could lead to noticeable performance improvements in all current PostgreSQL deployments.

An alternative and maybe complementary approach is to build an index based on case-insensitive operators, as detailed in [1] (but for an old version of PostgreSQL). Like this:

CREATE INDEX index ON table (column operator_class);

See [2] for more details.

These techniques looks more elegant than the creation of *_lower columns.

[1] http://archives.postgresql.org/pgsql-hackers/2003-10/msg01550.php
[2] http://www.postgresql.org/docs/8.3/interactive/indexes-opclass.html

catch’s picture

If we can use the new DB layer to swap LIKE for ILIKE, and then add the computed index in postgres via schema API, then I agree this would be a more elegant solution (although possibly username/name in user.module is worth doing with the extra column anyway for other reasons).

However, this opens us up to maintaining a different schema between postgres and MySQL - and then we may have to introduce other special cases for SQLite or whatever gets supported in the future (or they'll just be slower). webchick already brought this up on one of the grandfather issues to this one: http://drupal.org/node/83738#comment-447137

Probably we should wait and see what happens with the DB layer, and explore an index on lower in more depth. However I'll leave this at needs work instead of postponed in the hope that others will chime in.

Damien Tournoud’s picture

@catch: we really should abstract dirty implementation details as most as possible in the Schema API. Write the same queries, define the same schema, and let the Schema API and the Database Layer collaborate to do the rest.

This means we will probably need a new abstract column type "ivarchar".

catch’s picture

Title: Add column 'term_lower' to taxonomy » Remove lower from taxonomy (use ILIKE on postgresql)
Priority: Normal » Critical

Retitling now dbtng is in.

David Strauss’s picture

Priority: Critical » Normal
Status: Needs work » Closed (won't fix)

I'm not sure how I missed this issue, but this is not a productive change. Our autocomplete uses a "%text%" pattern for LIKE, which MySQL cannot use an index to satisfy, whether or not we pre-compute the lowercase names and index them.

In case someone feels the need to re-open this issue, I'm reducing this to priority "normal" because taxonomy autocompletion performance is most definitely not "critical."

catch’s picture

Status: Closed (won't fix) » Active

We're still using a straight LOWER() in taxonomy_get_term_by_name(), which is used in taxonomy_node_save() on each entered free-tagging term amongst other places in contrib.

So I think it's worth having an indexable query for that, which we can do easily with the LIKE to ILIKE mapping. But indeed, autocomplete was the main win here and that's looks un-fixable.

David Strauss’s picture

The current scalability for autocompletion is at least as bad as for taxonomy_get_term_by_name(). That's a problem because, if taxonomy_get_term_by_name() gets a big win from this issue, it means we have a critical issue with slowness for autocompletion.

@catch You are correct. I overlooked the taxonomy_get_term_by_name() benefit, which is comparatively small to achieving an indexed autocomplete but still nontrivial.

catch’s picture

Marked #175421: taxonomy_get_term_by_name uses "wrong" SQL as duplicate - this snippet is interesting about the LIKE indexes - http://pastie.caboo.se/96865

catch’s picture

Title: Remove lower from taxonomy (use ILIKE on postgresql) » Remove lower from taxonomy autocomplete
Assigned: catch » Unassigned
Status: Active » Closed (duplicate)

taxonomy_get_term_by_name() now uses LIKE on MySQL and ILIKE on postgresql. So marking this duplicate of #343788: Taxonomy module doesn't use it's own APIs. Like David said, the autocomplete is unfixable with LIKE '%%%$string%%%'.

David Strauss’s picture

@catch Indeed.

* Before removal of LOWER() on autocomplete: O(n)
* After removal of LOWER() on autocomplete: O(n)