taxonomy/term/n queries currently cause at least two filesorts - one in taxonomy_select_nodes() the other in taxonomy_get_tids_from_nodes(), these are in D6 too.
#412518: Convert taxonomy_node_* related code to use field API + upgrade path will be filesort free once #562816: Poor performance on field_attach_query() lands, but instead, will have to load all nids attached to a term into memory then stick them in a massive IN() query due to pluggable field storage not allowing us to join on any field tables. Which is actually more performant than current queries up to about 10,000 nodes atttached to any single term, but would quickly cause scalability/max_allowed_packet issues going much beyond 20,000 (about 50-60mb for the request measured during my test runs of fetching 20,000 nids).
So, opening this issue to discuss proper denormalization of term_node. Hoping that the field patch goes in, I think we need a table like:
entity type | bundle | entity id | term id | created | status | sticky
(created and status we could store for any entity which supports it - possibly re-using for something like user term listings, otherwise they'd be null).
This table would be managed by taxonomy module, and probably kept in sync by field_attach and/or nodeapi hooks.
Major issues with doing this:
1. It's not extensible - say you want to sort by another column, you'd need to hook_schema_alter() that into the table or build your own.
2. You'll never, ever be allowed to do update node set status = 1 WHERE $some_condition - because this table would get out of sync
3. If you disabled then re-enabled taxonomy module, we'd need to rebuild node->created and similar values, which may or may not require tracker2 style indexing.
4. This is a job which should really be handled by materialized view API (and I believe is handled by materialized views on d.o already) - however that's neither in core, nor in CVS.
I don't have time to work on this and get it committable before September 1st, however it's a schema change, and last indications were that performance patches requiring big schema changes might not get in post freeze. However, assuming the field patch gets in, we're unlikely to actually break any new APIs here, although it might mean adding a couple of (new) convenience functions.
If a decision is taken that doing this is a prerequisite to #412518: Convert taxonomy_node_* related code to use field API + upgrade path, fair enough, but we'll probably need to revert all taxonomy field patches, or get a code freeze extension, if that's the case.
| Comment | File | Size | Author |
|---|---|---|---|
| #13 | taxonomy.patch | 115.7 KB | catch |
| #12 | taxonomy_denormalization.patch | 115.22 KB | catch |
| #10 | taxonomy_denormalize.patch | 114.79 KB | catch |
| #8 | taxonomy_denormalize.patch | 105.48 KB | catch |
| #7 | taxonomy_denormalize.patch | 95.64 KB | catch |
Comments
Comment #1
yched commentedHow would the combination of
- #443422: 'per field' storage engine (fields chose their storage backend)
- #552716: Impossible to join or order by on field tables without using Views (a/ force taxonomy fields to be stored in a 'local SQL' storage backend, b/ add an official API for 'local SQL' storage backends to publicize their table names, allowing business code to build the direct JOINed queries they need)
help on this ?
Comment #2
catchI think in general that'd really help, and we need to try to get these two patches in for Views to work/be possible in D7 IMO.
In the specific case of taxonomy.module - since that'd more or less get us back to the D6 queries, and the D6 queries are terrible, we might be better off embracing the constraints here and going for denormalization anyway.
Comment #3
Leeteq commentedSubscribing.
Comment #4
damien tournoud commentedWhat we really need is a way for a module to add additional "properties" to a field defined by another module. As far as I know, this is not yet possible, and lead to the somewhat ugly filefield/imagefield custom hooks in D6. This would really help that issue (ie. in this model, the node module would add its own custom properties n.created, n.sticky, etc. to the term field). How could we make that happen?
Comment #5
catchThere's no hook_field_schema_alter() - and afaik that's by design, so a proper API for this seems quite far off. Also the way the taxonomy as field patch currently works, you can have more than one field per vocabulary, which would mean a lot of data duplication. I think materialized views is the proper answer to this but no chance of getting that into core this release.
Realised we don't need status on this table, because we should simply make sure that unpublished nodes don't show up in it, but we do need tracker2 style 'last_commented_or_updated' column if it's to be used for forums, so:
entity type | bundle | entity id | term id | created | sticky | last_activityThen:
Comment #6
catchHere's an initial, working, patch - it's currently based on, and includes #412518: Convert taxonomy_node_* related code to use field API + upgrade path.
I decided to abandon an entity-agnostic table - our only use case in core is taxonomy listings and forum topic listings, so we should do the minimum to support those, and let views / materialized views (and hopefully views and materalized views integration) handle other entity types when that's actually needed. That keeps the code here very simple - I'd hope to see both views and mv API move into core in the next couple of releases, in which case we can replace both this and the recently committed tracker indexing tables with those.
Done:
schema changes
taxonomy_field_*() hooks (and taxonomy_node_delete()).
TODO:
hook_update_N()
forum_comment_*() hooks to populate the 'changed' column + reworking forum queries to use this.
I don't think we need the same kind of indexing here as tracker tables - the only issue would be if you disabled then re-enabled forum and/or taxonomy modules for a while then had people posting comments on orphaned forum topics - as soon as any topic is update it's data will get refreshed in that edge case anyway.
I'm keeping this separate from the taxonomy_node_* patch for now, but will probably merge it back in when it's ready for review.
EXPLAIN on the current taxonomy_select_nodes() query and schema:
Comment #7
catchAnd the patch.
Note if anyone has any ideas on how to get that query having no extra 'where' column, that'd be great. Initial investigations suggest it won't happen.
Comment #8
catchNew update.
Now creates two tables:
{taxonomy_index}
{forum_index}
I was originally going to try to keep it to one table, but {forum_index} probably needs to account for the tablesort query if it's going to be really useful, which means more columns than we can reasonable be expected to maintain in taxonomy.module.
Added upgrade path for everything apart from forum data.
Didn't yet convert any forum queries to use the new table - that's up next.
Marking CNR for the bot to see if I managed to break anything along the way so far.
Promoting to bug report since afaik both of these are currently handled by custom code on Drupal.org to stop it falling over.
Comment #10
catchGiving up for the night. Have forum topic listings more or less working - this should also restore shadow forum topics (untested though).
Comment #12
catchStill needs upgrade path for forums and probably some more work on optimizing the topic listing query (and maybe convert some others), but this should pass tests now.
Comment #13
catchforum_get_topics queries now look like this:
Merged this back into #511286: D7UX: Implement customizable shortcuts now that it's fully functional with upgrade path. So marking postponed on that issue dependent on what happens over there. Latest patch attached here too so we know where things were left off in case this needs re-rolling independently.
Comment #14
catchThis got in with the Field API patch.