Initialization process uses query that causes full table scan [#1240760]

I'm running a test on a server for a live site conversion that's supposed to happen very soon, and the regeneration process is sticking at the Initializing process. It's re-creating approximately 110k aliases, so it's not an incredibly huge amount. I've run it fine before on my local machine, and it took a while, but it ran. On the server, however, MySQL is using 95% of only one processor, even though there are four on the box. As it sits now, it's been at the Initializing phase for over an hour. Any thoughts on anything I can do or look at to see why it's getting stuck?

Thanks.

Comment	File	Size	Author
#9	1240760-bulk-update-index-table.patch	7.65 KB	wonder95
#9	6.x-2.x: PHP 5.3 & MySQL 5.5, D6 CI error
#3	full-table-scan-query-1240760-3.patch	5.99 KB	wonder95
#3	6.x-2.x: PHP 5.3 & MySQL 5.5, D6 CI error

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

wonder95 CreditAttribution: wonder95 commented 5 August 2011 at 19:13

For what it's worth here is what the query looks like in a profiler.

Comment #2

wonder95 CreditAttribution: wonder95 commented 5 August 2011 at 21:17

Per a tweet from Steven Merrill, it appears that having a function (CONCAT() in this case) in the query is what generates a full table scan. The options appear to be:

Re-write the query so it doesn't use a function
Get the data into PHP and use it to do a count.

Comment #3

wonder95 CreditAttribution: wonder95 commented 8 August 2011 at 20:50

Title:	Stuck on initialization when bulk updating aliases	» Initialization process uses query that causes full table scan
Category:	support	» bug

File	Size
full-table-scan-query-1240760-3.patch	5.99 KB
6.x-2.x: PHP 5.3 & MySQL 5.5, D6 CI error

Changing title and category to match issue.

As demonstrated by profiler screen shots here and here, the queries used in taxonomy_pathauto_bulk_update_batch_process() and node_pathauto_bulk_update_batch_process() use queries that cause full table scans because of the use of the CONCAT() MySQL function in the query. The attached patch offloads all that processing to PHP using arrays. With this code, it gets right to the updating functionality and doesn't get stuck on the initialization part.

I'm sure there are ways this could be improved in terms of memory management, so I welcome suggestions on how to improve it. I looked at ways to do it in MySQL but the fact that the "node/" and "taxonomy/term/" strings have to be concatenated with tids and nids, I'm not sure there's another way to do it in the db. I do remember some type of query type in MSSQL and Oracle where you could do a query that is basically a loop through records, but I'd have to do some research on that. Even so, I think CONCAT() would have to be used, unless you want to create the whole src value first and then do an individual query for each row.

Anyway, I'm open to suggestions as to how this could be improved (or why it should be thrown out).

As a side note, in looking at the different process functions (taxonomy, node, forum, user, etc.) they are similar enough that I'm thinking they could be combined into one function. I'll look at that when I get a chance.

Thanks.

Comment #4

Dave Reid

he/him

English

Nebraska USA

CreditAttribution: Dave Reid commented 8 August 2011 at 21:21

Yeah unfortunately we're stuck with a crappy core table. I've been thinking about creating an url_alias_entity_index sub-module that can help perform this indexing against {url_alias} and could optionally be used by Pathauto.

Comment #5

sp3boy CreditAttribution: sp3boy commented 12 August 2011 at 13:56

I have a site with >100,000 aliases to rebuild, so I'm interested in this issue.

In the patch above, is it necessary to re-query url_alias every time another batch of 25 nodes is processed?

(Also, don't you need {...} around table names in your SQL text?)

Comment #6

jscm CreditAttribution: jscm commented 12 September 2011 at 08:07

+1000000 subscribe
I really need a solution about this hard problem.

Comment #7

jscm CreditAttribution: jscm commented 12 September 2011 at 10:41

The patch suggested works better than original pathauto.pathauto.inc queries, in my test website.

it's faster and dosen't require too much resources from server.

Comment #8

jscm CreditAttribution: jscm commented 12 September 2011 at 10:13

I tried to patch pathauto.pathauto.inc in my production website.
So, I get an error message with severity 3, that show a query that doesn't exist.

Comment #9

wonder95 CreditAttribution: wonder95 commented 15 September 2011 at 00:45

File	Size
1240760-bulk-update-index-table.patch	7.65 KB
6.x-2.x: PHP 5.3 & MySQL 5.5, D6 CI error

Here's a completely different approach that worked for updating ~80k nodes. It involves three parts:

An index table to store the nodes/users/terms to be aliased.
A drush command to fill the table
Modified queries in the *_pathauto_bulk_update_batch_process() functions to use the index table.

The idea is to run the drush command to fill the index table with any combination of the three types (node, user, and term), and then run the bulk update at admin/build/path/update_bulk (after deleting the appropriate aliases at admin/build/path/delete_bulk). It can take a while to run, but it works.

I only used this process for nodes and terms, but I modified user_pathauto_bulk_update_batch_process() anyway, even though I haven't actually tested it. It should work, though.

Possible improvements:

Instead of a drush command, implement the index table filling as part of the generation process.
Integrate it all together so that it delete existing aliases and generate url redirects (which would have been ideal in my case, since I had to go back and write some code to create new aliases for all those nodes), similar to the option at admin/build/path/settings.