So my first attempt to try and help not have to query the {url_alias} table for every single link (drupal_get_path_alias() in url()) during file generation didn't work (see #477072: Optimize the generate chunk SQL query). Trying to perform just a SELECT FROM {xmlsitemap} LEFT JOIN {url_alias} produced undesired results on MySQL vs PostgreSQL and duplicate results. I gave up on doing it with a join and the database. Currently the 6.x-2.x link output code is doing this:
function xmlsitemap_generate_chunk($handle, &$status, $chunk, $language = NULL) {
$url_options = array('absolute' => TRUE, 'language' => $language);
...
while ($link = db_fetch_array($query)) {
...
$link_output = '<url><loc>' . url($link['url'], $url_options) . '</loc>';
...
}
}
I was trying out the following code that fetches all the aliases for the specific language so they can be used to bypass the drupal_get_path_alias() calls inside url():
function xmlsitemap_get_alias($src, $language) {
static $aliases;
static $aliases_langauge;
if (!isset($aliases) || $aliases_langauge != $language->language) {
$aliases = array();
$aliases_langauge = $language->language;
$query = db_query("SELECT pid, src, dst FROM {url_alias} WHERE language IN ('%s', '') ORDER BY language, pid", $language->language);
while ($alias = db_fetch_object($query)) {
$aliases[$alias->src] = $alias->dst;
}
}
return isset($aliases[$src]) ? $aliases[$src] : $src;
}
function xmlsitemap_generate_chunk($handle, &$status, $chunk, $language = NULL) {
$url_options = array('absolute' => TRUE, 'language' => $language, 'alias' => TRUE);
...
while ($link = db_fetch_array($query)) {
...
$link_alias = xmlsitemap_get_alias($link['url'], $langauge);
$link_output = '<url><loc>' . url($link_alias, $url_options) . '</loc>';
...
}
}
Ok, so now the benchmarking results!
Old code with 6500 links, 2 sitemap languages, generated 50 times with ab -n 50:
Requests per second: 0.39 [#/sec] (mean)
Time per request: 2549.337 [ms] (mean)
Time per request: 2549.337 [ms] (mean, across all concurrent requests)
Transfer rate: 0.23 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 2436 2549 41.8 2554 2620
Waiting: 2432 2544 41.7 2550 2614
Total: 2436 2549 41.8 2554 2620
And the new code with xmlsitemap_get_alias() and the same conditions:
Requests per second: 0.89 [#/sec] (mean)
Time per request: 1123.020 [ms] (mean)
Time per request: 1123.020 [ms] (mean, across all concurrent requests)
Transfer rate: 0.52 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 1092 1123 26.5 1115 1212
Waiting: 1077 1118 27.3 1110 1212
Total: 1092 1123 26.5 1115 1212
With two larger queries instead of one large query with thousands of small queries, the result is a 56% improvement in generation time! I'm about to commit this to 6.x-2.x and wanted to share this with the other maintainers!
Comments
Comment #1
avpadernoThe result is very interesting, and I really appreciate you shared it with other people.
I take that the approach followed in changing the code can be used to change similar queries where the result returned from the single queries has the same probability to be used (or where this probability cannot be calculated).
Thanks again for sharing this interesting result.
Comment #2
dave reidDid some individual performance testing just with this function and an {url_alias} table with 75,000 aliases with two different languages and language neutral paths and this final function ended up being the fastest so I thought I'd share again:
Comment #3
avpadernoI noticed that the function keeps the aliases separated for each languages, but then the aliases for a language are removed, if the passed language is different from the language saved in the static variable. Are there any cases where in the same page request the language can have two different values?
It's just something I don't understand of the code, which works well.
Comment #4
dave reidThe way generation works in 6.x-2.x is we run the sitemap chunks by language. For instance, if we have two languages (en and fr) and 2 sitemap chunks, the generation creates the files in the following order:
xmlsitemap-en.xml
xmlsitemap-1-en.xml
xmlsitemap-2-en.xml
xmlsitemap-fr.xml
xmlsitemap-1-fr.xml
xmlsitemap-2-fr.xml
Because of this, once we've "switched" languages to fr in xmlsitemap_get_path_alias, we've moved on and don't need any of the en-specific aliases anymore. That way it can save memory instead of keeping all language aliases.
Comment #5
avpadernoI just noticed that without to have
$aliases[$last_language]it is not possible to delete the aliases for a specific language in once.Then, as you reported, the aliases are queried following an order on the language used.
The code was clearly correct, but it was me to understand it correctly. Thanks again for the reply, and for sharing.