So my first attempt to try and help not have to query the {url_alias} table for every single link (drupal_get_path_alias() in url()) during file generation didn't work (see #477072: Optimize the generate chunk SQL query). Trying to perform just a SELECT FROM {xmlsitemap} LEFT JOIN {url_alias} produced undesired results on MySQL vs PostgreSQL and duplicate results. I gave up on doing it with a join and the database. Currently the 6.x-2.x link output code is doing this:

function xmlsitemap_generate_chunk($handle, &$status, $chunk, $language = NULL) {
  $url_options = array('absolute' => TRUE, 'language' => $language);
  ...
  while ($link = db_fetch_array($query)) {
    ...
    $link_output = '<url><loc>' . url($link['url'], $url_options) . '</loc>';
    ...
  }
}

I was trying out the following code that fetches all the aliases for the specific language so they can be used to bypass the drupal_get_path_alias() calls inside url():

function xmlsitemap_get_alias($src, $language) {
  static $aliases;
  static $aliases_langauge;

  if (!isset($aliases) || $aliases_langauge != $language->language) {
    $aliases = array();
    $aliases_langauge = $language->language;
    $query = db_query("SELECT pid, src, dst FROM {url_alias} WHERE language IN ('%s', '') ORDER BY language, pid", $language->language);
    while ($alias = db_fetch_object($query)) {
      $aliases[$alias->src] = $alias->dst;
    }
  }

  return isset($aliases[$src]) ? $aliases[$src] : $src;
}

function xmlsitemap_generate_chunk($handle, &$status, $chunk, $language = NULL) {
  $url_options = array('absolute' => TRUE, 'language' => $language, 'alias' => TRUE);
  ...
  while ($link = db_fetch_array($query)) {
    ...
    $link_alias = xmlsitemap_get_alias($link['url'], $langauge);
    $link_output = '<url><loc>' . url($link_alias, $url_options) . '</loc>';
    ...
  }
}

Ok, so now the benchmarking results!

Old code with 6500 links, 2 sitemap languages, generated 50 times with ab -n 50:

Requests per second:    0.39 [#/sec] (mean)
Time per request:       2549.337 [ms] (mean)
Time per request:       2549.337 [ms] (mean, across all concurrent requests)
Transfer rate:          0.23 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:  2436 2549  41.8   2554    2620
Waiting:     2432 2544  41.7   2550    2614
Total:       2436 2549  41.8   2554    2620

And the new code with xmlsitemap_get_alias() and the same conditions:

Requests per second:    0.89 [#/sec] (mean)
Time per request:       1123.020 [ms] (mean)
Time per request:       1123.020 [ms] (mean, across all concurrent requests)
Transfer rate:          0.52 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:  1092 1123  26.5   1115    1212
Waiting:     1077 1118  27.3   1110    1212
Total:       1092 1123  26.5   1115    1212

With two larger queries instead of one large query with thousands of small queries, the result is a 56% improvement in generation time! I'm about to commit this to 6.x-2.x and wanted to share this with the other maintainers!

Comments

avpaderno’s picture

The result is very interesting, and I really appreciate you shared it with other people.
I take that the approach followed in changing the code can be used to change similar queries where the result returned from the single queries has the same probability to be used (or where this probability cannot be calculated).

Thanks again for sharing this interesting result.

dave reid’s picture

Did some individual performance testing just with this function and an {url_alias} table with 75,000 aliases with two different languages and language neutral paths and this final function ended up being the fastest so I thought I'd share again:

/**
 * Given an internal Drupal path, return the alias for the path.
 *
 * This is similar to drupal_get_path_alias(), but designed to fetch all alises
 * at once so that only one database query is executed instead of several or
 * possibly thousands during sitemap generation.
 *
 * @param $path
 *   An internal Drupal path.
 * @param $language
 *   A language code to look use when looking up the paths.
 */
function xmlsitemap_get_path_alias($path, $language) {
  static $aliases;
  static $last_language;

  if (!isset($aliases)) {
    $aliases['all'] = array();
    $query = db_query("SELECT src, dst FROM {url_alias} WHERE language = '' ORDER BY pid");
    while ($alias = db_fetch_object($query)) {
      $aliases['all'][$alias->src] = $alias->dst;
    }
  }
  if ($language && $last_language != $language) {
    unset($aliases[$last_language]);
    $aliases[$language] = array();
    $query = db_query("SELECT src, dst FROM {url_alias} WHERE language = '%s' ORDER BY pid", $language);
    while ($alias = db_fetch_object($query)) {
      $aliases[$language][$alias->src] = $alias->dst;
    }
    $last_language = $language;
  }

  if ($language && isset($aliases[$language][$path])) {
    return $aliases[$language][$path];
  }
  elseif (isset($aliases['all'][$path])) {
    return $aliases['all'][$path];
  }
  else {
    return $path;
  }
}
avpaderno’s picture

I noticed that the function keeps the aliases separated for each languages, but then the aliases for a language are removed, if the passed language is different from the language saved in the static variable. Are there any cases where in the same page request the language can have two different values?
It's just something I don't understand of the code, which works well.

dave reid’s picture

The way generation works in 6.x-2.x is we run the sitemap chunks by language. For instance, if we have two languages (en and fr) and 2 sitemap chunks, the generation creates the files in the following order:

xmlsitemap-en.xml
xmlsitemap-1-en.xml
xmlsitemap-2-en.xml
xmlsitemap-fr.xml
xmlsitemap-1-fr.xml
xmlsitemap-2-fr.xml

Because of this, once we've "switched" languages to fr in xmlsitemap_get_path_alias, we've moved on and don't need any of the en-specific aliases anymore. That way it can save memory instead of keeping all language aliases.

avpaderno’s picture

I just noticed that without to have $aliases[$last_language] it is not possible to delete the aliases for a specific language in once.
Then, as you reported, the aliases are queried following an order on the language used.

The code was clearly correct, but it was me to understand it correctly. Thanks again for the reply, and for sharing.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.