Hi,

My site is setup as multilingual with "path prefix" language negotiation.
www.example.com : default language (French)
www.example.com/en : English version

The crawler crawls properly the pages in the default language.
However it tries to crawl the English pages without the prefix resulting in page not found.

Thanks for your help

Laurent

CommentFileSizeAuthor
#3 boost-886450.patch1.83 KBmikeytown2

Comments

agence web coheractio’s picture

Anyone else having this issue ?

mikeytown2’s picture

what about pages already in the cache? Does it work correctly with those? My guess is your loading up URL's from the alias table. Code in question


/**
 * Get URLs from url alias table
 */
function boost_crawler_add_alias_to_table() {
  // Insert batch of html URL's into boost_crawler table
  global $base_url;
  if (!variable_get('boost_crawl_url_alias', FALSE)) {
    return TRUE;
  }
  $count = BOOST_CRAWL_DB_IMPORT_SIZE;

  // Get maximum packet size for mysql
  if (stristr($db_type, 'pgsql')) {
    // Set Max Packet size to 16MB if using postgreSQL.
    $max_packet = 16777216;
  }
  else {
    // Get maximum packet size for mysql
    $max_packet = db_fetch_array(db_query("SHOW VARIABLES WHERE Variable_name = 'max_allowed_packet'"));
    // default to 1/2 MB
    $max_packet = (int)$max_packet['Value'] > 524288 ? (int)$max_packet['Value'] : 524288;

    // Get bulk insert buffer size
    $insert_buffer_size = db_fetch_array(db_query("SHOW VARIABLES WHERE Variable_name = 'bulk_insert_buffer_size'"));
    // default to 1/2 MB
    $insert_buffer_size = (int)$insert_buffer_size['Value'] > 524288 ? (int)$insert_buffer_size['Value'] : 524288;

    // Set max
    $max_packet = $max_packet > $insert_buffer_size ? $insert_buffer_size : $max_packet;
  }
  $max_chunk = $max_packet/512;
  $chunks = 0;
  $loop_counter = 0;

  $total = db_result(db_query("SELECT COUNT(*) FROM {url_alias} AS ua LEFT JOIN {node} AS n ON n.nid = CAST(substring(ua.src, 6) AS UNSIGNED) WHERE n.status = 1 OR n.status IS NULL"));
  $loaded = variable_get('boost_crawler_loaded_count_alias', 0);
  if ($total > $loaded) {
    $list = db_query_range("SELECT dst FROM {url_alias} AS ua LEFT JOIN {node} AS n ON n.nid = CAST(substring(ua.src, 6) AS UNSIGNED) WHERE n.status = 1 OR n.status IS NULL", $loaded, $count);
    $data = array();
    while ($url = db_result($list)) {
      $url = $base_url . '/' . $url;
      $md5 = md5($url);
      $data[$chunks][] = $url;
      $data[$chunks][] = $md5;
      $loop_counter++;
      if ($loop_counter > $max_chunk) {
        $chunks++;
        $loop_counter = 0;
      }
    }
    foreach ($data as $values) {
      boost_db_multi_insert('boost_crawler', array('url' => "'%s'", 'hash' => "'%s'"), $values, FALSE);
    }
    variable_set('boost_crawler_loaded_count_alias', $loaded + $count);
    return FALSE;
  }
  else {
    return TRUE;
  }
}

This query has been heavily optimized so it would load up millions of alias in a short amount of time. Looking at the code I might have a solution...

mikeytown2’s picture

Version: 6.x-1.18 » 6.x-1.x-dev
Status: Active » Needs review
StatusFileSize
new1.83 KB
agence web coheractio’s picture

Status: Needs review » Reviewed & tested by the community

Works perfectly with the patch.
Many thanks for that great module

Just one comment : there is a missing "." in $url = $base_url . '/' $row['language'] . '/' . $row['dst'];
Should be $url = $base_url . '/'. $row['language'] . '/' . $row['dst'];

Laurent
Agence Web Coheractio

mikeytown2’s picture

Status: Reviewed & tested by the community » Fixed

committed

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

edjay’s picture

I put this thread as active because i think this patch can cause errors.

In fact, since i install the latest dev release, i see in my apache access logs that the crawler search my nodes in a wrong place. It prefix the path with '/fr' although my site is not in multilingual.

My default language is french and it is the only language activated in the language interface, the prefix 'fr' is indicated in the "admin/settings/language/edit/fr" page. So i've removed the prefix to see if boost change the path where it search my nodes but no result.

In the 'url_alias' table, the nodes have kept the prefix 'fr' in the row 'language' and I see that in the patch, no verification is made to know if the path must be prefixed, it just verify that the node language is corresponding to the active language or if the node language is empty.

In my case for example, boost has to know if the prefix indicated in the 'url_alias' table must be included has prefix (no, in this case).
What's more, the $row['language'] is used to rewrite the url. No use of the $row['prefix'] variable which seems better to correspond with the site settings.

this is what i change temporary and i know it is not the good way but it works with my configuration for now.

if (empty($row['language']) || $language->language != $row['language'] || empty($language->prefix)) {
$url = $base_url . '/' . $row['dst'];
}
else {
$url = $base_url . '/'. $language->prefix . '/' . $row['dst'];
}

I'll see if i have time to correct that.
Sorry for my english !

edjay’s picture

Status: Closed (fixed) » Active
mikeytown2’s picture

Priority: Normal » Major
Issue tags: +1.19 Release Blocker
ressa’s picture

This still happens with the latest dev version (6.x-1.x-dev 2010-Dec-21) when you enable "Crawl All URL's in the url_alias table". It prefixes the path with the language (for example '/fr') although the site is not multilingual, and can't find those pages, because they don't exist.

EDIT: I have now added the prefix under "Statically cache specific pages" excluding 'fr/*' -- perhaps a temporary fix?

ressa’s picture

Excluding 'fr/*' under "Statically cache specific pages" didn't work, the crawler still visits the 'fr/' urls...

mladenu’s picture

I had a problem that crawler won`t crawl entire url_alias table (only built-in english, not my serbian (sr) lenguage), and solve mentioned with this:

function boost_crawler_add_alias_to_table() {
// Insert batch of html URL's into boost_crawler table
global $base_url, $language;
if (!variable_get('boost_crawl_url_alias', FALSE)) {
return TRUE;
}

Removed "$language" string:

function boost_crawler_add_alias_to_table() {
// Insert batch of html URL's into boost_crawler table
global $base_url;
if (!variable_get('boost_crawl_url_alias', FALSE)) {
return TRUE;
}

All works fine now and crawl do job...

mikeytown2’s picture

looks like I can't use high performance logic along with i18n... need to call url instead of trying to glue the URL together in SQL.

Functions that should help: language_list('enabled'), language_default()

bohz’s picture

Same problem as #7 here.
The fix worked for me too.
Thanks a lot!

hedac’s picture

latest dev and I still have problems with this too...
crawler goes into /en/alias... 404 error. the alias has language set to English... but default site language is english so no /en/ should be on the url...
the alias entries without language assigned or other languages are working ok

hedac’s picture

ok I have it working now... I changed the #7 to :

if (empty($row['language']) || ($language->language == $row['language'] && empty($language->prefix)) ) {
        $url = $base_url . '/' . $row['dst'];
      }
      else {
        $url = $base_url . '/' . $row['language'] . '/' . $row['dst'];
      }
bgm’s picture

Status: Active » Fixed

Thanks for reviving this issue. I have reviewed and committed to 6.x-1.x the patch in #16/#7.

Status: Fixed » Closed (fixed)
Issue tags: -1.19 Release Blocker

Automatically closed -- issue fixed for 2 weeks with no activity.