In the _autotag_search_field function there is an algorithm which does the following:

foreach word in the field being parsed:
1) Check: does the word combined with the word after it look like any terms or synonyms in the selected vocabularies
2) If yes, get the term id of the term
3) Using the term id, retrieve the term name, and all the synonym names for that term, and see if any of them appear anywhere in the field
4) If they do, then add the term to the list of terms with which to tag the node

This logic has an error, which can be illustrated by the following example.

Imagine a node, with the following body text: "Bubba is one big ugly dude!"

Then imagine the vocabulary being used for autotagging contains the following term which has one synoym:
Term: BUB
Synonyms: Big Ugly Baby

As written, the module takes the two word phrase "Big Ugly", and searches the vocabulary to see if any terms or term synonyms comtain this phrase. Since the synonym "Big Ugly Baby" contains "Big Ugly", it moves on to retrieve the full term name, and all the synonyms. In this example, it retrieves "BUB", and "Big Ugly Baby". Then it checks each of these to see if they appear in the field. "Big Ugly Baby" does not match any section of the body text, however BUB does coincide with the first 3 letters of Bubba. It then goes on to tag the node with the term "Bub", which is a mistake.

The following is a rewritten version of this function, which contains proposed fixes to some other problems as well (proper UTF-8 handling, stripping html tags out of text before checking for terms)

function _autotag_search_field($field, $vids){
  $terms=array();
  if(is_array($field) || is_object($field)){
    // Field is an array (and likely to be an array of arrays).  Lets
    // recurse into it and check away
    foreach($field as $field_part=>$value){
      $terms = array_merge($terms,_autotag_search_field($value, $vids));
    }
  } else {
    // Only search if there is text
    if(trim($field)==''){return array();}
    // Field is raw text search it for stuff
    /**
     * Discovered that the following only seems to work in PHP 5.2, FARP!
    // Thanks to http://stackoverflow.com/questions/790596/split-a-text-into-single-words
    $words_including_small = preg_split('/[\p{P}\s\{\}\[\]\(\)]/', strtolower($field), -1, PREG_SPLIT_NO_EMPTY);
     */
    //YAQ - Strip html and php tags before searching
	$field = strip_tags($field);
    //$words_including_small = preg_split('/[\ `!"£$%^&*()_\-+={\[}\]:;@\'~#<,>.?\/|\\\]/', strtolower($field), -1, PREG_SPLIT_NO_EMPTY);
	mb_regex_encoding( "utf-8" );
	$words_including_small = mb_split('\W', mb_strtolower($field,'UTF-8'));
    // lets remove the shitty small words.
    $words = array();
    $words_placeholder = array();
    foreach($words_including_small as $key => $word){
	  //YAQ Mod: change minimum size from 4 to 3
      if(strlen(trim($word))>2){
        $words[$key] = $word;
        $words_placeholder[] = "'%s'";
      }
    }
    // Are we tagging just leaves?
    $tag_only_leaves = variable_get('autotag_only_leaves', FALSE);
    if($tag_only_leaves){
      $tag_only_leaves_sql = ' AND t.tid NOT IN (SELECT parent FROM {term_hierarchy}) ';
    }
    
    // Because I'm sending words as an array, I've also got to do vids in the same way
    $vids_placeholder = array();
    foreach($vids as $doesntmatter){
      $vids_placeholder[] = "vid = %d";
    }
    if(count($words_placeholder) && count($vids)){
	  // To make the following SQL command easier to read, it has been spaced!
	  $yaq_query = "
          SELECT 
            t.tid, t.vid 
          FROM 
            {term_lowername} l,
            {term_data} t 
          WHERE 
            t.tid = l.tid AND 
            lowername IN (".implode(",",$words_placeholder).") AND 
            (".implode(" OR ",$vids_placeholder).") $tag_only_leaves_sql 
          UNION 
          SELECT 
            t.tid, t.vid 
          FROM 
            {term_data} t, 
            {term_synonym} s 
          WHERE 
            s.tid = t.tid AND 
            LOWER(s.name) IN (".implode(",",$words_placeholder).") AND 
            (".implode(" OR ",$vids_placeholder).") $tag_only_leaves_sql";
	  $yaq_query_params = array_merge($words,$vids,$words,$vids);
      $results = db_query($yaq_query,$yaq_query_params);
        
      while($row=db_fetch_array($results)){
        $terms[] = array('tid'=>$row['tid'], 'vid'=>$row['vid']);
      }
      $total_words = count($words);
      $sql_array = array();
      $sql_array_syn = array();
      $words_array = array();
      $words_array_syn = array();
      foreach($words as $key => $word){
        if(isset($words_including_small[$key+1])){
          // Now lets search for "pairs" and search on "like".  If any hit, check back on the result against
          // the original field
          $sql_array[] = " l.lowername LIKE '%s%%%s%%' ";
          $words_array[] = $word;
          $words_array[] = $words_including_small[$key+1];
          $sql_array_syn[] = " LOWER(s.name) LIKE '%s%%%s%%' ";
          $words_array_syn[] = $word;
          $words_array_syn[] = $words_including_small[$key+1];
        }
      }
      if($tag_only_leaves){
        $tag_only_leaves_sql = ' AND tid NOT IN (SELECT parent FROM {term_hierarchy}) ';
      }
      if(count($sql_array)){
        $results = db_query("
            SELECT 
              tda.vid, l.tid, l.lowername AS name
            FROM 
              {term_lowername} l, {term_data} tda
            WHERE 
			   l.tid = tda.tid AND 
              (".implode(" OR ",$sql_array).") $tag_only_leaves_sql 
            UNION
            SELECT 
              tdb.vid s.tid, LOWER(s.name) AS name
            FROM 
              {term_synonym} s, {term_data} tdb
            WHERE 
			   s.tid = tdb.tid AND
              (".implode(" OR ",$sql_array_syn).") $tag_only_leaves_sql",
          array_merge($words_array, $words_array_syn));
        while($row=db_fetch_array($results)){
            if(strpos(mb_strtolower($field,'UTF-8'),$row['name'])!== FALSE){
              $terms[] = array('tid'=>$row['tid'],'vid'=>$row['vid']);
            }
        }
      }
    }
  }
  return $terms;
}

Comments

sja1’s picture

Sorry, there's an sql syntax error in the code I posted. Here's a new version of the function, with some additional fixes:

function _autotag_search_field($field, $vids){
  $terms=array();
  if(is_array($field) || is_object($field)){
    // Field is an array (and likely to be an array of arrays).  Lets
    // recurse into it and check away
    foreach($field as $field_part=>$value){
      $terms = array_merge($terms,_autotag_search_field($value, $vids));
    }
  } else {
    // Only search if there is text
    if(trim($field)==''){return array();}
    // Field is raw text search it for stuff
    /**
     * Discovered that the following only seems to work in PHP 5.2, FARP!
    // Thanks to http://stackoverflow.com/questions/790596/split-a-text-into-single-words
    $words_including_small = preg_split('/[\p{P}\s\{\}\[\]\(\)]/', strtolower($field), -1, PREG_SPLIT_NO_EMPTY);
     */
    //YAQ - Strip html and php tags before searching
	$field = strip_tags($field);
    //$words_including_small = preg_split('/[\ `!"£$%^&*()_\-+={\[}\]:;@\'~#<,>.?\/|\\\]/', strtolower($field), -1, PREG_SPLIT_NO_EMPTY);
	mb_regex_encoding( "utf-8" );
	$words_including_small = mb_split('\W', mb_strtolower($field,'UTF-8'));
    // lets remove the shitty small words.
    $words = array();
    $words_placeholder = array();
    foreach($words_including_small as $key => $word){
	  //YAQ Mod: change minimum size from 4 to 3
      if(strlen(trim($word))>2){
        $words[$key] = $word;
        $words_placeholder[] = "'%s'";
      }
    }
    // Are we tagging just leaves?
    $tag_only_leaves = variable_get('autotag_only_leaves', FALSE);
    if($tag_only_leaves){
      $tag_only_leaves_sql = ' AND t.tid NOT IN (SELECT parent FROM {term_hierarchy}) ';
    }
    
    // Because I'm sending words as an array, I've also got to do vids in the same way
    $vids_placeholder = array();
    foreach($vids as $doesntmatter){
      $vids_placeholder[] = "vid = %d";
    }
    if(count($words_placeholder) && count($vids)){
	  // To make the following SQL command easier to read, it has been spaced!
	  $yaq_query = "
          SELECT 
            t.tid, t.vid 
          FROM 
            {term_lowername} l,
            {term_data} t 
          WHERE 
            t.tid = l.tid AND 
            lowername IN (".implode(",",$words_placeholder).") AND 
            (".implode(" OR ",$vids_placeholder).") $tag_only_leaves_sql 
          UNION 
          SELECT 
            t.tid, t.vid 
          FROM 
            {term_data} t, 
            {term_synonym} s 
          WHERE 
            s.tid = t.tid AND 
            LOWER(s.name) IN (".implode(",",$words_placeholder).") AND 
            (".implode(" OR ",$vids_placeholder).") $tag_only_leaves_sql";
	  $yaq_query_params = array_merge($words,$vids,$words,$vids);
      $results = db_query($yaq_query,$yaq_query_params);
        
      while($row=db_fetch_array($results)){
        $terms[] = array('tid'=>$row['tid'], 'vid'=>$row['vid']);
      }
      $total_words = count($words);
      $sql_array = array();
      $sql_array_syn = array();
      $words_array = array();
      $words_array_syn = array();
      foreach($words as $key => $word){
        if(isset($words_including_small[$key+1])){
          // Now lets search for "pairs" and search on "like".  If any hit, check back on the result against
          // the original field
          $sql_array[] = " l.lowername LIKE '%s%%%s%%' ";
          $words_array[] = $word;
          $words_array[] = $words_including_small[$key+1];
          $sql_array_syn[] = " LOWER(s.name) LIKE '%s%%%s%%' ";
          $words_array_syn[] = $word;
          $words_array_syn[] = $words_including_small[$key+1];
        }
      }
      if($tag_only_leaves){
        $tag_only_leaves_sql_l = ' AND l.tid NOT IN (SELECT parent FROM {term_hierarchy}) ';
		$tag_only_leaves_sql_s = ' AND s.tid NOT IN (SELECT parent FROM {term_hierarchy}) ';
      }
	  //Note: need to add tsid column, otherwise synonyms which differ only in that one has
	  // (gran lío) and the other does not (gran lio), can be seen as identical by mysql,
	  // depending on collations being used, etc.  The Union command eliminates duplicates,
	  // so this means that one of the synonyms will never be returned by the query. Adding
	  // the tsid column ensures that the synonym rows returned are always unique, so no duplicates are detected.
      if(count($sql_array)){
	    $yaq_query = "
            SELECT 
              tda.vid, l.tid, l.lowername AS name, 'xxx' as tsid
            FROM 
              {term_lowername} l, {term_data} tda
            WHERE 
			   l.tid = tda.tid AND 
              (".implode(" OR ",$sql_array).") $tag_only_leaves_sql_l 
            UNION
            SELECT 
              tdb.vid, s.tid, LOWER(s.name) AS name, s.tsid
            FROM 
              {term_synonym} s, {term_data} tdb
            WHERE 
			   s.tid = tdb.tid AND
              (".implode(" OR ",$sql_array_syn).") $tag_only_leaves_sql_s";
	    $yaq_query_params = array_merge($words_array, $words_array_syn);
        $results = db_query($yaq_query,$yaq_query_params);
          while($row=db_fetch_array($results)){
		     //Note: strpos won't find matches if the spaces don't match exactly. A regular expression might be better
			 //using mb_ereg_match, as it can handle UTF-8
            if(strpos(mb_strtolower($field,'UTF-8'),$row['name'])!== FALSE){
              $terms[] = array('tid'=>$row['tid'],'vid'=>$row['vid']);
            }
          }
      }
    }
  }
  return $terms;
}
sdrycroft’s picture

Version: 6.x-1.27 » 6.x-2.0
Priority: Critical » Normal

Thanks for this Steve. Are you able to supply a patch file for this? I'd definitely like to get this fix into v2 of the module.

Simon

rkodrupal’s picture

code in #1 causes autotag 1.27 to ignore 'disable autotag' field for all vocabularies. i'm guessing this code was also included into autotag 2.0 which now also indiscriminately tags all vocabularies ...

sdrycroft’s picture

Status: Active » Closed (won't fix)

I believe this is no longer an issue in 7.x as the use of synonyms is no longer supported. Please re-open this issue if this is still an issue.