In the _autotag_search_field function there is an algorithm which does the following:
foreach word in the field being parsed:
1) Check: does the word combined with the word after it look like any terms or synonyms in the selected vocabularies
2) If yes, get the term id of the term
3) Using the term id, retrieve the term name, and all the synonym names for that term, and see if any of them appear anywhere in the field
4) If they do, then add the term to the list of terms with which to tag the node
This logic has an error, which can be illustrated by the following example.
Imagine a node, with the following body text: "Bubba is one big ugly dude!"
Then imagine the vocabulary being used for autotagging contains the following term which has one synoym:
Term: BUB
Synonyms: Big Ugly Baby
As written, the module takes the two word phrase "Big Ugly", and searches the vocabulary to see if any terms or term synonyms comtain this phrase. Since the synonym "Big Ugly Baby" contains "Big Ugly", it moves on to retrieve the full term name, and all the synonyms. In this example, it retrieves "BUB", and "Big Ugly Baby". Then it checks each of these to see if they appear in the field. "Big Ugly Baby" does not match any section of the body text, however BUB does coincide with the first 3 letters of Bubba. It then goes on to tag the node with the term "Bub", which is a mistake.
The following is a rewritten version of this function, which contains proposed fixes to some other problems as well (proper UTF-8 handling, stripping html tags out of text before checking for terms)
function _autotag_search_field($field, $vids){
$terms=array();
if(is_array($field) || is_object($field)){
// Field is an array (and likely to be an array of arrays). Lets
// recurse into it and check away
foreach($field as $field_part=>$value){
$terms = array_merge($terms,_autotag_search_field($value, $vids));
}
} else {
// Only search if there is text
if(trim($field)==''){return array();}
// Field is raw text search it for stuff
/**
* Discovered that the following only seems to work in PHP 5.2, FARP!
// Thanks to http://stackoverflow.com/questions/790596/split-a-text-into-single-words
$words_including_small = preg_split('/[\p{P}\s\{\}\[\]\(\)]/', strtolower($field), -1, PREG_SPLIT_NO_EMPTY);
*/
//YAQ - Strip html and php tags before searching
$field = strip_tags($field);
//$words_including_small = preg_split('/[\ `!"£$%^&*()_\-+={\[}\]:;@\'~#<,>.?\/|\\\]/', strtolower($field), -1, PREG_SPLIT_NO_EMPTY);
mb_regex_encoding( "utf-8" );
$words_including_small = mb_split('\W', mb_strtolower($field,'UTF-8'));
// lets remove the shitty small words.
$words = array();
$words_placeholder = array();
foreach($words_including_small as $key => $word){
//YAQ Mod: change minimum size from 4 to 3
if(strlen(trim($word))>2){
$words[$key] = $word;
$words_placeholder[] = "'%s'";
}
}
// Are we tagging just leaves?
$tag_only_leaves = variable_get('autotag_only_leaves', FALSE);
if($tag_only_leaves){
$tag_only_leaves_sql = ' AND t.tid NOT IN (SELECT parent FROM {term_hierarchy}) ';
}
// Because I'm sending words as an array, I've also got to do vids in the same way
$vids_placeholder = array();
foreach($vids as $doesntmatter){
$vids_placeholder[] = "vid = %d";
}
if(count($words_placeholder) && count($vids)){
// To make the following SQL command easier to read, it has been spaced!
$yaq_query = "
SELECT
t.tid, t.vid
FROM
{term_lowername} l,
{term_data} t
WHERE
t.tid = l.tid AND
lowername IN (".implode(",",$words_placeholder).") AND
(".implode(" OR ",$vids_placeholder).") $tag_only_leaves_sql
UNION
SELECT
t.tid, t.vid
FROM
{term_data} t,
{term_synonym} s
WHERE
s.tid = t.tid AND
LOWER(s.name) IN (".implode(",",$words_placeholder).") AND
(".implode(" OR ",$vids_placeholder).") $tag_only_leaves_sql";
$yaq_query_params = array_merge($words,$vids,$words,$vids);
$results = db_query($yaq_query,$yaq_query_params);
while($row=db_fetch_array($results)){
$terms[] = array('tid'=>$row['tid'], 'vid'=>$row['vid']);
}
$total_words = count($words);
$sql_array = array();
$sql_array_syn = array();
$words_array = array();
$words_array_syn = array();
foreach($words as $key => $word){
if(isset($words_including_small[$key+1])){
// Now lets search for "pairs" and search on "like". If any hit, check back on the result against
// the original field
$sql_array[] = " l.lowername LIKE '%s%%%s%%' ";
$words_array[] = $word;
$words_array[] = $words_including_small[$key+1];
$sql_array_syn[] = " LOWER(s.name) LIKE '%s%%%s%%' ";
$words_array_syn[] = $word;
$words_array_syn[] = $words_including_small[$key+1];
}
}
if($tag_only_leaves){
$tag_only_leaves_sql = ' AND tid NOT IN (SELECT parent FROM {term_hierarchy}) ';
}
if(count($sql_array)){
$results = db_query("
SELECT
tda.vid, l.tid, l.lowername AS name
FROM
{term_lowername} l, {term_data} tda
WHERE
l.tid = tda.tid AND
(".implode(" OR ",$sql_array).") $tag_only_leaves_sql
UNION
SELECT
tdb.vid s.tid, LOWER(s.name) AS name
FROM
{term_synonym} s, {term_data} tdb
WHERE
s.tid = tdb.tid AND
(".implode(" OR ",$sql_array_syn).") $tag_only_leaves_sql",
array_merge($words_array, $words_array_syn));
while($row=db_fetch_array($results)){
if(strpos(mb_strtolower($field,'UTF-8'),$row['name'])!== FALSE){
$terms[] = array('tid'=>$row['tid'],'vid'=>$row['vid']);
}
}
}
}
}
return $terms;
}
Comments
Comment #1
sja1 commentedSorry, there's an sql syntax error in the code I posted. Here's a new version of the function, with some additional fixes:
Comment #2
sdrycroft commentedThanks for this Steve. Are you able to supply a patch file for this? I'd definitely like to get this fix into v2 of the module.
Simon
Comment #3
rkodrupal commentedcode in #1 causes autotag 1.27 to ignore 'disable autotag' field for all vocabularies. i'm guessing this code was also included into autotag 2.0 which now also indiscriminately tags all vocabularies ...
Comment #4
sdrycroft commentedI believe this is no longer an issue in 7.x as the use of synonyms is no longer supported. Please re-open this issue if this is still an issue.