Porter-stemmer should only stem english or language neutral content for a multi-language site

sprior - January 23, 2009 - 20:42
Project:Porter-Stemmer
Version:6.x-2.x-dev
Component:Code
Category:bug report
Priority:normal
Assigned:Unassigned
Status:postponed
Description

A quick look over the code makes it seem as if it is installed on a multi-language site it will try to apply the stems to the non-english content as well as the desired content. Could a check be put in place that it only attempts to stem content marked as en, en-us, or neutral if locale is enabled?

#1

greggles - January 25, 2009 - 22:28

Yeah, this makes sense to me. It's hard for me to write code for this b/c I don't have any multilingual sites, but I'd be happy to apply any patches someone else writes/tests.

#2

sprior - January 26, 2009 - 19:35

I'm not up to writing any patches for this yet - only up to chap 5 in your book :-)

#3

andrewsuth - February 27, 2009 - 15:57

For those who need a quick hack, I think this should work:


function porterstemmer_search_preprocess(&$text) {
global $language;
if (substr($language->language,0,2) == "en") { // current user language is English?

// Split words from noise and remove apostrophes
$words = preg_split('/([^a-zA-Z]+)/', str_replace("'", '', $text), -1, PREG_SPLIT_DELIM_CAPTURE);

// Process each word
$odd = true;
foreach ($words as $k => $word) {
if ($odd) {
$words[$k] = Stem($word);
}
$odd = !$odd;
}

// Put it all back together
return implode('', $words);
}
}

This same idea could be applied to other porter stemmer modules to specify which language to apply the stemming to, based on the current language of the user. Just change "en" to "de" for German, "es" for Spanish, etc.

Andrew

#4

greggles - February 27, 2009 - 16:53
Status:active» needs review

This makes sense to me. Patch attached - does this work for folks?

AttachmentSize
363336_english_only.patch 1.33 KB

#5

andrewsuth - March 4, 2009 - 10:39

I reviewed my own code and tested it extensively. I seemed to have put a } in the wrong place and forgot to return a value if the currrent user display language was not English.

The following code works with the English porter-stemmer as well as when installed with other language stemmers - as long as you make the same alterations to the code of the other stemmers.

function porterstemmer_search_preprocess(&$text) {
global $language;
if (substr($language->language,0,2) == "en") {  # if the current user display language is "en" (English)

  // Split words from noise and remove apostrophes
  $words = preg_split('/([^a-zA-Z]+)/', str_replace("'", '', $text), -1, PREG_SPLIT_DELIM_CAPTURE);

  // Process each word
  $odd = true;
  foreach ($words as $k => $word) {
    if ($odd) {
      $words[$k] = Stem($word);
    }
    $odd = !$odd;
  }
  // Put it all back together
  return implode('', $words); # return value if the current user display is English
}
  return $text; # return the inputted text if the current user display is not English
}

#6

greggles - March 4, 2009 - 12:36
Status:needs review» needs work

Pasting it into the body like that makes it really hard to review/test code. Can you provide it as a http://drupal.org/patch/create

#7

andrewsuth - March 14, 2009 - 17:33
Status:needs work» needs review

Updated patch attached - review needed

Andrew

AttachmentSize
english_only.patch 944 bytes

#8

jhodgdon - July 1, 2009 - 17:18

This idea doesn't seem quite right to me.

It might work for preprocessing the key words when someone is searching, but the search_preprocess function is also called during search *indexing*, which is done in a cron job. So there is no user display language at this time, or if there is, it won't necessarily correspond to the language of the text.

What you would want to do is only apply Porter Stemmer to text that is marked as English, but since all the stemming module gets as input is the text itself, this is not possible to know. A major change in how search indexing is done for multilingual sites would be necessary to fix that -- i.e. a change in Drupal core.

#9

jhodgdon - July 1, 2009 - 17:18

#10

jhodgdon - July 6, 2009 - 16:08

I have closed #262873: Enable different behavior of search indexer per different node languages as a duplicate.

In order for this to be fixed, there would need to be some changes in Drupal core functionality around Search. So I have filed a new issue to address that. Since it would require some additional arguments to core API functions, I think it is unlikely to be addressed in Drupal 6 (we'll probably have to wait for Drupal 7). Anyway, here is the issue:
#511594: hook_search_preprocess needs to be language-specific

#11

jhodgdon - August 4, 2009 - 00:44

If you think this issue is important, please visit #511594: hook_search_preprocess needs to be language-specific and leave a comment explaining that it is important. Otherwise, this change may not make it into Drupal 7 either -- the code freeze is coming up on September 1st.

#12

jhodgdon - August 4, 2009 - 15:34
Status:needs review» postponed

I am marking this issue Postponed until there is some action on the core Search issue. We can't really fix Porter Stemmer to be language-specific until the core Search module's hooks are language-specific.

#13

jhodgdon - September 10, 2009 - 15:34
Version:6.x-1.x-dev» 6.x-2.x-dev

Just a note on this issue's status: No one reviewed the proposed core fix in #511594: hook_search_preprocess needs to be language-specific. So this did not get into Drupal 7 before the code freeze, and I see no way of addressing this within Porter Stemmer without the core code fix. So, this issue will probably have to be postponed at least until Drupal 8. Sigh. There was a distinct lack of interest from the Drupal community towards addressing this issue...

#14

boran - November 9, 2009 - 22:00

This is a huge disappointment, I was hoping that D7 at least, would be more multi-lang oriented.
Having started on a multi-lang forum a month ago, many issues have popped up.

Following the discussion in #511594 today (Nov.9) indicates an awareness of the issue, but lack of resources. And time has run out for D7 a real pity, never mind even hacking D6.

#15

jhodgdon - November 9, 2009 - 22:15

I have put the patch up there. The problem has been that no one expressed interest in supporting that it was important, and no one took the time to test/review the patch.

Please put a comment there on that other issue. It's possible that the patch can still be added to Drupal 7 if someone decides it is important enough.

 
 

Drupal is a registered trademark of Dries Buytaert.