Download & Extend

Add option for Porter-stemmer to only stem English or language neutral content for a multi-language site.

Project:Porter-Stemmer
Version:7.x-1.x-dev
Component:Code
Category:bug report
Priority:normal
Assigned:Unassigned
Status:postponed

Issue Summary

A quick look over the code makes it seem as if it is installed on a multi-language site it will try to apply the stems to the non-english content as well as the desired content. Could a check be put in place that it only attempts to stem content marked as en, en-us, or neutral if locale is enabled?

Comments

#1

Yeah, this makes sense to me. It's hard for me to write code for this b/c I don't have any multilingual sites, but I'd be happy to apply any patches someone else writes/tests.

#2

I'm not up to writing any patches for this yet - only up to chap 5 in your book :-)

#3

For those who need a quick hack, I think this should work:


function porterstemmer_search_preprocess(&$text) {
global $language;
if (substr($language->language,0,2) == "en") { // current user language is English?

// Split words from noise and remove apostrophes
$words = preg_split('/([^a-zA-Z]+)/', str_replace("'", '', $text), -1, PREG_SPLIT_DELIM_CAPTURE);

// Process each word
$odd = true;
foreach ($words as $k => $word) {
if ($odd) {
$words[$k] = Stem($word);
}
$odd = !$odd;
}

// Put it all back together
return implode('', $words);
}
}

This same idea could be applied to other porter stemmer modules to specify which language to apply the stemming to, based on the current language of the user. Just change "en" to "de" for German, "es" for Spanish, etc.

Andrew

#4

Status:active» needs review

This makes sense to me. Patch attached - does this work for folks?

AttachmentSizeStatusTest resultOperations
363336_english_only.patch1.33 KBIgnored: Check issue status.NoneNone

#5

I reviewed my own code and tested it extensively. I seemed to have put a } in the wrong place and forgot to return a value if the currrent user display language was not English.

The following code works with the English porter-stemmer as well as when installed with other language stemmers - as long as you make the same alterations to the code of the other stemmers.

function porterstemmer_search_preprocess(&$text) {
global $language;
if (substr($language->language,0,2) == "en") {  # if the current user display language is "en" (English)

  // Split words from noise and remove apostrophes
  $words = preg_split('/([^a-zA-Z]+)/', str_replace("'", '', $text), -1, PREG_SPLIT_DELIM_CAPTURE);

  // Process each word
  $odd = true;
  foreach ($words as $k => $word) {
    if ($odd) {
      $words[$k] = Stem($word);
    }
    $odd = !$odd;
  }
  // Put it all back together
  return implode('', $words); # return value if the current user display is English
}
  return $text; # return the inputted text if the current user display is not English
}

#6

Status:needs review» needs work

Pasting it into the body like that makes it really hard to review/test code. Can you provide it as a http://drupal.org/patch/create

#7

Status:needs work» needs review

Updated patch attached - review needed

Andrew

AttachmentSizeStatusTest resultOperations
english_only.patch944 bytesIgnored: Check issue status.NoneNone

#8

This idea doesn't seem quite right to me.

It might work for preprocessing the key words when someone is searching, but the search_preprocess function is also called during search *indexing*, which is done in a cron job. So there is no user display language at this time, or if there is, it won't necessarily correspond to the language of the text.

What you would want to do is only apply Porter Stemmer to text that is marked as English, but since all the stemming module gets as input is the text itself, this is not possible to know. A major change in how search indexing is done for multilingual sites would be necessary to fix that -- i.e. a change in Drupal core.

#9

#10

I have closed #262873: Enable different behavior of search indexer per different node languages as a duplicate.

In order for this to be fixed, there would need to be some changes in Drupal core functionality around Search. So I have filed a new issue to address that. Since it would require some additional arguments to core API functions, I think it is unlikely to be addressed in Drupal 6 (we'll probably have to wait for Drupal 7). Anyway, here is the issue:
#511594: hook_search_preprocess needs to be language-specific

#11

If you think this issue is important, please visit #511594: hook_search_preprocess needs to be language-specific and leave a comment explaining that it is important. Otherwise, this change may not make it into Drupal 7 either -- the code freeze is coming up on September 1st.

#12

Status:needs review» postponed

I am marking this issue Postponed until there is some action on the core Search issue. We can't really fix Porter Stemmer to be language-specific until the core Search module's hooks are language-specific.

#13

Version:6.x-1.x-dev» 6.x-2.x-dev

Just a note on this issue's status: No one reviewed the proposed core fix in #511594: hook_search_preprocess needs to be language-specific. So this did not get into Drupal 7 before the code freeze, and I see no way of addressing this within Porter Stemmer without the core code fix. So, this issue will probably have to be postponed at least until Drupal 8. Sigh. There was a distinct lack of interest from the Drupal community towards addressing this issue...

#14

This is a huge disappointment, I was hoping that D7 at least, would be more multi-lang oriented.
Having started on a multi-lang forum a month ago, many issues have popped up.

Following the discussion in #511594 today (Nov.9) indicates an awareness of the issue, but lack of resources. And time has run out for D7 a real pity, never mind even hacking D6.

#15

I have put the patch up there. The problem has been that no one expressed interest in supporting that it was important, and no one took the time to test/review the patch.

Please put a comment there on that other issue. It's possible that the patch can still be added to Drupal 7 if someone decides it is important enough.

#16

Well, there wasn't any interest in making this change, early enough in the Drupal 7 development cycle to make it happen.

It is now postponed to Drupal 8 or later.

#17

Subscribe

#18

Title:Porter-stemmer should only stem english or language neutral content for a multi-language site» Porter-stemmer should only stem English or language neutral content for a multi-language site

All my sites are multilingual with at least two languages (en + country specific). Most of them have nodes with mixed content (for example English words that are not translatable). Wouldn't implementing this feature here cause such mixed-language content to not get properly stemmed/indexed?

#19

If you have English words in your non-English content, then yes those words would get missed by the stemmer if this was implemented. However, if you are using the Internationalization module, any nodes whose language is set to something other than English will be omitted from an English-language search anyway.

#20

...any nodes whose language is set to something other than English will be omitted from an English-language search anyway.

Sure, but that is a bug though - not a feature. It will not always work this way. Ideally (if/when this bug is resolved), this is what will happen when one searches a term in a multilanguage site with mixed content nodes and the search term is (or can be) found in multiple nodes:

http://drupal.org/node/511594#comment-4844006

So, bottom line is that I thing that all stemmers enabled in a site should stem all content (or at least try to), just in case some terms appear in mixed content nodes. They can simply skip/ignore stemming of words that have text not supported by each stemmer. For example, the English stemmer would try to stem a Greek node as best as it can: stem any occurrences of English-text words and skip the rest. If it's not done this way, then perfectly valid content will be omitted from results (see my link above for a use case scenario).

#21

...just in case I didn't make my point clear: I am proposing to change this from "postponed" to "closed - works as designed"

#22

Please don't do that. It may be fine for an English stemmer to run over Greek text, because it probably won't do anything at all (not recognizing Greek letters). But if a stemmer using English linguistic rules stems a different language that uses the Latin alphabet, it may remove things that are not linguistically removable for that language. So this issue is still very valid for other use cases.

#23

OK, then this should be optional in the module's settings so that people like you can enable it and people with use cases like mine can leave the default.

...But if a stemmer using English linguistic rules stems a different language that uses the Latin alphabet, it may remove things that are not linguistically removable for that language.

I don't fully understand this (perhaps because I don't know the internals of the module and how it stores/deletes data). Care to explain this with an example use case scenario?

#24

I don't know of specific examples where this is a problem, but the Porter Stemmer algorithm, when indexing words for searching, does a bunch of suffix-removal steps, basically removing things like -ly, -ing, -ed, -es, -s, and also longer ones like -tion etc. The idea is to get down to the linguistic root of the word, so related words can be searched by typing in the other terms (search for "walked", get text with "walking", etc.).

But in other languages, the suffixes that we remove for English may not be linguistic suffixes, so we might be equating two words that are not actually related, and should not come up in search results mixed together. That is the problem with having English stemming applied to other languages, or other language's stemming applied to English. Having search results pull up unrelated material is not good...

#25

OK then perhaps someone with more insight can shed some light as to how stemmed data is stored in the db and if it would be possible for each stemmer to store its own data separate from other stemmers? If that is possible, then we wouldn't worry about stemmed data of one language being overwritten by other languages' stemmers. Search would have to combine stemmed data from all stemmers though in that case - don't know if it already does(?).

#26

No. Text is preprocessed/stemmed, and then stored in the search module's database. Search terms are preprocessed/stemmed in the same way and compared to the search module's database. Not possible to store different stemming output separately.

#27

Title:Porter-stemmer should only stem English or language neutral content for a multi-language site» Add option for Porter-stemmer to only stem English or language neutral content for a multi-language site.

...Not possible to store different stemming output separately.

Then until this is possible, optional setting it is. So, changing the issue's title to reflect the intention to make this optional in order to satisfy all the use cases we though of and discussed so far.

#28

UI-wise, all it would take is a set of checkboxes in the module's settings to enable the stemmer for each desired language available in the site + an extra checkbox for neutral content. But we do need to wait on #511594: hook_search_preprocess needs to be language-specific for this I guess.

In #23 above when I say:

...so that people like you can enable it and people with use cases like mine can leave the default.

it seems as if I solely decided what should be the default out of the box. I didn't meant to. Perhaps I should have phrased it better, like:

...so that people can either change the way the module works or set it to work as it does now.

Anyways, the point is that any of the two behaviors can be the default and we should discuss which one actually should. If we agree on going the checkboxes-for-each-language way, then what we need to decide is if those will be checked or not by default. For example, should the English language be checked by default for the Porter stemmer while others unchecked? Should the neutral language checkbox be checked or unchecked by default?

#29

This whole discussion is probably worthless, because no one is working on the issue it would depend on (fixing core that is).

#30

Subscribing. Will probably have something to contribute in the not too distant future.

#31

The core issue blocking this was finally fixed in Drupal 8.x. The fix was not backported to Drupal 7 or 6. So, we should be able to fix this in the 8.x version of Porter Stemmer, but not before.

#32

#33

Version:6.x-2.x-dev» 7.x-1.x-dev

Yes, that is the issue. See comments above. There is not a good way to fix this until 8.x.

#34

I don't remember why I asked the question back in #32 in the first place but I guess I meant to actually ask if #511594: hook_search_preprocess needs to be language-specific brought any API changes and if not if we could then push for a backport. Too late now?

#35

Probably a better place to ask that question would be on the other issue. My feeling is that it is a definite API/behavior change, because in D7 searching and search indexing are not language aware and with this change they are. It's fairly major.

nobody click here