I am using this module with the Link Intelligence module. I have noticed the existing stemming algo produces some odd matches. For example the term "designers" is stemmed to "design" thus links for "web designers" are created to requests for "web design".

I was thinking of creating an override feature where users can define custom stem overrides, e.g. "designer" and "designers" would be custom mapped to stem as "designer". I was thinking of adding it to Link Intelligence but it would make more sense to either make it a separate module or even integrate it into porter stemmer or making it a sub module of porter stemmer.

Is this something you would want to collaborate on? Thoughts?

Tom

Comments

jhodgdon’s picture

Probably if you don't want full linguistic stemming (which is what Porter Stemmer does), what you want to do is to take the Porter Stemmer module and cut some parts out of its stemming algorithm.... I think that making exceptions for individual words won't really solve the problem you are seeing. For instance, you could put in an exception saying "don't stem designers down to design", but you would still have other -er words being stemmed to their root.

I guess I also don't see why matching designers and design is a problem?

dafletcha’s picture

I understand the need here. In the content I'm working with, searching for, say, "manager" implies you want information on that role specifically, vs. information on management, managing, etc. It would be nice to be able to define words that get stemmed but that are also preserved in full, to give them a sort of precedence.

So, for example, if I can define "manager" as such a word, during indexing the word is submitted to the index in its stemmed form and in its whole form. And when searching, "manager" is searched for as both a stem and in its whole form. Because nodes that do not contain "manager" will have been indexed with only the stem ("manag"?), they will fail the search for "manager" and be ranked lower than nodes that match both the whole word and the stem. Does this make sense?

jhodgdon’s picture

I'm not sure about having two versions of a word getting into the index. It could cause some problems with phrase searching, because the pre-processed text is what is used when searching for a phrase. So in this case, if you had a phrase like "The web site manager likes" in your text, it would go into the index as "the web site manage manager lik" (assuming manager->manag and likes->lik in stemming)... well maybe that would be OK, because if you did a search on the phrase "manager likes", it would be pre-processed the same way.

But that would also mean that if you searched for "manager", you would really be doing an AND search on both "manager manag" being in the search index, which would probably screw up the relevant ratings if nothing else.

So I think we could do bypasses, but I think having it both ways is kind of a problem.

mark_fullmer’s picture

Issue summary: View changes
Status: Active » Closed (outdated)

Given there has been no activity on this in 11 years and Drupal 6 is end-of-life, I'm closing this as outdated.