search_excerpt() doesn't work well with stemming
| Project: | Drupal |
| Version: | 7.x-dev |
| Component: | search.module |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | needs review |
| Issue tags: | API change |
Background: The search_excerpt() function is used by node_search() to extract an excerpt/snippet showing where in your node the keywords you search for are found. It can also be called by other modules extending Search via hook_search(). The core search doesn't support keyword stemming (e.g. if you search for "work" in core search, you will not find nodes containing "works", "worked", and "working"). But you can add a module like http://drupal.org/project/porterstemmer to add stemming to Search, so that those searches will return results.
The issue: If you do use a stemming module, you'll find that search excerpts don't show you where your keyword is found, because the excerpt function is inflexible. It only searches for exact keyword matches, and there is no way for a module to modify this behavior.
How to resolve: I think what needs to be done is to add a hook to search_excerpt() that modules can use to override the search_excerpt function. This would allow issues like #437084: Excerpt fails to find stemmed keyword to be resolved in stemming modules.
This is an issue in 6.x. I've filed it against 7.x, however, because the search_excerpt function appears to be identical, and it's probably more likely to get noticed in the 7.x issue queue... hope that's OK.

#1
I've worked out a fix for this. In D6 search module on line 1212 add in .'.*?' so the line becomes:
if (preg_match('/'. $boundary . $key . '.*?' . $boundary .'/iu', $text, $match, PREG_OFFSET_CAPTURE, $included[$key]))And the same for line 1270 so that becomes
$text = preg_replace('/'. $boundary .'('. implode('|', $keys) .')'. '.*?' . $boundary .'/iu', '<strong>\0</strong>', $text);The first says match the key plus any more characters up to a boundary. So this find the right extract with the stemmed words. The second is the fix so the whole word is made bold.
#2
I am not sure that all stems are necessarily sub-strings of the full words they match. I don't know enough about how the stemming modules work to know this for sure, but in English, you should have "like" and "liking" being equivalent (the base word is "like", which is not a substring of "liking" -- is the "stemmed" keyword "lik" or "like"? I don't know... does this work in this case?). And what about things like "person" being the singular of "people"? I don't know whether stemming modules work for this or not, but if they do, your solution would not work in this case, I think.
Also, I am not sure that all substrings should be matched for all stems. For instance, if you search for "like", the words "likely" and "likewise" should not be matched, even though they both start with "like".
So I don't know if this is a good general solution. My guess is that for a general solution, you would need a hook that the stemming module could implement to modify the excerpt in some way that is appropriate for that particular stemming algorithm.
#3
Good point. You are right it isn't that simple. Taking your example like, liking likely the porter/stemmer says these all have the stem of like. So all nodes with any of those 3 words get a search key of like when cron runs. Type in any of those 3 words and the search key changes to like, finds the node, but then fails at finding the extract. Stemming rules that reduce say 10 possible words to one is fair enough but to go the other is likely to make one word into 50, some of which will be nonsense i.e. nationality stems to nation, but station doesn't unstem to stationality.
Perhaps the solution is to have multiple passes at the extract shrinking the keys by the end letter each time until there is match. Or I guess the stemmer could have a function that reduces the key down to the common root, so like, liking, likely would reduce to just lik. Or rather lik.*? which would be the right search key. That's probably better.
#4
OK, new mod. Add the '.*?' . to lines 1212 and 1270 as in post #1. And in addition before line 1186
$workkeys = $keys;insert this code:
foreach($keys as $k => $key){search_invoke_preprocess($key);
$keys[$k] = substr($key,0,-1);
}
what that does is take the search keys as typed. Stem them and remove the last character. Of course it would be better if there was a proper hook and stem root function but this probably works most of the time. The stem will be the smallest word but of course like -> liking loses the e, y becomes ies etc. so just removing the last stem character probably gives the correct result most of time, perhaps all, at least in english.
#5
Also: Are there languages (German?) where some stemming might use prefixes as well as suffixes? And what about irregular forms? I don't know if stemming modules work for irregular forms, but for instance, you would want to find "person" if you search for "people" in English, man/men, woman/women, etc. In all of these cases, just accounting for preprocessed-keyword-plus-suffix matching is not going to be useful.
#6
You could be right, but at least the above returns more extracts that are sensible that the current method which returns next none (only if you try in a word that is also a stem, if that is possible).
#7
I still think that having a heuristic like "look for a match of all but the last character of the stemmed keyword, and don't require the match to end on a word boundary" is not really going to solve the issue for all languages. And creating a patch that doesn't really fix the issue will probably not be accepted into the core of Drupal.
However, I do think it is possible to actually solve the problem. I think what is needed is to allow modules to do their own matching on the keys, wherein they could pre-process both the text and the key with their stemming algorithm. So you would replace this line in search_excerpt()
if (preg_match('/' . $boundary . $key . $boundary . '/iu', $text, $match, PREG_OFFSET_CAPTURE, $included[$key])) {with something like this (obviously it would need an addional } to get the loops working correctly):
foreach (module_implements('search_excerpt_match') as $module) {if( module_invoke( $module, 'search_excerpt_match', $boundary, $key, $text, $match )) {
The idea would be to allow stemming modules to see if they can find a match between the given key and the text. The first module to return TRUE would be accepted (you'd want to break out of this foreach loop), or you could do something more complex like accepting the one with the first position in the text. The return value from the hook would be TRUE/FALSE, and the $match array would be passed back by reference and give the position of the match found in the original text string, just as preg_match is currently doing with PREG_OFFSET_CAPTURE. (Though it might make sense to make the hook return something other than what preg_match would return -- this is just a concept so far.) Anyway, the Search module itself could have its own implementation of the new hook_search_excerpt_match, doing what it used to do (looking for exact keys):
search_search_excerpt_match( $boundary, $key, $text, &$match ) {return preg_match('/' . $boundary . $key . $boundary . '/iu', $text, $match, PREG_OFFSET_CAPTURE, $included[$key]);
}
This is not quite complete, because the keyword highlighting at the end of search_excerpt() would also need to be modified, so that the actual word matched would be highlighted, not the "key" (which might not be present in its exact form). Probably you'd want to save the actual word that was found in the
$keysarray, so that at the end:$text = preg_replace('/'. $boundary .'('. implode('|', $keys) .')'. $boundary .'/iu', '<strong>\0</strong>', $text);would still highlight the found words, rather than the original keywords.
Anyway, I might see if I can get this working, with patches for both Porter Stemmer and core Search. I think it's at least the start of a viable idea that would actually solve the problem.
#8
#9
Feel free to look at. I agree my solution is quick and dirty and not suitable as a patch. It just improves a bad situation, for english, to a state where the error is not so noticeable.
BTW it would be useful if your patch had a D6 as well as D7 version as I'm more interested in D6.
#10
I will definitely be developing/testing in D6! The Porter Stemmer module is not out for D7 yet, though it should not be too difficult to port, since it only implements 2 hooks.
#11
I think I have it working in Drupal 6 (have temporarily set the version of this issue to D6).
Attached:
- Patch for the core Search module in D6
- Patch for the Porter Stemmer module in D6 (patch created against the 6.x development branch of Porter Stemmer -- the patch just adds a new function to the module, so you should be able to add the function to pretty much any 6.x version of Porter Stemmer).
- Patch for the D6 docs (these are in the Contrib repository)
If people can review and test these patches for Drupal 6, and if everyone likes them, I will port the search and doc patches to Drupal 7 and submit for inclusion there. I am not sure whether they will want to patch Drupal 6 or not for this issue... not sure what the policy is there.
#12
In case someone wants to test this and isn't up to speed on applying patches, the attached zip file contains a replacement search.module (put into modules/search in your Drupal installation) and a replacement porterstemmer.module (put into sites/all/modules/porterstemmer, or wherever you have your contrib modules).
ONLY FOR DRUPAL 6.x!
#13
OK, I'm giving this a try. Seems to work well so far.
#14
One thing I just thought of: Probably the return value of the hook should be an associative array, such as
'pos' => $p,
'keyword' => $word
rather than just a simple array ($p, $word). That would make it a bit more self-documenting.
I plan to update the patches to do that, but not for a few days (holiday time, and I'm just about to leave on a backpacking trip).
#15
Here are updated D6 patches and zip file, using an associative array. Testing and comments welcome. I'll create a Drupal 7 patch once we're happy with Drupal 6's behavior.
#16
Another thought on the Porter Stemmer component of this patch group: It should verify that the keyword found actually stems to the searched key, after doing the substring match.
Here's a new patch for the Porter Stemmer module (compatible with the other patches). And a new zip file.
#17
Here's a patch for Drupal 7.
#18
The last submitted patch failed testing.
#19
This is very odd. The test that failed was "module dependency", and looking at that test, I do not see how this patch could have affected this test at all. So I am assuming there was something else that caused that test to fail. Requesting re-test.
#20
It would be great if we could get this into Drupal 7, and it would need to be before the code freeze... guess the next step would be if someone could review this patch?
#21
Related: #103548: Implement N-Gram searches. The reason the test fail to pass now, is the search_admin_validate() function is now gone. It was replaced by a proper submit() handler.
I think my solution in that issue was considerably smaller. I would consider testing that patch and see if it achieves the issue with less code changes. It was just a lil bit of regex.
#22
N-grams are NOT the same as stemming algorithms at all. Stemming algorithms are language-specific ways to linguistically reduce a word to its basic root, which is done to both the search terms and the text, and may not result in an actual sub-string of the original words.
N-grams are blind substrings.
Both have their strengths, but they are not equivalent. If you want to use stemming, then n-grams will not do the same thing.
#23
sigh u missed the point...
In the lasted patch, I had to accomplish what you are trying here, meaning highlight a full word when a part of the word was in the $keys
like so
@@ -1252,7 +1280,7 @@ function search_excerpt($keys, $text) {
}
// Locate a keyword (position $p), then locate a space in front (position
// $q) and behind it (position $s)
- if (preg_match('/' . $boundary . $key . $boundary . '/iu', $text, $match, PREG_OFFSET_CAPTURE, $included[$key])) {
+ if (preg_match('/' . $boundary .'[^' . PREG_CLASS_SEARCH_EXCLUDE . PREG_CLASS_CJK . ']*' . $key . '[^' . PREG_CLASS_SEARCH_EXCLUDE . PREG_CLASS_CJK . ']*' . $boundary . '/iu', $text, $match, PREG_OFFSET_CAPTURE, $included[$key])) {
$p = $match[0][1];
if (($q = strpos($text, ' ', max(0, $p - 60))) !== FALSE) {
$end = substr($text, $p, 80);
@@ -1310,7 +1338,7 @@ function search_excerpt($keys, $text) {
$text = (isset($newranges[0]) ? '' : '... ') . implode(' ... ', $out) . ' ...';
// Highlight keywords. Must be done at once to prevent conflicts ('strong' and '<strong>').
- $text = preg_replace('/' . $boundary . '(' . implode('|', $keys) . ')' . $boundary . '/iu', '<strong>\0</strong>', $text);
+ $text = preg_replace('/' . $boundary . '[^' . PREG_CLASS_SEARCH_EXCLUDE . PREG_CLASS_CJK . ']*' . '(' . implode('|', $keys) . ')' . '[^' . PREG_CLASS_SEARCH_EXCLUDE . PREG_CLASS_CJK . ']*' . $boundary . '/iu', '<strong>\0</strong>', $text);
return $text;
}
Sorry for not being clearer, but u should assume people are this dumb :-D. I was trying to point out I had to accomplish exactly what you have here and it did it by that Regex ^^^ which is considerably smaller then what you have here, and is utf-8 safe.
#24
The point I was making is that a "stem" as returned from a stemming algorithm is not necessarily a substring of the full word. I am not that dumb either, just unclear in my writing. :)
#25
Adding tag