search_excerpt() doesn't work well with stemming

Background: The search_excerpt() function is used by node_search() to extract an excerpt/snippet showing where in your node the keywords you search for are found. It can also be called by other modules extending Search via hook_search(). The core search doesn't support keyword stemming (e.g. if you search for "work" in core search, you will not find nodes containing "works", "worked", and "working"). But you can add a module like http://drupal.org/project/porterstemmer to add stemming to Search, so that those searches will return results.

The issue: If you do use a stemming module, you'll find that search excerpts don't show you where your keyword is found, because the excerpt function is inflexible. It only searches for exact keyword matches, and there is no way for a module to modify this behavior.

How to resolve: I think what needs to be done is to add a hook to search_excerpt() that modules can use to override the search_excerpt function. This would allow issues like #437084: Excerpt fails to find stemmed keyword to be resolved in stemming modules.

This is an issue in 6.x. I've filed it against 7.x, however, because the search_excerpt function appears to be identical, and it's probably more likely to get noticed in the 7.x issue queue... hope that's OK.

Comment	File	Size	Author
#44	493270.patch	10.72 KB	mcarbone
#43	search_excerpt_in_progress.patch	3.58 KB	mcarbone
#35	search_excerpt_simplified.patch	3.44 KB	mcarbone
#17	493270.patch	5.01 KB	jhodgdon
#16	493270_porter_v3_D6.patch	1.9 KB	jhodgdon
#16	493270_v3_D6.zip	17.39 KB	jhodgdon
#15	493270_v2_D6.zip	17.27 KB	jhodgdon
#15	493270_doc_v2_D6.patch	2.16 KB	jhodgdon
#15	493270_porter_v2_D6.patch	1.53 KB	jhodgdon
#15	493270_search_v2_D6.patch	2.82 KB	jhodgdon
#12	493270_D6.zip	17.47 KB	jhodgdon
#11	493270_search_D6.patch	2.82 KB	jhodgdon
#11	493270_porter_D6.patch	1.51 KB	jhodgdon
#11	493270_doc_D6.patch	2.08 KB	jhodgdon

Comments

Comment #1

malc_b commented 17 June 2009 at 22:18

I've worked out a fix for this. In D6 search module on line 1212 add in .'.*?' so the line becomes:

if (preg_match('/'. $boundary . $key . '.*?' . $boundary .'/iu', $text, $match, PREG_OFFSET_CAPTURE, $included[$key]))

And the same for line 1270 so that becomes

$text = preg_replace('/'. $boundary .'('. implode('|', $keys) .')'. '.*?' . $boundary .'/iu', '<strong>\0</strong>', $text);

The first says match the key plus any more characters up to a boundary. So this find the right extract with the stemmed words. The second is the fix so the whole word is made bold.

Log in or register to post comments

Comment #2

she/her

English

commented 17 June 2009 at 23:43

I am not sure that all stems are necessarily sub-strings of the full words they match. I don't know enough about how the stemming modules work to know this for sure, but in English, you should have "like" and "liking" being equivalent (the base word is "like", which is not a substring of "liking" -- is the "stemmed" keyword "lik" or "like"? I don't know... does this work in this case?). And what about things like "person" being the singular of "people"? I don't know whether stemming modules work for this or not, but if they do, your solution would not work in this case, I think.

Also, I am not sure that all substrings should be matched for all stems. For instance, if you search for "like", the words "likely" and "likewise" should not be matched, even though they both start with "like".

So I don't know if this is a good general solution. My guess is that for a general solution, you would need a hook that the stemming module could implement to modify the excerpt in some way that is appropriate for that particular stemming algorithm.

Log in or register to post comments

Comment #3

malc_b commented 18 June 2009 at 18:32

Good point. You are right it isn't that simple. Taking your example like, liking likely the porter/stemmer says these all have the stem of like. So all nodes with any of those 3 words get a search key of like when cron runs. Type in any of those 3 words and the search key changes to like, finds the node, but then fails at finding the extract. Stemming rules that reduce say 10 possible words to one is fair enough but to go the other is likely to make one word into 50, some of which will be nonsense i.e. nationality stems to nation, but station doesn't unstem to stationality.

Perhaps the solution is to have multiple passes at the extract shrinking the keys by the end letter each time until there is match. Or I guess the stemmer could have a function that reduces the key down to the common root, so like, liking, likely would reduce to just lik. Or rather lik.*? which would be the right search key. That's probably better.

Log in or register to post comments

Comment #4

malc_b commented 18 June 2009 at 21:46

OK, new mod. Add the '.*?' . to lines 1212 and 1270 as in post #1. And in addition before line 1186

$workkeys = $keys;

insert this code:

  foreach($keys as $k => $key){
    search_invoke_preprocess($key);
    $keys[$k] = substr($key,0,-1);
  }

what that does is take the search keys as typed. Stem them and remove the last character. Of course it would be better if there was a proper hook and stem root function but this probably works most of the time. The stem will be the smallest word but of course like -> liking loses the e, y becomes ies etc. so just removing the last stem character probably gives the correct result most of time, perhaps all, at least in english.

Log in or register to post comments

Comment #5

she/her

English

commented 22 June 2009 at 21:23

Also: Are there languages (German?) where some stemming might use prefixes as well as suffixes? And what about irregular forms? I don't know if stemming modules work for irregular forms, but for instance, you would want to find "person" if you search for "people" in English, man/men, woman/women, etc. In all of these cases, just accounting for preprocessed-keyword-plus-suffix matching is not going to be useful.

Log in or register to post comments

Comment #6

malc_b commented 28 June 2009 at 11:00

You could be right, but at least the above returns more extracts that are sensible that the current method which returns next none (only if you try in a word that is also a stem, if that is possible).

Log in or register to post comments

Comment #7

she/her

English

commented 1 July 2009 at 16:49

I still think that having a heuristic like "look for a match of all but the last character of the stemmed keyword, and don't require the match to end on a word boundary" is not really going to solve the issue for all languages. And creating a patch that doesn't really fix the issue will probably not be accepted into the core of Drupal.

However, I do think it is possible to actually solve the problem. I think what is needed is to allow modules to do their own matching on the keys, wherein they could pre-process both the text and the key with their stemming algorithm. So you would replace this line in search_excerpt()

     if (preg_match('/' . $boundary . $key . $boundary . '/iu', $text, $match, PREG_OFFSET_CAPTURE, $included[$key])) {

with something like this (obviously it would need an addional } to get the loops working correctly):

   foreach (module_implements('search_excerpt_match') as $module) {
       if( module_invoke( $module, 'search_excerpt_match', $boundary, $key, $text, $match )) {

The idea would be to allow stemming modules to see if they can find a match between the given key and the text. The first module to return TRUE would be accepted (you'd want to break out of this foreach loop), or you could do something more complex like accepting the one with the first position in the text. The return value from the hook would be TRUE/FALSE, and the $match array would be passed back by reference and give the position of the match found in the original text string, just as preg_match is currently doing with PREG_OFFSET_CAPTURE. (Though it might make sense to make the hook return something other than what preg_match would return -- this is just a concept so far.) Anyway, the Search module itself could have its own implementation of the new hook_search_excerpt_match, doing what it used to do (looking for exact keys):

search_search_excerpt_match( $boundary, $key, $text, &$match ) {
  return preg_match('/' . $boundary . $key . $boundary . '/iu', $text, $match, PREG_OFFSET_CAPTURE, $included[$key]);
}

This is not quite complete, because the keyword highlighting at the end of search_excerpt() would also need to be modified, so that the actual word matched would be highlighted, not the "key" (which might not be present in its exact form). Probably you'd want to save the actual word that was found in the $keys array, so that at the end:

  $text = preg_replace('/'. $boundary .'('. implode('|', $keys) .')'. $boundary .'/iu', '<strong>\0</strong>', $text);

would still highlight the found words, rather than the original keywords.

Anyway, I might see if I can get this working, with patches for both Porter Stemmer and core Search. I think it's at least the start of a viable idea that would actually solve the problem.

Log in or register to post comments

Comment #8

she/her

English

commented 1 July 2009 at 16:59

Assigned:

Unassigned

Log in or register to post comments

Comment #9

malc_b commented 2 July 2009 at 10:07

Feel free to look at. I agree my solution is quick and dirty and not suitable as a patch. It just improves a bad situation, for english, to a state where the error is not so noticeable.

BTW it would be useful if your patch had a D6 as well as D7 version as I'm more interested in D6.

Log in or register to post comments

Comment #10

she/her

English

commented 2 July 2009 at 15:01

I will definitely be developing/testing in D6! The Porter Stemmer module is not out for D7 yet, though it should not be too difficult to port, since it only implements 2 hooks.

Log in or register to post comments

Comment #11

she/her

English

commented 2 July 2009 at 22:47

Version:	7.x-dev	» 6.x-dev
Status:	Active	» Needs review

Status	File	Size
new	493270_doc_D6.patch	2.08 KB
new	493270_porter_D6.patch	1.51 KB
new	493270_search_D6.patch	2.82 KB

I think I have it working in Drupal 6 (have temporarily set the version of this issue to D6).

Attached:
- Patch for the core Search module in D6
- Patch for the Porter Stemmer module in D6 (patch created against the 6.x development branch of Porter Stemmer -- the patch just adds a new function to the module, so you should be able to add the function to pretty much any 6.x version of Porter Stemmer).
- Patch for the D6 docs (these are in the Contrib repository)

If people can review and test these patches for Drupal 6, and if everyone likes them, I will port the search and doc patches to Drupal 7 and submit for inclusion there. I am not sure whether they will want to patch Drupal 6 or not for this issue... not sure what the policy is there.

Log in or register to post comments

Comment #12

she/her

English

commented 2 July 2009 at 22:53

Status	File	Size
new	493270_D6.zip	17.47 KB

In case someone wants to test this and isn't up to speed on applying patches, the attached zip file contains a replacement search.module (put into modules/search in your Drupal installation) and a replacement porterstemmer.module (put into sites/all/modules/porterstemmer, or wherever you have your contrib modules).

ONLY FOR DRUPAL 6.x!

Log in or register to post comments

Comment #13

malc_b commented 3 July 2009 at 11:42

OK, I'm giving this a try. Seems to work well so far.

Log in or register to post comments

Comment #14

she/her

English

commented 3 July 2009 at 15:12

One thing I just thought of: Probably the return value of the hook should be an associative array, such as
'pos' => $p,
'keyword' => $word
rather than just a simple array ($p, $word). That would make it a bit more self-documenting.

I plan to update the patches to do that, but not for a few days (holiday time, and I'm just about to leave on a backpacking trip).

Log in or register to post comments

Comment #15

she/her

English

commented 6 July 2009 at 17:40

Status	File	Size
new	493270_search_v2_D6.patch	2.82 KB
new	493270_porter_v2_D6.patch	1.53 KB
new	493270_doc_v2_D6.patch	2.16 KB
new	493270_v2_D6.zip	17.27 KB

Here are updated D6 patches and zip file, using an associative array. Testing and comments welcome. I'll create a Drupal 7 patch once we're happy with Drupal 6's behavior.

Log in or register to post comments

Comment #16

she/her

English

commented 6 July 2009 at 22:33

Status	File	Size
new	493270_v3_D6.zip	17.39 KB
new	493270_porter_v3_D6.patch	1.9 KB

Another thought on the Porter Stemmer component of this patch group: It should verify that the keyword found actually stems to the searched key, after doing the substring match.

Here's a new patch for the Porter Stemmer module (compatible with the other patches). And a new zip file.

Log in or register to post comments

Comment #17

she/her

English

commented 8 July 2009 at 16:45

Version:

6.x-dev

» 7.x-dev

Status	File	Size
new	493270.patch	5.01 KB

Here's a patch for Drupal 7.

Log in or register to post comments

Comment #18

22 July 2009 at 11:20

Status:

Needs review

» Needs work

The last submitted patch failed testing.

Log in or register to post comments

Comment #19

she/her

English

commented 22 July 2009 at 15:33

Status:

Needs work

» Needs review

This is very odd. The test that failed was "module dependency", and looking at that test, I do not see how this patch could have affected this test at all. So I am assuming there was something else that caused that test to fail. Requesting re-test.

Log in or register to post comments

Comment #20

she/her

English

commented 3 August 2009 at 17:50

Assigned:

» Unassigned

It would be great if we could get this into Drupal 7, and it would need to be before the code freeze... guess the next step would be if someone could review this patch?

Log in or register to post comments

Comment #21

Scott Reynolds commented 10 August 2009 at 05:28

Related: #103548: Partial Search in Drupal Core. The reason the test fail to pass now, is the search_admin_validate() function is now gone. It was replaced by a proper submit() handler.

I think my solution in that issue was considerably smaller. I would consider testing that patch and see if it achieves the issue with less code changes. It was just a lil bit of regex.

Log in or register to post comments

Comment #22

she/her

English

commented 10 August 2009 at 15:03

N-grams are NOT the same as stemming algorithms at all. Stemming algorithms are language-specific ways to linguistically reduce a word to its basic root, which is done to both the search terms and the text, and may not result in an actual sub-string of the original words.

N-grams are blind substrings.

Both have their strengths, but they are not equivalent. If you want to use stemming, then n-grams will not do the same thing.

Log in or register to post comments

Comment #23

Scott Reynolds commented 10 August 2009 at 18:38

sigh u missed the point...

In the lasted patch, I had to accomplish what you are trying here, meaning highlight a full word when a part of the word was in the $keys

like so

@@ -1252,7 +1280,7 @@ function search_excerpt($keys, $text) {
       }
       // Locate a keyword (position $p), then locate a space in front (position
       // $q) and behind it (position $s)
-      if (preg_match('/' . $boundary . $key . $boundary . '/iu', $text, $match, PREG_OFFSET_CAPTURE, $included[$key])) {
+      if (preg_match('/' . $boundary .'[^' . PREG_CLASS_SEARCH_EXCLUDE . PREG_CLASS_CJK . ']*' . $key . '[^' . PREG_CLASS_SEARCH_EXCLUDE . PREG_CLASS_CJK . ']*' . $boundary . '/iu', $text, $match, PREG_OFFSET_CAPTURE, $included[$key])) {
         $p = $match[0][1];
         if (($q = strpos($text, ' ', max(0, $p - 60))) !== FALSE) {
           $end = substr($text, $p, 80);
@@ -1310,7 +1338,7 @@ function search_excerpt($keys, $text) {
   $text = (isset($newranges[0]) ? '' : '... ') . implode(' ... ', $out) . ' ...';
 
   // Highlight keywords. Must be done at once to prevent conflicts ('strong' and '<strong>').
-  $text = preg_replace('/' . $boundary . '(' . implode('|', $keys) . ')' . $boundary . '/iu', '<strong>\0</strong>', $text);
+  $text = preg_replace('/' . $boundary . '[^' . PREG_CLASS_SEARCH_EXCLUDE . PREG_CLASS_CJK . ']*' . '(' . implode('|', $keys) . ')' . '[^' . PREG_CLASS_SEARCH_EXCLUDE . PREG_CLASS_CJK . ']*' . $boundary . '/iu', '<strong>\0</strong>', $text);
   return $text;
 }

N-grams are NOT the same as stemming algorithms at all. Stemming algorithms are language-specific ways to linguistically reduce a word to its basic root, which is done to both the search terms and the text, and may not result in an actual sub-string of the original words.

Sorry for not being clearer, but u should assume people are this dumb :-D. I was trying to point out I had to accomplish exactly what you have here and it did it by that Regex ^^^ which is considerably smaller then what you have here, and is utf-8 safe.

Log in or register to post comments

Comment #24

she/her

English

commented 11 August 2009 at 14:26

The point I was making is that a "stem" as returned from a stemming algorithm is not necessarily a substring of the full word. I am not that dumb either, just unclear in my writing. :)

Log in or register to post comments

Comment #25

she/her

English

commented 20 August 2009 at 14:14

Issue tags:

Adding tag

Log in or register to post comments

cburschka’s picture

Comment #26

they

commented 31 December 2009 at 12:59

Version:

7.x-dev

» 8.x-dev

If this really does involve an API change, we may need to push it back to D8 now, sadly...

Log in or register to post comments

Comment #27

gpk commented 10 July 2010 at 02:05

This is rather cool.

Have sort of got it working on Drupal 6.x after a bit of hacking. I'm assuming the latest code in porterstemmer_sbp_excerpt_match() from porterstemmer 6.x-2.5 is what should be used rather than what's in porterstemmer_search_excerpt_match() from #16 above?

I hit a problem with the line
if ($foundstem == $key) {
since this test will fail if there are differences in capitalisation.

Also am I right in thinking that if an exact match for a $key is found -- in the new search_excerpt() -- then any potential matches of the stemmed $key prior to this will be missed?

[Currently I'm experimenting with a custom module which implements mymodule_preprocess_search_result() to override the default snippet - this seems to provide a practical way of getting this working in 6.x]

Thanks!

Log in or register to post comments

Comment #28

she/her

English

commented 10 July 2010 at 14:48

Yeah, the latest code in Porter Stemmer and Search by Page can be used in combination. Search by Page invokes the hook, and Porter Stemmer implements it.

Porter Stemmer lower cases everything before it does any stemming, so maybe that takes care of the upper/lower cased issue? Not sure... And I'm not sure about your other qeustions... will have to do some thinking (another day).

Log in or register to post comments

Comment #29

gpk commented 11 July 2010 at 14:44

>Porter Stemmer lower cases everything before it does any stemming
It looks as though the lowercasing happens in porterstemmer_search_preprocess http://drupalcode.org/viewvc/drupal/contributions/modules/porterstemmer/..., rather than in porterstemmer_stem. So $foundstem (http://drupalcode.org/viewvc/drupal/contributions/modules/porterstemmer/...) can have caps in it, as can $key. Using

if (drupal_strtolower($foundstem) == drupal_strtolower($key)) {

at line 105 fixed this problem for me. I've opened an issue against porterstemmer for this.

#850950: capitalization can cause porterstemmer_sbp_excerpt_match() to miss matches

re. the other question, yes I need to do some proper tests for this. Probably another day!!!!

Log in or register to post comments

Comment #30

she/her

English

commented 12 July 2010 at 13:51

Good catch! Thanks, I'll take care of that over in Porter Stemmer.

Log in or register to post comments

andypost’s picture

Comment #31

he/him

Russian

commented 10 August 2010 at 09:20

Subscribe

Log in or register to post comments

Comment #32

gpk commented 13 August 2010 at 09:10

Status:

Needs review

» Needs work

Thanks jhodgdon, #850950-7: capitalization can cause porterstemmer_sbp_excerpt_match() to miss matches takes care of the capitalization problem. While I was testing that I also investigated the problem I alluded to at #27 and #29 above.

A node has content "qqQqqeateat qqQqqeating qqQqqeat hello world"

When I search for "qqqqqeat" I get the following snippet:

qqQqqeateat qqQqqeating qqQqqeat hello world

(I'm using the latest code from Search by Page instead of #15/#16.)

What seems to be happening is that the exact match is taking priority, and then, provided our excerpt length is under 256 characters, and having tried any other keys, the code only looks for any *subsequent* matches against that particular key. If the node has the words in a different order "qqQqqeat qqQqqeateat qqQqqeating hello world" then I do get qqQqqeating highlighted as well i.e. the snippet is qqQqqeat qqQqqeateat qqQqqeating hello world, because having found the exact match (the "bare keyword") there is a *subsequent* valid stem match.

Log in or register to post comments

Comment #33

she/her

English

commented 13 August 2010 at 14:14

Hmmm. Thinking about how the SBP function works, that would be the case, because I think it is only invoking the preprocessor module to find matches if it doesn't find exact matches, and as you noticed, that also applies to "well, I have one match, let's see if there's another one", which always looks between the position of the match it found and the end of the string.

I have filed this as an issue in Search by Page. Thanks for your investigations! I'll see what I can do in the next few days about fixing this up. I'm so glad someone is testing all of this besides me. :)
#882328: When finding excerpts, exact matches have priority over preprocessing matches

Log in or register to post comments

Comment #34

she/her

English

commented 15 October 2010 at 22:01

Title:	search_excerpt() doesn't work well with stemming	» search_excerpt() doesn't work well with stemming, diacritical accents, etc.
Version:	8.x-dev	» 7.x-dev

We need to reopen this for D7. The issue is broader than just stemming, it also happens with diacritics/accents. See
#916086: search_excerpt() doesn't highlight words that are matched via search_simplify()
#731298: Searches for words with diacritics/accents: word not highlighted in results
which I've marked as duplicates of this issue. They're relevant even without stemming problems. This needs to be fixed.

Log in or register to post comments

Comment #35

mcarbone commented 19 October 2010 at 22:38

Title:	search_excerpt() doesn't work well with stemming, diacritical accents, etc.	» search_excerpt() doesn't work well with search_simplify(), stemming, and diacritical accents
Status:	Needs work	» Needs review

Status	File	Size
new	search_excerpt_simplified.patch	3.44 KB

Well, then, I reattach here the patch I originally wrote for #916086: search_excerpt() doesn't highlight words that are matched via search_simplify() as it addresses search_excerpt not supporting matches made via search_simplify().

I'm not convinced that we should focus on it respecting stemmed matches, since that could be handled by the contributed modules themselves. Handling stemmed excerpts is a feature to me -- not handling search_simplify and diacritics excerpting is a bug.

Lastly, I'll reiterate my point from #731298: Searches for words with diacritics/accents: word not highlighted in results that the diacritics issue has nothing to do with search_excerpt, and everything to do with mysql collation, and so perhaps should be handled separately by stripping diacritics entirely in the index. jhodgdon, you seem to disagree since you closed it as a dupe -- can you explain? Or do you think we should do fix that problem in this thread anyway?

Log in or register to post comments

Comment #36

she/her

English

commented 19 October 2010 at 23:52

Ah. Perhaps I should not have closed that other issue as a dup -- feel free to override me and reopen it. :)

Contributed modules, without a patch similar to ones attached above in previous comments, have no way to highlight matches, although they are using the API provided by the search module. So I think it's all part of the same picture.

Regarding the current patch, it needs more tests before I will believe that it works for diacritical marks. It seems currently to only be testing numbers, which are one facet of the problem.

Also, I'll need to read through this patch some more to understand what it's doing... One thing I noticed is that I think it assumes that search_simplify($key) results in exactly one word. That's not necessarily going to always be true.

Log in or register to post comments

Comment #37

mcarbone commented 20 October 2010 at 14:43

Title:	search_excerpt() doesn't work well with search_simplify(), stemming, and diacritical accents	» search_excerpt() doesn't work well with search_simplify(), and stemming
Status:	Needs review	» Needs work

Yep, you're right: this doesn't handle quoted keywords correctly. I'll take a stab at that in the near future, and add a test for it as well. I'm not sure other tests are needed, because it's not as if I'm just testing numbers here -- I'm testing the use of search_simplify in general (which is tested in all of its variations elsewhere). Thus just needs to make sure that changes made by search_simplify are still excerpted.

I see your point re: contributed modules using the search API, and I think it would likely involve adding a new API call in this patch to allow other modules to have a say about excerpting. But I'm still not convinced that this should hold up the rest of this thread, which is focused on a core bug caused by search_simplify (and nothing else in core as far as I know, if you accept my diacritics argument). I don't see why a contributed module couldn't just add a new preprocess variable to search_result.tpl.php to solve this itself. Again, I'm not against adding this functionality, but I don't see it as important as the search_simplify bug fix. But when I get to the above fix, I'll take a stab at this in case it's fairly easy to do.

I re-opened #731298: Searches for words with diacritics/accents: word not highlighted in results and removed diacritics from the subject line here.

Log in or register to post comments

Comment #38

she/her

English

commented 20 October 2010 at 17:02

Ideally, I'd like to see a test that tested highlighting of several different types of keywords that search_simplify would alter, in a larger chunk of text. If we had such a test, then we would be assured that future changes to the search module wouldn't break the desirable behavior of highlighting such things, even if it might not be totally necessary for this particular issue. More testing is good...

And yes, the patches above did introduce a new API to allow contrib modules to highlight their own words. I actually have that working using the contrib modules Search by Page and Porter Stemmer. If we did it via an API (which could be that one or maybe something simpler that just let contrib modules say "this is a variation on the keyword that should also be highlighted in the excerpt", similar to how your patch is working), then the search module could just implement the hook too, making the whole thing more modular.

As far as adding a variable to the TPL, that's a possible solution, but it would then require people to make a change in their theme's implementation of the TPL to print out that variable instead of the search excerpt calculated by the node module. So I don't think it's a very good solution myself.

And regarding multiple words, I wasn't actually referring to quoted keywords -- search_excerpt already ignores this and highlights each individual word anyway. What I was referring to was the possibility that search_simplify() could take a string like "abc,def" and return "abc def" -- i.e. replace punctuation with spaces, making one word into two words. And you'd then possibly want to highlight abc, def, and "abc,def" anywhere in the text?

Log in or register to post comments

Comment #39

mcarbone commented 21 October 2010 at 16:02

Title:

search_excerpt() doesn't work well with search_simplify(), and stemming

» search_excerpt() doesn't work well with search_simplify() and stemming

Well, in any case, it does turn out that quoted keywords are a problem when search_simplify gets involved. That is, if you have "one: two" in the text and search for "one two," my above patch wasn't highlighting the phrase appropriately. I've now fixed this issue, and from some tests it looks like the "abc,def" and "abc--def" situation isn't a problem. I'll add these to the testing suite.

I've looked at your patch and I agree that ideally I should combine my patch with yours. That is, search should implement search_excerpt_match to find search_simplify() related matches. However, I do think this might involve slightly tweaking your patch to allow an array of keywords to be returned, as opposed to just one, to catch the "one: two" situation mentioned above.

Assuming the merge works well, hopefully webchick/dries will be cool with adding a new API call (since it fixes a bug and it won't break any contrib functionality). If not, it wouldn't be too hard to post a version w/o it, but it would only work with search_simplify and not contributed stemmers, etc until D8.

Log in or register to post comments

Comment #40

she/her

English

commented 21 October 2010 at 16:06

I think that if someone searches for "one two" with quotes, the search excerpt should highlight the words one and two, and not worry about the phrase "one two" (which would be highlighted anyway). I think that's what other search engines do, and I think that's what the search excerpt function used to do (isn't it?).

Log in or register to post comments

Comment #41

mcarbone commented 21 October 2010 at 16:24

No, it doesn't currently do that when search_simplify is modifying one of the words, at least not in my current sandbox at HEAD.

To reproduce:

1) Create a node with body: "Word follows: this"

2) Run cron

3) Search for "follows this" and the node will be returned, but with nothing highlighted. If you search for "follows: this", however, it gets highlighted.

But as I said, I believe I've already solved this, because w/o search_simplify getting involved this wouldn't be an issue.

Log in or register to post comments

Comment #42

she/her

English

commented 21 October 2010 at 19:41

"w/o search_simplify getting involved this wouldn't be an issue" ... huh?

"I believe I've already solved this" -- how?

Log in or register to post comments

Comment #43

mcarbone commented 21 October 2010 at 21:53

Status	File	Size
new	search_excerpt_in_progress.patch	3.58 KB

Sorry, that wasn't well put. I was just stating the obvious, which is just that search_simplify() strips out punctuation and hence searching "follows" will find "follows:" -- so core is responsible for making sure that "follows this" will highlight "follows: this".

I think I've solved it in the patch I'm working on, but I still need to add more tests and integrate it with your patch by putting it inside of search_search_excerpt_match. But I've attached the latest version if you want to check it out.

Log in or register to post comments

Comment #44

mcarbone commented 25 October 2010 at 19:40

Status:

Needs work

» Needs review

Status	File	Size
new	493270.patch	10.72 KB

OK, I've added an implementation of the search_excerpt_match hook to find matches made via search_simplify(). I ended up having to rewrite the code I originally wrote, and I think the version I have now is much more robust. Hit me on IRC if you want to discuss the algorithm.

I've also added more tests -- five fail without the hook implementation.

Log in or register to post comments

Comment #45

she/her

English

commented 25 October 2010 at 23:20

Thanks! I'll give this a thorough review in the next day or two.

Log in or register to post comments

Comment #46

she/her

English

commented 17 November 2010 at 01:15

Status:

Needs review

» Reviewed & tested by the community

I finally got a chance to look this over carefully (sorry about the excessively long delay).

I think this patch is solid, and it has a solid test. Thanks for all the work you did and many iterations on the patches, mcarbone!

It's an API addition, so it better get in now or we'll have to leave this bug unfixed for D7 entirely.

Log in or register to post comments

webchick’s picture

Comment #47

she/they

English

Vancouver 🇨🇦

commented 22 November 2010 at 07:37

Version:

7.x-dev

» 8.x-dev

Sorry. I really do think it's too late for this. :( API freeze was over a year ago.

I see in #34 where this was changed back to 7.x, that this fixes some problems, but doesn't explain why this needs to happen via a new hook.

#916086: search_excerpt() doesn't highlight words that are matched via search_simplify() seems to be addressing the same issue without an API change. Is it worth looking at that again?

Log in or register to post comments

Comment #48

mcarbone commented 22 November 2010 at 15:09

This eventually needs to happen via a new hook in order to allow proper excerpting for 3rd party stemmers, but if we can't have that in 7.x, then I think we should just stick with solving the search_simplify issue. The patch here had some algorithmic improvements over the patch over there, so I'll work on a new version that solves the search_simplify issue without the API change, and then for 8.x we can add the new hook.

Log in or register to post comments

Comment #49

she/her

English

commented 22 November 2010 at 16:42

OK. I'm not happy about it (this change was originally proposed well before the original API freeze but I couldn't get anyone to review it, as usual), but I understand.

mcarbone: I've reopened that other issue in order to work the no-API-change version for Drupal 7. Thanks for all your hard work on this patch...

Log in or register to post comments

Comment #50

mcarbone commented 24 November 2010 at 14:55

Title:	search_excerpt() doesn't work well with search_simplify() and stemming	» search_excerpt() doesn't work well with stemming
Status:	Reviewed & tested by the community	» Active

This should get a new title then so as not to be confused with the newly reopened issue.

Log in or register to post comments

Comment #51

Laveena commented 12 April 2012 at 05:14

Status:

Active

» Needs review

#17: 493270.patch queued for re-testing.

Log in or register to post comments

Comment #52

gpk commented 26 August 2012 at 21:29

Status:

Needs review

» Needs work

Just wondering what the status of this issue is. I guess the most recent patch is at #44 but now that #916086: search_excerpt() doesn't highlight words that are matched via search_simplify() has gone in to 8.x (and 7.x) I guess #44 needs updating? Also is it best to leave #731298: Searches for words with diacritics/accents: word not highlighted in results in its own issue?

Log in or register to post comments

andypost’s picture

Comment #53

he/him

Russian

commented 1 November 2012 at 10:38

Category:

bug

» task

Suppose we should introduce a kind of hook to alter data before indexing and make the same for excerpt generation

Log in or register to post comments

Comment #54

she/her

English

commented 1 November 2012 at 13:50

Category:

task

» bug

RE #52/53 - actually I think the patch on #916086: search_excerpt() doesn't highlight words that are matched via search_simplify(), which was put into both 8.x and 7.x, may have completely taken care of this issue. I haven't tested it yet -- has anyone else?

Log in or register to post comments

pwolanin’s picture

Comment #55

pwolanin commented 6 February 2013 at 22:10

did it fix it, or there is still a bug here?

Log in or register to post comments

Comment #56

she/her

English

commented 6 February 2013 at 22:31

Status:

Needs work

» Closed (duplicate)

It should have fixed it but there was a bug in the patch. I'm working on it. Anyway this can be marked as a duplicate.

Log in or register to post comments