Excerpt fails to find stemmed keyword

jhodgdon - April 17, 2009 - 20:35
Project:Porter-Stemmer
Version:6.x-2.x-dev
Component:Code
Category:bug report
Priority:normal
Assigned:Unassigned
Status:postponed
Description

I have a test site with the Porter Stemmer module installed and working correctly, in that if I search for "come", I find the page including the phrase "here comes the sun". So far, so good (thanks for the module!).

However, the search excerpt shown in the search results doesn't show me that portion of the page. If I search for "comes", I see "here comes the sun" with the word "comes" highlighted in bold. But if I search for "come", I just see some random part of the page (or the top?) , with nothing highlighted.

I'm not sure this can be fixed within Porter Stemmer, actually, but I thought I'd report it anyway.

The problem is happening within functions such as node_search( $op = 'search'), which call the core function search_excerpt() to find the excerpt of the content to display. It doesn't look like there is any easy hook-based way to modify that function, but maybe the Porter Stemmer module could supply a replacement function of some sort? Not sure what the right approach would be...

#1

jhodgdon - June 16, 2009 - 16:11

I have filed a related issue against the core Search module -- I think that would need to be fixed before a stemmer module could do anything to fix it. See #493270: search_excerpt() doesn't work well with stemming

#2

malc_b - June 17, 2009 - 22:19

I've added a comment to the above with a fix, see http://drupal.org/node/493270#comment-1714308

#3

jhodgdon - July 2, 2009 - 22:57

#4

greggles - July 3, 2009 - 15:05
Status:active» duplicate

Ok, then this can be a duplicate. Thanks, all.

#5

jhodgdon - August 4, 2009 - 00:47

If you think this issue is important, please visit #493270: search_excerpt() doesn't work well with stemming and leave a comment. Otherwise, it is possible that no one will think it is important to get into Drupal 7 (much less Drupal 6). The code freeze for Drupal 7 is coming up on September 1st.

#6

jhodgdon - August 4, 2009 - 15:41
Status:duplicate» postponed

I'm going to reopen this issue, because if the core Search issue is fixed so that the search_excerpt() function has a new hook in it, we'll also need to modify Porter Stemmer accordingly. And because others are almost certainly having the same problem, so having an open issue they can see immediately in the issue list will help them find it.

But I'll mark this issue "postponed" because we can't really do anything until there is action on that core Search issue.

#7

cpliakas - August 14, 2009 - 03:06

Just a note, highlighting does work when using this module with the 2.0 version of Search Lucene API.

#8

jhodgdon - August 14, 2009 - 17:36

The Lucene project has its own search excerpt function, and doesn't use the core search_excerpt() function.

It looks to me as though the main difference is that search_excerpt() is looking for the keywords as complete words, where luceneapi_excerpt() looks for the keyword as a substring anywhere in the word. So this will *usually* find stem output of Porter Stemmer, but it might not in every case, because sometimes the stems from Porter Stemmer may not actually be substrings of the word that in the text.

For instance, try using Porter Stemmer to search for the words "accessory" and "accessories" in text containing one or the other, and I think you may find that Lucene doesn't highlight the match in some cases. (The Porter Stemmer stemming output for both of those is "accessori", at least in the current version of Porter Stemmer using the Porter 2 algorithm.)

#9

cpliakas - August 14, 2009 - 18:59

Thanks for your reply and a great module, but the point you mentioned above is actually not true (although I completely understand why you thought so). Search Lucene API highlighting is based on the position of the matched word in the document and not the substring like you suggested. By the time it does the pattern matching for highlighting, it looks for both "accessory" and "accessories" because it knows that those were the words that were matched. I just tried it out, and it worked as expected. See the attached screenshot. I guess my point is that I haven't come across a case where a matched word wasn't highlighted because it knows the position of the matched word. Maybe this technique could be applied to the core search somehow, but I am not sure if it is possible due to the SQL backend of the search. I know this isn't really the place for this discussion, so I apologize for the tangent :-). My fault. Again, thanks for the great module.

AttachmentSize
accessory-match.png 24.68 KB

#10

jhodgdon - August 14, 2009 - 18:59

OK, my mistake. I didn't delve too deeply into the code, obviously.

#11

jhodgdon - September 10, 2009 - 14:56

I don't think the Search Lucene API module actually uses stemming modules at all -- I think it does its own stemming, by the way.

#12

jhodgdon - September 10, 2009 - 15:08
Version:6.x-1.0» 6.x-2.x-dev

An update on the status of this issue:

Apparently, no one cared about this issue enough to review my proposed change to Drupal Core (#493270: search_excerpt() doesn't work well with stemming), which would have modified the core search_excerpt() function to allow Porter Stemmer to work correctly with search excerpts. Drupal 7 is now in a "code freeze", so it is now (I believe) too late to get this into Drupal 7, and we can only hope for Drupal 8 at this point. Until this gets into Drupal core, there is no hope of having search excerpts working correctly with core search and Porter Stemmer. There's nothing else I can really do about it at this point. I will leave this issue as "postponed" until this fix goes into Drupal core.

So, as a work-around, I have implemented a similar change in the Search by Page module (http://drupal.org/project/search_by_page), which I'm the maintainer of. This means that if you use the Search by Page module in place of the core Search module, your search excerpts will work correctly with Porter Stemmer.

This is currently only checked into the development versions of both Porter Stemmer (6.x-2.x) and Search by Page (6.x-1.x) (make sure the build date reads September 10 or later). Assuming you were previously up to date on the Porter Stemmer module (version 6.x-2.1), you shouldn't need to clear your search index or run cron when installing this new version to see it working, since this change only affects search results display, not the search index.

#13

jhodgdon - October 1, 2009 - 19:18

This is now released in Porter Stemmer 6.x-2.2 and Search by Page 6.x-1.4 -- you can use Search by Page if you want better search exceprts.

It is still not fixed in Drupal core though, so I'll leave this issue as Postponed. Indefinitely.

 
 

Drupal is a registered trademark of Dries Buytaert.