Closed (duplicate)
Project:
Apache Solr Search
Version:
6.x-1.x-dev
Component:
Code
Priority:
Normal
Category:
Task
Assigned:
Unassigned
Reporter:
Created:
4 Sep 2008 at 15:03 UTC
Updated:
9 Apr 2010 at 10:29 UTC
Jump to comment: Most recent file
Comments
Comment #1
janusman commentedJust testing this yesterday.
Good news:
No need to change the current schema.xml ... a test query might be:
http://localhost:8983/solr/select?indent=on&version=2.2&q=TEST&start=0&r...
and at the end of the response a <lst name="highlighting"> section shows highlighted parts. It handles stemming and other string transformations, so "testing" would highlight "test".
Bad news:
Since we are sending the complete HTML, tags and all to Solr, highlighting is being done inside HTML tags =(
An example (Solr inserted the <em> tags around "tests" when searching for test)
This is part of a URL inside the node... eww.
Right now I'm handling highlighting inside a theming function; I wanted node teasers with higlights and if no highlights were there then show the highlighted body part. For a sample see this search result.
Comment #2
brainski commentedHi janusman
It seems that we are developing on the same issues. Maybe you have see that I have developed a spellchecker to (also supporting phrases)
http://drupal.org/node/303937
I have implemented highlighting text snippets using solr.
Step 1: Replace Solr PHP Client with attached one
The author of SolrPHPClient has released a new SolrPHPClient using json_decode instead of xml results. This is a performance gain.
https://issues.apache.org/jira/browse/SOLR-341
I have attached a modified version of the SolrPHPClient.
Step 2: Edit apachesolr_search.module
Step3: adapt apachesolr.module
Comment #3
janusman commentedIs your attached version of SolrPhpClient the same as the one in https://issues.apache.org/jira/browse/SOLR-341 ?
Your changes to the apachesolr module should be in diff format. Please see http://drupal.org/patch/create ... This way we can track what exact version is being changed and where =)
Comment #4
brainski commentedNo, the attached SolrPHPClient is slightly modified to support different search request handler.
Because I have implemented a spellchecker too, I'm not able to export this modification as "standalone". I have attached a patch in the diff form including the spellchecker modifications. Its for the latest 6.0 version.
Comment #5
robertdouglass commentedComment #6
brainski commentedDid someone test the code? I think, the highlighting feature of solr is very nice, we should use it. As you can see in my code, only 4 new parameter are required.
$params['hl'] = 'true';
$params['hl.fragsize']= variable_get('apachesolr_textsnippetlength', 100);
$params['hl.simple.pre'] = '';
$params['hl.simple.post'] = '';
Additionally the fragment size can be defined to a different lenght than 100 chars.
What may I do to help, that this feature is added to the dev version?
Comment #7
robertdouglass commented@brainski: we just need the time to get to it =) I'm otherwise in total agreement on this feature.
Comment #8
janusman commented@brainski: This might also be good to incorporate in the new SolrPhpClient in the current DEV versions... hope you can help =)
Comment #9
robertdouglass commented@janusman: the latest SolrPhpClient is already in bot D5 and D6 dev.
Comment #10
brainski commentedI'm working on this issue again. I will provide a standalone patch for the latest dev 6.0 version. The highlighting feature of solr is highly preferable compared to the drupal search_excerpt().
I'll provide a patch soon and hope for testers
Comment #11
robertdouglass commentedLooking forward to it!
Comment #12
brainski commentedHello!
I have some problems with the highlighting. Unfortunately we save the whole html code of the body field in the solr database.
As I found out, the highlighting feature needs a stored field and will not work with "text" which is not stored.
I know, that we strip out all HTML Tags with the solr.HTMLStripWhitespaceTokenizerFactory Filter. But in the stored field, the HTML Code still remains. Its only stripped out for indexing.
This leads to the result, that the fragments contain crippled html code, because they are based on te body field.
Has anyone an idea, how to circumvent this? The highlighting feature has no possibility to remove all html tags.
By the way: Why do we store the full html code of the body in the index? Wouldn't it be better to strip all HTML tags out before we pass the whole thing to solr?
Comment #13
brainski commentedOk! I found an easy solution for the point decribed above.
This is a standalone integration of the solr highlighting feature. It comes with a seperate menu for configuration.
I wanted to highlight words also in the search result title. But unfortunately all html tags are escaped:
"My title" will be displayed as
"My <strong>title</strong>"and not as "My title".If anyone nows an easy solution to inculde html code in search result titles, give me a short feedback. I'll implement highlighting in titles too.
Comment #14
pwolanin commentedI have been looking at the highlighting here too: http://drupal.org/node/338534
Comment #15
brainski commentedIt looks like you have "reinvented" the proposal I already implemented.
Comment #16
pwolanin commentedSince core search does not highlight titles, it seesm like we should just skip highlighting there.
Comment #17
JacobSingh commentedHI Brainski,
Ugh... Sorry we crossed wires here.
Let's consolidate.
I've looked at your patch, and indeed it is more or less the same. Peter and I have a couple different ideas though, so would you please review the one @http://drupal.org/node/338534 and let us know if you think that one is good to commit, and if not, what should be incorporated.
Here are the main things different between are two patches, and my take:
1. Don't really want to create an admin form for the highlighting. The defaults probably account for 95% of cases, and as long as the variables can be overriden and it is documented, that should be good enough for the other 5. We're trying to keep the module from getting bloated.
2. Optional. I don't think it should be optional. We don't want to be building highlights on the drupal side, there is no good reason AFAICT.
3. solrconfig.xml settings. The research Peter did was excellent, so now if there is no snippet, we show a fallback (first couple hundred chars of the body).
4. A Theme function to allow people to change the format of the snippets.
If you have any objection to committing the other patch (and crediting you as a participant), and marking this duplicate, please let me know in the next few days, as we'd like to close this. If you would like to argue any of the points mentioned, go for it, I won't commit until you feel okay about it too.
Best,
Jacob
Comment #18
brainski commentedHi JacobSingh
I will review your patch in the next two days and give you a feedback on this.
The main reason, why I implemented solr highlighting was the following:
- In the drupal core search it is not possible, to set the length of the search snippet. Its fixed to 100 chars. I needed about 300 chars, because I mainly index large pieces of text. When I searched for a solution for this in core search, I found out, that a lot of user would like to specify the length of the search snippets. That's the reason why I implemented a backend admin optioon. I think this should be configurable somewhere.
For me it doesn't matter, if the default values are in solrconfig.xml. But I think an admin page is more the drupal way than external config files.
To Point 2:
I agree. If you activate solr, you should use also the solr highlighting.
3) I don't understand here, what you mean
4) I agree
You you can mention me as a participant, that would be nice. I'm not able to commit, but I try to make patches and module extensions here and there for different drupal modules.
Maybe you could also review my implementation of the "Did you mean" functionality here:
http://drupal.org/node/303937
Comment #19
pwolanin commentedre: #3 - we can (and should) return a snippet of the body if nothing was found to highlight (which can happen if, for example, the search match is on author name or title and not anywhere in the body)
Comment #20
JacobSingh commentedHi Brainski,
I haven't heard back from you so I'm going to commit a slightly re-rolled http://drupal.org/node/338534#comment-1139100.
Please let me know in the next couple hours if this is an issue. I suggest that you provide a new patch based on it for your admin interface if you feel it will be of use to others and attach it to a new (closed) issue with a descriptive title, and we will link to it from the docs.
Best,
J.
Comment #21
JacobSingh commentedcommitted in #338534
Comment #22
billnbell commentedI have tried multiple combination to get highlighting to work.
http://domain/solr/core1/select?indent=on&version=2.2&q=Testing&start=0&...
This does not add anything to body... No em or strong...
What am I missing?
Thanks.
Comment #23
billnbell commentedAlso, where is the apachesolr_h1_fieldtohighlight in the admin UI ?
I cannot find the variable.
Comment #24
JacobSingh commentedThere is no admin UI for it. We figured the admin screens were already overkill, and simplicity would win in this case. You can set the variable in your settings.php (or database).
To set it in settings.php, use $conf['apachesolr_....'] = 'whatever'
Comment #25
miruoss commentedI think it would be nice to be able to highlight the title and the body. By adding the following two lines to settings.php, solr does exactly this.
However, the module currently doesn't support this as it expects the apachesolr_hl_fieldtohightlight to only contain one field. The attached patch enables highlighting of all the fields in apachesolr_hl_fieldtohightlight.
Comment #26
fcmtuan commented@brainski: I try to use your product but it's still error. Can you show us a complete of your code? thanks!