The default search snippet of drupal is only 100 chars longs. The highlighting is done by the function search_excerpt().

This task could be done using solr. Customizing the search snippet length in the backend.

I'm thinking about implementing highlight feature of solr.

Any hints are welcome.

Comments

janusman’s picture

Just testing this yesterday.

Good news:

No need to change the current schema.xml ... a test query might be:

http://localhost:8983/solr/select?indent=on&version=2.2&q=TEST&start=0&r...

and at the end of the response a <lst name="highlighting"> section shows highlighted parts. It handles stemming and other string transformations, so "testing" would highlight "test".

Bad news:

Since we are sending the complete HTML, tags and all to Solr, highlighting is being done inside HTML tags =(

An example (Solr inserted the <em> tags around "tests" when searching for test)

?id=8474321395&type=100&tit=El+libro+de+los+<em>tests</em>&aut=Nash%2C+Bruce+M" alt="El libro de los <em>tests</em>" width=80

This is part of a URL inside the node... eww.

Right now I'm handling highlighting inside a theming function; I wanted node teasers with higlights and if no highlights were there then show the highlighted body part. For a sample see this search result.

brainski’s picture

StatusFileSize
new99.23 KB
new150.29 KB

Hi janusman

It seems that we are developing on the same issues. Maybe you have see that I have developed a spellchecker to (also supporting phrases)
http://drupal.org/node/303937

I have implemented highlighting text snippets using solr.

Step 1: Replace Solr PHP Client with attached one
The author of SolrPHPClient has released a new SolrPHPClient using json_decode instead of xml results. This is a performance gain.
https://issues.apache.org/jira/browse/SOLR-341

I have attached a modified version of the SolrPHPClient.

Step 2: Edit apachesolr_search.module

      try {
        $params = apachesolr_search_get_params();

        //Enable highlighting
        if(variable_get('apachesolr_solrhighlighting', false)) {
        
        	//Add new parameter to the search request
        	$params['hl'] = 'true';
        	$params['hl.fragsize']= variable_get('apachesolr_textsnippetlength', 100);
        	$params['hl.simple.pre'] = '<strong>';
        	$params['hl.simple.post'] = '</strong>';
        	
        }
        
        $response = $solr->search($query->get_query(), $params['start'], $params['rows'], $params, $search_servlet);
        apachesolr_has_searched(TRUE);
        
        //New since json_decode
        $highlight = $response->highlighting;
        $response = $response->response;
            $extra['score'] = $doc->score;
            
            if(variable_get('apachesolr_solrhighlighting', false)){
				 $snippet = each($highlight);
				 $snippet = $snippet['value']->text[0];            
            }
	        else
	        	 $snippet = search_excerpt($keys, $doc->body);    
            
            if (trim($snippet) == '...') {
              $snippet = '';
            }

Step3: adapt apachesolr.module

  $form['apachesolr_path'] = array(
    '#type' => 'textfield',
    '#title' => t('Solr path'),
    '#default_value' => variable_get('apachesolr_path', '/solr'),
    '#description' => t('Path that identifies the Solr request handler to be used. Leave this as /solr for now.'),
    );
  $form['apachesolr_solrhighlighting'] = array(
    '#type' => 'checkbox',
    '#title' => t('Enable highlighting by solr'),
    '#default_value' => variable_get('apachesolr_solrhighlighting', false),
    '#description' => t('Highlighting and text excerpt is done using solr.'),
   );
  
  $options = array();
  foreach (array(100, 150, 200, 250, 300, 400, 500, 600) as $option) {
    $options[$option] = $option;
  }
  $form['apachesolr_textsnippetlength'] = array(
    '#type' => 'select',
    '#title' => t('Length of text snippet in the result'),
    '#default_value' => variable_get('apachesolr_textsnippetlength', 100),
    '#options' => $options,
    '#description' => t('The number of characters the text snippet in the search result will be.'),
    );

janusman’s picture

Title: Using the highlighting features of solr instead of search_excerpt() » Implemented highlighting feature using the SolrPHPClient

Is your attached version of SolrPhpClient the same as the one in https://issues.apache.org/jira/browse/SOLR-341 ?

Your changes to the apachesolr module should be in diff format. Please see http://drupal.org/patch/create ... This way we can track what exact version is being changed and where =)

brainski’s picture

StatusFileSize
new894.67 KB

No, the attached SolrPHPClient is slightly modified to support different search request handler.

Because I have implemented a spellchecker too, I'm not able to export this modification as "standalone". I have attached a patch in the diff form including the spellchecker modifications. Its for the latest 6.0 version.

robertdouglass’s picture

Category: task » feature
brainski’s picture

Category: feature » task

Did someone test the code? I think, the highlighting feature of solr is very nice, we should use it. As you can see in my code, only 4 new parameter are required.

$params['hl'] = 'true';
$params['hl.fragsize']= variable_get('apachesolr_textsnippetlength', 100);
$params['hl.simple.pre'] = '';
$params['hl.simple.post'] = '
';

Additionally the fragment size can be defined to a different lenght than 100 chars.

What may I do to help, that this feature is added to the dev version?

robertdouglass’s picture

@brainski: we just need the time to get to it =) I'm otherwise in total agreement on this feature.

janusman’s picture

@brainski: This might also be good to incorporate in the new SolrPhpClient in the current DEV versions... hope you can help =)

robertdouglass’s picture

@janusman: the latest SolrPhpClient is already in bot D5 and D6 dev.

brainski’s picture

I'm working on this issue again. I will provide a standalone patch for the latest dev 6.0 version. The highlighting feature of solr is highly preferable compared to the drupal search_excerpt().

I'll provide a patch soon and hope for testers

robertdouglass’s picture

Looking forward to it!

brainski’s picture

Hello!

I have some problems with the highlighting. Unfortunately we save the whole html code of the body field in the solr database.

As I found out, the highlighting feature needs a stored field and will not work with "text" which is not stored.

I know, that we strip out all HTML Tags with the solr.HTMLStripWhitespaceTokenizerFactory Filter. But in the stored field, the HTML Code still remains. Its only stripped out for indexing.

This leads to the result, that the fragments contain crippled html code, because they are based on te body field.

Has anyone an idea, how to circumvent this? The highlighting feature has no possibility to remove all html tags.

By the way: Why do we store the full html code of the body in the index? Wouldn't it be better to strip all HTML tags out before we pass the whole thing to solr?

brainski’s picture

StatusFileSize
new36.19 KB
new11.25 KB

Ok! I found an easy solution for the point decribed above.

This is a standalone integration of the solr highlighting feature. It comes with a seperate menu for configuration.

I wanted to highlight words also in the search result title. But unfortunately all html tags are escaped:

"My title" will be displayed as "My <strong>title</strong>" and not as "My title".

If anyone nows an easy solution to inculde html code in search result titles, give me a short feedback. I'll implement highlighting in titles too.

pwolanin’s picture

I have been looking at the highlighting here too: http://drupal.org/node/338534

brainski’s picture

It looks like you have "reinvented" the proposal I already implemented.

pwolanin’s picture

Since core search does not highlight titles, it seesm like we should just skip highlighting there.

JacobSingh’s picture

Status: Active » Postponed (maintainer needs more info)

HI Brainski,

Ugh... Sorry we crossed wires here.

Let's consolidate.

I've looked at your patch, and indeed it is more or less the same. Peter and I have a couple different ideas though, so would you please review the one @http://drupal.org/node/338534 and let us know if you think that one is good to commit, and if not, what should be incorporated.

Here are the main things different between are two patches, and my take:

1. Don't really want to create an admin form for the highlighting. The defaults probably account for 95% of cases, and as long as the variables can be overriden and it is documented, that should be good enough for the other 5. We're trying to keep the module from getting bloated.

2. Optional. I don't think it should be optional. We don't want to be building highlights on the drupal side, there is no good reason AFAICT.

3. solrconfig.xml settings. The research Peter did was excellent, so now if there is no snippet, we show a fallback (first couple hundred chars of the body).

4. A Theme function to allow people to change the format of the snippets.

If you have any objection to committing the other patch (and crediting you as a participant), and marking this duplicate, please let me know in the next few days, as we'd like to close this. If you would like to argue any of the points mentioned, go for it, I won't commit until you feel okay about it too.

Best,
Jacob

brainski’s picture

Hi JacobSingh

I will review your patch in the next two days and give you a feedback on this.

The main reason, why I implemented solr highlighting was the following:

- In the drupal core search it is not possible, to set the length of the search snippet. Its fixed to 100 chars. I needed about 300 chars, because I mainly index large pieces of text. When I searched for a solution for this in core search, I found out, that a lot of user would like to specify the length of the search snippets. That's the reason why I implemented a backend admin optioon. I think this should be configurable somewhere.

For me it doesn't matter, if the default values are in solrconfig.xml. But I think an admin page is more the drupal way than external config files.

To Point 2:
I agree. If you activate solr, you should use also the solr highlighting.

3) I don't understand here, what you mean

4) I agree

You you can mention me as a participant, that would be nice. I'm not able to commit, but I try to make patches and module extensions here and there for different drupal modules.

Maybe you could also review my implementation of the "Did you mean" functionality here:
http://drupal.org/node/303937

pwolanin’s picture

re: #3 - we can (and should) return a snippet of the body if nothing was found to highlight (which can happen if, for example, the search match is on author name or title and not anywhere in the body)

JacobSingh’s picture

Hi Brainski,

I haven't heard back from you so I'm going to commit a slightly re-rolled http://drupal.org/node/338534#comment-1139100.

Please let me know in the next couple hours if this is an issue. I suggest that you provide a new patch based on it for your admin interface if you feel it will be of use to others and attach it to a new (closed) issue with a descriptive title, and we will link to it from the docs.

Best,
J.

JacobSingh’s picture

Status: Postponed (maintainer needs more info) » Closed (duplicate)

committed in #338534

billnbell’s picture

I have tried multiple combination to get highlighting to work.

http://domain/solr/core1/select?indent=on&version=2.2&q=Testing&start=0&...

This does not add anything to body... No em or strong...

What am I missing?

Thanks.

billnbell’s picture

Also, where is the apachesolr_h1_fieldtohighlight in the admin UI ?

I cannot find the variable.

JacobSingh’s picture

There is no admin UI for it. We figured the admin screens were already overkill, and simplicity would win in this case. You can set the variable in your settings.php (or database).

To set it in settings.php, use $conf['apachesolr_....'] = 'whatever'

miruoss’s picture

StatusFileSize
new1.57 KB

I think it would be nice to be able to highlight the title and the body. By adding the following two lines to settings.php, solr does exactly this.

  'apachesolr_hl_active' => 'true',
  'apachesolr_hl_fieldtohightlight' => 'title,body',

However, the module currently doesn't support this as it expects the apachesolr_hl_fieldtohightlight to only contain one field. The attached patch enables highlighting of all the fields in apachesolr_hl_fieldtohightlight.

fcmtuan’s picture

@brainski: I try to use your product but it's still error. Can you show us a complete of your code? thanks!