In apachesolr_search, we potentially retrience a lot more information (e.g. CCK fields, taxonomy term names and tids, in addition to the the full body, etc) then we need to display the search results:

          $extra = array();
          foreach ($response->response->docs as $doc) {
            $extra += node_invoke_nodeapi($doc, 'search result');
            $extra['score'] = $doc->score;
            $snippet = search_excerpt($keys, $doc->body);
            if (trim($snippet) == '...') {
              $snippet = '';
            }
            $results[] = array('link' => $doc->url,
                               'type' => node_get_types('name', $doc),
                               'title' => $doc->title,
                               'user' => theme('username', $doc),
                               'date' => $doc->changed,
                               'node' => $doc,
                               'extra' => $extra,
                               'score' => $doc->score,
                               'snippet' => $snippet);
          }

          // Hook to allow modifications of the retrieved results
          foreach (module_implements('apachesolr_process_results') as $module) {
            $function = $module .'_apachesolr_process_results';
            call_user_func_array($function, array(&$results));
          }
        }

For nodes with lots of term, fields, etc (or potentially later w/ attachments?) this would seem like it might cause performance problems since we are transferring all the information 2x (in the body and as a separate field). Guess this problem is reduced if we can improve the code so they are gzipped?

We could potentially add fl params to the search to limit which fields are returned? e.g.:

&fl=body,title,username,changed,uid,hash,url,score

Comments

robertdouglass’s picture

Good idea.

pwolanin’s picture

Title: request only needed information - not the full doc - to reduce bandwidth/response-time » request only needed information o reduce bandwidth/response-time

I talked about this with Jacob some - and ideally we would also use the Solr highlighting feature to return a snippet, rather than the full body. That would be a big help in terms of bandwidth - returning a ~1k snippet, rather than potentially 10, 100, or 1000 kB of the body.

The Solr highlighting wasn't working, for some reason - needs investigation.

pwolanin’s picture

we can make some changes to solrconfig.xml like so:

  <requestHandler name="standard" class="solr.SearchHandler" default="true">
    <!-- default values for query parameters -->
     <lst name="defaults">
       <str name="echoParams">explicit</str>
       <str name="hl">true</str>
       <str name="hl.fl">body</str>
       <str name="hl.mergeContiguous">true</str>
     <!-- instructs Solr to return the field itself if no query terms are
          found -->

       <str name="f.body.hl.alternateField">body</str>
       <str name="f.body.hl.maxAlternateFieldLength">256</str>
       <str name="f.text.hl.fragmenter">regex</str> <!-- defined below -->
      </lst>
  </requestHandler>

and

   <!-- Configure the standard formatter -->
   <formatter name="html" class="org.apache.solr.highlight.HtmlFormatter" default="true">
    <lst name="defaults">
     <str name="hl.simple.pre"><![CDATA[<strong>]]></str>
     <str name="hl.simple.post"><![CDATA[</strong>]]></str>
    </lst>
   </formatter>
  </highlighting>
  

to highlight with strong rather than em tags

pwolanin’s picture

Title: request only needed information o reduce bandwidth/response-time » request only needed information to reduce bandwidth/response-time

looking at the highlighting results, I think we should be running strip tags before indexing the body. Otherwise Solr can return broken markup.

robertdouglass’s picture

Hmm. I thought we had used a handler that removed HTML markup at the Solr level.

pwolanin’s picture

I think we do have a handler that removes HTML for Solr - but it seems like Solr uses that html-stripped version for indexing, not for the results it returns. Need to double-cehck the solr docs on this.

pwolanin’s picture

pwolanin’s picture

So, looks like we could use a default fl parameter something like fl=id,title,comment_count,type,changed,score,url,uid,name and get a Solr-generated snippet of the body back in the highlighting.

the PHP Response class doesn't specifically handle highlighting, so we'd have to grab it out of the parsed data

$response->highlighting

which is keyed by document id. So, something like:

$snippet = implode(' ... ', $response->highlighting->$id->body);
JacobSingh’s picture

StatusFileSize
new22.76 KB

Attached is a rough draft of implementing highlighting.

I decided not to create an administrative form because I think it just bloats the module. I think it would be rare to want to customize the defaults, and people can do so via settings.php now. Anyway, that should be another issue IMO.

I also made a pretty substantial change which will require deleting your index!

-<field name="text" type="text" indexed="true" stored="false" termVectors="true"/>
+<field name="text" type="text" indexed="true" stored="true" termVectors="true"/>

This will 2x the size of everyone's index. So why am I doing this? Because we are searching on this "text" field, which is an amalgamation of all other fields. However, we can't generate highlights from it. In my tests, I found that sometimes I would get a result, but no highlight because the result was not found in body. I decided that a 2x index size is not that big of a problem (HDD are very cheap), plus, as a "real" field it will probably be faster and have more flexibility.

However, solrconfig.xml specifies (as pwolanin above pointed out) an alternate field to show if there is no highlight (i.e. not a test search). This field is "body" because text may contain some gibberish at the top.

Best,
Jacob

JacobSingh’s picture

StatusFileSize
new21.87 KB

Okay, at Peter's behest (and probably better judgement), I've changed this to use body, and I've tacked the comments onto the body field so they can be found. If anyone writing a 3rd party module wants the text they are indexing to show up in snippets, they also need to append it to the body.

pwolanin’s picture

Status: Active » Needs review
pwolanin’s picture

there are a lot of whitespace changes which make it a little hard to read

brainski’s picture

I implemented highlighting feaature more or less the same way some time ago:
http://drupal.org/node/303973

Maybe we should use synergies in future.

Issue:
One issue I had, that I wasn't able to solve is the highlighting in the search result title. If one add HTML Tags to the title, they were escaped. Did you solve this?

pwolanin’s picture

@Jacob:

I think we should put all the highlighting defaults into solrconfig.xml:

+        //Highlighting settings
+        $params['hl'] = 'true';
+        $params['hl.fragsize']= variable_get('apachesolr_textsnippetlength', 100);
+        $params['hl.simple.pre'] = variable_get('apachesolr_highlightpretag', '<strong>');
+        $params['hl.simple.post'] = variable_get('apachesolr_highlightposttag', '</strong>');
+        $params['hl.snippets'] = variable_get('apachesolr_numsnippets', 3);
+        $params['hl.fl'] = 'body';

Is overriding them going to be a common use case?

JacobSingh’s picture

I don't think overriding them is going to be common, but I think it is common enough that we need to provide the option to do so. Especially for our purposes where clients will not have control of their solrconfig.xml.

What is the concern of setting them query time?

pwolanin’s picture

I think we should simply avoid sending in the URL anything that's going to match the defaults in solrconfig.xml to reduce the bansdwith per search request.

JacobSingh’s picture

Status: Needs review » Patch (to be ported)
StatusFileSize
new661 bytes

Committed the attached in 6.x

brainski’s picture

Hi JacobSingh

I'm sitting behind a proxy and atm I'm not able to do a checkout via CVS. I will attach a patch with the admin option, as soon as I'm able to checkout the HEAD Version of the module.

pwolanin’s picture

patch in #17 is empty

pwolanin’s picture

Status: Patch (to be ported) » Needs review
StatusFileSize
new4.13 KB

Patch to cleanup - mostly whitespace/style changes, except this also populates $doc->body with the snippet if that's not one of the fl fields.

Also adds nid as a fl field.

JacobSingh’s picture

Status: Needs review » Fixed

Looks good. Sorry about the NULL, I never remember that. I think it's damn ugly, but I have no other mark against it. :)

-J

pwolanin’s picture

Status: Fixed » Needs review
StatusFileSize
new1.51 KB

another cleanup needed perhaps - we don't actually have a $node

pwolanin’s picture

committed this patch

robertdouglass’s picture

Status: Needs review » Patch (to be ported)
pwolanin’s picture

Version: 6.x-1.x-dev » 5.x-1.x-dev
pwolanin’s picture

Version: 5.x-1.x-dev » 6.x-1.x-dev
Status: Patch (to be ported) » Closed (fixed)