request only needed information to reduce bandwidth/response-time [#338534]

In apachesolr_search, we potentially retrience a lot more information (e.g. CCK fields, taxonomy term names and tids, in addition to the the full body, etc) then we need to display the search results:

          $extra = array();
          foreach ($response->response->docs as $doc) {
            $extra += node_invoke_nodeapi($doc, 'search result');
            $extra['score'] = $doc->score;
            $snippet = search_excerpt($keys, $doc->body);
            if (trim($snippet) == '...') {
              $snippet = '';
            }
            $results[] = array('link' => $doc->url,
                               'type' => node_get_types('name', $doc),
                               'title' => $doc->title,
                               'user' => theme('username', $doc),
                               'date' => $doc->changed,
                               'node' => $doc,
                               'extra' => $extra,
                               'score' => $doc->score,
                               'snippet' => $snippet);
          }

          // Hook to allow modifications of the retrieved results
          foreach (module_implements('apachesolr_process_results') as $module) {
            $function = $module .'_apachesolr_process_results';
            call_user_func_array($function, array(&$results));
          }
        }

For nodes with lots of term, fields, etc (or potentially later w/ attachments?) this would seem like it might cause performance problems since we are transferring all the information 2x (in the body and as a separate field). Guess this problem is reduced if we can improve the code so they are gzipped?

We could potentially add fl params to the search to limit which fields are returned? e.g.:

&fl=body,title,username,changed,uid,hash,url,score

Comment	File	Size	Author
#22	node-doc-338534-22.patch	1.51 KB	pwolanin
#20	hl-cleanup-338534-20.patch	4.13 KB	pwolanin
#17	apachesolr_338534_highlighting.diff	661 bytes	JacobSingh
#10	highlighting_body.diff	21.87 KB	JacobSingh
#9	highlighting.diff	22.76 KB	JacobSingh

Comments

Comment #1

robertdouglass commented 25 November 2008 at 08:51

Good idea.

Comment #2

pwolanin commented 29 November 2008 at 16:00

Title:

request only needed information - not the full doc - to reduce bandwidth/response-time

» request only needed information o reduce bandwidth/response-time

I talked about this with Jacob some - and ideally we would also use the Solr highlighting feature to return a snippet, rather than the full body. That would be a big help in terms of bandwidth - returning a ~1k snippet, rather than potentially 10, 100, or 1000 kB of the body.

The Solr highlighting wasn't working, for some reason - needs investigation.

Comment #3

pwolanin commented 29 November 2008 at 19:00

we can make some changes to solrconfig.xml like so:

  <requestHandler name="standard" class="solr.SearchHandler" default="true">
    <!-- default values for query parameters -->
     <lst name="defaults">
       <str name="echoParams">explicit</str>
       <str name="hl">true</str>
       <str name="hl.fl">body</str>
       <str name="hl.mergeContiguous">true</str>
     <!-- instructs Solr to return the field itself if no query terms are
          found -->

       <str name="f.body.hl.alternateField">body</str>
       <str name="f.body.hl.maxAlternateFieldLength">256</str>
       <str name="f.text.hl.fragmenter">regex</str> <!-- defined below -->
      </lst>
  </requestHandler>

and

   <!-- Configure the standard formatter -->
   <formatter name="html" class="org.apache.solr.highlight.HtmlFormatter" default="true">
    <lst name="defaults">
     <str name="hl.simple.pre"><![CDATA[<strong>]]></str>
     <str name="hl.simple.post"><![CDATA[</strong>]]></str>
    </lst>
   </formatter>
  </highlighting>

to highlight with strong rather than em tags

Comment #4

pwolanin commented 29 November 2008 at 19:18

Title:

request only needed information o reduce bandwidth/response-time

» request only needed information to reduce bandwidth/response-time

looking at the highlighting results, I think we should be running strip tags before indexing the body. Otherwise Solr can return broken markup.

Comment #5

robertdouglass commented 29 November 2008 at 23:41

Hmm. I thought we had used a handler that removed HTML markup at the Solr level.

Comment #6

pwolanin commented 30 November 2008 at 02:48

I think we do have a handler that removes HTML for Solr - but it seems like Solr uses that html-stripped version for indexing, not for the results it returns. Need to double-cehck the solr docs on this.

Comment #7

pwolanin commented 30 November 2008 at 03:06

http://markmail.org/message/qflbmxps4d5eiih2#query:solr%20HTMLStripWhite...
basically as above - stripped for indexing, not for results

The suggestions here is to strip html before indexing:

http://www.mail-archive.com/solr-user@lucene.apache.org/msg06465.html
http://markmail.org/message/qflbmxps4d5eiih2#query:solr%20HTMLStripWhite...

Comment #8

pwolanin commented 30 November 2008 at 04:56

So, looks like we could use a default fl parameter something like fl=id,title,comment_count,type,changed,score,url,uid,name and get a Solr-generated snippet of the body back in the highlighting.

the PHP Response class doesn't specifically handle highlighting, so we'd have to grab it out of the parsed data

$response->highlighting

which is keyed by document id. So, something like:

$snippet = implode(' ... ', $response->highlighting->$id->body);

Comment #9

JacobSingh commented 4 December 2008 at 12:54

Status	File	Size
new	highlighting.diff	22.76 KB

Attached is a rough draft of implementing highlighting.

I decided not to create an administrative form because I think it just bloats the module. I think it would be rare to want to customize the defaults, and people can do so via settings.php now. Anyway, that should be another issue IMO.

I also made a pretty substantial change which will require deleting your index!

-<field name="text" type="text" indexed="true" stored="false" termVectors="true"/>
+<field name="text" type="text" indexed="true" stored="true" termVectors="true"/>

This will 2x the size of everyone's index. So why am I doing this? Because we are searching on this "text" field, which is an amalgamation of all other fields. However, we can't generate highlights from it. In my tests, I found that sometimes I would get a result, but no highlight because the result was not found in body. I decided that a 2x index size is not that big of a problem (HDD are very cheap), plus, as a "real" field it will probably be faster and have more flexibility.

However, solrconfig.xml specifies (as pwolanin above pointed out) an alternate field to show if there is no highlight (i.e. not a test search). This field is "body" because text may contain some gibberish at the top.

Best,
Jacob

Comment #10

JacobSingh commented 4 December 2008 at 13:31

Status	File	Size
new	highlighting_body.diff	21.87 KB

Okay, at Peter's behest (and probably better judgement), I've changed this to use body, and I've tacked the comments onto the body field so they can be found. If anyone writing a 3rd party module wants the text they are indexing to show up in snippets, they also need to append it to the body.

Comment #11

pwolanin commented 4 December 2008 at 14:49

Status:

Active

» Needs review

Comment #12

pwolanin commented 4 December 2008 at 14:56

there are a lot of whitespace changes which make it a little hard to read

Comment #13

brainski commented 5 December 2008 at 11:44

I implemented highlighting feaature more or less the same way some time ago:
http://drupal.org/node/303973

Maybe we should use synergies in future.

Issue:
One issue I had, that I wasn't able to solve is the highlighting in the search result title. If one add HTML Tags to the title, they were escaped. Did you solve this?

Comment #14

pwolanin commented 8 December 2008 at 00:06

@Jacob:

I think we should put all the highlighting defaults into solrconfig.xml:

+        //Highlighting settings
+        $params['hl'] = 'true';
+        $params['hl.fragsize']= variable_get('apachesolr_textsnippetlength', 100);
+        $params['hl.simple.pre'] = variable_get('apachesolr_highlightpretag', '<strong>');
+        $params['hl.simple.post'] = variable_get('apachesolr_highlightposttag', '</strong>');
+        $params['hl.snippets'] = variable_get('apachesolr_numsnippets', 3);
+        $params['hl.fl'] = 'body';

Is overriding them going to be a common use case?

Comment #15

JacobSingh commented 8 December 2008 at 05:01

I don't think overriding them is going to be common, but I think it is common enough that we need to provide the option to do so. Especially for our purposes where clients will not have control of their solrconfig.xml.

What is the concern of setting them query time?

Comment #16

pwolanin commented 8 December 2008 at 21:42

I think we should simply avoid sending in the URL anything that's going to match the defaults in solrconfig.xml to reduce the bansdwith per search request.

Comment #17

JacobSingh commented 11 December 2008 at 08:55

Status:

Needs review

» Patch (to be ported)

Status	File	Size
new	apachesolr_338534_highlighting.diff	661 bytes

Committed the attached in 6.x

Comment #18

brainski commented 11 December 2008 at 09:02

Hi JacobSingh

I'm sitting behind a proxy and atm I'm not able to do a checkout via CVS. I will attach a patch with the admin option, as soon as I'm able to checkout the HEAD Version of the module.

Comment #19

pwolanin commented 11 December 2008 at 13:33

patch in #17 is empty

Comment #20

pwolanin commented 11 December 2008 at 20:31

Status:

Patch (to be ported)

» Needs review

Status	File	Size
new	hl-cleanup-338534-20.patch	4.13 KB

Patch to cleanup - mostly whitespace/style changes, except this also populates $doc->body with the snippet if that's not one of the fl fields.

Also adds nid as a fl field.

Comment #21

JacobSingh commented 12 December 2008 at 06:16

Status:

Needs review

» Fixed

Looks good. Sorry about the NULL, I never remember that. I think it's damn ugly, but I have no other mark against it. :)

-J

Comment #22

pwolanin commented 13 December 2008 at 18:24

Status:

Fixed

» Needs review

Status	File	Size
new	node-doc-338534-22.patch	1.51 KB

another cleanup needed perhaps - we don't actually have a $node

Comment #23

pwolanin commented 15 December 2008 at 19:48

committed this patch

Comment #24

robertdouglass commented 17 December 2008 at 04:29

Status:

Needs review

» Patch (to be ported)

Comment #25

pwolanin commented 17 December 2008 at 21:26

Version:

6.x-1.x-dev

» 5.x-1.x-dev

Comment #26

pwolanin commented 18 January 2009 at 03:12

Version:	5.x-1.x-dev	» 6.x-1.x-dev
Status:	Patch (to be ported)	» Closed (fixed)

request only needed information to reduce bandwidth/response-time

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Comment #20

Comment #21

Comment #22

Comment #23

Comment #24

Comment #25

Comment #26

News items

Our community

Documentation

Drupal code base

Governance of community