In apachesolr_search, we potentially retrience a lot more information (e.g. CCK fields, taxonomy term names and tids, in addition to the the full body, etc) then we need to display the search results:
$extra = array();
foreach ($response->response->docs as $doc) {
$extra += node_invoke_nodeapi($doc, 'search result');
$extra['score'] = $doc->score;
$snippet = search_excerpt($keys, $doc->body);
if (trim($snippet) == '...') {
$snippet = '';
}
$results[] = array('link' => $doc->url,
'type' => node_get_types('name', $doc),
'title' => $doc->title,
'user' => theme('username', $doc),
'date' => $doc->changed,
'node' => $doc,
'extra' => $extra,
'score' => $doc->score,
'snippet' => $snippet);
}
// Hook to allow modifications of the retrieved results
foreach (module_implements('apachesolr_process_results') as $module) {
$function = $module .'_apachesolr_process_results';
call_user_func_array($function, array(&$results));
}
}
For nodes with lots of term, fields, etc (or potentially later w/ attachments?) this would seem like it might cause performance problems since we are transferring all the information 2x (in the body and as a separate field). Guess this problem is reduced if we can improve the code so they are gzipped?
We could potentially add fl params to the search to limit which fields are returned? e.g.:
&fl=body,title,username,changed,uid,hash,url,score
Comments
Comment #1
robertdouglass commentedGood idea.
Comment #2
pwolanin commentedI talked about this with Jacob some - and ideally we would also use the Solr highlighting feature to return a snippet, rather than the full body. That would be a big help in terms of bandwidth - returning a ~1k snippet, rather than potentially 10, 100, or 1000 kB of the body.
The Solr highlighting wasn't working, for some reason - needs investigation.
Comment #3
pwolanin commentedwe can make some changes to solrconfig.xml like so:
and
to highlight with strong rather than em tags
Comment #4
pwolanin commentedlooking at the highlighting results, I think we should be running strip tags before indexing the body. Otherwise Solr can return broken markup.
Comment #5
robertdouglass commentedHmm. I thought we had used a handler that removed HTML markup at the Solr level.
Comment #6
pwolanin commentedI think we do have a handler that removes HTML for Solr - but it seems like Solr uses that html-stripped version for indexing, not for the results it returns. Need to double-cehck the solr docs on this.
Comment #7
pwolanin commentedhttp://markmail.org/message/qflbmxps4d5eiih2#query:solr%20HTMLStripWhite...
basically as above - stripped for indexing, not for results
The suggestions here is to strip html before indexing:
http://www.mail-archive.com/solr-user@lucene.apache.org/msg06465.html
http://markmail.org/message/qflbmxps4d5eiih2#query:solr%20HTMLStripWhite...
Comment #8
pwolanin commentedSo, looks like we could use a default fl parameter something like
fl=id,title,comment_count,type,changed,score,url,uid,nameand get a Solr-generated snippet of the body back in the highlighting.the PHP Response class doesn't specifically handle highlighting, so we'd have to grab it out of the parsed data
$response->highlighting
which is keyed by document id. So, something like:
Comment #9
JacobSingh commentedAttached is a rough draft of implementing highlighting.
I decided not to create an administrative form because I think it just bloats the module. I think it would be rare to want to customize the defaults, and people can do so via settings.php now. Anyway, that should be another issue IMO.
I also made a pretty substantial change which will require deleting your index!
This will 2x the size of everyone's index. So why am I doing this? Because we are searching on this "text" field, which is an amalgamation of all other fields. However, we can't generate highlights from it. In my tests, I found that sometimes I would get a result, but no highlight because the result was not found in body. I decided that a 2x index size is not that big of a problem (HDD are very cheap), plus, as a "real" field it will probably be faster and have more flexibility.
However, solrconfig.xml specifies (as pwolanin above pointed out) an alternate field to show if there is no highlight (i.e. not a test search). This field is "body" because text may contain some gibberish at the top.
Best,
Jacob
Comment #10
JacobSingh commentedOkay, at Peter's behest (and probably better judgement), I've changed this to use body, and I've tacked the comments onto the body field so they can be found. If anyone writing a 3rd party module wants the text they are indexing to show up in snippets, they also need to append it to the body.
Comment #11
pwolanin commentedComment #12
pwolanin commentedthere are a lot of whitespace changes which make it a little hard to read
Comment #13
brainski commentedI implemented highlighting feaature more or less the same way some time ago:
http://drupal.org/node/303973
Maybe we should use synergies in future.
Issue:
One issue I had, that I wasn't able to solve is the highlighting in the search result title. If one add HTML Tags to the title, they were escaped. Did you solve this?
Comment #14
pwolanin commented@Jacob:
I think we should put all the highlighting defaults into solrconfig.xml:
Is overriding them going to be a common use case?
Comment #15
JacobSingh commentedI don't think overriding them is going to be common, but I think it is common enough that we need to provide the option to do so. Especially for our purposes where clients will not have control of their solrconfig.xml.
What is the concern of setting them query time?
Comment #16
pwolanin commentedI think we should simply avoid sending in the URL anything that's going to match the defaults in solrconfig.xml to reduce the bansdwith per search request.
Comment #17
JacobSingh commentedCommitted the attached in 6.x
Comment #18
brainski commentedHi JacobSingh
I'm sitting behind a proxy and atm I'm not able to do a checkout via CVS. I will attach a patch with the admin option, as soon as I'm able to checkout the HEAD Version of the module.
Comment #19
pwolanin commentedpatch in #17 is empty
Comment #20
pwolanin commentedPatch to cleanup - mostly whitespace/style changes, except this also populates $doc->body with the snippet if that's not one of the fl fields.
Also adds nid as a fl field.
Comment #21
JacobSingh commentedLooks good. Sorry about the NULL, I never remember that. I think it's damn ugly, but I have no other mark against it. :)
-J
Comment #22
pwolanin commentedanother cleanup needed perhaps - we don't actually have a $node
Comment #23
pwolanin commentedcommitted this patch
Comment #24
robertdouglass commentedComment #25
pwolanin commentedComment #26
pwolanin commented