Download & Extend

apachesolr and core search treat Drupal fields (Field API, CCK) differently

Project:Apache Solr Search Integration
Version:7.x-1.x-dev
Component:User interface
Category:bug report
Priority:normal
Assigned:Unassigned
Status:needs work

Issue Summary

There's a search configuration for cck fields per content type, for example /admin/content/node-type/page/display/search. It's divided into separate configurations for search index and search result. But compared to core search module apachesolr only takes care about settings for search index and ignores those for search result.
The according code could be found in apachesolr.index.inc:

<?php
function apachesolr_node_to_document($nid, $namespace) {
  ...
   
// Build the node body.
   
$node->build_mode = NODE_BUILD_SEARCH_INDEX;
   
$node = node_build_content($node, FALSE, FALSE);
   
$node->body = drupal_render($node->content);
  ...
}
?>

cck fields configured to be indexed might become part of $node->body at this point which is included in $document->body.

Later the search result text snippet will be built from $document->body (first lines or fragment with highlighted key words). Then cck fields might be displayed even if the user disabled them for search result.

From my point of view build mode should be NODE_BUILD_SEARCH_RESULT in the code above. In this case settings for search result will be used and those for search index will be irrelevant. This is still not perfect but might be easier to understand for the user.

Additionally apachesolr has a special treatment for cck fields (see hook_apachesolr_cck_fields_alter). Maybe settings for search index should be respected here.

Comments

#1

Interesting suggestion - it' s true that we are using the same field to search and build snippets.

I'm not sure if Solr will highlight a non-indexed field. Sounds like the nicest combination with CCK would be to build the node 2x and index one version and store the other for highlighting.

#2

Building the node two times seems like a good approach to me, too.

"I'm not sure if Solr will highlight a non-indexed field" (pwolanin)
I didn't found anything about that in solr's documentation. So we should try something like this:

schema.xml:

<field name="body" type="text" indexed="true" stored="false"/>
<field name="body_result" type="text" indexed="false" stored="true"/>

It's important to keep the name body for the indexed / searchable part to not break existing custom seraches like body:foo.

apachesolr.index.inc:

<?php
$node_result
= drupal_clone($node)
   
$node->build_mode = NODE_BUILD_SEARCH_INDEX;
$node = node_build_content($node, FALSE, FALSE);

$document->body = drupal_render($node->content); // just as example, that's not all of the body

$node_result->build_mode = NODE_BUILD_SEARCH_RESULT;
$node_result = node_build_content($node_result, FALSE, FALSE);

$document->body_result = drupal_render($node_result->content); // just as example, that's not all of the body
?>

To turn this into a patch some more work is necessary, especially using body_result instead of body for building the search result text snippet.

#3

A little problematic too in that all users would have to re-index.

#4

Status:active» needs review

"A little problematic too in that all users would have to re-index."
... like always if something in schema.xml changes.

I created a first patch.

AttachmentSizeStatusTest resultOperations
body_result.patch6.78 KBIgnored: Check issue status.NoneNone

#5

I'm trying to think if there is some way to avoid building the node 2x. Probably not?

#6

Reworked patch to be compatible to #503644: HTTP Status 500 - Illegal character (NULL, unicode 0) encountered.
Patch is against 6.x-1.x-dev 2009-Jul-03.

AttachmentSizeStatusTest resultOperations
body_result.patch6.43 KBIgnored: Check issue status.NoneNone

#7

Status:needs review» postponed (maintainer needs more info)

Discussed this with janusman - given that this is going to break existing indexes, we agreed that this change should be targeted to a forthcoming 6.x-2.x branch where will be adding more features and not worried about breaking existing sites, vs. the 6.x-1.x branch that we want to have available as stable for production (even if less existing in terms of no new features).

I'm also still waiting for confirmation that highlighting works correctly with a non-indexed field. Even if it does work, are there serious performance costs? I have not dug into the java code for the highlighter too much, but it takes a token stream, which suggests that Solr might have to tokenize on the fly a non-indexed field every time you want to highlight it? We really can't proceed without definitive answers for these questions.

The alternative to this that janusman suggested for single-site indexes is accept the cost of doing a node load for each search result so that you have full control over the output. It should also be possible to render the node a second time and put that into a dynamic field and use that field for highlighting using the variables that can be set.

#8

Version:6.x-1.x-dev» 6.x-2.x-dev
Status:postponed (maintainer needs more info)» needs review

re-opening this now that there's a 2.x branch.

#9

@pwolanin, #7:
"I'm also still waiting for confirmation that highlighting works correctly with a non-indexed field."

Yes it works.

As you might know we maintain a patched version of apachesolr 6.x-1.x to deal with non English websites which is available from http://drupal.cocomore.com/de/project/apachesolr
This version also contains a patch based on the one I posted at #6 (but modified to be compatible to the latest version of apachesolr 6.x-1.x).

To prove that it works have a look at these sites using our version of apachesolr:
http://www.familymanager.de/search/apachesolr_search/Samstag
http://www.jacobs-university.de/search/apachesolr_search/math

#10

Will this patch be committed? I suffer from the same problem, that hidden fields appear in the search snippet.

#11

mkalkbrenner - any chance you could reroll this for 6.2? Thanks.

#12

Status:needs review» needs work

The question about highlighting and performance was not answered - we cannot proceed without addressing that.

#13

@pwolanin #12:
Please have a look at the links I posted at #9. They prove that highlighting works.

Why do you expect performance issues?

@robertDouglass #11:
I'll have a look at 6.2. But it will take some time ...

#14

@mkalkbrenner - for Solr 1.5 at least the fast highlighter needs term vectors. I'm not sure if it affects the highlighting speed with 1.4.

#15

@pwolanin:

Does apachesolr 6.x-2.x depend on Solr 1.5?

#16

We don't have a dependency on Solr 1.5.

#17

Yes, I know that - I'm mostly saying I'm unclear about the algorithm used for highlighting in 1.4 and whether it uses term vectors.

#18

Is 7.x affected?

#19

Version:6.x-2.x-dev» 6.x-1.x-dev

Flipping back to 1.x where it originated, as it is serious enough to be considered there.

#20

Version:6.x-1.x-dev» 7.x-1.x-dev

So, where does that leave us? I think we're never likely to make this change in 6.x-1.x at this point.

For 7.x-1.x now would be the time to sort this out.

Seems like we would index but not store the "search index" version, and store but not index the "search result" version, assuming that highlighting does work correctly as Markus suggests above.

There is still the somewhat substantial downside that in the typical case we'll have to send the entire content 2x instead of 1x.

One option would be to supply a copyField directive by default, but one could readily customize that away in the schema.xml?

#21

e.g.

--- a/schema.xml
+++ b/schema.xml
@@ -313,8 +313,12 @@
    <field name="label" type="text" indexed="true" stored="true" termVectors="true" omitNorms="true"/>
    <!-- The string version of the title is used for sorting -->
    <copyField source="label" dest="sort_label"/>
-   <!-- content is the default field for full text search - dump crap here -->
-   <field name="content" type="text" indexed="true" stored="true" termVectors="true"/>
+   <!-- content is the default field for full text search  -->
+   <field name="content" type="text" indexed="true" stored="false" termVectors="true"/>
+   <!-- results_content is the default field for highlighted results snippets. Comment out the
+        copyField if you are directly populating it -->
+   <field name="results_content" type="text" indexed="false" stored="true" />
+   <copyField source="content" dest="results_content"/>
    <field name="teaser" type="text" indexed="false" stored="true"/>

    <field name="path" type="string" indexed="true" stored="true"/>

#22

Title:apachesolr and core search treat CCK fields differently» apachesolr and core search treat Drupal fields (Field API, CCK) differently

Changing title to include 6 and 7

nobody click here