Solr is a great solution for us and our clients, but we sometimes find the results it returns are too generous.

Specifically, with a largely out-of-the-box Solr implementation, searching for longer strings gets more results. On a (not yet live) client build, searching for "progressive" gets less than ten results; searching for "progressive health care" gets literally hundreds; "progressive health care in london" basically seems to return everything that mentions "London".

This is using the default schema.xml, including defaultOperator="AND"; so surely the expected behaviour is: the longer the search phrase, the fewer the results. Or is that not the case?

I've used Tomcat's catalina.out reports to generate relevant Solr URLs, and switched various query string parameters on and off, but nothing really changes the high volume of results I get for what are quite specific search strings. So I don't think the point-of-search is where the problem lies: it seems to be at the point of decomposing Drupal nodes into Solr search proxies to build the index.

I appreciate that sometimes search is better for the visitor if it's inclusive; Google does return results even if it doesn't find perfect matches. But, stopwords and stemming aside, the client does require that searches return only results where all words appear in the content. Any thoughts?

Comments

jp.stacey’s picture

As a diagnostic, here's an example of what search in Drupal puts in the catalina.out logs as the params= line. Is there anything in here that could be e.g. turning OR operators on?

{spellcheck=false&facet=true&f.im_field_health_topic.facet.mincount=1
 &facet.mincount=1&f.ds_created.facet.date.gap=%2B1YEAR
 &spellcheck.q=particular+phrase+in+london
 &qf=taxonomy_names^2.0&qf=path_alias^5.0&qf=content^40&qf=label^21.0
 &qf=tos_content_extra^1.0&qf=ts_comments^20&qf=tm_vid_3_names^200
 &facet.date=ds_created
 &f.ds_created.facet.date.start=1970-01-01T00:00:00Z/YEAR
 &f.bundle.facet.mincount=1&hl.fl=content,ts_comments
 &json.nl=map&wt=json&rows=10&fl=id,entity_id,entity_type,bundle,bundle_name,
   label,is_comment_count,ds_created,ds_changed,score,path,url,is_uid,
   tos_name,tm_node,zs_entity
 &start=0&facet.sort=count&f.bundle.facet.limit=50&q=special+phrase+in+london
 &f.ds_created.facet.date.end=2012-01-01T00:00:00Z%2B1YEAR/YEAR
 &bf=recip(ms(NOW,ds_created),3.16e-11,1,1)^150.0
 &facet.field=im_field_health_topic&facet.field=bundle
 &f.im_field_health_topic.facet.limit=50&f.ds_created.facet.limit=50}
hits=1998 status=0 QTime=14

(NB I also opened more or less this same query on StackOverflow, when I thought it was just a straightforward Solr query; but I'm happy to repeat anything here if anyone needs any other diagnostics.)

cpliakas’s picture

Category: bug » support

Hi jp.stacey,

Thanks for the post. According to the schema.xml documentation, it is preferable to not use or rely on this setting but instead specify the operator via local params or the request handler in solrconfig.xml. I would personally recommend the local param method since it will override the defaults in the configs and you know you have the last crack at the settings.

Just to throw a wrench in this, the Apache Solr Search Integration module does set the mm param, so you may have to set q.op = AND and mm = 100%. I am not exactly sure how these two play together and would be interested to hear people experience with them.

With that being said, even though you get "more" results the way the module is configured by default, the mm param is preferable to work with as the AND operator can be too restrictive sometimes and not return any results. Even if more results are returned, the relevancy is effected by how close the words are in the dataset. Therefore it acts like an AND in the result weighting, but will also return results when an AND would have filtered out everything.

So, just make sure you think through the the implications of using the AND operator explicitly and the effect it will have to your end users before making the switch. Displaying no results is generally bad, and if you can avoid it that is probably the best approach.

Hope this helps,
Chris

jp.stacey’s picture

Hi Chris,

thanks for your speedy reply.

It wasn't clear from the schema.xml included with the apachesolr module that this quite innocuous configuration option is actually deprecated. Also, I still can't work out why it isn't actually working, despite being set!

If this configuration option is problematic, would it be good to put a comment to that effect in the bundled schema.xml ? Otherwise people aren't necessarily going to find the warning in Solr's own documentation until they've fiddled around for some time. I'm happy to submit a trivial patch for that if so.

The "LocalParam" method in solrconfig.xml would do us just as well for us, but I don't quite see how to effect that. Do you know what we need to edit? The LocalParams documentation explains how to change query parameters in the URL string, but has no XML configuration in it.

Do I just add

  <str name="op">AND</str>

to the <lst name="defaults"> subsection for the <requestHandler> section of solrconfig.xml? Is that right?

Thanks again,
J-P

jp.stacey’s picture

(For reference, adding this line to requestHandler in solrconfig.xml and restarting Tomcat did nothing - still getting almost 90% of the site back in response to "progressive health care in london".)

jp.stacey’s picture

Oh, it looks like there are several requestHandler elements in solrconfig.xml . I'm not really sure what the others are for, as there's one called drupal that has default="true" set (which I'm assuming makes it the default.)

mm was also set to "1" in this. However it looks like, unlike in the usual mathematical sense, "1" is not synonymous with "100%":

  • At least 2 of the optional clauses must match, regardless of how many clauses there are: "2"
  • At least 75% of the optional clauses must match, rounded down: "75%"

I set this section to read:

  <requestHandler name="drupal" class="solr.SearchHandler" default="true">
    <lst name="defaults">
      <!-- ... -->
      <str name="mm">100%</str>
      <str name="op">AND</str>
      <!-- ... -->

and this did improve the number of results. However, the stemmer was still returning a lot of other results. When I turned this off in schema.xml:

    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <!-- ... -->
        <!-- <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/> -->
        <!-- ... -->
      <analyzer type="query">
        <!-- ... -->
        <!-- <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/> -->
        <!-- ... -->

then I ended up with a very restricted search set. I appreciate turning off the stemmer means search ends up "dumber", but the alternative is basically to be deluged with results.

This at least gives us some levers to provide the client.

cpliakas’s picture

Right, let's move away from doing this in the config files as the mm parameter is still probably conflicting(?). In a custom module, implement hook_apachesolr_query_alter() and add the following statements:

$params->removeParam('mm');
$params->replaceParam('q.op', 'AND');

I haven't tested this code, but it should get you going in the right direction.

Re: the comments, I am all for documenting things but this seems to me to be more on the Solr side of things. Happy to discuss further, but maybe we should discuss that in another thread. I would also point you to the Apache Solr Common Configurations initiative, which is where a lot of the thought regarding schemas is happening right now.

jp.stacey’s picture

Status: Active » Closed (fixed)

(Thanks for the help!)

cpliakas’s picture

jp.stacey,

Not a problem. Good luck!

Chris