Solr is a great solution for us and our clients, but we sometimes find the results it returns are too generous.
Specifically, with a largely out-of-the-box Solr implementation, searching for longer strings gets more results. On a (not yet live) client build, searching for "progressive" gets less than ten results; searching for "progressive health care" gets literally hundreds; "progressive health care in london" basically seems to return everything that mentions "London".
This is using the default schema.xml, including defaultOperator="AND"; so surely the expected behaviour is: the longer the search phrase, the fewer the results. Or is that not the case?
I've used Tomcat's catalina.out reports to generate relevant Solr URLs, and switched various query string parameters on and off, but nothing really changes the high volume of results I get for what are quite specific search strings. So I don't think the point-of-search is where the problem lies: it seems to be at the point of decomposing Drupal nodes into Solr search proxies to build the index.
I appreciate that sometimes search is better for the visitor if it's inclusive; Google does return results even if it doesn't find perfect matches. But, stopwords and stemming aside, the client does require that searches return only results where all words appear in the content. Any thoughts?
Comments
Comment #1
jp.stacey CreditAttribution: jp.stacey commentedAs a diagnostic, here's an example of what search in Drupal puts in the catalina.out logs as the params= line. Is there anything in here that could be e.g. turning OR operators on?
(NB I also opened more or less this same query on StackOverflow, when I thought it was just a straightforward Solr query; but I'm happy to repeat anything here if anyone needs any other diagnostics.)
Comment #2
cpliakas CreditAttribution: cpliakas commentedHi jp.stacey,
Thanks for the post. According to the schema.xml documentation, it is preferable to not use or rely on this setting but instead specify the operator via local params or the request handler in solrconfig.xml. I would personally recommend the local param method since it will override the defaults in the configs and you know you have the last crack at the settings.
Just to throw a wrench in this, the Apache Solr Search Integration module does set the mm param, so you may have to set q.op = AND and mm = 100%. I am not exactly sure how these two play together and would be interested to hear people experience with them.
With that being said, even though you get "more" results the way the module is configured by default, the mm param is preferable to work with as the AND operator can be too restrictive sometimes and not return any results. Even if more results are returned, the relevancy is effected by how close the words are in the dataset. Therefore it acts like an AND in the result weighting, but will also return results when an AND would have filtered out everything.
So, just make sure you think through the the implications of using the AND operator explicitly and the effect it will have to your end users before making the switch. Displaying no results is generally bad, and if you can avoid it that is probably the best approach.
Hope this helps,
Chris
Comment #3
jp.stacey CreditAttribution: jp.stacey commentedHi Chris,
thanks for your speedy reply.
It wasn't clear from the schema.xml included with the apachesolr module that this quite innocuous configuration option is actually deprecated. Also, I still can't work out why it isn't actually working, despite being set!
If this configuration option is problematic, would it be good to put a comment to that effect in the bundled schema.xml ? Otherwise people aren't necessarily going to find the warning in Solr's own documentation until they've fiddled around for some time. I'm happy to submit a trivial patch for that if so.
The "LocalParam" method in solrconfig.xml would do us just as well for us, but I don't quite see how to effect that. Do you know what we need to edit? The LocalParams documentation explains how to change query parameters in the URL string, but has no XML configuration in it.
Do I just add
to the <lst name="defaults"> subsection for the <requestHandler> section of solrconfig.xml? Is that right?
Thanks again,
J-P
Comment #4
jp.stacey CreditAttribution: jp.stacey commented(For reference, adding this line to requestHandler in solrconfig.xml and restarting Tomcat did nothing - still getting almost 90% of the site back in response to "progressive health care in london".)
Comment #5
jp.stacey CreditAttribution: jp.stacey commentedOh, it looks like there are several requestHandler elements in solrconfig.xml . I'm not really sure what the others are for, as there's one called drupal that has default="true" set (which I'm assuming makes it the default.)
mm was also set to "1" in this. However it looks like, unlike in the usual mathematical sense, "1" is not synonymous with "100%":
I set this section to read:
and this did improve the number of results. However, the stemmer was still returning a lot of other results. When I turned this off in schema.xml:
then I ended up with a very restricted search set. I appreciate turning off the stemmer means search ends up "dumber", but the alternative is basically to be deluged with results.
This at least gives us some levers to provide the client.
Comment #6
cpliakas CreditAttribution: cpliakas commentedRight, let's move away from doing this in the config files as the mm parameter is still probably conflicting(?). In a custom module, implement hook_apachesolr_query_alter() and add the following statements:
I haven't tested this code, but it should get you going in the right direction.
Re: the comments, I am all for documenting things but this seems to me to be more on the Solr side of things. Happy to discuss further, but maybe we should discuss that in another thread. I would also point you to the Apache Solr Common Configurations initiative, which is where a lot of the thought regarding schemas is happening right now.
Comment #7
jp.stacey CreditAttribution: jp.stacey commented(Thanks for the help!)
Comment #8
cpliakas CreditAttribution: cpliakas commentedjp.stacey,
Not a problem. Good luck!
Chris