First off, thanks for this great module. It is a big improvement on core search.

Once I installed it, I am getting a huge increase in the size of the accesslog, sessions, and watchdog tables. It appears that our on-campus indexer is hitting every single link generated by the guided search. When you multiply the number of facets by the number of facets, you get exponential growth :)

I don't want to completely block access to our site from our on-campus robot, so my question is two-fold:

a) Can I disable the reporting of Faceted Search events to the Watchdog module?

b) Can I somehow disallow crawls along the faceted_search path (perhaps by altering robots.txt)?

(I know that (b) is not necessarily a support question for faceted_search per se, just wondering if you have a solution.)

Thanks!

Comments

David Lesieur’s picture

These are serious issues, thanks for reporting them!

a) Actually, I don't think the use of the watchdog is really useful... Who monitors searches that way? If anyone finds logging searches with the watchdog useful, then I guess I could add a setting to enable/disable it, otherwise I think I'll simply remove the call to watchdog() from faceted_search_ui.module.

b) Yes, you could disallow faceted_search/* in robots.txt. However, a more interesting approach might be to make faceted_search_ui use robot meta tags. Perhaps we'd want to use this selectively, inserting a nofollow meta tag only when more than one facet is selected in the current search, so "top-level" search results would still be indexed. Refining a search never brings in new content, so crawlers certainly don't need to select more than one facet.

john bickar’s picture

Thanks for the quick response. We actually have been using the watchdog reports to discover search terms. That way we can add content if it's something that we hadn't thought of (e.g., a user searches for "parking"). Would it be possible to have keyword searches make a call to watchdog() and not guided searches?

I will look into the robot meta tags. Thanks for the tip!

David Lesieur’s picture

Version: 5.x-0.10 » 6.x-1.x-dev
Component: Miscellaneous » Code
Category: support » feature

Yes, it would make sense to log keyword searches only. But I wish better tools were provided to monitor searches. I guess one could edit his AWStats (or similar tool) configuration to include faceted search requests, but I've never tried it. I'd love to see not only what keywords are requested, but also what categories are popular, which ones are chosen first, the number of results for each request, how many times people refine their current search, etc. This could also help improve the classification, and perhaps the module's user interface as well.

About the robot meta tags, this will involve making changes to faceted_search_ui.module (I'd be glad to review a patch!). If you need a solution that doesn't involve programming, then robots.txt would be the way to go until robot meta tags are managed by the module.

john bickar’s picture

Hi David,

I'd be willing to try creating a patch to faceted_search_ui.module; however, I poked around in the module and didn't immediately see where those tags would be generated. Would it be by using the function drupal_set_html_head?

If you'd point me in the right direction, I can try rolling a patch.

UPDATE: Think I've got it, need to test and will post patch tomorrow. Pointers would still help, though :D

David Lesieur’s picture

Yes, drupal_set_html_head() should do it. Can't tell right now where exactly it should be called, but you might want to start looking at the faceted_search_ui_stage_* set of functions.

john bickar’s picture

Status: Active » Needs review
StatusFileSize
new494 bytes

OK, this patch calls drupal_set_html_head() in faceted_search_ui_stage_select(). It is correctly placing the meta tag on the faceted_search/select page only. My reasoning is that telling the robot not to index and not to follow links from this page (the faceted search launch page) should prevent it from crawling all the generated links. We'll see if it works.

john bickar’s picture

StatusFileSize
new799 bytes

Well, that patch didn't exactly work, perhaps because some of my results pages had already been indexed and were being re-visited. This attached patch places the robot meta tags on the faceted_search/select page and on faceted_search/results pages.

David Lesieur’s picture

You'll also want to have the meta tag on the faceted_search/facet page.

It may happen on some sites that faceted search is the only way to find all the content, so we'd like crawlers to still be able to find it. Thus, faceted_search_ui_stage_results() should probably deny robots only when there is more than one active facet.

john bickar’s picture

I don't have a faceted_search/facet page. How do you get to it?

I will try to take a look at the faceted_search_ui_stage_results() implementation.

(Our site, if you want to take a look at what I mean about faceted_search/facet.)

john bickar’s picture

StatusFileSize
new1.08 KB

I have a patch for the faceted_search/facet page (attached). I cannot figure out which variable holds the facets; I quickly tried

  if (count($facet) > 1) {
  drupal_set_html_head('<meta name="robots" content="noindex, nofollow" />');
  }

within faceted_search_ui_stage_reults() and that wasn't successful.

john bickar’s picture

Just an update on this issue: the patch I posted has inserted the robots meta tags correctly, but the Googlebots (both our on-campus Google Search Appliance, and Google's) are not obeying them. Campus IT has reported the issue to Google customer support, and I'm waiting to hear back.

alsears’s picture

Will this patch get applied into the Head version? If so, when?

David Lesieur’s picture

Status: Needs review » Needs work

As it stands, the patch won't be applied. It needs work because we actually want to insert the nofollow meta tag only when more than one facet is selected in the current search, so "top-level" search results can be crawled by robots.

You could achieve the same as the above patch with a Disallow: /faceted_search directive in robots.txt, but then nodes that are only available through Faceted Search won't be crawled.

alsears’s picture

Sorry, I guess what I meant then is at what point do you think you would be able have a fix that would insert a nofollow meta tage only when more than one facet is selected in the search?

We have about 6 taxonomy facets each with 10-50 terms which would create an overwhelming number of permutations for Google to index, so at this point, we have used "Disallow: /faceted_search" The problem is that the most important pages for Google to index for us would be our top level of facets. See: http://www.urbanministry.org/faceted_search/results

David Lesieur’s picture

@cubbtech: Are you still positive about Googlebot not obeying the robots meta tags? Other sources seem to say the contrary:

http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=61050
http://googleblog.blogspot.com/2007/02/robots-exclusion-protocol.html

john bickar’s picture

Hi David,

Sorry for the delayed response. I know that the Googlebot was supposed to be obeying the noindex nofollow tags, but for some reason it was not. Now it appears that the Googlebot is not crawling our site at all, which is an issue that I'm following up on with campus IT.

David Lesieur’s picture

Version: 6.x-1.x-dev » 5.x-0.20
Status: Needs work » Fixed

I've just committed the selective addition of the robots meta tag. Robots are still allowed to visit the results pages for any single category, but not for combinations of categories. Let's hope that most robots will obey the directive.

john bickar’s picture

I can confirm that this update inserts the robots meta tag on searches that combine multiple categories, but does not insert it for single-category faceted searches.

Thanks!

Anonymous’s picture

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.

eliosh’s picture

Version: 5.x-0.20 » 6.x-1.x-dev
Status: Closed (fixed) » Active

Sorry to reopen this issue, but in last dev version we have always those log in watchdog table.
Is in program to remove those logs or to have a config to choose ?

Thanks

David Lesieur’s picture

True, the watchdog entries have not been addressed. We might want to do as suggested by cubbtech on #2 and avoid logging the guided searches.

eliosh’s picture

Perfect.
Thanks for your hard work, we all appreciate it.

mikeytown2’s picture

Title: Module makes database balloon in size » Module makes database balloon in size - avoid logging the guided searches

subscribe, it's very annoying... you might want to add in a verbose setting, I did something similar for boost.

v8powerage’s picture

There should be option "turn watchdog logging off".

weseze’s picture

subscribing

I'd love to see a better integration with the default Drupal search for logging and reporting!

john bickar’s picture

You can turn off logging the searches completely by commenting out line 376 of faceted_search_ui.module (6.x-10.beta2) like so:


// watchdog('faceted_search', '%text.', array('%text' => $text), WATCHDOG_NOTICE, l(t('results'), $path));