Download & Extend

Module makes database balloon in size - avoid logging the guided searches

Project:Faceted Search
Version:6.x-1.x-dev
Component:Code
Category:feature request
Priority:normal
Assigned:Unassigned
Status:active

Issue Summary

First off, thanks for this great module. It is a big improvement on core search.

Once I installed it, I am getting a huge increase in the size of the accesslog, sessions, and watchdog tables. It appears that our on-campus indexer is hitting every single link generated by the guided search. When you multiply the number of facets by the number of facets, you get exponential growth :)

I don't want to completely block access to our site from our on-campus robot, so my question is two-fold:

a) Can I disable the reporting of Faceted Search events to the Watchdog module?

b) Can I somehow disallow crawls along the faceted_search path (perhaps by altering robots.txt)?

(I know that (b) is not necessarily a support question for faceted_search per se, just wondering if you have a solution.)

Thanks!

Comments

#1

These are serious issues, thanks for reporting them!

a) Actually, I don't think the use of the watchdog is really useful... Who monitors searches that way? If anyone finds logging searches with the watchdog useful, then I guess I could add a setting to enable/disable it, otherwise I think I'll simply remove the call to watchdog() from faceted_search_ui.module.

b) Yes, you could disallow faceted_search/* in robots.txt. However, a more interesting approach might be to make faceted_search_ui use robot meta tags. Perhaps we'd want to use this selectively, inserting a nofollow meta tag only when more than one facet is selected in the current search, so "top-level" search results would still be indexed. Refining a search never brings in new content, so crawlers certainly don't need to select more than one facet.

#2

Thanks for the quick response. We actually have been using the watchdog reports to discover search terms. That way we can add content if it's something that we hadn't thought of (e.g., a user searches for "parking"). Would it be possible to have keyword searches make a call to watchdog() and not guided searches?

I will look into the robot meta tags. Thanks for the tip!

#3

Version:5.x-0.10» 6.x-1.x-dev
Component:Miscellaneous» Code
Category:support request» feature request

Yes, it would make sense to log keyword searches only. But I wish better tools were provided to monitor searches. I guess one could edit his AWStats (or similar tool) configuration to include faceted search requests, but I've never tried it. I'd love to see not only what keywords are requested, but also what categories are popular, which ones are chosen first, the number of results for each request, how many times people refine their current search, etc. This could also help improve the classification, and perhaps the module's user interface as well.

About the robot meta tags, this will involve making changes to faceted_search_ui.module (I'd be glad to review a patch!). If you need a solution that doesn't involve programming, then robots.txt would be the way to go until robot meta tags are managed by the module.

#4

Hi David,

I'd be willing to try creating a patch to faceted_search_ui.module; however, I poked around in the module and didn't immediately see where those tags would be generated. Would it be by using the function drupal_set_html_head?

If you'd point me in the right direction, I can try rolling a patch.

UPDATE: Think I've got it, need to test and will post patch tomorrow. Pointers would still help, though :D

#5

Yes, drupal_set_html_head() should do it. Can't tell right now where exactly it should be called, but you might want to start looking at the faceted_search_ui_stage_* set of functions.

#6

Status:active» needs review

OK, this patch calls drupal_set_html_head() in faceted_search_ui_stage_select(). It is correctly placing the meta tag on the faceted_search/select page only. My reasoning is that telling the robot not to index and not to follow links from this page (the faceted search launch page) should prevent it from crawling all the generated links. We'll see if it works.

AttachmentSize
faceted_search_ui_robotsmeta.patch 494 bytes

#7

Well, that patch didn't exactly work, perhaps because some of my results pages had already been indexed and were being re-visited. This attached patch places the robot meta tags on the faceted_search/select page and on faceted_search/results pages.

AttachmentSize
faceted_search_ui_robotsmeta02.patch 799 bytes

#8

You'll also want to have the meta tag on the faceted_search/facet page.

It may happen on some sites that faceted search is the only way to find all the content, so we'd like crawlers to still be able to find it. Thus, faceted_search_ui_stage_results() should probably deny robots only when there is more than one active facet.

#9

I don't have a faceted_search/facet page. How do you get to it?

I will try to take a look at the faceted_search_ui_stage_results() implementation.

(Our site, if you want to take a look at what I mean about faceted_search/facet.)

#10

I have a patch for the faceted_search/facet page (attached). I cannot figure out which variable holds the facets; I quickly tried

  if (count($facet) > 1) {
  drupal_set_html_head('<meta name="robots" content="noindex, nofollow" />');
  }

within faceted_search_ui_stage_reults() and that wasn't successful.

AttachmentSize
faceted_search_ui_robotsmeta03.patch 1.08 KB

#11

Just an update on this issue: the patch I posted has inserted the robots meta tags correctly, but the Googlebots (both our on-campus Google Search Appliance, and Google's) are not obeying them. Campus IT has reported the issue to Google customer support, and I'm waiting to hear back.

#12

Will this patch get applied into the Head version? If so, when?

#13

Status:needs review» needs work

As it stands, the patch won't be applied. It needs work because we actually want to insert the nofollow meta tag only when more than one facet is selected in the current search, so "top-level" search results can be crawled by robots.

You could achieve the same as the above patch with a Disallow: /faceted_search directive in robots.txt, but then nodes that are only available through Faceted Search won't be crawled.

#14

Sorry, I guess what I meant then is at what point do you think you would be able have a fix that would insert a nofollow meta tage only when more than one facet is selected in the search?

We have about 6 taxonomy facets each with 10-50 terms which would create an overwhelming number of permutations for Google to index, so at this point, we have used "Disallow: /faceted_search" The problem is that the most important pages for Google to index for us would be our top level of facets. See: http://www.urbanministry.org/faceted_search/results

#15

@cubbtech: Are you still positive about Googlebot not obeying the robots meta tags? Other sources seem to say the contrary:

http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=61050
http://googleblog.blogspot.com/2007/02/robots-exclusion-protocol.html

#16

Hi David,

Sorry for the delayed response. I know that the Googlebot was supposed to be obeying the noindex nofollow tags, but for some reason it was not. Now it appears that the Googlebot is not crawling our site at all, which is an issue that I'm following up on with campus IT.

#17

Version:6.x-1.x-dev» 5.x-0.20
Status:needs work» fixed

I've just committed the selective addition of the robots meta tag. Robots are still allowed to visit the results pages for any single category, but not for combinations of categories. Let's hope that most robots will obey the directive.

#18

I can confirm that this update inserts the robots meta tag on searches that combine multiple categories, but does not insert it for single-category faceted searches.

Thanks!

#19

Status:fixed» closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.

#20

Version:5.x-0.20» 6.x-1.x-dev
Status:closed (fixed)» active

Sorry to reopen this issue, but in last dev version we have always those log in watchdog table.
Is in program to remove those logs or to have a config to choose ?

Thanks

#21

True, the watchdog entries have not been addressed. We might want to do as suggested by cubbtech on #2 and avoid logging the guided searches.

#22

Perfect.
Thanks for your hard work, we all appreciate it.

#23

Title:Module makes database balloon in size» Module makes database balloon in size - avoid logging the guided searches

subscribe, it's very annoying... you might want to add in a verbose setting, I did something similar for boost.

#24

There should be option "turn watchdog logging off".

#25

subscribing

I'd love to see a better integration with the default Drupal search for logging and reporting!

#26

You can turn off logging the searches completely by commenting out line 376 of faceted_search_ui.module (6.x-10.beta2) like so:

<?php
// watchdog('faceted_search', '%text.', array('%text' => $text), WATCHDOG_NOTICE, l(t('results'), $path));
?>