First off, thanks for this great module. It is a big improvement on core search.
Once I installed it, I am getting a huge increase in the size of the accesslog, sessions, and watchdog tables. It appears that our on-campus indexer is hitting every single link generated by the guided search. When you multiply the number of facets by the number of facets, you get exponential growth :)
I don't want to completely block access to our site from our on-campus robot, so my question is two-fold:
a) Can I disable the reporting of Faceted Search events to the Watchdog module?
b) Can I somehow disallow crawls along the faceted_search path (perhaps by altering robots.txt)?
(I know that (b) is not necessarily a support question for faceted_search per se, just wondering if you have a solution.)
Thanks!
| Comment | File | Size | Author |
|---|---|---|---|
| #10 | faceted_search_ui_robotsmeta03.patch | 1.08 KB | john bickar |
| #7 | faceted_search_ui_robotsmeta02.patch | 799 bytes | john bickar |
| #6 | faceted_search_ui_robotsmeta.patch | 494 bytes | john bickar |
Comments
Comment #1
David Lesieur commentedThese are serious issues, thanks for reporting them!
a) Actually, I don't think the use of the watchdog is really useful... Who monitors searches that way? If anyone finds logging searches with the watchdog useful, then I guess I could add a setting to enable/disable it, otherwise I think I'll simply remove the call to watchdog() from faceted_search_ui.module.
b) Yes, you could disallow faceted_search/* in robots.txt. However, a more interesting approach might be to make faceted_search_ui use robot meta tags. Perhaps we'd want to use this selectively, inserting a nofollow meta tag only when more than one facet is selected in the current search, so "top-level" search results would still be indexed. Refining a search never brings in new content, so crawlers certainly don't need to select more than one facet.
Comment #2
john bickar commentedThanks for the quick response. We actually have been using the watchdog reports to discover search terms. That way we can add content if it's something that we hadn't thought of (e.g., a user searches for "parking"). Would it be possible to have keyword searches make a call to watchdog() and not guided searches?
I will look into the robot meta tags. Thanks for the tip!
Comment #3
David Lesieur commentedYes, it would make sense to log keyword searches only. But I wish better tools were provided to monitor searches. I guess one could edit his AWStats (or similar tool) configuration to include faceted search requests, but I've never tried it. I'd love to see not only what keywords are requested, but also what categories are popular, which ones are chosen first, the number of results for each request, how many times people refine their current search, etc. This could also help improve the classification, and perhaps the module's user interface as well.
About the robot meta tags, this will involve making changes to faceted_search_ui.module (I'd be glad to review a patch!). If you need a solution that doesn't involve programming, then robots.txt would be the way to go until robot meta tags are managed by the module.
Comment #4
john bickar commentedHi David,
I'd be willing to try creating a patch to faceted_search_ui.module; however, I poked around in the module and didn't immediately see where those tags would be generated. Would it be by using the function drupal_set_html_head?
If you'd point me in the right direction, I can try rolling a patch.
UPDATE: Think I've got it, need to test and will post patch tomorrow. Pointers would still help, though :D
Comment #5
David Lesieur commentedYes, drupal_set_html_head() should do it. Can't tell right now where exactly it should be called, but you might want to start looking at the
faceted_search_ui_stage_*set of functions.Comment #6
john bickar commentedOK, this patch calls drupal_set_html_head() in faceted_search_ui_stage_select(). It is correctly placing the meta tag on the faceted_search/select page only. My reasoning is that telling the robot not to index and not to follow links from this page (the faceted search launch page) should prevent it from crawling all the generated links. We'll see if it works.
Comment #7
john bickar commentedWell, that patch didn't exactly work, perhaps because some of my results pages had already been indexed and were being re-visited. This attached patch places the robot meta tags on the faceted_search/select page and on faceted_search/results pages.
Comment #8
David Lesieur commentedYou'll also want to have the meta tag on the faceted_search/facet page.
It may happen on some sites that faceted search is the only way to find all the content, so we'd like crawlers to still be able to find it. Thus, faceted_search_ui_stage_results() should probably deny robots only when there is more than one active facet.
Comment #9
john bickar commentedI don't have a faceted_search/facet page. How do you get to it?
I will try to take a look at the faceted_search_ui_stage_results() implementation.
(Our site, if you want to take a look at what I mean about faceted_search/facet.)
Comment #10
john bickar commentedI have a patch for the faceted_search/facet page (attached). I cannot figure out which variable holds the facets; I quickly tried
within faceted_search_ui_stage_reults() and that wasn't successful.
Comment #11
john bickar commentedJust an update on this issue: the patch I posted has inserted the robots meta tags correctly, but the Googlebots (both our on-campus Google Search Appliance, and Google's) are not obeying them. Campus IT has reported the issue to Google customer support, and I'm waiting to hear back.
Comment #12
alsears commentedWill this patch get applied into the Head version? If so, when?
Comment #13
David Lesieur commentedAs it stands, the patch won't be applied. It needs work because we actually want to insert the nofollow meta tag only when more than one facet is selected in the current search, so "top-level" search results can be crawled by robots.
You could achieve the same as the above patch with a
Disallow: /faceted_searchdirective in robots.txt, but then nodes that are only available through Faceted Search won't be crawled.Comment #14
alsears commentedSorry, I guess what I meant then is at what point do you think you would be able have a fix that would insert a nofollow meta tage only when more than one facet is selected in the search?
We have about 6 taxonomy facets each with 10-50 terms which would create an overwhelming number of permutations for Google to index, so at this point, we have used "Disallow: /faceted_search" The problem is that the most important pages for Google to index for us would be our top level of facets. See: http://www.urbanministry.org/faceted_search/results
Comment #15
David Lesieur commented@cubbtech: Are you still positive about Googlebot not obeying the robots meta tags? Other sources seem to say the contrary:
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=61050
http://googleblog.blogspot.com/2007/02/robots-exclusion-protocol.html
Comment #16
john bickar commentedHi David,
Sorry for the delayed response. I know that the Googlebot was supposed to be obeying the noindex nofollow tags, but for some reason it was not. Now it appears that the Googlebot is not crawling our site at all, which is an issue that I'm following up on with campus IT.
Comment #17
David Lesieur commentedI've just committed the selective addition of the robots meta tag. Robots are still allowed to visit the results pages for any single category, but not for combinations of categories. Let's hope that most robots will obey the directive.
Comment #18
john bickar commentedI can confirm that this update inserts the robots meta tag on searches that combine multiple categories, but does not insert it for single-category faceted searches.
Thanks!
Comment #19
Anonymous (not verified) commentedAutomatically closed -- issue fixed for two weeks with no activity.
Comment #20
elioshSorry to reopen this issue, but in last dev version we have always those log in watchdog table.
Is in program to remove those logs or to have a config to choose ?
Thanks
Comment #21
David Lesieur commentedTrue, the watchdog entries have not been addressed. We might want to do as suggested by cubbtech on #2 and avoid logging the guided searches.
Comment #22
elioshPerfect.
Thanks for your hard work, we all appreciate it.
Comment #23
mikeytown2 commentedsubscribe, it's very annoying... you might want to add in a verbose setting, I did something similar for boost.
Comment #24
v8powerage commentedThere should be option "turn watchdog logging off".
Comment #25
weseze commentedsubscribing
I'd love to see a better integration with the default Drupal search for logging and reporting!
Comment #26
john bickar commentedYou can turn off logging the searches completely by commenting out line 376 of faceted_search_ui.module (6.x-10.beta2) like so: