Hello,
"Refine Criteria" (base template) block is very useful for drilldown filtering, and marvellous for "humans".
But, search bots (e.g. Googlebot) somehow finds a way to discover ridiculous paths like "/taxonomy/term/10,1,5,115,8,7,28,11,22,629,6,27,15,12,13,30,379,33,34,32,31,29,371,38,719,14,112,358,37,39,35,36,368,17,9,18,16,20,1082,71,1140,644,328,40,24,19,643,332,329,331,72,514,1067,645,731,562,26,330,25,386,21,676,522,86,96" by recurrently following the links in the block, terms accumulate.
I know this is not directly TF's problem, but RC block+Bots cause this unfortunately. "rel=nofollow" for links is not a complete solution as some bots do not hesitate to follow these anyway. There must be a way to hide this block when a certain number of filter terms is reached. Why? Here's the story.
When a bot (or an insane human like DoS attacks) follows above type of path:
If terms <=61:
mySQL optimizer commits a suicide when trying to optimize a query with ~60 joins and process gets stuck (eating CPU) on "statistics" phase and begin to lock consequent queries. Then you find your sites down. Machine is sitting ducks with MySQL eating 800% (8 cores) and has hundreds of locked queries. Only remedy is to cap optimizer and add an "optimizer_search_depth = 3" type of depressant to my.cnf. You now feel less optimized, but sites are working anyway.
If terms >61:
Drupal causes a PHP error which say "Too many tables; MySQL can only use 61 tables in a join query: SELECT...". As bots ask the same page hundreds of times, then you have tons of this in Apache and Drupal logs. As this highly detailed error is shown even to anonymous visitor on the subjected page, your server's guts are exposed also.
I have to be fair. This problem is not particular to TF's queries, but custom taxo pages with some views have the same problem, and can be exploited in the same way.
Of course nothing to do for an insane human who follows a path like above accidentally or the one expecting a DoS for the unprepared, but search bots should be considered somehow as being chronic.
Any suggestions?
Comments
Comment #1
solotandem commentedSorry about this. This is a known issue. See #216150: Excessive table joins and poor performance . I have a solution in mind but have not been able to implement.
As a temporary solution, could you add some code to check the number of term ids in the url and either ignore the request or truncate the ids to a reasonable number?
For example, in taxonomy_filter_block() on line 224 (in the current dev release -- I recommend switching to the dev release if you are not using it), you could add:
$reasonable_number_of_tids = ???;
if (count($tids) > $reasonable_number_of_tids) {
$tids = array_slice($tids, 0, $reasonable_number_of_tids);
}
There may be another spot in code that needs this too.
Comment #2
chawl commentedI think number of arguments should be restrictable before they are passed to Taxonomy Filter, Views etc., for example at the level of Drupal core or at panels taxo page handler etc. Unfortunately, none of them has an option like this. Then we should take care of ourselves. I will try to hack the code, and see if I can manage something.
Yes, in fact both of your suggestions seems to be needed.
1. Reasonable number of arguments should be accepted as filters. This can prevent insane paths to exploit things. But I am not sure if "," or "+" differs things.
2. Refine criteria block should be hidden if there are reasonable number of filters present. This will prevent bots (and humans) to see pointless links furthermore. This is very very important, because if bots follow pointless links, duplicate content will be a problem. Moreover, some views also fail for insane taxo paths, a bot visit will crash the page again even if TF is ok. Thus pointless links should never be exposed.
Also reasonable number can be an option on admin interface.
Am I on the right track by the way?
Thank you in advance.
Comment #3
solotandem commentedRead this comment regarding a significant performance improvement. Hopefully this will go a long way towards helping your situation.
Let me know if you see the same degree of performance improvement. I encourage others who read this issue to report their findings.
This does not address your suggestion to add a user limit on the number of term ids in the URL. I will leave this issue open because of that.
Comment #4
chawl commentedPlease see this for the performance issue.
As the new dev version doesn't use 61+ inner joins for 61+ terms, MySQL won't possibly be "out of order" I hope :)
Anyway, it can still be safer to be able to put some limit on the hierarchy anyway, but not an emergency anymore.
Thank you.
Comment #5
chawl commentedOK, I omitted "optimizer_search_depth=3" line in my.cnf in honour of the newest dev, and waited for the peak times. Unfortunately SQL processes begin to lock in "statistics" phase again as previously, but not at severe numbers as before.
I am now trying "optimizer_search_depth=0" (automatic) and observed no locks for several hours.
Hopes are alive :)
Note that this is not a performance but process lock issue, performance is OK.
Comment #6
chawl commentedNope, back to "optimizer_search_depth=3" anyway. Performance is still perfect though.