Suggestion on how to fix inappropriate searches
| Project: | Zeitgeist |
| Version: | 5.x-1.x-dev |
| Component: | Code |
| Category: | feature request |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | closed |
According to this:
"WARNING: it is generally NOT a good idea to activate the Zeitgest latest searches block for public display, as it can be used to display inappropriate content just by searching for it."
If somebody would search for bad languages search engines or other people would pick up these words which obviously wouldn't be too good. But I came to the conclusion that this might be able to be solved in two ways.
1. Use bad word filter which does not display searches if they match certain criterias
2. Only show searches that matched something. If the site contains no bad language the search would return empty, and then you would not display the particular search in the latest search
Let me know what you think. And I hope you make this for 5.x
Thanks.

#1
Hi Zyles.
Thanks for your suggestion. Overall, I think the stop words idea would be valuable, but as always the devil is in the details : I don't see myself maintain a "stop words" list, and building a centralized repository would be error-prone, due to the fact the inappropriate terms in one language can be perfectly correct in another language. Do you have any idea on how to work around this ?
The second idea seems much simpler and could actually be effective. Can you try to make a patch for it, and I'll review it ?
Regarding 5.0, I don't have free time for this now, and don't expect to have it for yet a couple of months, but it will probably happen some day... until 6.0 appears first.
#2
Hi,
Unfortunately I do not have time to contribute at all. I am working on several other projects at the moment and the lack of time hinders me.
I can see the problem with the bad words list and the lack of effort to maintain it. And personally would not use it.
The ideal solution is the second one for another reason aswell. Displaying search matches for users and search engines that do not exist on the website is not relevant at all. There is little reason to show a search for "dogs" on a site that only has content about "cats".
Perhaps the easiest way to make this possible is somehow have the module make a check if results are more than zero when a search is done. If the search returns one or more results then add the latest search. If empty do nothing.
#3
Just to add a note. It may be in the interest of the website owner to at least be able to see non-matching searches in the admin section to see what visitors are interested in. But only publicly display the matching results.
#4
Regrettably, I do not see a clean way of implementing the second suggestion: search provides no direct way for another module to be invoked when searching something it is not defining itself : it invokes [search_name]_search directly, and offers no room for third party intervention like
hook_nodeapidoes for node events.The cleanest method I can think of right now is to define a global sequence number in
zeitgeist_form_alter, store it in an extended version of the ZG table, and request from the site admin to invoke a specific ZG function (say, azeitgeist_confirm_logto be written) in the theme file to implementtheme_search_page.From there, the result set is available, and the
zeitgeist_confirm_logcan grab the global sequence value, and use it to update the ZG table entry for this sequence number value with the result count. It even degrades more or less gracefully, since if the confirm is not invoked, the basic ZG search logging is still performed.At this point everything follows logically: the stats are available, and settings can be defined to choose how to theme the resulting data for the ZG blocks or pages created using
_zeitgeist_stats.The problem remains, though: this depends on the site admin modifying his theme.
Of course, if search was more extensible.... or if
watchdogused some pluggable scheme, or even a simplehook_watchdog... everything would be simpler.But as long as things remain as they are in core, I don't plan on creating this type of feature. Don't let that prevent you from trying, though : if you create it and it is usable, I'll be glad to include the feature to a future version.
#5
I would like to mark this as "active" since this is an important feature. What is the point of having this module if it is not exposed to the public? This needs a solution to abuse issue just like any other user-generated functionality.
Perhaps a fix could be an integration with spam module that uses heuristics to determine likelihood of abuse. We could also match search strings for specific phrases. For example, I noticed that all the spam I got includes "http", that no legitimate searchers would use.
Either way, let's think of more solutions.
#6
The point is to inform the site admin of the latest searches. For non-admin information, this is a non-feature. The "top searches" block is probably way more relevant (and harder to spam).
As I already answered Zyles, you're welcome to try and come up with a solution. If it works well and does not require manual maintenance (like a stop list word would), it will be added to the module.
The "won't fix" just means I won't commit time to this feature myself, because I don't feel it worthwile and doable in a clean way because it needs either modifications to core to work cleanly OR custom themeing to work less cleanly.
#7
I got an idea. Before recording the search, run the search string through strip_tags. Only spammers have incentive to put HTML into a search box - and all of them do. Simply add this as a first line into _zeitgeist_store_search:
if ($search != strip_tags($search)) return;What do you think?
#8
I'm rather doubtful about this approach:
filtering has apparently already been performed: searching for HTML code does not result in its
being returned unfiltered
display does not seem appropriate : the goal with this module is to know what is actually being
searched, not what we would like it to be. Sanitizing the data on display seems more appropriate
which is to enter improper search keys (think insults, hate speech, or anything illegal)
For all these reasons, it seems that such a fix:
So I don't think it should be applied.
If, however, you can build a test case demonstrating an exploit, please send it to me (do not publish it here until the problem is fixed, it might cause damage to sites using the module), and we'll see what needs to be done.
#9
Sure, this is not a solution to everything that can go wrong. But at least deterrent to spammers who want to drop in links.
The test case is here, which was marked as duplicate: http://drupal.org/node/120245
That is the problem I am currently facing.
#10
Hi. I saw that post initially, but what I find surprising is that you seem to imply that the html is output as such ?
When I try to input such HTML strings on my sites, the ZG blocks return them encoded, so the spam appears as such, not as a link to the spammer's site. Is it otherwise on yours ?
#11
Well, the dumb spammers do drop in HTML, using scripts. Just like they do it to comment forms.
The samples from http://drupal.org/node/120245 is what I see show up in my "Latest searches" block - verbatim.
If this HTML gets encoded by ZG while going through forms, it needs to be decoded before strip_tags test.
#12
Just to make sure we understand each other: it is normal that you see this HTML, verbatim, since it is what has been searched for. The potential problem is whether this links to their site or not ; when you click on the link in the block, does it link back safelly to the search page within drupal, or link to the spammer's site ?
If it links back within drupal safely, this is a feature. If it links to the spammer's site, it is a bug.
Now, I understand that you could want to disable this feature. I suggest that you submit a patch adding a setting to the module that would allow admins to filter searches, typically with three choices:
It could even be combined with the option to record empty searches, to minimize settings clutter.
...and implement the setting accordingly.
#13
Yes, you got it right and the solution is correct - in terms of adding a setting. But this setting is probably orthogonal to "record empty searches", since this anti-spam setting deals with non-empty searches only.
Note however that the block shows HTML source as anchor text. The link of course points within the site (the only way ZG block would allow), but looks extremely ugly, crowds out real searches and may break CSS (if overflow is not handled right). This is still a bug, rather than a feature. What real users would search for HTML source?
I do not know how to properly make a patch, but could attach a modified file. However, I might rather leave it out there to test for a little while to make sure spammers cannot get around this fix.
#14
As you suggest, no well-meaning real user will probably search for HTML, but you probably still want to know what is submitted to your site, whether it is from well-meaning users or not (at least I do): IMO, ignoring attacks to focus on just what you would like to see is just not safe. But you can still use that new setup to ignore them if you so chose : just make sure it requires a specific decision, instead of being a default behaviour.
Regarding the patch, you can just submit the modified module, and I can roll up the patch. Or, better yet, because you must become acquainted with it if you are to work within the drupal ecosystem, use a GUI front-end like TortoiseCVS (if you're on windows). If you use Tortoise, just select the modified patch your CVS directory, and right-click to CVS->Create patch
#15
OK, my fix stood the test of time (so far) and I am pleased to attache the modified 4.7 module.
You can do a diff and see changes, but this is very straightforward. Note how this can be extended to add more filters later.
#16
Hi, can we get the patch finally commited?
By the way, is there anything being done for 5.1 upgrade of the whole module?
#17
Hi.
I'm not working on a 5.x version of the module at the moment, and do not expect to, with 6.0 approching so fast.
So what I did was commit your changes as the initial version for 5.x: that way you be able to take advantage of the release system for your work, while not impacting the stable 4.7.x version.
#18
OK, I am preparing for 5.x migration now so should test this soon enough.
I suggest still checking in a 4.7 branch... This is tested, works and should be available in CVS. Perhaps as a new module release.
#19
#20
Note that a D6 branch has now been created: I'll be putting new code in it.
Do you consider your D5 version as stable enough for a release ?
#21
I updated the whole set of files making up the module for D5, incorporating the latest changes you committed.
I'm not making a release: this is up to you when you decide a version in this branch is good enough.
#22