Suggestion on how to fix inappropriate searches

Zyles - January 30, 2007 - 15:39
Project:Zeitgeist
Version:5.x-1.x-dev
Component:Code
Category:feature request
Priority:normal
Assigned:Unassigned
Status:closed
Description

According to this:

"WARNING: it is generally NOT a good idea to activate the Zeitgest latest searches block for public display, as it can be used to display inappropriate content just by searching for it."

If somebody would search for bad languages search engines or other people would pick up these words which obviously wouldn't be too good. But I came to the conclusion that this might be able to be solved in two ways.

1. Use bad word filter which does not display searches if they match certain criterias

2. Only show searches that matched something. If the site contains no bad language the search would return empty, and then you would not display the particular search in the latest search

Let me know what you think. And I hope you make this for 5.x

Thanks.

#1

fgm - January 30, 2007 - 20:08

Hi Zyles.

Thanks for your suggestion. Overall, I think the stop words idea would be valuable, but as always the devil is in the details : I don't see myself maintain a "stop words" list, and building a centralized repository would be error-prone, due to the fact the inappropriate terms in one language can be perfectly correct in another language. Do you have any idea on how to work around this ?

The second idea seems much simpler and could actually be effective. Can you try to make a patch for it, and I'll review it ?

Regarding 5.0, I don't have free time for this now, and don't expect to have it for yet a couple of months, but it will probably happen some day... until 6.0 appears first.

#2

Zyles - February 3, 2007 - 12:47

Hi,

Unfortunately I do not have time to contribute at all. I am working on several other projects at the moment and the lack of time hinders me.

I can see the problem with the bad words list and the lack of effort to maintain it. And personally would not use it.

The ideal solution is the second one for another reason aswell. Displaying search matches for users and search engines that do not exist on the website is not relevant at all. There is little reason to show a search for "dogs" on a site that only has content about "cats".

Perhaps the easiest way to make this possible is somehow have the module make a check if results are more than zero when a search is done. If the search returns one or more results then add the latest search. If empty do nothing.

#3

Zyles - February 3, 2007 - 12:50

Just to add a note. It may be in the interest of the website owner to at least be able to see non-matching searches in the admin section to see what visitors are interested in. But only publicly display the matching results.

#4

fgm - February 5, 2007 - 21:35
Version:HEAD» 4.7.x-1.x-dev
Status:active» won't fix

Regrettably, I do not see a clean way of implementing the second suggestion: search provides no direct way for another module to be invoked when searching something it is not defining itself : it invokes [search_name]_search directly, and offers no room for third party intervention like hook_nodeapi does for node events.

The cleanest method I can think of right now is to define a global sequence number in zeitgeist_form_alter, store it in an extended version of the ZG table, and request from the site admin to invoke a specific ZG function (say, a zeitgeist_confirm_log to be written) in the theme file to implement theme_search_page.

From there, the result set is available, and the zeitgeist_confirm_log can grab the global sequence value, and use it to update the ZG table entry for this sequence number value with the result count. It even degrades more or less gracefully, since if the confirm is not invoked, the basic ZG search logging is still performed.

At this point everything follows logically: the stats are available, and settings can be defined to choose how to theme the resulting data for the ZG blocks or pages created using _zeitgeist_stats.

The problem remains, though: this depends on the site admin modifying his theme.

Of course, if search was more extensible.... or if watchdog used some pluggable scheme, or even a simple hook_watchdog... everything would be simpler.

But as long as things remain as they are in core, I don't plan on creating this type of feature. Don't let that prevent you from trying, though : if you create it and it is usable, I'll be glad to include the feature to a future version.

#5

dkruglyak - February 18, 2007 - 11:40
Status:won't fix» active

I would like to mark this as "active" since this is an important feature. What is the point of having this module if it is not exposed to the public? This needs a solution to abuse issue just like any other user-generated functionality.

Perhaps a fix could be an integration with spam module that uses heuristics to determine likelihood of abuse. We could also match search strings for specific phrases. For example, I noticed that all the spam I got includes "http", that no legitimate searchers would use.

Either way, let's think of more solutions.

#6

fgm - February 18, 2007 - 13:54

The point is to inform the site admin of the latest searches. For non-admin information, this is a non-feature. The "top searches" block is probably way more relevant (and harder to spam).

As I already answered Zyles, you're welcome to try and come up with a solution. If it works well and does not require manual maintenance (like a stop list word would), it will be added to the module.

The "won't fix" just means I won't commit time to this feature myself, because I don't feel it worthwile and doable in a clean way because it needs either modifications to core to work cleanly OR custom themeing to work less cleanly.

#7

dkruglyak - February 26, 2007 - 13:13
Status:active» needs review

I got an idea. Before recording the search, run the search string through strip_tags. Only spammers have incentive to put HTML into a search box - and all of them do. Simply add this as a first line into _zeitgeist_store_search:

  if ($search != strip_tags($search)) return;

What do you think?

#8

fgm - February 26, 2007 - 19:01
Status:needs review» needs work

I'm rather doubtful about this approach:

  • more than an antispam measure, such input filtering is necessary for basic safety
  • however, the data for ZG come from the FAPI processing from search.module, where this
    filtering has apparently already been performed: searching for HTML code does not result in its
    being returned unfiltered
  • the very mechanism of stripping data from the actual user searches instead of stripping it upon
    display does not seem appropriate : the goal with this module is to know what is actually being
    searched, not what we would like it to be. Sanitizing the data on display seems more appropriate
  • the most annoying point is that it does nothing to prevent against the most typical misuse,
    which is to enter improper search keys (think insults, hate speech, or anything illegal)

For all these reasons, it seems that such a fix:

  • does not prevent safety exploits, which are already prevented against by search.module
  • does not prevent search abuse

So I don't think it should be applied.

If, however, you can build a test case demonstrating an exploit, please send it to me (do not publish it here until the problem is fixed, it might cause damage to sites using the module), and we'll see what needs to be done.

#9

dkruglyak - February 26, 2007 - 22:25

Sure, this is not a solution to everything that can go wrong. But at least deterrent to spammers who want to drop in links.

The test case is here, which was marked as duplicate: http://drupal.org/node/120245

That is the problem I am currently facing.

#10

fgm - February 27, 2007 - 07:09

Hi. I saw that post initially, but what I find surprising is that you seem to imply that the html is output as such ?

When I try to input such HTML strings on my sites, the ZG blocks return them encoded, so the spam appears as such, not as a link to the spammer's site. Is it otherwise on yours ?

#11

dkruglyak - February 27, 2007 - 07:28

Well, the dumb spammers do drop in HTML, using scripts. Just like they do it to comment forms.

The samples from http://drupal.org/node/120245 is what I see show up in my "Latest searches" block - verbatim.

If this HTML gets encoded by ZG while going through forms, it needs to be decoded before strip_tags test.

#12

fgm - February 27, 2007 - 09:56

Just to make sure we understand each other: it is normal that you see this HTML, verbatim, since it is what has been searched for. The potential problem is whether this links to their site or not ; when you click on the link in the block, does it link back safelly to the search page within drupal, or link to the spammer's site ?

If it links back within drupal safely, this is a feature. If it links to the spammer's site, it is a bug.

Now, I understand that you could want to disable this feature. I suggest that you submit a patch adding a setting to the module that would allow admins to filter searches, typically with three choices:

  • record all searches as such [default]
  • record a sanitized form for searches (like strip_tags)
  • do not store searches differing from their sanitized form (strip_tags != original)

It could even be combined with the option to record empty searches, to minimize settings clutter.

...and implement the setting accordingly.

#13

dkruglyak - February 27, 2007 - 14:19

Yes, you got it right and the solution is correct - in terms of adding a setting. But this setting is probably orthogonal to "record empty searches", since this anti-spam setting deals with non-empty searches only.

Note however that the block shows HTML source as anchor text. The link of course points within the site (the only way ZG block would allow), but looks extremely ugly, crowds out real searches and may break CSS (if overflow is not handled right). This is still a bug, rather than a feature. What real users would search for HTML source?

I do not know how to properly make a patch, but could attach a modified file. However, I might rather leave it out there to test for a little while to make sure spammers cannot get around this fix.

#14

fgm - February 27, 2007 - 15:01

As you suggest, no well-meaning real user will probably search for HTML, but you probably still want to know what is submitted to your site, whether it is from well-meaning users or not (at least I do): IMO, ignoring attacks to focus on just what you would like to see is just not safe. But you can still use that new setup to ignore them if you so chose : just make sure it requires a specific decision, instead of being a default behaviour.

Regarding the patch, you can just submit the modified module, and I can roll up the patch. Or, better yet, because you must become acquainted with it if you are to work within the drupal ecosystem, use a GUI front-end like TortoiseCVS (if you're on windows). If you use Tortoise, just select the modified patch your CVS directory, and right-click to CVS->Create patch

#15

dkruglyak - March 14, 2007 - 02:40
Status:needs work» reviewed & tested by the community

OK, my fix stood the test of time (so far) and I am pleased to attache the modified 4.7 module.

You can do a diff and see changes, but this is very straightforward. Note how this can be extended to add more filters later.

AttachmentSize
zeitgeist.module.txt 22.58 KB

#16

dkruglyak - May 19, 2007 - 07:23

Hi, can we get the patch finally commited?

By the way, is there anything being done for 5.1 upgrade of the whole module?

#17

fgm - May 19, 2007 - 20:26
Status:reviewed & tested by the community» fixed

Hi.

I'm not working on a 5.x version of the module at the moment, and do not expect to, with 6.0 approching so fast.

So what I did was commit your changes as the initial version for 5.x: that way you be able to take advantage of the release system for your work, while not impacting the stable 4.7.x version.

#18

dkruglyak - May 20, 2007 - 15:02

OK, I am preparing for 5.x migration now so should test this soon enough.

I suggest still checking in a 4.7 branch... This is tested, works and should be available in CVS. Perhaps as a new module release.

#19

Anonymous - June 4, 2007 - 21:23
Status:fixed» closed

#20

fgm - August 25, 2007 - 16:53
Version:4.7.x-1.x-dev» 5.x-1.x-dev
Status:closed» needs review

Note that a D6 branch has now been created: I'll be putting new code in it.

Do you consider your D5 version as stable enough for a release ?

#21

fgm - August 26, 2007 - 20:48
Status:needs review» fixed

I updated the whole set of files making up the module for D5, incorporating the latest changes you committed.

I'm not making a release: this is up to you when you decide a version in this branch is good enough.

#22

Anonymous - September 9, 2007 - 21:43
Status:fixed» closed
 
 

Drupal is a registered trademark of Dries Buytaert.