The problem of trackback and comment spam in Drupal, and one way to address it

By escoles on 29 Aug 2005 at 20:19 UTC

I got very little (read: NO) interest when I posted on this topic earlier with a different phrasing. So, here's my second shot at this.

I propose that there is a problem with the ways that program function URLs are written in Drupal, that causes Drupal to be a disproportionate target for trackback and comment spammers.

The problem with comment and trackback spam in Drupal is this: It's too easy to guess the URL for comments and trackbacks.

In Drupal, the link for a node has the form "/node/x", where x is the node id. In fact, you can formulate a lot of Drupal URLs that way; for example, to track-back to x, the URI would be "/node/x/trackback"; to post a comment to x would be "/node/x/comment". So you can see that it would be a trivially easy task to write a script that just walked the node table from top to bottom, trying to post comments.

Which is pretty much what spammers do to my site: They fire up a 'bot to walk my node tree, looking for nodes that are open to comment or accepting trackbacks. I have some evidence that it's different groups of spammers trying to do each thing, but that hardly matters -- what does matter is that computational horsepower and network bandwidth cost these guys so little that they don't even bother to stop trying after six or seven hundred failures -- they just keep on going, like the god damned energizer bunny. For the first sixteen days of August this year, I got well over 100,000 page views, of which over 95% were my 404 error page. The "not found" URL in over 90% of those cases was some variant on a standard Drupal trackback or comment-posting URL.

One way to address this would be to use something other than a sequential integer as the node ID. This is effectively what happens with tools like MoveableType and Wordform/WordPress because they use real words to form the path elements in their URIs -- for example, /archives/2005/07/05/wordform-metadata-for-wordpress/, which links to an article on Shelley Powers's site. Whether those real words correspond to real directories or not is kind of immaterial; the important point is that they're impractically difficult to crank otu iteratively with a simple scripted 'bot. Having to discover the links would probably increase the 'bot's time per transaction by a factor of five or six. Better to focus on vulnerable tools, like Drupal.

But the solution doesn't need to be that literal. What if, instead of a sequential integer, Drupal assigned a Unix timestamp value as a node ID? That would introduce an order of complexity to the node naming scheme that isn't quite as dramatic as that found in MT or WordPress, but is still much, much greater than what we've got now. Unless you post at a ridiculous frequency, it would guarantee unique node IDs. And all at little cost in human readability (since I don't see any evidence that humans address articles or taxonomy terms by ID number, anyway).

Comments

Pathauto module,

Tommy Sundstrom commented 29 August 2005 at 21:52

Pathauto module, http://drupal.org/node/17345, will give the kind of urls that you request.

Nice to know that i can do it, but ...

escoles commented 29 August 2005 at 22:17

... until something like this is the default behavior for the application, there will still be a strong incentive to code 'bots to hit Drupal sites by walking the node tree.

Put another way: In order for this to be anything more than a neat gimmick, everyone's gotta do it. Until then, it's just coding yourself into a corner: What do you do when pathauto doesn't get rev'd up to Drupal 5?

two things

sepeck commented 29 August 2005 at 22:44

pathauto works to add a URL Alias to a given link, it does not prevent the existing links from working, so th enode crawl you describe will continue to work. Also, too many people use it for some developer not to update it. :)

UNIX timestamps wouldn;t be a good solution for those not running their sites on a *NIX platform. I realized you threw that out there as an example, but just wanted to mention the 'why not *NIX timestamps paart).

-sp
---------
Test site, always start with a test site.
Drupal Best Practices Guide -|- Black Mountain

-Steven Peck
---------
Test site, always start with a test site.
_{Drupal Best Practices Guide}

True enough, but some timestamp...

escoles commented 30 August 2005 at 00:32

... could still work. I used time as an example because it's a good way to get a sufficiently unique string without randomness. It could have the added benefit of naturally sorting in order by date.

As for pathauto, I realized right as i was walking up to dinner after I posted that it wouldn't prevent the problem I'm facing. Actually, nothing really would; what I'm suggesting actually requires some deep changes and I'm aware of that. I just wish people would be aware of this.

I think the design transparency of Drupal clean URLs is a beautiful thing. Unfortunately, it's too nice to live -- it's like the trusting child who always gets taken advantage of. I'm not suggesting we make the child less beautiful -- just a little less easy to manipulate.

You may consider Bad Behavior

dpangier commented 5 September 2005 at 08:42

Bad Behavior is a "site denial" system for spambots. It examines the request headers during the page initialisation, and if the headers don't correlate with a regular webcrawling bot or a recognised browser, the request is redirected to another page.

This stops spambots from even getting to nodes to find out if comments, trackback or pingbacks are enabled, as well as blocking the URLs that allow posting.

You can read more at my Bad Behavior for Drupal page.

What about latency and (un)natural selection?

escoles commented 9 September 2005 at 19:27

Doesn't this add latency to the page load process?

And how difficult will it be for the spammers to just work around this? That is, they start noticing that they're blocked a lot, and realize they need ot be more careful formatting their headers?

(Not that I'm not all over looking at this. I need to spec a community site and I'd rather not do it in something other than Drupal.)

Latency is very low

dpangier commented 9 September 2005 at 22:33

The page latency is a tiny fraction of a second on my 400MHz server, so on a modern box should by almost invisible.

Natural selction is a much more serious issue, which I believe may ultimately end the usefulness of such a module. In the meantime, however, it is very effective.

David

it works

nsk commented 10 September 2005 at 10:08

I recently installed Bad Behaviour and it works fine. It saved me, and it enabled me to allow again comments in my site!

--
NSK, Founder of the Wikinerds Community. See my Drupal site.

This has nothing to do with uniform node paths

Boris Mann commented 9 September 2005 at 20:30

Trust me, spammers do not rely on something as simple as that. And, both MT and Wordpress use numbers underneath, and you can access those pages. Do some searches on mt-comment.cgi -- it's in a known location, so spammers just call it directly.

The spam.module works brilliantly, and has blocked over 40K spams for me. That or BadBehaviour mentioned (which I haven't used -- ideally, something like this would actually be rolled into spam.module directly) can solve these issues, once and for all.

What do they rely on?

escoles commented 15 September 2005 at 20:34

If they're relying on something more complex than that, why do they persist in hitting my site until I block their IP, even though they don't get a single success?

As for the effectiveness of the spam module, sure, it works. It blocked about that many for me, while I was still using it. But it also let thousands pass. And when it let them pass, it would let them pass FAST -- so fast, that my site would be hammered with them before I had a chance to do anything about it.

I finally stopped using it when it became clear that the operation of the spam mudule, under the pressure of a spam message every second or so, was causing my webserver to hang.

FYI

laura s commented 16 September 2005 at 04:13

The spam 2.0 module does IP blocking for repeat spammers. Even if they're using 40 IP addresses to hit you, the module will block them. You can set the tolerance for how many repeat spams from an IP address before the IP is blocked. How long it's blocked is also set by a global duration.

Bad behavior is another module to consider. I've not tried it, and don't know if it will work with spam or spam 2.0. But it's now released for 4.6.x.

Laura
===
pingVision • scattered sunshine

_____ ____ ___ __ _ _
Laura Scott :: design » blog » tweet

Volume

Boris Mann commented 16 September 2005 at 06:03

It's all about percentages. It costs them nothing to keep trying, and it's not a "they" in any case -- it is a kit that automatically crawls the web, finding sites everywhere. Bits cost them essentially 0, so they just keep going.

The bad behaviour module blocks before it even loads your site -- you might need to use it as well (although that really sucks).

FYI, spam.module 2.0 has blocked 40K spams in about a month, on a single site.

You're not answering the question; and you're missing the point

escoles commented 16 September 2005 at 15:05

What methods *do* they use?

What methods do the spammers use to decide what pages to attack, if they don't automatically formulate URLs based on the target CMS?

And the point isn't that I need a better spam module; the point is that there is a design defect in Drupal that makes it a more attractive target for spammers. To counter this, you'll need to demonstrate to me that the spammers aren't exploiting that defect.

You seem to be changing your argument, BTW. First you say, "they don't use something as simple as" URL extrapolation based on the target CMS. (BTW, I've yet to see a 404 for mt-comment.cgi, which tells me that they know it's a Drupal site and suggests that they're targeting accordingly. But I digress.)

Now you say, 'it doesn't matter what algorithms they use, because they'll keep attacking no matter what and there's nothing you can do about it.' And then you point me to a spam control module. Ignoring the fact that when the spam module worked for me, spam attacks still resulted in my site being taken down. Not making all the database hits needed to load a page is great, but it still misses the root point: It would be easy to make Drupal a lot less attractive to spammers by changing how node IDs are assigned.

That "it's all about volume" was more or less precisely my point: They have close to zero cost, as I thought I established in my initial post. So they can afford to use dirt-dumb algorithms. I knew that. that was my point. But even if your cost is close to zero, there's a difference between tossing, say, three-digit node IDs, and 13-digit node IDs. The probability of hitting with a node IDs that's on the order of size and complexity of, say, a timestamp, is a factor of maybe ten thousand lower than with a three-digit node ID.

I think a decreased probability of success on the order of a factor of ten-thousand would make a difference even to spammers. Remember: They don't have real costs, but they do have opportunity costs -- after all, those zombies could be off doing something productive with their bandwidth rather than jousting at some Drupal-powered windmill all afternoon.

Not trying to defend, trying to help

Boris Mann commented 16 September 2005 at 15:21

Sorry, I'm just trying to help here. Random hashes in node IDs or anything like that is not the solution (and just isn't going to happen). But of course you're welcome to contribute some code.

Bad Behavior stops the spam before it hits your site, so you might want to try that.

Jeremy, author of the spam.module, is getting changes into the comment system to make it harder to automate submitting (a Drupal key, actually). That still won't help your load problem.

The other thing you can do is try and implement spam blocking at the HTTP layer with external-to-Drupal tools.

re: opportunity costs: you'd think so. The number of comments I get (all blocked) seem to suggest otherwise.

Consider this possibility

laura s commented 16 September 2005 at 16:11

Perhaps your logic is backwards. Spammers target sites that rank well in search engines, sites that get traffic. Drupal sites achieve these things.

The spam is an indication of this success. Other sites get spammed heavily, and they deal with it. Have you noticed how many sites don't allow anonymous commenting. In fact, I'd say most sites don't. Why do you think that is? I suggest that it's because of spam.

I don't think being spammed means that Drupal is the problem. Maybe the architecture could make it harder for spammers to spam. But it won't stop them. It's very easy to write a script to do the tapdance you put into the code. If you really want to stop them, try the captcha module. Ultimately a graphic challenge will stymie nearly every spammer you can think of.

Laura
===
pingVision • scattered sunshine

_____ ____ ___ __ _ _
Laura Scott :: design » blog » tweet

I don't allow anonymous comments, and ...

escoles commented 16 September 2005 at 18:11

... I get very few hits. Except from spammers, that is.

You guys aren't listening: That stuff works; that's not the problem. Even if the stuff works, a lot of sites at a lot of hosting providers will still end up DOS'd or will end up with bandwidth bills at the end of the month. I served over 10GB last month; over 9GB of that was 404 pages, served because I don't allow anon comments or trackbacks.

Re-read the post: I'm getting hundreds of thousands of hits just from spammers. Those are all going to 404 -- so there's no way that spamming could be driving up the "popularity" of my site for spammers. I know darn well why they don't allow anonymous comments; it's teh same reason I don't. I tried the spam module, I tried adding captachas, and I still got DOS'd for my trouble.

What is remarkable about 404 DOS attacks?

laura s commented 16 September 2005 at 18:22

They can happen to any site. I don't see how a module or architecture revisions are going to change that.

I would also add that if 9GB of traffic puts you in hot water with your host provider, you might look at changing hosts. 9GB is nothing these days. Our sites get over 6GB of traffic just in search crawlers. It's the price you pay for having an optimized site.

I'm sorry you're getting hit. It's no fun. Spam 2.0 made a world of difference on our heaviest trafficked sites. It doesn't uninvent spammers, but it's been a big help.

Laura
===
pingVision • scattered sunshine

_____ ____ ___ __ _ _
Laura Scott :: design » blog » tweet

I'm giong to try this one

escoles commented 19 September 2005 at 10:43

I'm giong to try this one more time, then I'm going to quit. It seems that most of y'all are knee-jerking defensive -- you're unwilling to understand that there's a problem with the design of Drupal (namely, the "uniform" [sic] node paths that allow a spammer to trivially script exploit attempts against Drupal nodes), and instead insist on seeing this as me trying to solve *my* *problem*.

Well, it's not my problem. It's everyone's problem. If the problem is not fixed at a core level, then default installs will always have this weakness, and the system will always be targeted.

Very few people responding in this thread have really paid any attention to what I'm saying.

I've said, "Drupal has a weakness that makes it attractive to spammers."

You've said, "You should try Spam.module 2.0. It's great."

I've said, "It doesn't matter if I've fixed my problem. If most other people have un-fixed Drupal installs, it's still an attractive target for spammers."

You've said, "Just block them with with Bad Behavior. Oh, and by the way, spam.module 2.0? It's great."

I've said, "I don't think you get it. You're selling a door that has a weak spot and saying 'just apply this super patch inside the door where no one can see it, and no one will get through.' But since thieves know all examples of this kind of door have this weak spot, they target it and hit this kind of door even more, and the door stops working. The thieves don't get in, but the door stops working."

You've said, "Dude, I'm sorry that your door stops working. Have you tried spam.module 2.0? It stops people from getting through that weak spot. Oh, and Bad Behavior's really great, too."

... And then I walk away, shaking my head in frustration at trying to communicate a "commons" issue to geeks ...

Yeah, that's very productive

laura s commented 19 September 2005 at 16:06

Have you submitted an issue on this? Comments are for discussion, but issues are for doing.

Laura
===
pingVision • scattered sunshine

_____ ____ ___ __ _ _
Laura Scott :: design » blog » tweet

I didn't submit an issue because I wanted discussion

escoles commented 19 September 2005 at 16:11

I wanted to understand why this isn't being addressed. I had one idea for a way to address the problem; I was hoping that instead of defensive responses I would get either an explanation of why it's not an issue (and that hasn't been given), or discussion of the problem.

BTW, what makes you think it wouldn't die an even quicker death if posted as an issue? It would simply then be dismissed. Here, at least, a discussion thread could be started.

You might consider

laura s commented 19 September 2005 at 16:24

You might consider the possibility that you are not being clear about what you assert. I am not being defensive -- I have nothing to defend here. This is not my code. Try to relax and breathe.

You say your site gets DOS attacks, and you blame Drupal. I don't see why Drupal is responsible for something that can happen to any site anywhere running any CMS or straight html.

But maybe I just don't understand what it is you think you're saying. Maybe you have the brilliant insight into a major problem with the code. I say, Great! So now what? You want to "understand"? Okay. Maybe it's not an issue because you haven't submitted the issue. You know, that's just a thought.

Laura
===
pingVision • scattered sunshine

_____ ____ ___ __ _ _
Laura Scott :: design » blog » tweet

consider

sepeck commented 16 September 2005 at 18:33

working with your host provider to block ip ranges because only at the router and network layer can you stop traffic to your site. You cannot stop it any other way. 404 DOS attacks are not solvable by Drupal. This is not solvable by Apache.

Nothing you do that does not block all traffic from the originating ip addresses wil help you on 404 DOS. Get a firewall and have your provider block the IP Addresses and ranges needed to prevent the traffic from hitting your site. Once it hit's your webserver, it's to late to do anything.

-sp
---------
Test site, always start with a test site.
Drupal Best Practices Guide -|- Black Mountain

-Steven Peck
---------
Test site, always start with a test site.
_{Drupal Best Practices Guide}

This problem is very, very real

cel4145 commented 19 September 2005 at 19:40

You are right. This problem is very, very real. I've seen this happen to a friend of mine. Spam and trackback traffic was so bad, it was DoSing the server. Even thought trackbacks were disabled on the site, Drupal's page not found generation was eating up too much CPU processing.

Now one solution I arrived at was to add in a redirect in .htaccess so that trackback requests of the form /node/trackback/ were redirected away from Drupal. That let Apache generate a 500 internal error, which is much less resource hungry than letting Drupal handle it. But this only small, stop gap solution.

I recommend trying Spam 2.0

laura s commented 9 September 2005 at 22:46

It is linked on the spam module desciption. Jeremy did a fine job rewriting the module to block repeat spammer IPs (configurable). It also comes with a trackback_blackhole module that dumps trackback queries to an empty page (and hence no db calls to load a Drupal page).

Laura
===
pingVision • scattered sunshine

_____ ____ ___ __ _ _
Laura Scott :: design » blog » tweet

Spam 2.0 and Trackback 4.6 backporting for 4.5 sites?

Ian Ward commented 16 September 2005 at 18:33

Has anyone backported trackback 4.6 head which works w/ spam 2.0 to work on drupal 4.5? Anyone know if there are impossibilities here? Going have a look now, just thought I'd ask in the meantime.

Comment spam is solvable ..

eldarin commented 19 September 2005 at 17:32

See http://drupal.org/node/29954 and then if you really need to - add capchas.
Further on you could add more sniffing/analyzing on accepting POSTed forms data. It depends on how thorough you want to be.

An excellent spam-module would analyze how forms/trackbacks are being submitted too. There are a lot of variables to consider, and making out the patterns can be laborous by hand. Kohonen SOM or Bayesian probability can help a lot.
;-)