Mass deletes (millions) of a given node type, HowTo?

yountod - November 9, 2009 - 20:45

I need to delete millions of nodes. I was using the Devel mod to delete before generating, then simply not generating, but that crashes after a few thousands of deletes. How can I blow away millions of one node type, keeping all others?

What are the ramifications of direct MySQL node deletions? Many, I presume?

That's a lot of nodes

stevenc - November 14, 2009 - 15:22

If you call the normal node_delete function, you are definitely going to have issues with that many nodes.

http://api.drupal.org/api/function/node_delete

The API function is designed to delete nodes individually. There are several reasons why it is bad for large numbers of nodes:

- It loads the entire node object
- It invokes the call-back functions
- It clears the cache
- It removes the node from the Search index
- It adds a log of the event

Attempting to call this iteratively would take a lot of time and sever resources.

In a batch mode like you are operating, you can manually do the cache and Search update after your batch is complete, and you can likely choose to ignore the log entries.

I think you may run into a challenge with the call-back functions, as these are what other modules use to be be notified that a node was deleted.

If I was handling this problem, I would first consider how many other modules are interacting with the nodes (menu, path, comment, etc) and to what extent. The harm in directly editing your MySQL is that you may create orphan data or relationships, so you would need to be careful in deleting data. Also, a module could store node data as part of a larger data set (for example, as part of a parsable string) and so finding and removing/editing this data manually could be a nightmare.

Also, nodes can store information on the server (uploaded images and files, for example) so you'd have to be aware of cleaning up that data as well.

If direct edits are unmanageable (which is most likely), I would suggest you simply break the process down into small batches with a function that still invokes the call-backs. This will let other modules clean up their data normally. It may take a lot longer than you'd like, but this is better for maintaining the integrity of your database.

...

John_Kenney - November 14, 2009 - 15:44

not sure it'll do millions, but this works very nicely for thousands.

http://drupal.org/project/revision_deletion

Better to rebuild, I guess?

yountod - November 17, 2009 - 15:22

Thank you both for the guidance. After careful consideration I've concluded the safest way to do this is to migrate the site to an aux server and CSV import the records I'd like to retain, omitting those that are to be eliminated. I would have gone the DEVEL auto-delete route if it supported cron-based renewal of failed delete batches, but I don't think it does.

If there were any other safe way to delete tens of thousands via cron jobs (continually batching until millions were deleted) I would try it, since I can afford to wait but not to sit at the console hitting F5. But I'm not aware of any such routine.

 
 

Drupal is a registered trademark of Dries Buytaert.