I am struggling with how I want to approach a data management issue. I have about 20,000 pieces of data that I need to import to Drupal so they can be listed, searched, commented, etc. I tried importing and found that Drupal doesn't handle large amounts of nodes real well. It seems I can't delete large numbers of nodes real easily. So once it's there, it's not going anywhere. What I did find though is that FeedAPI manages the RSS nodes better and I can delete groups at a time without repercussion. So I'm debating whether I should just import my data as story nodes or set up an RSS server to have it stream through the FeedAPI, which will allow some better categorization and management. Any thoughts?

Thanks,
Ryan

Comments

yelvington’s picture

Drupal handles the existence of large numbers of nodes quite well but deletions are another matter.

In order to accommodate random extensions that might have been implemented in third-party modules through the hook system, every node deletion tool I've seen uses PHP to iterate through a list of nodes to be deleted, repeatedly calling node_delete(), which in turn calls node_load(), which in turn fires every possible hook_load() implementation, and pretty soon you've got open warfare between table locks and active users. The users lose, of course.

And this is what you've discovered, right?

The problem with "lightweight" RSS aggregation is that you lose all the benefits of nodes -- extensibility through the API, third-party modules for ranking/rating/recommendations, easy Views integration, etc.

I don't really have a solution for this. We're struggling with it, too. For node types where we know exactly what we're getting into, we're likely to write custom deletion code that takes advantage of InnoDB's transactional capabilities. That's what our ubergeeks are telling me right now, anyway.

ryanmnly’s picture

i'm not real familiar with the hooks and php functions. but i think we are talking about the same thing. you're right, it seems to handle the large amounts of data. but come time for deletion, its not feasible to checkmark each one and delete 20,000 nodes via web page. so i opted to go into the tables and do it in mysql manually. the data appears to be spread over 4 tables. not that it isn't manageable to do it this way, but it makes me a little sick going into all of them and deleting. to much can go wrong.

the rss aggregation i talk about is a little different. with feedapi, you setup an rss stream. then, each item streamed is automatically turned into a node. you can set the type of node you want it to turn into. thus, you actually have a tangible node to work with. all ranking, rating, recommendations, comments, views, etc are all fully functional. i really like it.

so what i can do is setup a bunch of smaller feeds into manageable groups. then when i delete an rss feed, it deletes all the associated nodes. this is the best i could do for systematic mass deletion. i suppose it kinda follows this process:

data flow: mysql --> rss server --> rss feed --> node

as opposed to:

data flow: mysql --> drupal server --> node

more step,s but manageable. and with the feed you can import your custom xml format and keep whatever database structure floats your boat. what do you think?