Can FeedAPI automatically delete items when they are removed from the feed?

therzog - July 30, 2008 - 16:07
Project:FeedAPI
Version:6.x-1.7-beta2
Component:Code feedapi_node
Category:feature request
Priority:normal
Assigned:Unassigned
Status:active
Description

I'm looking for a way for FeedAPI to delete items as soon as they fall off the feed, so that the stored items are exactly the same as what the feed was when last read.

We use Vocus, which lets our media officers arbitrarily pick news stories to be published on an RSS feed that we post to our site (in a block), so we need old stories to immediately drop off whenever they choose something new. On the other hand, we can't really use "Delete Items Older Than," because some stories may stay on the feed for a week or two.

But the same might be true of a flickr group or the feed for a del.icio.us tag.

Here's a solution, but it's not pretty, so perhaps you can think of a better way in a future version. The problem is the "unique" functionality is designed to compare a new item to previously read items in the database, but not the other way around. So at a high level (above the processor), it's impossible to identify items that aren't on the current feed and delete them. That's why this is built into feedapi_aggregator.module:

1) Extend _feedapi_aggregator_unique to optionally return the matching fiid (for internal use)
2) Extend _feedapi_aggregator_expire to pre-scan items from the feed and delete ones in the database that don't match.

--- feedapi_aggregator.module   30 Jul 2008 15:39:37 -0000      1.2
+++ feedapi_aggregator.module   30 Jul 2008 16:01:14 -0000
@@ -1,5 +1,5 @@
<?php
-// $Id: feedapi_aggregator.module,v 1.2 2008/07/30 15:39:37 cvsroot Exp $
+// $Id: feedapi_aggregator.module,v 1.1 2008/05/27 18:04:23 devseed Exp $

/**
  * @abstract This module emulates aggregator module with the feedapi framework.
@@ -194,6 +194,12 @@
         '#default_value' => 3,
         '#options' => drupal_map_assoc(array(2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)),
       );
+         $form['delete_missing'] = array(
+               '#type' => 'checkbox',
+               '#title' => t('Delete missing feed items'),
+               '#description' => t('If checked, previously read feed items will be removed when the feed is refreshed, if they are no longer in the feed (even if they haven\'t expired).'),
+               '#default_value' => 1,
+         );
       $categories_result = db_query('SELECT cid, title FROM {feedapi_aggregator_category}');
       $categories = array();
       while ($category = db_fetch_object($categories_result)) {
@@ -510,7 +516,7 @@
/**
  * Is this feed item created?
  */
-function _feedapi_aggregator_unique($feed_item, $feed_nid, $settings = array()) {
+function _feedapi_aggregator_unique($feed_item, $feed_nid, $settings = array(), $return_id=FALSE) {
   $entry = FALSE;
   if ($feed_item->options->guid) {
     $entry = db_fetch_object(db_query("SELECT iid FROM {feedapi_aggregator_item} WHERE feed_nid = %d AND guid = '%s'", $feed_nid, $feed_item->options->guid));
@@ -522,6 +528,11 @@
   else {
     $entry = db_fetch_object(db_query("SELECT iid FROM {feedapi_aggregator_item} WHERE feed_nid = %d AND title = '%s'", $feed_nid, $feed_item->title));
   }
+
+  if( $return_id ) {
+    return is_object($entry) ? $entry->iid : null;
+  }
+
   return is_object($entry) ? FALSE : TRUE;
}

@@ -552,6 +563,25 @@
       $count++;
     }
   }
+
+  $processor_settings = $settings['processors']['feedapi_aggregator'];
+  if( $processor_settings['delete_missing'] ) {
+    $items_to_keep = array();
+       foreach( $feed->items as $index => $item) {
+         if( $iid = module_invoke('feedapi_aggregator', 'feedapi_item', 'unique', $item, $feed->nid, $processor_settings, TRUE) ) {
+           $items_to_keep[] = $iid;
+         }
+       }
+
+       if( $items_to_keep ) {
+         $result = db_query('SELECT * FROM {feedapi_aggregator_item} WHERE feed_nid=%d AND iid NOT IN (%s)', $feed->nid, implode(',', $items_to_keep));
+         while( $item = db_fetch_object($result) ) {
+           $item->fiid = $item->iid;
+               feedapi_expire_item($feed, $item);
+               $count++;
+         }
+       }
+  }
   return $count;
}

I also had to change feedapi.module so that new items are read before the call to feedapi_expire, so the latter can see new items:

--- feedapi.module      27 May 2008 18:04:23 -0000      1.1
+++ feedapi.module      30 Jul 2008 16:00:30 -0000
@@ -1125,16 +1125,17 @@
   }
   $settings = feedapi_get_settings(NULL, $feed->nid);

-  // Step 1: Force processors to delete old items and determine the max. create elements.
-  $counter['expired'] = feedapi_expire($feed);
-
-  // Step 2: Get feed.
+  // Step 1: Get feed.
   $nid = $feed->nid;
   $hash_old = $feed->hash;
   $feed = _feedapi_call_parsers($feed, $feed->parsers, $feed->half_done);
   if (is_object($feed)) {
     $feed->hash = md5(serialize($feed->items));
   }
+
+  // Step 2: Force processors to delete old items and determine the max. create elements.
+  $counter['expired'] = feedapi_expire($feed);
+
   // Step 3: See, whether feed has been modified.
   if ($feed === FALSE || $hash_old == $feed->hash) {
     // Updated the checked field in any case.

#1

Jomel - October 19, 2008 - 03:13

Subscribing. This would be extremely useful, as we want to use RSS to mirror content from another site. I haven't yet had time to see if the patch applies against 1.4...

#2

doublejosh - March 12, 2009 - 21:34
Version:5.x-1.2» 5.x-1.5
Component:Code» Code feedapi_node

This would be great. I'd love a way to set a max number of feed items for a given feed. Or even just a way to set a max number to be fetched at cron or on refresh.

Possible already???

#3

Aron Novak - March 13, 2009 - 09:37

This is not possible at the moment. Also unfortunately I don't plan to add new features to the 5.x branch. If someone step up with a patch, i happily review it.

#4

doublejosh - March 16, 2009 - 19:18

Are maxes and or throttling planned for V6?

I'm about to start on a feed aggregation project, that might enable me to spend time on this! :)

#5

quickcel - June 4, 2009 - 20:15

Subscribing

#6

quickcel - June 25, 2009 - 14:23

I've created a patch for the 6.x branch that I think provides the functionality that the original poster mentioned (thanks to therzog for some of the code).

Basically, there is a new option under the FeedApi Processor settings to delete old nodes that drop off the feed (see attachment for screenshot). If that box is checked then when a feed is refreshed (manually or through cron) any items that have fallen off the feed will be deleted.

It's working well for a couple of feeds I'm using at the moment, but definitely needs some testing and review.

AttachmentSize
6-25-2009 10-11-19 AM.png 15.27 KB
feedapi.module.patch 1.61 KB
feedapi_node.module.patch 1.95 KB

#7

LouBabe - July 2, 2009 - 23:16

Noticed that the changes to feedapi.module in the last patch only apply to the node processor, which might not be used in all cases. I don't like the idea of storing feed items as nodes and am working on a lighter weight processor, and am having the same "dropped" feeds issue.

#8

LouBabe - July 3, 2009 - 17:05
Version:5.x-1.5» 6.x-1.7-beta2

This is the code I ended up writing within my own processor to make this work. It's called during the expire operation. Works well in preliminary testing.

   if ($settings['processors']['feedapi_toolbox']['delete_missing']) {
    $guids_to_keep = array();
    foreach ($feed->parsers as $parser) {
      $result = module_invoke($parser, 'feedapi_feed', 'parse', $feed);
      foreach($result->items as $item){
        if(!empty($item->options->guid)){
          $guids_to_keep[] = "'" . $item->options->guid . "'";
        }
      }
    } 
    if (!empty($guids_to_keep)) {
      $result = db_query("SELECT * FROM {feedapi_toolbox_item} WHERE nid = %d AND guid NOT IN (".implode(", ", $guids_to_keep).")", $feed->nid);
      while ($item = db_fetch_object($result)) {
        // We callback feedapi for deleting
        feedapi_expire_item($feed, $item);
      }
    } 
  }

NOTE: Code snipped updated since original post.

#9

chrism2671 - July 16, 2009 - 09:50

Has any of this been committed to the dev branch? It seems like mirroring the RSS feed should be a relatively trivial operation...

#10

quickcel - July 16, 2009 - 12:35

@chrism2671 - I do not believe so

#11

DavidWhite - July 16, 2009 - 13:48

Subscribe

#12

alex_b - July 16, 2009 - 15:37

We've recently had a similar requirement. I'm way to paranoid to actually delete items when they are not found on the feed (what happens if the feed blips?) - so I opted for unpublishing them. I wrote it as separate processor and called it FeedAPI garbage collector :)

Here's the module. If somebody wants to run with it ad maintain it on d.o. as separate project, I'm all for it.

AttachmentSize
feedapi_gc.tar_.gz 1.34 KB

#13

CheezItMan - August 8, 2009 - 09:23

I have another problem i that when I hit refresh on the feed to start creating nodes from a google calendar. I get a blank white screen.

#14

pdcarto - August 21, 2009 - 12:33

#12 works for me. Simple and clean.

#15

philpro - November 5, 2009 - 22:32

Sweet! it works. and if i change line 45 in "feedapi_gc.module" from

<?php
  db_query
('UPDATE {node} SET status = 0 WHERE nid = %d', $feed_item->nid);
?>

to

<?php
  node_delete
($feed_item->nid);
?>

It deletes those offending nodes. Thanks for this module.

 
 

Drupal is a registered trademark of Dries Buytaert.