Can FeedAPI automatically delete items when they are removed from the feed?

therzog - July 30, 2008 - 16:07
Project:FeedAPI
Version:6.x-1.7-beta2
Component:Code feedapi_node
Category:feature request
Priority:normal
Assigned:Unassigned
Status:needs work
Description

I'm looking for a way for FeedAPI to delete items as soon as they fall off the feed, so that the stored items are exactly the same as what the feed was when last read.

We use Vocus, which lets our media officers arbitrarily pick news stories to be published on an RSS feed that we post to our site (in a block), so we need old stories to immediately drop off whenever they choose something new. On the other hand, we can't really use "Delete Items Older Than," because some stories may stay on the feed for a week or two.

But the same might be true of a flickr group or the feed for a del.icio.us tag.

Here's a solution, but it's not pretty, so perhaps you can think of a better way in a future version. The problem is the "unique" functionality is designed to compare a new item to previously read items in the database, but not the other way around. So at a high level (above the processor), it's impossible to identify items that aren't on the current feed and delete them. That's why this is built into feedapi_aggregator.module:

1) Extend _feedapi_aggregator_unique to optionally return the matching fiid (for internal use)
2) Extend _feedapi_aggregator_expire to pre-scan items from the feed and delete ones in the database that don't match.

--- feedapi_aggregator.module   30 Jul 2008 15:39:37 -0000      1.2
+++ feedapi_aggregator.module   30 Jul 2008 16:01:14 -0000
@@ -1,5 +1,5 @@
<?php
-// $Id: feedapi_aggregator.module,v 1.2 2008/07/30 15:39:37 cvsroot Exp $
+// $Id: feedapi_aggregator.module,v 1.1 2008/05/27 18:04:23 devseed Exp $

/**
  * @abstract This module emulates aggregator module with the feedapi framework.
@@ -194,6 +194,12 @@
         '#default_value' => 3,
         '#options' => drupal_map_assoc(array(2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)),
       );
+         $form['delete_missing'] = array(
+               '#type' => 'checkbox',
+               '#title' => t('Delete missing feed items'),
+               '#description' => t('If checked, previously read feed items will be removed when the feed is refreshed, if they are no longer in the feed (even if they haven\'t expired).'),
+               '#default_value' => 1,
+         );
       $categories_result = db_query('SELECT cid, title FROM {feedapi_aggregator_category}');
       $categories = array();
       while ($category = db_fetch_object($categories_result)) {
@@ -510,7 +516,7 @@
/**
  * Is this feed item created?
  */
-function _feedapi_aggregator_unique($feed_item, $feed_nid, $settings = array()) {
+function _feedapi_aggregator_unique($feed_item, $feed_nid, $settings = array(), $return_id=FALSE) {
   $entry = FALSE;
   if ($feed_item->options->guid) {
     $entry = db_fetch_object(db_query("SELECT iid FROM {feedapi_aggregator_item} WHERE feed_nid = %d AND guid = '%s'", $feed_nid, $feed_item->options->guid));
@@ -522,6 +528,11 @@
   else {
     $entry = db_fetch_object(db_query("SELECT iid FROM {feedapi_aggregator_item} WHERE feed_nid = %d AND title = '%s'", $feed_nid, $feed_item->title));
   }
+
+  if( $return_id ) {
+    return is_object($entry) ? $entry->iid : null;
+  }
+
   return is_object($entry) ? FALSE : TRUE;
}

@@ -552,6 +563,25 @@
       $count++;
     }
   }
+
+  $processor_settings = $settings['processors']['feedapi_aggregator'];
+  if( $processor_settings['delete_missing'] ) {
+    $items_to_keep = array();
+       foreach( $feed->items as $index => $item) {
+         if( $iid = module_invoke('feedapi_aggregator', 'feedapi_item', 'unique', $item, $feed->nid, $processor_settings, TRUE) ) {
+           $items_to_keep[] = $iid;
+         }
+       }
+
+       if( $items_to_keep ) {
+         $result = db_query('SELECT * FROM {feedapi_aggregator_item} WHERE feed_nid=%d AND iid NOT IN (%s)', $feed->nid, implode(',', $items_to_keep));
+         while( $item = db_fetch_object($result) ) {
+           $item->fiid = $item->iid;
+               feedapi_expire_item($feed, $item);
+               $count++;
+         }
+       }
+  }
   return $count;
}

I also had to change feedapi.module so that new items are read before the call to feedapi_expire, so the latter can see new items:

--- feedapi.module      27 May 2008 18:04:23 -0000      1.1
+++ feedapi.module      30 Jul 2008 16:00:30 -0000
@@ -1125,16 +1125,17 @@
   }
   $settings = feedapi_get_settings(NULL, $feed->nid);

-  // Step 1: Force processors to delete old items and determine the max. create elements.
-  $counter['expired'] = feedapi_expire($feed);
-
-  // Step 2: Get feed.
+  // Step 1: Get feed.
   $nid = $feed->nid;
   $hash_old = $feed->hash;
   $feed = _feedapi_call_parsers($feed, $feed->parsers, $feed->half_done);
   if (is_object($feed)) {
     $feed->hash = md5(serialize($feed->items));
   }
+
+  // Step 2: Force processors to delete old items and determine the max. create elements.
+  $counter['expired'] = feedapi_expire($feed);
+
   // Step 3: See, whether feed has been modified.
   if ($feed === FALSE || $hash_old == $feed->hash) {
     // Updated the checked field in any case.

#1

Jomel - October 19, 2008 - 03:13

Subscribing. This would be extremely useful, as we want to use RSS to mirror content from another site. I haven't yet had time to see if the patch applies against 1.4...

#2

doublejosh - March 12, 2009 - 21:34
Version:5.x-1.2» 5.x-1.5
Component:Code» Code feedapi_node

This would be great. I'd love a way to set a max number of feed items for a given feed. Or even just a way to set a max number to be fetched at cron or on refresh.

Possible already???

#3

Aron Novak - March 13, 2009 - 09:37

This is not possible at the moment. Also unfortunately I don't plan to add new features to the 5.x branch. If someone step up with a patch, i happily review it.

#4

doublejosh - March 16, 2009 - 19:18

Are maxes and or throttling planned for V6?

I'm about to start on a feed aggregation project, that might enable me to spend time on this! :)

#5

quickcel - June 4, 2009 - 20:15

Subscribing

#6

quickcel - June 25, 2009 - 14:23

I've created a patch for the 6.x branch that I think provides the functionality that the original poster mentioned (thanks to therzog for some of the code).

Basically, there is a new option under the FeedApi Processor settings to delete old nodes that drop off the feed (see attachment for screenshot). If that box is checked then when a feed is refreshed (manually or through cron) any items that have fallen off the feed will be deleted.

It's working well for a couple of feeds I'm using at the moment, but definitely needs some testing and review.

AttachmentSize
6-25-2009 10-11-19 AM.png 15.27 KB
feedapi.module.patch 1.61 KB
feedapi_node.module.patch 1.95 KB

#7

LouBabe - July 2, 2009 - 23:16

Noticed that the changes to feedapi.module in the last patch only apply to the node processor, which might not be used in all cases. I don't like the idea of storing feed items as nodes and am working on a lighter weight processor, and am having the same "dropped" feeds issue.

#8

LouBabe - July 3, 2009 - 17:05
Version:5.x-1.5» 6.x-1.7-beta2

This is the code I ended up writing within my own processor to make this work. It's called during the expire operation. Works well in preliminary testing.

   if ($settings['processors']['feedapi_toolbox']['delete_missing']) {
    $guids_to_keep = array();
    foreach ($feed->parsers as $parser) {
      $result = module_invoke($parser, 'feedapi_feed', 'parse', $feed);
      foreach($result->items as $item){
        if(!empty($item->options->guid)){
          $guids_to_keep[] = "'" . $item->options->guid . "'";
        }
      }
    } 
    if (!empty($guids_to_keep)) {
      $result = db_query("SELECT * FROM {feedapi_toolbox_item} WHERE nid = %d AND guid NOT IN (".implode(", ", $guids_to_keep).")", $feed->nid);
      while ($item = db_fetch_object($result)) {
        // We callback feedapi for deleting
        feedapi_expire_item($feed, $item);
      }
    } 
  }

NOTE: Code snipped updated since original post.

#9

chrism2671 - July 16, 2009 - 09:50

Has any of this been committed to the dev branch? It seems like mirroring the RSS feed should be a relatively trivial operation...

#10

quickcel - July 16, 2009 - 12:35

@chrism2671 - I do not believe so

#11

DavidWhite - July 16, 2009 - 13:48

Subscribe

#12

alex_b - July 16, 2009 - 15:37

We've recently had a similar requirement. I'm way to paranoid to actually delete items when they are not found on the feed (what happens if the feed blips?) - so I opted for unpublishing them. I wrote it as separate processor and called it FeedAPI garbage collector :)

Here's the module. If somebody wants to run with it ad maintain it on d.o. as separate project, I'm all for it.

AttachmentSize
feedapi_gc.tar_.gz 1.34 KB

#13

CheezItMan - August 8, 2009 - 09:23

I have another problem i that when I hit refresh on the feed to start creating nodes from a google calendar. I get a blank white screen.

#14

pdcarto - August 21, 2009 - 12:33

#12 works for me. Simple and clean.

#15

philpro - November 5, 2009 - 22:32

Sweet! it works. and if i change line 45 in "feedapi_gc.module" from

<?php
  db_query
('UPDATE {node} SET status = 0 WHERE nid = %d', $feed_item->nid);
?>

to

<?php
  node_delete
($feed_item->nid);
?>

It deletes those offending nodes. Thanks for this module.

#16

johnmunro - November 16, 2009 - 08:11

#12 not working for me.

The 'FeedAPI Garbage Collector' is appearing in the content type as a Processor correctly, but it fails to appear in the actual feed item content. Only the 'FeedAPI Node' processor is showing and no items are ever unpublished (or deleted with #15's code)

#17

Copyfight - November 16, 2009 - 12:12

subscribing

#18

quickcel - November 16, 2009 - 17:10

If you manually refresh the feed will it unpublish items?

If you can manually refresh the feeds and they work OK, but they don't update with a cron run you may want to check out this discussion: http://drupal.org/node/580508

#19

johnmunro - November 17, 2009 - 02:54
Status:active» needs review

I can now understand why there is not feedapi GC processor shown when editing the feed item (not content type) as there is no function feedapi_gc_feedapi_settings_form. This seems ok in that there are no settings required.

Strangely however, the function feedapi_gc_feedapi_after_refresh never runs on a manual or cron refresh for me (I'm using Firebug for Drupal to check this).

...

I have now found out why. The module_feedapi_after_refresh only gets to run if there has been a change in the actual feed. In my testing, my test machine had run cron after some items were removed from the feed but before adding the gc module.

Making a small change in the feed kicked it into gear.

This extra module approach seems the most elegant and bug free approach to solve this issue. Maybe it can be added to the feedapi standard add ons?

I suggest the following changes:
Change feedapi_gc.info from:

name = FeedAPI Garbage Collector
description = This FeedAPI processor unpublishes feed item nodes that are not present on the feed.
core = 6.x

To:

name = FeedAPI Garbage Collector
description = This FeedAPI processor unpublishes feed item nodes that are not present on the feed.
dependencies[] = feedapi
package = FeedAPI Add On
core = 6.x

AND add the following to the end of the README.txt

After installation, the feed must actually change before the garbage collector
will run.

#20

johnmunro - November 17, 2009 - 03:11

Attaching update version of alex_b's module for review

AttachmentSize
feedapi_gc.tar_.gz 1.41 KB

#21

johnmunro - November 17, 2009 - 03:24
Status:needs review» needs work

Whoops, the above attachment says 'items _deleted_' rather than 'items unpublished'

For flexibility, also inserted this code at line 45 of feedapi_gc.module

      // TO DO: change the following into an admin preference
      // Uncomment the next line to delete rather than unpublish
      //node_delete($feed_item->nid);
      // Comment the next line to delete rather than unpublish

AttachmentSize
feedapi_gc.tar_.gz 1.49 KB

#22

johnmunro - November 20, 2009 - 00:14

Another issue has arisen with the message feedback to the user.

    drupal_set_message(t('!checked items checked, !removed items removed.', array('!checked' => $checked, '!removed' => $removed)));

is good and helpful unless it is a cron run, in which case the message is shown to the next Anonym visitor, confusing them. Therefore something like this code is needed:
  if (!$cron) {
    drupal_set_message(t('!checked items checked, !removed items removed.', array('!checked' => $checked, '!removed' => $removed)));
  }

However, there is no access to the $cron variable in the calling function from feedapi.module _feedapi_invoke_refresh so what to do?? As I see it there needs to be an API change to pass this variable in the hook_feedapi_after_refresh.

Namely, feedapi.module line ~1278:

  foreach (module_implements('feedapi_after_refresh') as $module) {
    $func = $module .'_feedapi_after_refresh';
    $func($feed);
  }

Change to:
  foreach (module_implements('feedapi_after_refresh') as $module) {
    $func = $module .'_feedapi_after_refresh';
    $func($feed, $cron);
  }

Or can someone think of a less disruptive method. No feedback message is one way...

 
 

Drupal is a registered trademark of Dries Buytaert.