Can FeedAPI automatically delete items when they are removed from the feed?
| Project: | FeedAPI |
| Version: | 6.x-1.7-beta2 |
| Component: | Code feedapi_node |
| Category: | feature request |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | needs work |
I'm looking for a way for FeedAPI to delete items as soon as they fall off the feed, so that the stored items are exactly the same as what the feed was when last read.
We use Vocus, which lets our media officers arbitrarily pick news stories to be published on an RSS feed that we post to our site (in a block), so we need old stories to immediately drop off whenever they choose something new. On the other hand, we can't really use "Delete Items Older Than," because some stories may stay on the feed for a week or two.
But the same might be true of a flickr group or the feed for a del.icio.us tag.
Here's a solution, but it's not pretty, so perhaps you can think of a better way in a future version. The problem is the "unique" functionality is designed to compare a new item to previously read items in the database, but not the other way around. So at a high level (above the processor), it's impossible to identify items that aren't on the current feed and delete them. That's why this is built into feedapi_aggregator.module:
1) Extend _feedapi_aggregator_unique to optionally return the matching fiid (for internal use)
2) Extend _feedapi_aggregator_expire to pre-scan items from the feed and delete ones in the database that don't match.
--- feedapi_aggregator.module 30 Jul 2008 15:39:37 -0000 1.2
+++ feedapi_aggregator.module 30 Jul 2008 16:01:14 -0000
@@ -1,5 +1,5 @@
<?php
-// $Id: feedapi_aggregator.module,v 1.2 2008/07/30 15:39:37 cvsroot Exp $
+// $Id: feedapi_aggregator.module,v 1.1 2008/05/27 18:04:23 devseed Exp $
/**
* @abstract This module emulates aggregator module with the feedapi framework.
@@ -194,6 +194,12 @@
'#default_value' => 3,
'#options' => drupal_map_assoc(array(2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)),
);
+ $form['delete_missing'] = array(
+ '#type' => 'checkbox',
+ '#title' => t('Delete missing feed items'),
+ '#description' => t('If checked, previously read feed items will be removed when the feed is refreshed, if they are no longer in the feed (even if they haven\'t expired).'),
+ '#default_value' => 1,
+ );
$categories_result = db_query('SELECT cid, title FROM {feedapi_aggregator_category}');
$categories = array();
while ($category = db_fetch_object($categories_result)) {
@@ -510,7 +516,7 @@
/**
* Is this feed item created?
*/
-function _feedapi_aggregator_unique($feed_item, $feed_nid, $settings = array()) {
+function _feedapi_aggregator_unique($feed_item, $feed_nid, $settings = array(), $return_id=FALSE) {
$entry = FALSE;
if ($feed_item->options->guid) {
$entry = db_fetch_object(db_query("SELECT iid FROM {feedapi_aggregator_item} WHERE feed_nid = %d AND guid = '%s'", $feed_nid, $feed_item->options->guid));
@@ -522,6 +528,11 @@
else {
$entry = db_fetch_object(db_query("SELECT iid FROM {feedapi_aggregator_item} WHERE feed_nid = %d AND title = '%s'", $feed_nid, $feed_item->title));
}
+
+ if( $return_id ) {
+ return is_object($entry) ? $entry->iid : null;
+ }
+
return is_object($entry) ? FALSE : TRUE;
}
@@ -552,6 +563,25 @@
$count++;
}
}
+
+ $processor_settings = $settings['processors']['feedapi_aggregator'];
+ if( $processor_settings['delete_missing'] ) {
+ $items_to_keep = array();
+ foreach( $feed->items as $index => $item) {
+ if( $iid = module_invoke('feedapi_aggregator', 'feedapi_item', 'unique', $item, $feed->nid, $processor_settings, TRUE) ) {
+ $items_to_keep[] = $iid;
+ }
+ }
+
+ if( $items_to_keep ) {
+ $result = db_query('SELECT * FROM {feedapi_aggregator_item} WHERE feed_nid=%d AND iid NOT IN (%s)', $feed->nid, implode(',', $items_to_keep));
+ while( $item = db_fetch_object($result) ) {
+ $item->fiid = $item->iid;
+ feedapi_expire_item($feed, $item);
+ $count++;
+ }
+ }
+ }
return $count;
}I also had to change feedapi.module so that new items are read before the call to feedapi_expire, so the latter can see new items:
--- feedapi.module 27 May 2008 18:04:23 -0000 1.1
+++ feedapi.module 30 Jul 2008 16:00:30 -0000
@@ -1125,16 +1125,17 @@
}
$settings = feedapi_get_settings(NULL, $feed->nid);
- // Step 1: Force processors to delete old items and determine the max. create elements.
- $counter['expired'] = feedapi_expire($feed);
-
- // Step 2: Get feed.
+ // Step 1: Get feed.
$nid = $feed->nid;
$hash_old = $feed->hash;
$feed = _feedapi_call_parsers($feed, $feed->parsers, $feed->half_done);
if (is_object($feed)) {
$feed->hash = md5(serialize($feed->items));
}
+
+ // Step 2: Force processors to delete old items and determine the max. create elements.
+ $counter['expired'] = feedapi_expire($feed);
+
// Step 3: See, whether feed has been modified.
if ($feed === FALSE || $hash_old == $feed->hash) {
// Updated the checked field in any case.
#1
Subscribing. This would be extremely useful, as we want to use RSS to mirror content from another site. I haven't yet had time to see if the patch applies against 1.4...
#2
This would be great. I'd love a way to set a max number of feed items for a given feed. Or even just a way to set a max number to be fetched at cron or on refresh.
Possible already???
#3
This is not possible at the moment. Also unfortunately I don't plan to add new features to the 5.x branch. If someone step up with a patch, i happily review it.
#4
Are maxes and or throttling planned for V6?
I'm about to start on a feed aggregation project, that might enable me to spend time on this! :)
#5
Subscribing
#6
I've created a patch for the 6.x branch that I think provides the functionality that the original poster mentioned (thanks to therzog for some of the code).
Basically, there is a new option under the FeedApi Processor settings to delete old nodes that drop off the feed (see attachment for screenshot). If that box is checked then when a feed is refreshed (manually or through cron) any items that have fallen off the feed will be deleted.
It's working well for a couple of feeds I'm using at the moment, but definitely needs some testing and review.
#7
Noticed that the changes to feedapi.module in the last patch only apply to the node processor, which might not be used in all cases. I don't like the idea of storing feed items as nodes and am working on a lighter weight processor, and am having the same "dropped" feeds issue.
#8
This is the code I ended up writing within my own processor to make this work. It's called during the expire operation. Works well in preliminary testing.
if ($settings['processors']['feedapi_toolbox']['delete_missing']) {$guids_to_keep = array();
foreach ($feed->parsers as $parser) {
$result = module_invoke($parser, 'feedapi_feed', 'parse', $feed);
foreach($result->items as $item){
if(!empty($item->options->guid)){
$guids_to_keep[] = "'" . $item->options->guid . "'";
}
}
}
if (!empty($guids_to_keep)) {
$result = db_query("SELECT * FROM {feedapi_toolbox_item} WHERE nid = %d AND guid NOT IN (".implode(", ", $guids_to_keep).")", $feed->nid);
while ($item = db_fetch_object($result)) {
// We callback feedapi for deleting
feedapi_expire_item($feed, $item);
}
}
}
NOTE: Code snipped updated since original post.
#9
Has any of this been committed to the dev branch? It seems like mirroring the RSS feed should be a relatively trivial operation...
#10
@chrism2671 - I do not believe so
#11
Subscribe
#12
We've recently had a similar requirement. I'm way to paranoid to actually delete items when they are not found on the feed (what happens if the feed blips?) - so I opted for unpublishing them. I wrote it as separate processor and called it FeedAPI garbage collector :)
Here's the module. If somebody wants to run with it ad maintain it on d.o. as separate project, I'm all for it.
#13
I have another problem i that when I hit refresh on the feed to start creating nodes from a google calendar. I get a blank white screen.
#14
#12 works for me. Simple and clean.
#15
Sweet! it works. and if i change line 45 in "feedapi_gc.module" from
<?phpdb_query('UPDATE {node} SET status = 0 WHERE nid = %d', $feed_item->nid);
?>
to
<?phpnode_delete($feed_item->nid);
?>
It deletes those offending nodes. Thanks for this module.
#16
#12 not working for me.
The 'FeedAPI Garbage Collector' is appearing in the content type as a Processor correctly, but it fails to appear in the actual feed item content. Only the 'FeedAPI Node' processor is showing and no items are ever unpublished (or deleted with #15's code)
#17
subscribing
#18
If you manually refresh the feed will it unpublish items?
If you can manually refresh the feeds and they work OK, but they don't update with a cron run you may want to check out this discussion: http://drupal.org/node/580508
#19
I can now understand why there is not feedapi GC processor shown when editing the feed item (not content type) as there is no function feedapi_gc_feedapi_settings_form. This seems ok in that there are no settings required.
Strangely however, the function feedapi_gc_feedapi_after_refresh never runs on a manual or cron refresh for me (I'm using Firebug for Drupal to check this).
...
I have now found out why. The module_feedapi_after_refresh only gets to run if there has been a change in the actual feed. In my testing, my test machine had run cron after some items were removed from the feed but before adding the gc module.
Making a small change in the feed kicked it into gear.
This extra module approach seems the most elegant and bug free approach to solve this issue. Maybe it can be added to the feedapi standard add ons?
I suggest the following changes:
Change feedapi_gc.info from:
name = FeedAPI Garbage Collectordescription = This FeedAPI processor unpublishes feed item nodes that are not present on the feed.
core = 6.x
To:
name = FeedAPI Garbage Collectordescription = This FeedAPI processor unpublishes feed item nodes that are not present on the feed.
dependencies[] = feedapi
package = FeedAPI Add On
core = 6.x
AND add the following to the end of the README.txt
After installation, the feed must actually change before the garbage collectorwill run.
#20
Attaching update version of alex_b's module for review
#21
Whoops, the above attachment says 'items _deleted_' rather than 'items unpublished'
For flexibility, also inserted this code at line 45 of feedapi_gc.module
// TO DO: change the following into an admin preference// Uncomment the next line to delete rather than unpublish
//node_delete($feed_item->nid);
// Comment the next line to delete rather than unpublish
#22
Another issue has arisen with the message feedback to the user.
drupal_set_message(t('!checked items checked, !removed items removed.', array('!checked' => $checked, '!removed' => $removed)));is good and helpful unless it is a cron run, in which case the message is shown to the next Anonym visitor, confusing them. Therefore something like this code is needed:
if (!$cron) {drupal_set_message(t('!checked items checked, !removed items removed.', array('!checked' => $checked, '!removed' => $removed)));
}
However, there is no access to the $cron variable in the calling function from feedapi.module
_feedapi_invoke_refreshso what to do?? As I see it there needs to be an API change to pass this variable in the hook_feedapi_after_refresh.Namely, feedapi.module line ~1278:
foreach (module_implements('feedapi_after_refresh') as $module) {$func = $module .'_feedapi_after_refresh';
$func($feed);
}
Change to:
foreach (module_implements('feedapi_after_refresh') as $module) {$func = $module .'_feedapi_after_refresh';
$func($feed, $cron);
}
Or can someone think of a less disruptive method. No feedback message is one way...