Handle duplicates across feeds - X Feed Deduping
| Project: | FeedAPI |
| Version: | 5.x-1.x-dev |
| Component: | Code |
| Category: | feature request |
| Priority: | normal |
| Assigned: | alex_b |
| Status: | closed |
Currently, if there are duplicate articles coming in from different feeds, feedapi_node module does not recognize them as duplicate.
Note the where clause for feed_nid in _unique():
<?php
$count = db_num_rows(db_query("SELECT fiid FROM {feedapi_node_item} WHERE url = '%s' AND feed_nid = %d", $feed_item->options->original_url, $feed_nid));
?>Especially if you are aggregating feeds from search engines you will get lots of duplicates coming in from different feeds. I recently measured ~10 % duplicates in an aggregator i am running.
We should handle those duplicates by allowing "one feed item to many feeds" relations.
* Link the feedapi_node_item table to the feedapi table through a cross table called feedapi_node_item_feed (feed_nid, feed_item_nid) and thus
* Test for duplicates across entire feed item base and link to feed item to current feed if feed item already exists rather than creating a new feed item.

#1
Correction:
* Link the feedapi_node_item table to the feedapi table through a cross table called feedapi_node_item_feed (feed_nid, feed_item_nid)
* Test for duplicates across entire feed item base and link to feed item to current feed if feed item already exists rather than creating a new feed item.
#2
This functionality has to be optional in order to cover these two use cases of feedapi:
* Community site allows members to add feeds to their own space on the site and manage their feeds and feed items. Even if there are duplicate feed items from duplicate feeds in the system, it is important to keep them as duplicates, because they belong to different users.
* Aggregate information from many different feeds into a single space on the web site, use subscriptions to keyword search engines like blogsearch.google.com or technorati.com. Here you would want to eliminate duplicates coming in from different feeds.
I would suggest to make this option a global feedapi setting on the feedapi settings page:
Duplicate handling
Allow duplicate items from *different* feeds.
(*) Yes
( ) No
If you allow duplicate items from different feeds, a feed item will be created if it does not exist for the feed it is coming in from. If you do not allow duplicate items from different feeds, a feed item will only be created if it does not exist on *any* feed in the system.
#3
Patch coming
#4
Subscribing.
Thanks for your effort!
greetings,
Martijn
#5
Here is the first version of the patch.
Apply it to latest development version (DRUPAL-5) and run update.php.
Basic functionality:
* X Feed deduping is optional, it is a feature of FeedAPI node module, you will have to activate the feature on a per content type basis on the content type edit page of your feed.
* There is a new cross table (feedapi_node_item_feed) that stores feed / feed item relations.
* If a feed item appears on another feed than its first feed, it will not be stored again, but LINKED to the new feed through the feedapi_node_item_feed table, linking causes a load and update of the node representation of the duplicate feed item, thus add on modules can act on a found duplicate.
Changes:
This patch touches quite a lot of modules in the FeedAPI suite.
* FeedAPI Node: core functionality of x feed deduping, moved store procedures from a shut off one step process (preparation of data+storage in _feedapi_node_save()) to a two step nodeapi process (preparation of data in _feedapi_node_save() and storage in nodeapi()). Updated test module.
* FeedAPI node views: use new cross table.
* FeedAPI inherit: inherit from multiple feeds.
* There are some updates of info files, ignore them.
Upgrade of existing content:
If you don't want to use x feed deduping on existing feeds and feed content types, patching the module and running update.php is fine.
However, if you would like to CONVERT your running system to support x feed deduping, use http://cvs.drupal.org/viewvc.py/drupal/contributions/sandbox/alex_b/feed... - install this module and run cron, it will successively convert your entire system into an x-deduping enabled one. Turn off the module once the message "Ran feedapi_node_update_cron() but update already done. You can turn off feedapi_node_update module now." pops up on the watchdog log.
The patch is fairly mature and we use it in a staging environment. I expect one or the other bump though. Test and bug reports are highly appreciated.
#6
Patch re-rolled for feedapi 1.0
#7
Fix _feedapi_node_expire()
#8
Important warning: if you apply the patch and you have feedapi node views installed, please turn off the module (the feedapi node views) and turn on again. It re-creates the default views and it's necessary.
#9
I added postgresql support and tested the patch. I beautifuled it according to our coding style, removed some MyISAM storage constraint (it looked uncomfortable for me).
#10
The underlying logic of this patch works fine, i have just checked out. It's ready to be committed.
#11
Aron, thank you for pgsql support and check on formatting.
Committed.
Feels good to have this one out of the way.
Alex
#12
Automatically closed -- issue fixed for two weeks with no activity.