Currently, if there are duplicate articles coming in from different feeds, feedapi_node module does not recognize them as duplicate.
Note the where clause for feed_nid in _unique():
$count = db_num_rows(db_query("SELECT fiid FROM {feedapi_node_item} WHERE url = '%s' AND feed_nid = %d", $feed_item->options->original_url, $feed_nid));
Especially if you are aggregating feeds from search engines you will get lots of duplicates coming in from different feeds. I recently measured ~10 % duplicates in an aggregator i am running.
We should handle those duplicates by allowing "one feed item to many feeds" relations.
* Link the feedapi_node_item table to the feedapi table through a cross table called feedapi_node_item_feed (feed_nid, feed_item_nid) and thus
* Test for duplicates across entire feed item base and link to feed item to current feed if feed item already exists rather than creating a new feed item.
| Comment | File | Size | Author |
|---|---|---|---|
| #9 | dedupe_2008_01_30.patch | 18.63 KB | aron novak |
| #7 | feedapi_1_0_x_dedupe.patch | 23.46 KB | alex_b |
| #6 | feedapi_1_0_x_dedupe.patch | 22.65 KB | alex_b |
| #5 | feedapi_x_dedupe.patch | 24.85 KB | alex_b |
Comments
Comment #1
alex_b commentedCorrection:
* Link the feedapi_node_item table to the feedapi table through a cross table called feedapi_node_item_feed (feed_nid, feed_item_nid)
* Test for duplicates across entire feed item base and link to feed item to current feed if feed item already exists rather than creating a new feed item.
Comment #2
alex_b commentedThis functionality has to be optional in order to cover these two use cases of feedapi:
* Community site allows members to add feeds to their own space on the site and manage their feeds and feed items. Even if there are duplicate feed items from duplicate feeds in the system, it is important to keep them as duplicates, because they belong to different users.
* Aggregate information from many different feeds into a single space on the web site, use subscriptions to keyword search engines like blogsearch.google.com or technorati.com. Here you would want to eliminate duplicates coming in from different feeds.
I would suggest to make this option a global feedapi setting on the feedapi settings page:
Comment #3
alex_b commentedPatch coming
Comment #4
summit commentedSubscribing.
Thanks for your effort!
greetings,
Martijn
Comment #5
alex_b commentedHere is the first version of the patch.
Apply it to latest development version (DRUPAL-5) and run update.php.
Basic functionality:
* X Feed deduping is optional, it is a feature of FeedAPI node module, you will have to activate the feature on a per content type basis on the content type edit page of your feed.
* There is a new cross table (feedapi_node_item_feed) that stores feed / feed item relations.
* If a feed item appears on another feed than its first feed, it will not be stored again, but LINKED to the new feed through the feedapi_node_item_feed table, linking causes a load and update of the node representation of the duplicate feed item, thus add on modules can act on a found duplicate.
Changes:
This patch touches quite a lot of modules in the FeedAPI suite.
* FeedAPI Node: core functionality of x feed deduping, moved store procedures from a shut off one step process (preparation of data+storage in _feedapi_node_save()) to a two step nodeapi process (preparation of data in _feedapi_node_save() and storage in nodeapi()). Updated test module.
* FeedAPI node views: use new cross table.
* FeedAPI inherit: inherit from multiple feeds.
* There are some updates of info files, ignore them.
Upgrade of existing content:
If you don't want to use x feed deduping on existing feeds and feed content types, patching the module and running update.php is fine.
However, if you would like to CONVERT your running system to support x feed deduping, use http://cvs.drupal.org/viewvc.py/drupal/contributions/sandbox/alex_b/feed... - install this module and run cron, it will successively convert your entire system into an x-deduping enabled one. Turn off the module once the message "Ran feedapi_node_update_cron() but update already done. You can turn off feedapi_node_update module now." pops up on the watchdog log.
The patch is fairly mature and we use it in a staging environment. I expect one or the other bump though. Test and bug reports are highly appreciated.
Comment #6
alex_b commentedPatch re-rolled for feedapi 1.0
Comment #7
alex_b commentedFix _feedapi_node_expire()
Comment #8
aron novakImportant warning: if you apply the patch and you have feedapi node views installed, please turn off the module (the feedapi node views) and turn on again. It re-creates the default views and it's necessary.
Comment #9
aron novakI added postgresql support and tested the patch. I beautifuled it according to our coding style, removed some MyISAM storage constraint (it looked uncomfortable for me).
Comment #10
aron novakThe underlying logic of this patch works fine, i have just checked out. It's ready to be committed.
Comment #11
alex_b commentedAron, thank you for pgsql support and check on formatting.
Committed.
Feels good to have this one out of the way.
Alex
Comment #12
Anonymous (not verified) commentedAutomatically closed -- issue fixed for two weeks with no activity.
Comment #13
meghs commentedI am using FeedAPI De-Dupe module to check the duplicate feed across all feeds but it is not working. Duplicate feeds are still coming. Is there any settings I am missing ? Need serious help. The site is on production.