Currently, if there are duplicate articles coming in from different feeds, feedapi_node module does not recognize them as duplicate.

Note the where clause for feed_nid in _unique():

$count = db_num_rows(db_query("SELECT fiid FROM {feedapi_node_item} WHERE url = '%s' AND feed_nid = %d", $feed_item->options->original_url, $feed_nid));

Especially if you are aggregating feeds from search engines you will get lots of duplicates coming in from different feeds. I recently measured ~10 % duplicates in an aggregator i am running.

We should handle those duplicates by allowing "one feed item to many feeds" relations.

* Link the feedapi_node_item table to the feedapi table through a cross table called feedapi_node_item_feed (feed_nid, feed_item_nid) and thus
* Test for duplicates across entire feed item base and link to feed item to current feed if feed item already exists rather than creating a new feed item.

Comments

alex_b’s picture

Correction:

* Link the feedapi_node_item table to the feedapi table through a cross table called feedapi_node_item_feed (feed_nid, feed_item_nid)
* Test for duplicates across entire feed item base and link to feed item to current feed if feed item already exists rather than creating a new feed item.

alex_b’s picture

This functionality has to be optional in order to cover these two use cases of feedapi:

* Community site allows members to add feeds to their own space on the site and manage their feeds and feed items. Even if there are duplicate feed items from duplicate feeds in the system, it is important to keep them as duplicates, because they belong to different users.
* Aggregate information from many different feeds into a single space on the web site, use subscriptions to keyword search engines like blogsearch.google.com or technorati.com. Here you would want to eliminate duplicates coming in from different feeds.

I would suggest to make this option a global feedapi setting on the feedapi settings page:

Duplicate handling

Allow duplicate items from *different* feeds.
(*)  Yes
( ) No

If you allow duplicate items from different feeds, a feed item will be created if it does not exist for the feed it is coming in from. If you do not allow duplicate items from different feeds, a feed item will only be created if it does not exist on *any* feed in the system.
alex_b’s picture

Assigned: Unassigned » alex_b

Patch coming

summit’s picture

Subscribing.
Thanks for your effort!
greetings,
Martijn

alex_b’s picture

Title: Handle duplicates across feeds » Handle duplicates across feeds - X Feed Deduping
Status: Active » Needs review
StatusFileSize
new24.85 KB

Here is the first version of the patch.

Apply it to latest development version (DRUPAL-5) and run update.php.

Basic functionality:

* X Feed deduping is optional, it is a feature of FeedAPI node module, you will have to activate the feature on a per content type basis on the content type edit page of your feed.
* There is a new cross table (feedapi_node_item_feed) that stores feed / feed item relations.
* If a feed item appears on another feed than its first feed, it will not be stored again, but LINKED to the new feed through the feedapi_node_item_feed table, linking causes a load and update of the node representation of the duplicate feed item, thus add on modules can act on a found duplicate.

Changes:

This patch touches quite a lot of modules in the FeedAPI suite.

* FeedAPI Node: core functionality of x feed deduping, moved store procedures from a shut off one step process (preparation of data+storage in _feedapi_node_save()) to a two step nodeapi process (preparation of data in _feedapi_node_save() and storage in nodeapi()). Updated test module.
* FeedAPI node views: use new cross table.
* FeedAPI inherit: inherit from multiple feeds.
* There are some updates of info files, ignore them.

Upgrade of existing content:

If you don't want to use x feed deduping on existing feeds and feed content types, patching the module and running update.php is fine.

However, if you would like to CONVERT your running system to support x feed deduping, use http://cvs.drupal.org/viewvc.py/drupal/contributions/sandbox/alex_b/feed... - install this module and run cron, it will successively convert your entire system into an x-deduping enabled one. Turn off the module once the message "Ran feedapi_node_update_cron() but update already done. You can turn off feedapi_node_update module now." pops up on the watchdog log.

The patch is fairly mature and we use it in a staging environment. I expect one or the other bump though. Test and bug reports are highly appreciated.

alex_b’s picture

StatusFileSize
new22.65 KB

Patch re-rolled for feedapi 1.0

alex_b’s picture

StatusFileSize
new23.46 KB

Fix _feedapi_node_expire()

aron novak’s picture

Important warning: if you apply the patch and you have feedapi node views installed, please turn off the module (the feedapi node views) and turn on again. It re-creates the default views and it's necessary.

aron novak’s picture

StatusFileSize
new18.63 KB

I added postgresql support and tested the patch. I beautifuled it according to our coding style, removed some MyISAM storage constraint (it looked uncomfortable for me).

aron novak’s picture

Status: Needs review » Reviewed & tested by the community

The underlying logic of this patch works fine, i have just checked out. It's ready to be committed.

alex_b’s picture

Status: Reviewed & tested by the community » Fixed

Aron, thank you for pgsql support and check on formatting.

Committed.

Feels good to have this one out of the way.

Alex

Anonymous’s picture

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.

meghs’s picture

Project: FeedAPI » FeedAPI De-Dupe
Version: 5.x-1.x-dev » 6.x-1.0-rc1
Component: Code » Miscellaneous
Assigned: alex_b » Unassigned
Category: feature » support
Priority: Normal » Critical
Status: Closed (fixed) » Active

I am using FeedAPI De-Dupe module to check the duplicate feed across all feeds but it is not working. Duplicate feeds are still coming. Is there any settings I am missing ? Need serious help. The site is on production.