Handle duplicates across feeds - X Feed Deduping

alex_b - November 20, 2007 - 18:11
Project:FeedAPI
Version:5.x-1.x-dev
Component:Code
Category:feature request
Priority:normal
Assigned:alex_b
Status:closed
Description

Currently, if there are duplicate articles coming in from different feeds, feedapi_node module does not recognize them as duplicate.

Note the where clause for feed_nid in _unique():

<?php
$count
= db_num_rows(db_query("SELECT fiid FROM {feedapi_node_item} WHERE url = '%s' AND feed_nid = %d", $feed_item->options->original_url, $feed_nid));
?>

Especially if you are aggregating feeds from search engines you will get lots of duplicates coming in from different feeds. I recently measured ~10 % duplicates in an aggregator i am running.

We should handle those duplicates by allowing "one feed item to many feeds" relations.

* Link the feedapi_node_item table to the feedapi table through a cross table called feedapi_node_item_feed (feed_nid, feed_item_nid) and thus
* Test for duplicates across entire feed item base and link to feed item to current feed if feed item already exists rather than creating a new feed item.

#1

alex_b - November 20, 2007 - 18:12

Correction:

* Link the feedapi_node_item table to the feedapi table through a cross table called feedapi_node_item_feed (feed_nid, feed_item_nid)
* Test for duplicates across entire feed item base and link to feed item to current feed if feed item already exists rather than creating a new feed item.

#2

alex_b - November 20, 2007 - 21:58

This functionality has to be optional in order to cover these two use cases of feedapi:

* Community site allows members to add feeds to their own space on the site and manage their feeds and feed items. Even if there are duplicate feed items from duplicate feeds in the system, it is important to keep them as duplicates, because they belong to different users.
* Aggregate information from many different feeds into a single space on the web site, use subscriptions to keyword search engines like blogsearch.google.com or technorati.com. Here you would want to eliminate duplicates coming in from different feeds.

I would suggest to make this option a global feedapi setting on the feedapi settings page:

Duplicate handling

Allow duplicate items from *different* feeds.
(*)  Yes
( ) No

If you allow duplicate items from different feeds, a feed item will be created if it does not exist for the feed it is coming in from. If you do not allow duplicate items from different feeds, a feed item will only be created if it does not exist on *any* feed in the system.

#3

alex_b - January 16, 2008 - 02:40
Assigned to:Anonymous» alex_b

Patch coming

#4

Summit - January 16, 2008 - 15:01

Subscribing.
Thanks for your effort!
greetings,
Martijn

#5

alex_b - January 17, 2008 - 19:37
Title:Handle duplicates across feeds» Handle duplicates across feeds - X Feed Deduping
Status:active» patch (code needs review)

Here is the first version of the patch.

Apply it to latest development version (DRUPAL-5) and run update.php.

Basic functionality:

* X Feed deduping is optional, it is a feature of FeedAPI node module, you will have to activate the feature on a per content type basis on the content type edit page of your feed.
* There is a new cross table (feedapi_node_item_feed) that stores feed / feed item relations.
* If a feed item appears on another feed than its first feed, it will not be stored again, but LINKED to the new feed through the feedapi_node_item_feed table, linking causes a load and update of the node representation of the duplicate feed item, thus add on modules can act on a found duplicate.

Changes:

This patch touches quite a lot of modules in the FeedAPI suite.

* FeedAPI Node: core functionality of x feed deduping, moved store procedures from a shut off one step process (preparation of data+storage in _feedapi_node_save()) to a two step nodeapi process (preparation of data in _feedapi_node_save() and storage in nodeapi()). Updated test module.
* FeedAPI node views: use new cross table.
* FeedAPI inherit: inherit from multiple feeds.
* There are some updates of info files, ignore them.

Upgrade of existing content:

If you don't want to use x feed deduping on existing feeds and feed content types, patching the module and running update.php is fine.

However, if you would like to CONVERT your running system to support x feed deduping, use http://cvs.drupal.org/viewvc.py/drupal/contributions/sandbox/alex_b/feed... - install this module and run cron, it will successively convert your entire system into an x-deduping enabled one. Turn off the module once the message "Ran feedapi_node_update_cron() but update already done. You can turn off feedapi_node_update module now." pops up on the watchdog log.

The patch is fairly mature and we use it in a staging environment. I expect one or the other bump though. Test and bug reports are highly appreciated.

AttachmentSize
feedapi_x_dedupe.patch24.85 KB

#6

alex_b - January 26, 2008 - 20:32

Patch re-rolled for feedapi 1.0

AttachmentSize
feedapi_1_0_x_dedupe.patch22.65 KB

#7

alex_b - January 29, 2008 - 22:46

Fix _feedapi_node_expire()

AttachmentSize
feedapi_1_0_x_dedupe.patch23.46 KB

#8

Aron Novak - January 30, 2008 - 10:06

Important warning: if you apply the patch and you have feedapi node views installed, please turn off the module (the feedapi node views) and turn on again. It re-creates the default views and it's necessary.

#9

Aron Novak - January 30, 2008 - 10:58

I added postgresql support and tested the patch. I beautifuled it according to our coding style, removed some MyISAM storage constraint (it looked uncomfortable for me).

AttachmentSize
dedupe_2008_01_30.patch18.63 KB

#10

Aron Novak - January 30, 2008 - 14:09
Status:patch (code needs review)» patch (reviewed & tested by the community)

The underlying logic of this patch works fine, i have just checked out. It's ready to be committed.

#11

alex_b - January 30, 2008 - 19:20
Status:patch (reviewed & tested by the community)» fixed

Aron, thank you for pgsql support and check on formatting.

Committed.

Feels good to have this one out of the way.

Alex

#12

Anonymous (not verified) - February 13, 2008 - 19:27
Status:fixed» closed

Automatically closed -- issue fixed for two weeks with no activity.

 
 

Drupal is a registered trademark of Dries Buytaert.