Hi,
I have found an issue with Feedapi, The feed data is duplicated,even though it is comming from the same source
here is what's happening, the data which I am getting from the feeds is same, the feed api node item table saving the items with same guid and hence there are two nodes are created with same GUID.
Is there any patch for this.How do you approach this.

Comments

aron novak’s picture

Status: Active » Postponed (maintainer needs more info)

Can you paste here the feed url that you use?

panji’s picture

I've got the same problem, but my version is 6.x-1.4

It's all because of double post without i know it..

nicolash’s picture

I have the same problem. We use the "Check for duplicates on all feeds" option and it works. However, every now and again an item with the same GUID gets created again - either in the same feed or in another.

All feeds concerned also have "Update existing feed items" checked, so shouldn't create new ones.

Aaron, without looking further through the code, is the GUID the only parameter that determines whether an item already exists?

Here are the feeds in question:

* http://www.abc.net.au/news/indexes/brisbane/rss.xml
* http://www.abc.net.au/news/indexes/goldcoast/rss.xml
* http://www.abc.net.au/news/indexes/southqld/rss.xml
* http://www.abc.net.au/news/indexes/sunshine/rss.xml
* http://www.abc.net.au/news/indexes/idx-qld/rss.xml

emackn’s picture

I'm seeing this too, updated my site with 4 new feeds (5 total) and now have quite a few duplicates.

Feeds are using both "Update existing feed items" and "check duplicates on all feeds".

One thing i noticed, the duplicates look to be within the individual feeds. Is there a way to check for both?

adub’s picture

Has anybody got simultaneous cron jobs showing up in the logs?

nicolash’s picture

emackn, same here. Duplicates occurr mainly within individual feeds, perhaps solely...can't say for sure ATM.

nicolash’s picture

Aron, could you point me towards possible places where to look for causes?

skizzo’s picture

I am not a programmer so I don't really know whether the following is relevant or not. I just created a Feed Content Type (parser_simplepie enabled, creates nodes from feed items, FeedAPI Inherit on) and then created a Feed (Check for duplicates only within feed, Feed nodes inherit taxonomy, mapping original_url to cck video field). I tried that before, with other feed URL's, and it worked. Today I applied that to a different feed URL and got the following warnings. If this is not relevant please let me know and I will post it as a separate issue. Thx.

On first refresh:

 user warning: Duplicate entry '0' for key 1 query: INSERT INTO content_field_embedded_video (field_embedded_video_embed, field_embedded_video_value, field_embedded_video_provider, field_embedded_video_data, vid, nid) VALUES ('http://www.youtube.com/watch?v=1CPRgWSvg_k', '1CPRgWSvg_k', 'youtube', 'a:3:{s:25:\"video_cck_youtube_version\";i:1;s:9:\"thumbnail\";a:1:{s:3:\"url\";s:43:\"http://img.youtube.com/vi/1CPRgWSvg_k/0.jpg\";}s:5:\"flash\";a:3:{s:3:\"url\";s:32:\"http://youtube.com/v/1CPRgWSvg_k\";s:4:\"size\";s:3:\"882\";s:4:\"mime\";s:29:\"application/x-shockwave-flash\";}}', 0, 1679) in /var/www/drupaz/includes/database.mysql.inc on line 174. 

Purging the feed items:

 * user warning: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 1 query: SELECT lid FROM location_instance WHERE in /var/www/drupaz/includes/database.mysql.inc on line 174.
    * user warning: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 1 query: DELETE FROM location_instance WHERE in /var/www/drupaz/includes/database.mysql.inc on line 174.
gomez_in_the_south’s picture

Component: Code » Code feedapi_node

I was struggling with duplicate items appearing on various feeds as well. It would only occur on certain feeds and wouldn't happen immediately either, but would consistently appear after some time. After some investigation I found a solution that works for me. It's a bit difficult to explain but I'll give it my best shot.

I'm using the simplepie parser, the FeedAPI node processor and the 6.x-1.x-dev (2009-May-09) version of the FeedAPI. I am not using the option to delete duplicates across all feeds.

The problem occurs when you have two different feeds from the same source (let's call them Feed A and Feed B) that may have the same feed items on them. When Feed A is refreshed, the feed items get updated, to find the feed items to be updated, a select is done matching either the GUID or the URL of the feed item. As there is more than one feed item with the same GUID and URL, sometimes the feed items from Feed B are updated instead and saved to Feed A, in effect transferring it to this feed. Then when Feed B is refreshed it can't find the existing items so it fetches them again.

The problem code seems to be in the _feedapi_node_update function, currently it looks for the feed item to be updated with the code (around line 345):

$node = db_fetch_object(db_query("SELECT nid FROM {feedapi_node_item} WHERE guid = '%s'", $feed_item->options->guid));
or
$node = db_fetch_object(db_query("SELECT nid FROM {feedapi_node_item} WHERE url = '%s'", $feed_item->options->original_url));

ignoring the fact that two feed items with the same guid may exist on different feeds. So sometimes the wrong feed item is returned and updated.

So, I replaced these lines with the following:

$node = db_fetch_object(db_query("SELECT nid FROM {feedapi_node_item} ni INNER JOIN {feedapi_node_item_feed} f  ON ni.nid = f.feed_item_nid  WHERE guid = '%s' AND feed_nid = '%d' ", $feed_item->options->guid, $feed_nid));    
and
$node = db_fetch_object(db_query("SELECT nid FROM {feedapi_node_item} ni INNER JOIN {feedapi_node_item_feed} f  ON ni.nid = f.feed_item_nid  WHERE url = '%s' AND feed_nid = '%d'", $feed_item->options->original_url, $feed_nid));

To replicate the problem, I first altered the hash manually so that the feed would actually process the whole feed each time (by appending a rubbish value to the $old_hash variable before the compare in feedapi.module.
Then I added two feeds that have the same articles/feed items on them. An example of these are:
http://www.moneyweb.co.za/mw/rss/mw/en/moneyweb_news.xml and http://www.moneyweb.co.za/mw/rss/mw/en/rss_daily_news.xml .
If you watch the database values carefully whilst updating the feeds manually you can see how this cross-posting of the feed items occurs.

I wasn't sure if this thread was more appropriate or http://drupal.org/node/365943 as they seem to be similar issues

alex_b’s picture

Version: 5.x-1.4 » 6.x-1.x-dev

#9: This is a good find.

The proposed solution would break though when cross feed deduplication is enabled. In this mode, feedapi looks for duplicates across all feeds and creates a new reference to an existing feed item rather than creating a new feed item for every feed it finds an existing item for.

Further, it looks wasteful that we're looking up these items in _unique() and a little later in _update().

Needs to be fixed for 1.7.

aron novak’s picture

StatusFileSize
new1.83 KB

Hm, i created a patch what at least does not break simpletest, but i'm not really happy with this.
I'm sure that the code in #9 cannot be the solution for #4 and #5 also.

gomez_in_the_south’s picture

Aron, I think you're right. #4 mentions that he has selected 'check for duplicates across all feeds'. The solution I posted was specific to the problem I was encountering. A cursory look at your patch looks like it should solve #9 - thanks. I'll implement and test it in the next couple of days and report back here.

alex_b’s picture

Status: Postponed (maintainer needs more info) » Needs work

What if hook_feedapi_item('unique') returns an id instead of TRUE and feedapi passes it back into feedapi_node with hook_feedapi_item('update', $feed_item, $feed_nid, $settings, $id)?

This would take care of the double look up and the problem outlined in #9.

aron novak’s picture

Do we want to change the API? If yes, I'm happy to to the changes outlined in #13. It really does make sense.

aron novak’s picture

StatusFileSize
new3.88 KB

According to Alex's suggestion in #13, i refactored the things, this means code removal!, _feedapi_node_update() function became small :)

aron novak’s picture

Status: Needs work » Needs review
gomez_in_the_south’s picture

Looks good. I reverted to the old version that was giving trouble and confirmed the duplication of the items. Then I applied the patch in #15 and duplicates no longer appeared using the same steps. In other words, it works for me.

If this patch, as well as the one in http://drupal.org/node/464342 are going to make it into the next dev release, then I'll just try that on my semi-production server and keep an eye on the behaviour.

Thanks Aron, Alex for your work on this module.

alex_b’s picture

Title: RE:Duplicate Data » Duplicate items when update items enabled
Status: Needs review » Needs work

Some nitpicks:

if ($unique === FALSE || is_numeric($unique)) {

I think it's fine to assume that there won't be an id 0. we could therefore simplify the code and do without a $unique variable altogether. Of course, I'm happy to be convinced otherwise, I may be overlooking something here.

_feedapi_node_update($feed_item, $feed_nid, $settings = array(), $id)

The function signature doesn't look alright: we should either make $settings mandatory or move $id in front of $settings. I prefer the former, because it does not interfere with existing API implementations.

This issue reminds me of another important issue #443436: Create full API documentation :-)

aron novak’s picture

Status: Needs work » Needs review
StatusFileSize
new3.99 KB

#18: function signature: true, good catch.
about the if condition: i like it because this immediately shows you the valid usages of the 'unique'.
I made it a little bit simplier, one if condition is enough, no need to nest them.

aron novak’s picture

Status: Needs review » Fixed

"I made it a little bit simplier, one if condition is enough, no need to nest them." - this was silly idea, it generated a serious bug, before the commit, i reverted the original structure.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.