Hi,
I have found an issue with Feedapi, The feed data is duplicated,even though it is comming from the same source
here is what's happening, the data which I am getting from the feeds is same, the feed api node item table saving the items with same guid and hence there are two nodes are created with same GUID.
Is there any patch for this.How do you approach this.
| Comment | File | Size | Author |
|---|---|---|---|
| #19 | feedapi_node_duplicate_1.patch | 3.99 KB | aron novak |
| #15 | feedapi_node_duplicate.patch | 3.88 KB | aron novak |
| #11 | feedapi_node_duplicate.patch | 1.83 KB | aron novak |
Comments
Comment #1
aron novakCan you paste here the feed url that you use?
Comment #2
panji commentedI've got the same problem, but my version is 6.x-1.4It's all because of double post without i know it..
Comment #3
nicolash commentedI have the same problem. We use the "Check for duplicates on all feeds" option and it works. However, every now and again an item with the same GUID gets created again - either in the same feed or in another.
All feeds concerned also have "Update existing feed items" checked, so shouldn't create new ones.
Aaron, without looking further through the code, is the GUID the only parameter that determines whether an item already exists?
Here are the feeds in question:
* http://www.abc.net.au/news/indexes/brisbane/rss.xml
* http://www.abc.net.au/news/indexes/goldcoast/rss.xml
* http://www.abc.net.au/news/indexes/southqld/rss.xml
* http://www.abc.net.au/news/indexes/sunshine/rss.xml
* http://www.abc.net.au/news/indexes/idx-qld/rss.xml
Comment #4
emackn commentedI'm seeing this too, updated my site with 4 new feeds (5 total) and now have quite a few duplicates.
Feeds are using both "Update existing feed items" and "check duplicates on all feeds".
One thing i noticed, the duplicates look to be within the individual feeds. Is there a way to check for both?
Comment #5
adub commentedHas anybody got simultaneous cron jobs showing up in the logs?
Comment #6
nicolash commentedemackn, same here. Duplicates occurr mainly within individual feeds, perhaps solely...can't say for sure ATM.
Comment #7
nicolash commentedAron, could you point me towards possible places where to look for causes?
Comment #8
skizzo commentedI am not a programmer so I don't really know whether the following is relevant or not. I just created a Feed Content Type (parser_simplepie enabled, creates nodes from feed items, FeedAPI Inherit on) and then created a Feed (Check for duplicates only within feed, Feed nodes inherit taxonomy, mapping original_url to cck video field). I tried that before, with other feed URL's, and it worked. Today I applied that to a different feed URL and got the following warnings. If this is not relevant please let me know and I will post it as a separate issue. Thx.
On first refresh:
Purging the feed items:
Comment #9
gomez_in_the_south commentedI was struggling with duplicate items appearing on various feeds as well. It would only occur on certain feeds and wouldn't happen immediately either, but would consistently appear after some time. After some investigation I found a solution that works for me. It's a bit difficult to explain but I'll give it my best shot.
I'm using the simplepie parser, the FeedAPI node processor and the 6.x-1.x-dev (2009-May-09) version of the FeedAPI. I am not using the option to delete duplicates across all feeds.
The problem occurs when you have two different feeds from the same source (let's call them Feed A and Feed B) that may have the same feed items on them. When Feed A is refreshed, the feed items get updated, to find the feed items to be updated, a select is done matching either the GUID or the URL of the feed item. As there is more than one feed item with the same GUID and URL, sometimes the feed items from Feed B are updated instead and saved to Feed A, in effect transferring it to this feed. Then when Feed B is refreshed it can't find the existing items so it fetches them again.
The problem code seems to be in the _feedapi_node_update function, currently it looks for the feed item to be updated with the code (around line 345):
ignoring the fact that two feed items with the same guid may exist on different feeds. So sometimes the wrong feed item is returned and updated.
So, I replaced these lines with the following:
To replicate the problem, I first altered the hash manually so that the feed would actually process the whole feed each time (by appending a rubbish value to the $old_hash variable before the compare in feedapi.module.
Then I added two feeds that have the same articles/feed items on them. An example of these are:
http://www.moneyweb.co.za/mw/rss/mw/en/moneyweb_news.xml and http://www.moneyweb.co.za/mw/rss/mw/en/rss_daily_news.xml .
If you watch the database values carefully whilst updating the feeds manually you can see how this cross-posting of the feed items occurs.
I wasn't sure if this thread was more appropriate or http://drupal.org/node/365943 as they seem to be similar issues
Comment #10
alex_b commented#9: This is a good find.
The proposed solution would break though when cross feed deduplication is enabled. In this mode, feedapi looks for duplicates across all feeds and creates a new reference to an existing feed item rather than creating a new feed item for every feed it finds an existing item for.
Further, it looks wasteful that we're looking up these items in _unique() and a little later in _update().
Needs to be fixed for 1.7.
Comment #11
aron novakHm, i created a patch what at least does not break simpletest, but i'm not really happy with this.
I'm sure that the code in #9 cannot be the solution for #4 and #5 also.
Comment #12
gomez_in_the_south commentedAron, I think you're right. #4 mentions that he has selected 'check for duplicates across all feeds'. The solution I posted was specific to the problem I was encountering. A cursory look at your patch looks like it should solve #9 - thanks. I'll implement and test it in the next couple of days and report back here.
Comment #13
alex_b commentedWhat if hook_feedapi_item('unique') returns an id instead of TRUE and feedapi passes it back into feedapi_node with hook_feedapi_item('update', $feed_item, $feed_nid, $settings, $id)?
This would take care of the double look up and the problem outlined in #9.
Comment #14
aron novakDo we want to change the API? If yes, I'm happy to to the changes outlined in #13. It really does make sense.
Comment #15
aron novakAccording to Alex's suggestion in #13, i refactored the things, this means code removal!, _feedapi_node_update() function became small :)
Comment #16
aron novakComment #17
gomez_in_the_south commentedLooks good. I reverted to the old version that was giving trouble and confirmed the duplication of the items. Then I applied the patch in #15 and duplicates no longer appeared using the same steps. In other words, it works for me.
If this patch, as well as the one in http://drupal.org/node/464342 are going to make it into the next dev release, then I'll just try that on my semi-production server and keep an eye on the behaviour.
Thanks Aron, Alex for your work on this module.
Comment #18
alex_b commentedSome nitpicks:
I think it's fine to assume that there won't be an id 0. we could therefore simplify the code and do without a $unique variable altogether. Of course, I'm happy to be convinced otherwise, I may be overlooking something here.
The function signature doesn't look alright: we should either make $settings mandatory or move $id in front of $settings. I prefer the former, because it does not interfere with existing API implementations.
This issue reminds me of another important issue #443436: Create full API documentation :-)
Comment #19
aron novak#18: function signature: true, good catch.
about the if condition: i like it because this immediately shows you the valid usages of the 'unique'.
I made it a little bit simplier, one if condition is enough, no need to nest them.
Comment #20
aron novak"I made it a little bit simplier, one if condition is enough, no need to nest them." - this was silly idea, it generated a serious bug, before the commit, i reverted the original structure.