I get duplicated nodes when cron runs. Need to add some check for duplicated item from same feed.

CommentFileSizeAuthor
#26 feeds_dupe_process_debug.diff1.42 KBekes

Comments

alex_b’s picture

Status: Active » Postponed (maintainer needs more info)

Is at least one of your mappings checked 'unique target' ?

See:

http://skitch.com/alexbarth/nus3t/edit-feed-d6

Anonymous’s picture

I am having this problem also. I have unique guid's mapped like in your screenshot, but no URL's.
Even when manually running chron, duplicates didn't really show up right away until about 5-6 minutes after.

If you run the import manually from the actual feed instead of relying on Chron, it doesn't seem to duplicate stuff.

aleighs’s picture

I am having the same issue, it looks like the node creation might somehow be running twice. See cron run meesage: http://img.skitch.com/20100226-xedh4wdnd7fx2hkt5mwcgtg6fx.jpg

I have GUID and URL link selected as unique targets. I have also tried running it with "update existing items" both checked and unchecked with the same duplicate node results.

webwriter’s picture

Subscribe. The duplicate node issue is causing a lot of rework.

joejoejoew’s picture

I am also having this issue. I am trying to populate a custom content type from an RSS feed on another drupal 5 site. I set the GUID as the unique field, but I get a whole new set of nodes every time I refresh my feed.

joejoejoew’s picture

I fixed this by mapping Item URL (link) to URL and the Item GUID to the GUID, and setting both unique target.

ramya.narayanan’s picture

Version: 6.x-1.0-alpha10 » 6.x-1.0-alpha12

using Version:6.x-1.0-alpha12 , after the cron run , the nodes are created 4 times. I have mapped guid and item url and checked "unique target". Do i have to check any other settings?

alex_b’s picture

#7: can you verify that the source does actually contain unique URLs and GUIDs?

You can do that by looking at the feeds_node_item table (are there URLs and GUIDs of the same value?) and by printing the parsed content in the parse() method of the processor you're using to the screen or the error log (does the parsed content contain expected values at the places that you use for URL and GUID?).

There may be problems with:

- the parser not detecting values properly
- the source not containing expected values

SamRose’s picture

I fixed this by mapping Item URL (link) to URL and the Item GUID to the GUID, and setting both unique target.

This worked for me too

webavant’s picture

I tried mapping Item URL (link) to URL and Item GUID to GUID and setting both to unique target. I looked at feeds_node_item table and saw duplicated GUID/Links. I ran cron.php several times. First cron run imports 15 items, second cron run imports nothing, third cron run imports 15 duplicate items and consequential cron runs do not import any more duplicates... I am not sure what will happen when there is actual new content in the feed, but I assume it will import another duplicate.

webwriter’s picture

Anyone going to clean up the link spamming in #9?

FreddieK’s picture

Anyone found a solution to this yet?

pwhite’s picture

Subscribing - having the same problem.

FreddieK’s picture

My problem was that a feed I'd added to a node and then disabled was still actively importing nodes. It wasn't until I deleted the node that the feed stopped importing.

alex_b’s picture

How did you 'disable' the feed?

FreddieK’s picture

As I recall it I had to create a clone of the feed importer and disable the original one to stop it from importing duplicates.

egm’s picture

I had issue this, too. I had both URL and GUID set as unique fields, but was still getting duplicates on cron. Earlier in the setup, I had the mapping set up in a weird way so that my Twitter items were getting imported with no titles (which is sort of right, but makes admin/content awfully difficult to navigate!). I realized that when duplicates were getting created, half had titles and half didn't, so I think half of the nodes were (somehow) being created by the old version of the feed importer. I don't understand how that could happen because I didn't create new importers, I changed the settings on the same ones and resaved them.

I deleted all the duplicate nodes to start over, and also deleted the nodes that were doing the importing (but I didn't delete the importers themselves). That seems to have cleared up the problem. But what I saw seems to differ from what Freddie describes (I think) because I didn't have two nodes, both of which were importing; I had one node that seemed to be importing under both the old and the new importer settings (weird as that sounds).

egm’s picture

Hoo boy. Still not fixed--woke up this morning to hundreds of new duplicates. I think the only thing I haven't tried yet is deleting both nodes and importers. May just delete everything and start over when I have time.

EDIT: Here's a data point. I deleted all the importers and nodes that were related to feeds. Created a new importer that is attached to the CCK content type Twitter. Created a Twitter node with the URL of a Twitter account that only has two posts (to speed up deletion when it goes haywire). Set to import on creation, but no new nodes appeared. Ran cron. It created two Story nodes with the Twitter posts. But I now have only one importer that doesn't use the Story content type so I'm really mystified as to how this could have happened. Also, I selected delete on the Twitter-type node that didn't appear to have imported anything, and it did delete the two Story nodes.

egm’s picture

Category: support » bug

I've played with this quite a bit more, including uninstalling the module and starting over, and I think I've discovered two problems that have happened more than once, although I don't yet understand what's causing them.

1. I've created feed importers that are set to create a CCK content type, but the importer only creates Story nodes. Maybe I don't understand what this is supposed to do! I create an importer for my Twitter account (simplest example) and attach it to a CCK content type called Twitter. I create a Twitter node that contains the URL of the feed I want to import. When it imports, it creates Story nodes instead of Twitter. I guess this isn't really a problem but it's not what I expected based on the documentation.

2. I have nodes that I deleted that are still importing and creating nodes. I took a gander with phpMyAdmin and identified a duplicate. One copy was created by node 1591 (a node that still exists) and one copy was created by node 1447, which I deleted before these duplicates appeared. Node 1447 still appeared in feeds_source and there were tons of items in feeds_node_item that had feed_nid 1447. I deleted those rows from feeds_source and feeds_node_item manually, but I think the fact that they were still there is a bug. I then used the GUI to delete some of the duplicate nodes to see what would happen.

This morning, they were back! Now I have two identical nodes, 2044 and 1992. Node 2044 appears in the feeds_node_item table, imported by node 1591 (as before). But node 1992 doesn't appear in the feeds_node_item table at all, so I don't know what imported it. Where did it come from?

Meanwhile, I think I'm going to use Views to filter on Owner feed nid < 1692, but that's pretty kludgey! Would like to understand what's really going on here.

(This is in alpha15, btw, not sure if I should start a new thread.)

alex_b’s picture

Please upgrade to latest release (alpha 15) and see whether problem persists. If you start serious debugging, use CVS HEAD.

IetC_development’s picture

Subscribing, same issue here, thx for the great module, though.

ufeg02’s picture

I think this is caused by the bug I described: http://drupal.org/node/825624

alex_b’s picture

Version: 6.x-1.0-alpha12 » 6.x-1.x-dev

#20

1)

Look at the node processor's settings - there you will find which node will be *created from feed items*. The 'attach to' setting only says which node you would like to use to piggy back a subscription on.

2)

You clearly have some rogue subscriptions somewhere. Delete them from your feeds_source table. What I'm pretty sure has happened is that you've created an importer configuration for a content type, then you've created a couple of subscriptions by creating nodes of this content type, than you've disabled the importer for this content type and deleted the nodes... That's actually a bug and I've opened an issue for this here: #827572: Orphaned subscriptions.

This entire issue *may* be a duplicate of #827572: Orphaned subscriptions. Not entirely sure.

sagar ramgade’s picture

Hi,

I am using alpha 15 version, i had multiple feeds giving same articles or their updates, so i had mapped article id with GUID. What i noticed that i am getting duplicate nodes as same articles getting fetched from the multiple sources. I am using article id as GUID and update existing items.
I am using xml parser and feedxml parser.
Can anyone help with this issue.

Regards
Sagar

ekes’s picture

StatusFileSize
new1.42 KB

I've experienced something like this, so I'm popping this one in for additional info...

Situation: feeds, with filefield mapper - and the files are videos so they aren't small. This has caused all sorts of fun I'm still debugging - working theory is problem occurs when the feed isn't completed before timeout (other hints welcomed for debugging). feeds_node_item certainly contained items with the same hash but different node IDs.

Now it's possible I've solved it by doing something else, I've thrown solutions at this. But I think the working situation so far is from adding the attached patch - it's debug code not for production. What it is doing is keeping a record of the hashes that are being worked on and checking that new ones coming up for process aren't already being worked on - and occasionally it is catching one and skipping it.

alex_b’s picture

ekes your report just made it click. Some of the reports here may find duplicates because of the fact that Feeds does not have source level locking. Duplicates due to missing source level locking are more likely to happen if

- Cron is running and feeds are frequently imported manually at the same time.
- Imports are heavy and slow.

#640508: Lock sources before importing/expiring/clearing them

alex_b’s picture

I'd love you guys' feedback on whether #640508-3: Lock sources before importing/expiring/clearing them fixes your problems.

michellezeedru’s picture

Before I try the Lock Source patch Alex refers to above, I just want to check to see if my issue is something different, because it may be (if so, happy to create a new issue). I have a feed setup to refresh as often as possible, which is every 5 minutes with my cron run, with GUID as the unique target -- works perfectly, no duplicates. Before you start wondering if I'm crazy for refreshing the import every 5 minutes, this is to import playlists from http://www.radioactivity.fm/, where the DJs enter in the tracks they plan. Feeds imports these to display up-to-date recent tracks played on the radio station's site. I am using a source feed of "Songs played in the last hour", so there are only 15 or so nodes created or updated each cron run. So far, this has worked well for a few months now (though the site is not in production yet).

Now I'd like to add an additional feed, to refresh only once a day, with a different feed "Songs played in the last 24 hours" -- this because sometimes songs are not entered by the DJ within the hour, so this will fill in any that didn't make it the "last hour" feed.

This 2nd feed is the problem. I had to create a 2nd importer and 2nd content type, so the refresh is only once a day (seemed the only way). This feed, though, creates duplicates of the nodes created from the original feed. I've even tried creating another feed using my same importer/content type (but different source feed), and this creates duplicates as well. All settings are the same between the feeds, and I confirmed the different feeds from the source do contain unique GUID's that match across feeds, so they shouldn't be duplicated if that is all that's being considered when determining "uniqueness" and whether to update or create.

So in addition to the GUID, is the importer looking at the feed node or feed source to determine if it should create or update? Seems to be the case for me. Any ideas to work around this? Sorry for the long explanation :)

alex_b’s picture

#29: this behavior is expected as existing items are only checked for within a feed (see FeedsNodeProcessor::existingItemId()) - you could quickly fix this issue by extending FeedsNodeProcessor and overriding existingItemId() so that its sql queries don't filter for feed_nid.

SELECT nid FROM {feeds_node_item} WHERE feed_nid = %d AND id = '%s' AND url = '%s'

becomes

SELECT nid FROM {feeds_node_item} WHERE id = '%s' AND url = '%s'

TheInspector’s picture

Version: 6.x-1.x-dev » 6.x-1.0-beta4

I have the same problem with duplicates, even when I try to manually import items. I've selected unique URL and GUID. The feed I'm importing from is a Twitter feed.

anonymous07’s picture

Same duplication issue here since upgrading to D6 and just can't solve it.

Importing various Feeds that worked perfectly under D5/FeedAPI.

Duplicates created every time no matter what settings I use.

jaarong’s picture

I have this same problem and am using the default feed importer. Is there a temporary solution for this? I saw that for several people, the locking patch didn't work. I also checked my feed sources, and the source is only listed once. Is #30 the fix?

alex_b’s picture

I just committed #906654: Fix phantom subscriptions which will fix *some* of the duplicate issues reported here.

alex_b’s picture

Status: Postponed (maintainer needs more info) » Closed (duplicate)

I am fairly certain that remaining issues will be fixed with #640508: Lock sources before importing/expiring/clearing them.

Setting this issue to duplicate.