duplicate nodes being created from feedapi runs
| Project: | FeedAPI |
| Version: | 6.x-1.5 |
| Component: | Code feedapi_node |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | needs work |
I have a site with about 25 feeds or so setup using the feedapi module using the SimplePie parser. I'm noticing that I keep getting duplicate nodes being created on successive runs of cron. It doesn't seem to occur all the time but certainly a lot of the time. I also have the box checked for checking for duplicates within the feed so this doesn't seem to be working correctly.
The nodes that are created using a pathauto alias, so if the path already exists, it just affixes a "-0", "-1", etc...... at the end.
I'm not sure if this is related but I'm seeing some strange behavior of certain feed nodes being promoted to the front page for no apparent reason. I'm also troubleshooting another issue where I keep reaching the max time when running cron so I'm not sure if that is related or not.
Thanks

#1
Can you try out if it happens without pathauto also?
#2
I can certainly try that.
After a few days of monitoring I'm noticing a pattern where the duplicates only seem to be created on the feeds where I have multiple feeds from same source. For example, I may have an espn feed on "big ten" basketball and an espn feed on "NCAA Basketball". It's these feeds or so it appears that the duplicates are generated.
#3
I have turned off pathauto and confirmed that I still am getting duplicate nodes on certain feeds.
#4
Ok, after further review it appears to be a malformed feed. It seems that after an article is posted, if the feed runs the next day the very same article is posted again with a new url containing the date. Based on the logic of feedapi which compares direct matches of guid and url it makes sense that a new node is created.
I wrote a local module that searches for a substring on the pathauto generated url in the hook_nodeapi insert case. If it finds a match then it issues a node_delete on the current node being saved.
This seems to have resolved my problem.
Thanks
#5
I solved the same problem by writing a parser module, using common_syndication_parser as a starting point. Added functionality includes:
Here's the function which follows redirects to get the "real" url and avoid duplicate articles.
<?phpfunction _direct_parser_realurl($url) {
static $curlopts = array(
CURLOPT_AUTOREFERER => true,
CURLOPT_COOKIESESSION => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_HEADER => true,
CURLOPT_NOBODY => true,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_USERAGENT => 'Mozilla/4.0',
);
$ch = curl_init($url);
curl_setopt_array($ch, $curlopts);
$output = curl_exec($ch);
curl_close($ch);
foreach(explode("\r\n",$output) as $line) {
if (!strncmp($line, 'Location: ', 10)) {
$url = substr($line, 10);
}
}
return $url;
}
?>
Compare the source feed with the parsed result.
(But be kind; the Coyle site is very much under construction; the theme is obviously not fleshed out.)
#6
The above strategy still occasionally produced duplicate stories, when the story in fact was found on duplicate URL's.
Filtering unnecessary arguments from the URL helped but did not eliminate the problem.
My latest strategy is to replace the guid with an md5 sum of the extracted text. That seems to be working for now.
#7
A small variation:
<?php
function _direct_parser_realurl($url) {
static $curlopts = array(
CURLOPT_AUTOREFERER => true,
CURLOPT_COOKIESESSION => true,
CURLOPT_HEADER => true,
CURLOPT_NOBODY => true,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1',
);
$ch = curl_init($url);
curl_setopt_array($ch, $curlopts);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 10);
$output = curl_exec($ch);
$newurl = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
if (!$newurl) {
$newurl = $url;
}
curl_close($ch);
return $newurl;
}
function mymodule_feedapi_after_parse($feed){
for ($i = 0; $i < count($feed->items); $i++) {
$feed->items[$i]->options->original_url = _direct_parser_realurl($feed->items[$i]->options->original_url);
}
}
?>
Saludos,
Roberto
http://www.sobrefamosos.com
#8
Thanks for the improvement. I'm attaching my module, in case anyone is interested.
The current version depends on a single regex which is used to select applicable paragraphs from all newsfeeds. Not everyone would prefer this implementation, but it works for me.
#9
Hi,
I have this situation already a long time, see: http://drupal.org/node/251908
Could a solution for this (may be one of the above), please be inserted into feedapi 6?
Thanks a lot in advance for considering this!
Greetings,
Martijn
#10
@Summit:
Not the same problem.
Your issue has to do with pathauto generating duplicate aliases. Doesn't happen on my installation -- the second one gets a "-1" added to the name; the third gets a "-2"; etc. Dunno what is causing your problem but my hunch is that it has to do with your pathauto installation, not with feedapi.
My issue is different. I want to eliminate duplicate article summaries, even when they come from distinct feed items. To do this, I made a new parser module and rewrote its criteria for duplicate detection.
The original code compares the URL and GUID as reported by the feed.
My code compares a hash of the excerpt of the article as shown on my site.
The original is relatively quick and compliant with the RSS standard.
Mine is relatively slow but immune from lousy implementations of the standard.
I doubt that the feedapi author will choose to incorporate my ideas. But if I get a second paying client who wants a newsfeed, I might go ahead and register a new parser module. Meanwhile, you have a fairly recent snapshot of my working code (above). Feel free to use and improve it. Or pay someone to do it for you.
#11
The idea is quite good, sometimes it's useful to consider the actual content of the feed item.
However, as i checked out the module source code, it could re-use common syndication functions instead of copy-pase them. I imagine this module as a common syndication wrapper plus the small modification what you did in guid computation.
#12
Yup. Mine is a horrible ugly hack; it Works For Me (tm) but I'd have to clean it up quite a bit before publishing as a module. Meanwhile, anybody who wants to use the ideas to make something better is certainly welcome.
#13
Anyway i'm sure this is not a big work to achieve a clean module.
#14
I'm experiencing the same issue, if I have two separate feeds from the same source, like "USAToday World Politics" & "USAToday World Health", any time they include the same article within both feeds, I get duplicates...
I'm processing hundreds of feeds though so this regex solution might send my cron into a timeout frenzy... any advice? Currently using the simplepie parser
hope it's ok that I changed the status... This is actually pretty critical
#15
#14:
Have you turned on "Check for duplicates on all feeds" option? If not, do that. If you did, please supply the exact feed URLs where you experience this problem.
#16
Aron, I was wrong, it's not duplicate nodes, it's duplicates of the same node within a view (since each feed is from a different taxonomy), enabling the "distinct node" feature in views2 appears to catch it.
#17
If i understand the situation, this issue is fixed.
#18
Well I guess the "distinct:yes" solution caught some of them but not all.
I'm actually experiencing nodes being created both twice from the same feed (with -0 being appended on to the duplicate URL) as well as two different feeds being credited for the same node (as parents)..
Usually it's the 2nd issue, where two separate feeds that carry some identical stories get credited twice as the parent, so only one node, with two parents.. this causes two problems. Views seems to pick it up twice, even though it's technically only one node and "distinct:yes" is configured. So when I'm on a category page (example.com/taxonomy/term/3) with a views block generating all relevant nodes by argument arg(2);, I get duplicate results for any article with multiple parents.
The same issue causes a second problem, I use
<?phpif (!$node->links['feedapi_feed']['title']):
?>
Example :: (Typical node)
[links] => Array
(
[feedapi_feed] => Array
(
[title] => Feed: Reuters
[href] => node/6449
)
[feedapi_original] => Array
(
[title] => Original article
[href] => http://feeds.reuters.com/~r/reuters/worldNews/~3/1w3sYSzcQqo/idUSTRE55T0... )
)
Example :: (Problem Node with multiple parents)
[links] => Array
(
[feedapi_feed_263] => Array
(
[title] => Feed: Newsweek
[href] => node/263
)
[feedapi_feed_6526] => Array
(
[title] => Feed: MSNBC.com
[href] => node/6526
)
[feedapi_original] => Array
(
[title] => Original article
[href] => http://www.newsweek.com/id/204762
)
)
The two feeds that make up the problem node in this example are
http://www.newsweek.com/id/43805/output/rss
&
http://www.msnbc.msn.com/id/3032506/device/rss/rss.xml
So in some cases when both problems combine I get the same article three times, one with two parents, and then another with -0 at the end when one of those parents happened to randomly create the node twice, all of which show up in my view block.
I see the duplicate node (with -0 appended to the duplicate) less often than the multiple parent issue.
Let me know if you need any other details to help troubleshoot, pretty critical problem at the moment.
Best Regards
#19
I should add that I'm running the FeedAPI DEV version, updated in the last 7 days.. so if anything significant has happened related to duplicates recently let me know and I'll try another upgrade.
#20
To answer your question from #15, "check for duplicates on all feeds" is selected on every feed.
#21
I'm looking at 5.x-1.5 code and the interesting thing to me is that feedapi_inherit_feedapi_item() always returns NULL in case $op is 'unique', or any op for that matter. The code in _feedapi_invoke_refresh() doesn't distinguish between FALSE or NULL, so in some scenario's we run both the code for a unique and non unique item.
Unfortunately I cannot investigate further right now.
The interesting scenario in _feedapi_invoke_refresh() seems to be this one:
A new item is processed and feedapi_node reports it as unique (save it) and then feedapi_inherit reports it to be non unique (update an item that may not yet exist everywhere?).
Of course some people may have these processors running in the reverse order.
EDIT:
I have taken another quick look and it appears that feedapi_inherit causes no harm. It reports a duplicate but does not insert anything itself, neither does it appear to trigger some other insert.
#22
5.x-1.5
_feedapi_node_unique()checks for a duplicate URL or GUID within the same feed, but_feedapi_node_update()uses the nid of the first duplicate URL or GUID from all feeds... This wrong combination of feed and feed item is then used to delete from and insert into in the list of existing items...EDIT: _feedapi_update() was a mistake, and I fixed it to the correct name: _feedapi_node_update().
#23
I'm having similar issues with my site - both in the creation of duplicate nodes (with -0, -1, etc.) and the appearance of duplicate feeds in the admin/content/feeds menu whenever I edit and save a feed.
The big issue for me is that the feed items aren't always showing all the content - sometimes the body is missing, and sometimes it's not showing by default but when I go in and edit, it's there and all I have to do is save it to get it to show. I'm so lost, been trying to fix this for a few days now.
Setup:
Drupal 6.14
FeedAPI 6.x-1.9-beta1
FeedAPI Node 6.x-1.9-beta1
FeedAPI Mapper 6.x-2.0-alpha3
Common syndication parser 6.x-1.9-beta1
FeedAPI Inherit 6.x-1.9-beta1
FeedAPI Taxonomy Compare 6.x-1.4
#24
Follow up on #22. I know it's not against HEAD but perhaps this 5.x-1.5 patch gets someone started who has time and deeper knowledge of FeedAPI.
EDIT: #324797: Duplicate items when update items enabled http://drupal.org/node/324797#comment-1607302
All that is missing in that issue is a complete solution for 5.x.
#25
subscribing, same issue with 6.x
#26
subscribing, same issue with 6.x
with simple pie as my XML parser of choice