I installed leech on a multi site environment and got multiple entries for all articles, up to 33.
I needed a whole day to delete hundreds of articles.
I deactivated leech because the visitor stream to these site also decreased dramatically. It seems that Google penalized me for adding duplicate content.

Is there a solution to prevent multiple entries?

Juergen

Comments

alex_b’s picture

Status: Active » Postponed (maintainer needs more info)

Hi unitec,

thanks for your bug report.

* Do all duplicate items contain links to their original articles?
* Do duplicate items have same links?
* How does the script that calls cron.php look like? (wget...)
* Any anomalities on cron? (Cron timing out or similar)
* Could you post the feed URLs the duplicates stem from
* Did you upgrade leech from a previous version? if yes, which?
* what PHP version/mysql version are you using?

alex_b’s picture

For reference: Previous (fixed) issue on duplicates: http://drupal.org/node/135333

unitec’s picture

Hello Alex,

* Do all duplicate items contain links to their original articles?

None of the article has a link to the original one.

* Do duplicate items have same links?

See above

* How does the script that calls cron.php look like? (wget...)

I run Poormanscron. I also happened when I run leech manually.

* Any anomalities on cron? (Cron timing out or similar)

No

* Could you post the feed URLs the duplicates stem from

http://z.about.com/6/o/m/homerenovations_t2.xml
It on all feeds URL's (30+)

* Did you upgrade leech from a previous version? if yes, which?

from Version 5.1.8, but the previous version had the same problem.

* what PHP version/mysql version are you using?

PHP 4.4.6
MySQL database 4.1.22

I installed Leech 5.1.8 on another site http://www.home-improvement4u.com and followed all installation steps but with the same result.

Juergen

alex_b’s picture

Priority: Normal » Critical
Status: Postponed (maintainer needs more info) » Active

Got it: there is no "link" property. The duplicate checking needs to check for guid if the link tag is not present.

<guid isPermaLink="true">http://homerenovations.about.com/od/wallsandtrim/g/drywallgloss.htm</guid>

alex_b’s picture

Title: Multiple Nodes Were Created » Duplicates from feeds with guid tags but no link tags
unitec’s picture

But how to solve it?

alex_b’s picture

I am trying to get a patch together on our end - any time to work on this on your end?

I just saw that about.com has link tags on their feeds now - still duplicates coming in?

aron novak’s picture

alex_b:
There IS a link property:

         <link>http://homerenovations.about.com/od/wallsandtrim/g/drywallgloss.htm</link>
         <description><![CDATA[Drywall]]></description>
         <guid isPermaLink="true">http://homerenovations.about.com/od/wallsandtrim/g/drywallgloss.htm</guid>

I don't know that the feed creator changed the feed or not.

aron novak’s picture

Sorry Alex, i haven't read your last comment :) So it seems that link property was added. I strip the link elements and try the testing then.

aron novak’s picture

I tested the given feed in both ways (with link item stripped or not).
I could not reproduce the bug at all. The "link to guid" fallback is handled by the parser:

  if ($data['LINK']) {
    // TODO: remove this Atom hack when we have field mapping or at least specialized parsers in place
    if (count($data['LINK']) > 1) {
      $item->link = $feed->link;
      foreach ($data['LINK'] as $temp) {
        if ($temp['REL'] == 'alternate') {
          $item->link = $temp['HREF'];
        }
      }
    }
    else {
      $item->link = ($data['LINK'][0]['HREF'] ? $data['LINK'][0]['HREF'] : $data['LINK'][0]['VALUE']);
    }
  }
  elseif ($data['GUID'] && (strncmp($data['GUID'][0]['VALUE'], 'http://', 7) == 0) && $data['GUID'][0]['ISPERMALINK'] != 'false') {
    $item->link = $data['GUID'][0]['VALUE'];
  }
  else {
    $item->link = $feed->link;
  }

Alex, have you successfully reproduced this?

alex_b’s picture

No, I haven't reproduced the bug.

unitec, can you check your leech_news_item table? Are there entries with a link for every one of the duplicate items you're having?

unitec’s picture

The entries are trice in leech_news_item.

Juergen

alex_b’s picture

So, you have 3 entries with the same "link" property?

unitec’s picture

Yes, I have as much entries as nodes were created of one property.

Juergen

nerdymark’s picture

I'd like to chime in on this too: I am also having this same problem. No links to original article, and duplicates. Oh the duplicates. If anything ever makes you want to truncate your node and node revision tables... yeah.

This happens on fresh installs and upgrades. The links are not even being added into the tables. There are other threads with similar issues and none of the advice there has resolved my problem. I get the problem when all but leech and its dependencies are running. Strange.

unitec’s picture

The problem with duplicate content only appears on some feeds.
So I only deleted these feeds and instead I googled for other rss feeds which are working fine on my
Credit Card site.
I also deleted the Yahoo tag generator because I got thousands of unrelated tags and it's very time consuming to delete them again.

unitec’s picture

I tried again and got the same error.
I looked in table leech_news_item and found that not all records will be written into the db.
There seems to be a problem with writing feeds into the db.
I run leech in a multi site environment. May be that this causes this problem?
How could I help to solve this problem?

Juergen

unitec’s picture

Title: Duplicates from feeds with guid tags but no link tags » Problem Solved!

I solved the problem:
To know from whom the feed is I assigned a user name like EzineArticlesVoIP to it.
When they are mixed up on the blog so I always could see from where them came (Submitted by EzineArticlesVoIP at ...).

I changed the user name to admin and now it works fine.

Juergen
VoIP-Telephony Blog

alex_b’s picture

Title: Problem Solved! » Duplicates from feeds when manually setting user of feed items
Priority: Critical » Normal
Status: Active » Postponed (maintainer needs more info)

Please leave the title of the issue so that it accurately describes the problem.

unitec: So you had manually set the user id of the feed and that caused leech to not recognize the items as duplicates anymore?

Thanks for keeping us up to date!

unitec’s picture

Yes, exactly.
After changing the user back to admin it works fine.

Juergen

kollega’s picture

This bug now is tested and submitted by myself too!
If you are user#1 you don't get duplicates.

I got Rss-news with user#1 as author. And tried to retrieve data. NO DUPLICATIES.
Then I got Rss-news from same feed at almost same time with user#N. I didn't got any duplicates (because rss-news with user#1 as author still exist).
Then I deleted got rss-news (with user#1 as author) and got Rss-news from same feed with user#N as author and I got duplicates every time I retrieved data.

Please! Correct bug as quick as possible.
Thank you

alex_b’s picture

User #N is an existing user? (=valid user id)

And: you only have one feed and get duplicates from the one feed? Your post reads a bit like as if there where two feeds, one for #1 and one for user #N.

kollega’s picture

user #N is any another user, is not user #1(admin).
I made tests for 1 feed.
Firstly, I made tests for admin, then for user #N.

aron novak’s picture

Assigned: Unassigned » aron novak
Status: Postponed (maintainer needs more info) » Fixed

Reproduced. Fixed in the DRUPAL-5 branch of the CVS.
Problem: node's author user has to have the "create feed" right.
Before the fix leech didn't check this fact at validation.
Now leech does this checking, so users cannot create these feeds with duplication behavior.
Thanks for everyone who contributed information in the thread.

Anonymous’s picture

Status: Fixed » Closed (fixed)