I installed leech on a multi site environment and got multiple entries for all articles, up to 33.
I needed a whole day to delete hundreds of articles.
I deactivated leech because the visitor stream to these site also decreased dramatically. It seems that Google penalized me for adding duplicate content.
Is there a solution to prevent multiple entries?
Juergen
Comments
Comment #1
alex_b commentedHi unitec,
thanks for your bug report.
* Do all duplicate items contain links to their original articles?
* Do duplicate items have same links?
* How does the script that calls cron.php look like? (wget...)
* Any anomalities on cron? (Cron timing out or similar)
* Could you post the feed URLs the duplicates stem from
* Did you upgrade leech from a previous version? if yes, which?
* what PHP version/mysql version are you using?
Comment #2
alex_b commentedFor reference: Previous (fixed) issue on duplicates: http://drupal.org/node/135333
Comment #3
unitec commentedHello Alex,
* Do all duplicate items contain links to their original articles?
None of the article has a link to the original one.
* Do duplicate items have same links?
See above
* How does the script that calls cron.php look like? (wget...)
I run Poormanscron. I also happened when I run leech manually.
* Any anomalities on cron? (Cron timing out or similar)
No
* Could you post the feed URLs the duplicates stem from
http://z.about.com/6/o/m/homerenovations_t2.xml
It on all feeds URL's (30+)
* Did you upgrade leech from a previous version? if yes, which?
from Version 5.1.8, but the previous version had the same problem.
* what PHP version/mysql version are you using?
PHP 4.4.6
MySQL database 4.1.22
I installed Leech 5.1.8 on another site http://www.home-improvement4u.com and followed all installation steps but with the same result.
Juergen
Comment #4
alex_b commentedGot it: there is no "link" property. The duplicate checking needs to check for guid if the link tag is not present.
<guid isPermaLink="true">http://homerenovations.about.com/od/wallsandtrim/g/drywallgloss.htm</guid>
Comment #5
alex_b commentedComment #6
unitec commentedBut how to solve it?
Comment #7
alex_b commentedI am trying to get a patch together on our end - any time to work on this on your end?
I just saw that about.com has link tags on their feeds now - still duplicates coming in?
Comment #8
aron novakalex_b:
There IS a link property:
I don't know that the feed creator changed the feed or not.
Comment #9
aron novakSorry Alex, i haven't read your last comment :) So it seems that link property was added. I strip the link elements and try the testing then.
Comment #10
aron novakI tested the given feed in both ways (with link item stripped or not).
I could not reproduce the bug at all. The "link to guid" fallback is handled by the parser:
Alex, have you successfully reproduced this?
Comment #11
alex_b commentedNo, I haven't reproduced the bug.
unitec, can you check your leech_news_item table? Are there entries with a link for every one of the duplicate items you're having?
Comment #12
unitec commentedThe entries are trice in leech_news_item.
Juergen
Comment #13
alex_b commentedSo, you have 3 entries with the same "link" property?
Comment #14
unitec commentedYes, I have as much entries as nodes were created of one property.
Juergen
Comment #15
nerdymark commentedI'd like to chime in on this too: I am also having this same problem. No links to original article, and duplicates. Oh the duplicates. If anything ever makes you want to truncate your node and node revision tables... yeah.
This happens on fresh installs and upgrades. The links are not even being added into the tables. There are other threads with similar issues and none of the advice there has resolved my problem. I get the problem when all but leech and its dependencies are running. Strange.
Comment #16
unitec commentedThe problem with duplicate content only appears on some feeds.
So I only deleted these feeds and instead I googled for other rss feeds which are working fine on my
Credit Card site.
I also deleted the Yahoo tag generator because I got thousands of unrelated tags and it's very time consuming to delete them again.
Comment #17
unitec commentedI tried again and got the same error.
I looked in table leech_news_item and found that not all records will be written into the db.
There seems to be a problem with writing feeds into the db.
I run leech in a multi site environment. May be that this causes this problem?
How could I help to solve this problem?
Juergen
Comment #18
unitec commentedI solved the problem:
To know from whom the feed is I assigned a user name like EzineArticlesVoIP to it.
When they are mixed up on the blog so I always could see from where them came (Submitted by EzineArticlesVoIP at ...).
I changed the user name to admin and now it works fine.
Juergen
VoIP-Telephony Blog
Comment #19
alex_b commentedPlease leave the title of the issue so that it accurately describes the problem.
unitec: So you had manually set the user id of the feed and that caused leech to not recognize the items as duplicates anymore?
Thanks for keeping us up to date!
Comment #20
unitec commentedYes, exactly.
After changing the user back to admin it works fine.
Juergen
Comment #21
kollega commentedThis bug now is tested and submitted by myself too!
If you are user#1 you don't get duplicates.
I got Rss-news with user#1 as author. And tried to retrieve data. NO DUPLICATIES.
Then I got Rss-news from same feed at almost same time with user#N. I didn't got any duplicates (because rss-news with user#1 as author still exist).
Then I deleted got rss-news (with user#1 as author) and got Rss-news from same feed with user#N as author and I got duplicates every time I retrieved data.
Please! Correct bug as quick as possible.
Thank you
Comment #22
alex_b commentedUser #N is an existing user? (=valid user id)
And: you only have one feed and get duplicates from the one feed? Your post reads a bit like as if there where two feeds, one for #1 and one for user #N.
Comment #23
kollega commenteduser #N is any another user, is not user #1(admin).
I made tests for 1 feed.
Firstly, I made tests for admin, then for user #N.
Comment #24
aron novakReproduced. Fixed in the DRUPAL-5 branch of the CVS.
Problem: node's author user has to have the "create feed" right.
Before the fix leech didn't check this fact at validation.
Now leech does this checking, so users cannot create these feeds with duplication behavior.
Thanks for everyone who contributed information in the thread.
Comment #25
(not verified) commented