Project:Leech
Version:5.x-1.x-dev
Component:leech
Category:bug report
Priority:normal
Assigned:alex_b
Status:closed (fixed)

Issue Summary

I get duplicate feed items nodes for almost all nodes only the first time leech downloads fresh feeds on cron run (see attached screenshot).

The module works as expected (not adding any duplicate item node) if i immediately run cron manually again.

Can anyone confirm or help? Thanks a lot!

AttachmentSize
Picture 1_33.png35.38 KB

Comments

#1

Category:bug report» support request

hum..how does leech actually behave with duplicate feed items (items pointing to the same final URL) coming from 2 or more different RSS feeds?

or said other way around: what happens if the same final news item (URL) appears on multiple leeched feeds?

..i'm afraid this probably is the reason for the duplication issue....can it be? cheers

[turning issue into support request]

#2

Category:support request» bug report

ok, remarking as a bug, as the duplicate items belong o the *same* RSS feed and link to the *same* original article (see screenshot). thx

AttachmentSize
Picture 1_34.png 21.57 KB

#3

Hi Marcob.

Leech identifies duplicates by looking at the feed item URLs. This works usually very well.

Are you sure you are using version 1.6 for 4.7?

Some other questions:

Could you post the original article URLs of the duplicate feed items? Take the URLs from the link column in the leech_news_item table.

Could you post the SQL structure of your leech_news_item table?

Do you get duplicate feed items only for particular items, or only for particular feeds or for all items and feeds?

#4

Have you possible upgraded from an earlier leech version to the current one?

#5

Current meaning _your_ current version: 1.6.

#6

Alex_b
I have duplicated items too.

I marked, that this process started after end of March. (see dev versions. may be you did a chandes in code?)

I get duplicate feed items only for particular items for one leech. This duplicates may be for one feeds, but later - for another.

(installed - Drupal 5.1, mysql 5x., php5x, apache 1.3)

#7

I think, it depends on availability of a feed. Or it is a mistake during analysis of a feed items.
It is impossible to notice any law in these mistakes. They are not regular. Do not arise constantly. Though arise in the certain feeds is more often.

#8

@Alex_b about your questions at #3-4:

yes, i'm on leech-4.7.x-1.6 (even if in the module the version number is v 1.3.2.22 of the 2007/02/23).

nope, i didn't upgrade, i installed it on a fresh Drupal 4.7.2 install.

I solved the problem setting the download from just 1 feed at a time (not 5 as was by default) and setting cron to run every 5 minutes (i have actually 9 different feeds, so i can download them all every 45 minutes).

Unfortunately leech now runs on production site, so i cannot really screw things up... what i can say is previously to the "fix" i was getting almost all duplicate items (i can't swear it, but they were sooo many duplicates that at least 90% was repeated, some of them 2, some of them 3 times).

Hope this is a sort of clue for you. If this does not help, i could eventually when i find a little time create a test site and screw things up there..

About

Could you post the original article URLs of the duplicate feed items? Take the URLs from the link column in the leech_news_item table.
Could you post the SQL structure of your leech_news_item table?

i'm not sure i got what you mean here..

Thanks

#9

marcob,

Sorry for being unclear with "SQL structure" - I mean, could you post a dump of some duplicate feeds in leech_news_item?

What PHP and MySQL versions are you using?

The bit of code that does duplicate checking is this:

<?php
      $result
= db_result(db_query("SELECT COUNT(nid) FROM {leech_news_item} WHERE link = '%s' AND fid = %d", $item->link, $node->nid));
      if (
$result > 0) {
       
$duplicate_count++;
        if (
variable_get('leech_news_verbose', FALSE) === 1) {
         
watchdog('leech', t('Found duplicate. Link: %link', array('%link' => $item->link)));
        }
        continue;
      }
?>

The continue statement jumps over the rest of the enclosing loop that would otherwise store the feed item. There could be something wrong with the select statement - a feed item link could be corrupted when stored to the database and not be recognized later or the feed items could stem from different feeds.

Could it be that there are non-latin1 characters in the duplicate feed item's links?

#10

Alex, thanks for explaining, actually it's some time i play with Drupal but i'm no coder.

I'm on MySQL 4.1.21 and PHP 4.4.4. Really don't have idea about non-latin1 characters, still i can say that some feed items titles do contain "different" characters for apostrophes and others (see here, here and here for examples).

I just tryed to set a higher number of updated leeches at each cron run, but didn't manage to get any duplicates...maybe because is late night here in Italy, and there are no new feeds right now ;)

Will try again when i find some time

Thanks for considering

#11

To Alex_b on #9

May be insert additional checking on $ Item->guid in sql statement.

Like this


      $result = db_result(db_query("SELECT COUNT(nid) FROM {leech_news_item} WHERE link = '%s' AND fid = %d AND guid = '%s'", $item->link, $node->nid, $item->guid));

#12

Subscribing.

Still have tons of duplicate content. 4.7 AND 5.1.

#13

There is no duplicate checking on titles. Only on source URLs (that's the URLs that link back to the original article).

So if two articles have the same title and the same body, but have different source URL's, they will show up both, they won't be duplicates in the narrower sense.

I am running leech on an array of websites but I don't get any articles with equal URL's.

funana, when you 're having content with equal URL's - you should have insert errors on cron time, as the leech table declares its url field UNIQUE - do you have such errors?

alex

#14

Alex,

the items have the same url, title and body text. There are no error messages during cron and it even says "0 item(s) added, 20 duplicate(s) found. Leech node: ..."
BUT as you said "cron"... I have a server cron AND a poorman's cron running on my sites. That's something that could make trouble, isnt it?
I switched poorman's cron off and now let's see.

Thank you very much for your support Alex!

#15

Version:4.7.x-1.6» 5.x-1.7

Duplicated items here also.

#16

Switched off poorman's cron, still getting duplicates...

#17

multithread cron off?

if there are duplicate articles, there should also be sql errors upon insertion of those articles - the url field is UNIQUE. Any such sql errors?

#18

Version:5.x-1.7» master
Assigned to:Anonymous» alex_b

I can reproduce this error now. I just got it on a 4.7 site. I am pretty sure that this problem occurs in 4.7.x and 5.x versions of leech, as the parsing/storing code didn't change a lot between the versions.

Stay tuned.

#19

It looks like that some feed items get stored without an entry in leech_news_item.

As the duplicate check is performed by a check of a present equal URL in leech_news_item, duplicates of such feed items are not identified properly.

#20

Status:active» postponed (maintainer needs more info)

Does anybody of you guys who experience duplicate feed items find "Last cron run did not complete" messages in your watchdog log?

If so, do you find subsequent "added" messages of duplicate nodes in the log, typically in a period shorter than the cron run is supposed to recur? Those added messages show only up if "verbose output to watchdog" is checked on the leech settings page.

I couldn't go all the way to the ground today. I will be off for the weekend now, but I ll work on this issue on Monday morning again.

#21

Version:master» 5.x-1.x-dev
Status:postponed (maintainer needs more info)» active

I am pretty sure that duplicate feed items occur when more than one cron process runs at a time.

On cron time, leech opens the next X feed nodes for updating, downloads the feeds, parses them, creates nodes from their feed items and marks the feed nodes updated.

Between opening the feeds for updating and marking them updated there was a huge time window of sometimes minutes. If cron started before a previous cron was finished, it basically opened some feed nodes that were already opened by the first cron run.

This problem is mitigated in 4.7.x dev version - leech nodes are being marked updated (checked) immediately after they are opened. This decreases the time window in which they are opened but not marked updated to some milliseconds down from some minutes.

Somewhat related is this issue: http://drupal.org/node/150972 - make sure that you call cron.php with wget ... -t 1 - I had lots of problems with concurrent cron.php calls by wget trying to pull the page up to 20 times a row.

I need to port this patch to 5.x now.

http://cvs.drupal.org/viewcvs/drupal/contributions/modules/leech/leech.m...

#22

Status:active» fixed

Patch ported.

Check out the latest 4.7.x and 5.x dev versions for leech and see wether they can resolve the duplication issue for you. Be aware that both are development versions.

http://cvs.drupal.org/viewcvs/drupal/contributions/modules/leech/leech.m...

#23

Update: check out new devel versions if you have checked out between last post an this one. There was a small but fatal error in revision 1.4.2.28.

Thanks, Alex

#24

Status:fixed» closed (fixed)
nobody click here