Duplication checks in aggregator are naive and incorrect [#51644]

In the RSS 2.0 "specification", a GUID is an optional element that is meant to "uniquely [identify] the item" of that particular feed. In Drupal, however, we're going too many steps further: if a GUID doesn't exist, Drupal treats the LINK of an item as a unique value. This is incorrect: it assumes that the LINK is unique, which no version of RSS has ever indicated as being true: in many non-blogging feeds (such as a weather feed, where I am unable to get the second ITEM), multiple bits of information are spit, all with the same LINK. Drupal, however, only sees one of those items, because it erroneously treats the LINK as if it were unique. Drupal should check only for GUID, and then a direct TITLE + LINK + DESCRIPTION check for each item - only then should something be considered unique. (So yes, minor spelling errors would be considered a new unique item - some readers even go so far as to show a diff between two items that have met a threshold of similarity).

Comments

Comment #1

morbus iff

they/them

commented 27 February 2006 at 22:31

After discussion about this in IRC with dopry, we've established the following for duplicate rules:

Use GUID if it exists and is less than 255 characters
Use TITLE + LINK PUBDATE if PUBDATE exists.
Use TITLE + LINK + DESCRIPTION as a last resort.

Comment #2

dopry commented 27 February 2006 at 22:35

Another thought that came to mind would be to finger print an item with its values and store that as md5 hash or some other finger printing mechanism we can use as a universal unique id regardless of what elements an item may contain.

Comment #3

magico commented 13 September 2006 at 17:28

Verified. Deserves further discussion!

Comment #4

LAsan commented 7 April 2008 at 09:08

Version:

x.y.z

» 5.x-dev

Verified. Deserves further discussion!

Still a bug in cvs?

Comment #5

roychri commented 15 June 2008 at 16:50

Version:

5.x-dev

» 7.x-dev

This is still an issue in D7.
It checks for GUID, then only the link and then only the title.
Please provide example of inputs with expected output.
What I mean is: If someone could provide with feeds (attached) with proper items to test the conditions, I can write the patch for D7.

Comment #6

alex_b commented 16 July 2008 at 15:40

I'm not sure what's the best solution here. I've ran before into examples like the weather feed where the current approach breaks (and also the approach taken by the patch on #236237).

But: When aggregating news feeds, only using GUID is leading to many many duplicates. The same will happen by the proposed 3 stage approach (#1) as GUID is on the first stage:

A pattern I've seen over and over when working on http://www.managingnews.com is that news feeds use the GUID as an indicator whether an article has changed, while the LINK of the item stays the same. In this scenario, you would usually want to discard link duplicates or update existing items with the updated data, otherwise you end up with many de facto duplicates. Typically these duplicates reflect minor typo fixes or intentional changes for spamming.

It's hard to say to what extent this scenario effects non power users of aggregator.

Given the inconsistent nature of the beast, I think an important feature is customization of deduplicating by offering contrib modules a way to override the method being used.

This is something that's going to be part of the patch over here #236237.

Further:

* I principally like the idea of finger printing/hashing. We've got very good experiences with speeding up deduplication by hashing entire feeds in FeedAPI.
* We've also got good experiences with using the pub date for deduping. In a custom module I'm using [if pub is the same then check title, if title the same then is duplicate] - but that's a lot of guessing right there :)

Questions:

* "Use TITLE + LINK PUBDATE if PUBDATE exists." - should this be TITLE + LINK + PUBDATE ?
* The weather feed contains GUIDs unique within the feed - you still can't aggregate all items?

Comment #7

morbus iff

they/them

commented 16 July 2008 at 16:05

"* The weather feed contains GUIDs unique within the feed - you still can't aggregate all items?" - no idea. This issue was two years old, and that particular site I'm not working on anymore. And yes on the TITLE + LINK + PUBDATE.

I'd be much happier doing our own hashing technique in core - it's plainly obvious that using anything provided at the feed level is irrelevant (no GUID in RSS 1, ATOM, etc.) or wrong (people who don't use the GUID properly; LINKs being the same, etc.). EDIT: If people want to try more-spec-specific deduping in their custom modules, great, but I don't think we should a) revert core further to just using LINKs [per the gsoc aggregator] or b) keep the current/broken behavior in core which depends on an optional, and poorly understood, feature.

Comment #8

alex_b commented 19 July 2008 at 16:17

a) - agree.

This is an interesting summary of what other aggregators do: http://www.xn--8ws00zhy3a.com/blog/2006/08/rss-dup-detection

Comment #9

morbus iff

they/them

commented 19 July 2008 at 17:15

Heh. My AmphetaDesk (one of the first aggregators, and the first open source/cross platform) didn't do *any* dupe detection (he states, for no particularly important reason).

Comment #10

aron novak

Hungarian

Hungary, Budapest

commented 21 July 2008 at 16:39

Here is a possible deduping way:

1) Check the GUID. If the GUID is in the DB => it's duplicate. else: go to 2.
2) If link exists and all link is different inside the downloaded feed, go to
3, else go to 4
3) Check TITLE + LINK (+ pubdate if exists) , If it's in the DB => duplicate, else => unique
4) Check TITLE + md5(DESCRIPTION). If it's in the DB => duplicate, else
=> unique

It requires to store md5(DESCRIPTION) in the DB to speed up "full text" comparison of the items.
This method means:
the item is new: two SQL queries
the item is already exists: one or two queries(depends on the feed uses guid or not)

Please tell your opinion about this!
The basic concept is: let's only trust guid for deciding that the item is duplicate. For saying unique, guid is not enough.

Comment #11

morbus iff

they/them

commented 21 July 2008 at 17:42

Why the necessity for "all link is different inside the downloaded feed" - that moves us dangerously into particular case scenarios. If we move to a "GUID then MD5" scenario, we've already "solved" this particular bug, as we're no longer falling back on LINK. We don't need to add logic /specifically/ for this bug anymore. I'd rather we only do two steps of logic - GUID then MD5. Adding in a third ("if links are different") could get us into trouble. As for the MD5, I suspected the MD5 would be md5(TITLE + DESCRIPTION) - not TITLE + md5(DESCRIPTION)?

Comment #12

dopry commented 22 July 2008 at 15:47

is there a reason to drop pubdate as a part of the fingerprint....

$guid = isset($item['guid']) ? $item['guid'] : '';
...
$fingerprint = md5($guid . $title . $description . $pubdate);

and doing md5($title . $description . $pubdate) would probably be usable... We could also concatenate GUID into the hashed string. sometimes one may be empty, but that becomes a part of the signature, so missing elements are less of an imposition.

Comment #14

R.Muilwijk commented 22 June 2010 at 18:29

I'm having problems with this for one of my sites. One of the things that happens is when the TITLE + DESCRIPTION is just a little bit different and it does not provide a GUID it does an insert but doesn't do a delete.

Comment #15

Anonymous (not verified) commented 19 August 2010 at 06:17

Using Aggregator in a Drupal 6 install. Responding here because this appears to be the main bug report:

The manual page or the module itself should carry a warning that it can create duplicates (under some circumstances). The Aggregator module should not be in core until this is completely resolved. Duplicates are a severe bug. It could work, it should work, but it doesn't, and therefore as a module is seriously flawed, and should either be specified as such (dev/ beta/ experimental), or not included.

And maybe this is something that cannot be fixed from a module side, until the feeds providing the data have a better mechanism which is easier to control by a feed reader for duplicates.

As for now, lost an entire day implementing something which should work, but doesn't. A reminder to read bug reports before installing a module. But I did not expect this on a core module. Hence my vote on adding a warning or removing it from core, until such time that it is fixed.

Comment #16

damien tournoud commented 19 August 2010 at 12:48

@design_dolphin: your are mistaken. The issue described here is that duplication checks implemented by the aggregator module are too strong, and can wrongly consider that two items are duplicates one from another, even if the feed owner intended them to be different items.

The aggregator module doesn't create duplicates.

Comment #17

Anonymous (not verified) commented 19 August 2010 at 20:15

Thank you for replying. Not sure if you are correct though. Can anyone confirm?

The aggregator module doesn't create duplicates.

In my case it did. Was still working on how to write things down and the best way of doing so, as there are different posts on the forum describing the duplicate issue, when I read your post. Should those different forum posts be consolidated? Be happy to help out.

It turned out that the URL's in the feed items were longer than the varchar value allowed, and they were truncated when storing. Which seems odd that they duplicate? Aren't they all truncated at the same point, you'd think. Increasing the amount of characters for the 'link varchar' field in the aggregator_item table solved the problem. :-) Did the same for title, author, and guid. It solved the duplicate problem in this case. So there needs to be a better way to account for longer URL's, title, author, and guid? With pretty URL's and long article titles for example some articles get really long URL's. Also should the module give a warning in if any of the fields get truncated when importing, and not import the item? This would give an admin a heads up before the site is filled with dups.

Hope it helps

Comment #19

Anonymous (not verified) commented 21 August 2010 at 09:09

Title:

Duplication checks in aggregator are naive and incorrect

» Duplication checks in aggregator are naive and incorrect

Correct spam changing issue.
Spam report filed.

Comment #20

mfb

they or he

commented 15 November 2010 at 03:49

FYI, just filed a related issue: #971812: aggregator should interpret Atom entry id as equivalent to RSS item guid This bug causes new Atom entries to update existing aggregator items rather than inserting new items, in cases where entries have unique ids but share a common link.

Comment #21

luke.stewart commented 8 March 2024 at 03:01

Issue summary:

View changes

This should probably be closed or moved to -> https://www.drupal.org/project/aggregator

Comment #22

8 March 2024 at 03:01

Status:

Active

» Closed (outdated)

Automatically closed because Drupal 7 security and bugfix support has ended as of 5 January 2025. If the issue verifiably applies to later versions, please reopen with details and update the version.

Duplication checks in aggregator are naive and incorrect

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #14

Comment #15

Comment #16

Comment #17

Comment #19

Comment #20

Comment #21

Comment #22

News items

Our community

Documentation

Drupal code base

Governance of community