In the RSS 2.0 "specification", a GUID is an optional element that is meant to "uniquely [identify] the item" of that particular feed. In Drupal, however, we're going too many steps further: if a GUID doesn't exist, Drupal treats the LINK of an item as a unique value. This is incorrect: it assumes that the LINK is unique, which no version of RSS has ever indicated as being true: in many non-blogging feeds (such as a weather feed, where I am unable to get the second ITEM), multiple bits of information are spit, all with the same LINK. Drupal, however, only sees one of those items, because it erroneously treats the LINK as if it were unique. Drupal should check only for GUID, and then a direct TITLE + LINK + DESCRIPTION check for each item - only then should something be considered unique. (So yes, minor spelling errors would be considered a new unique item - some readers even go so far as to show a diff between two items that have met a threshold of similarity).
Comments
Comment #1
morbus iffAfter discussion about this in IRC with dopry, we've established the following for duplicate rules:
Comment #2
dopry commentedAnother thought that came to mind would be to finger print an item with its values and store that as md5 hash or some other finger printing mechanism we can use as a universal unique id regardless of what elements an item may contain.
Comment #3
magico commentedVerified. Deserves further discussion!
Comment #4
LAsan commentedVerified. Deserves further discussion!
Still a bug in cvs?
Comment #5
roychri commentedThis is still an issue in D7.
It checks for GUID, then only the link and then only the title.
Please provide example of inputs with expected output.
What I mean is: If someone could provide with feeds (attached) with proper items to test the conditions, I can write the patch for D7.
Comment #6
alex_b commentedI'm not sure what's the best solution here. I've ran before into examples like the weather feed where the current approach breaks (and also the approach taken by the patch on #236237).
But: When aggregating news feeds, only using GUID is leading to many many duplicates. The same will happen by the proposed 3 stage approach (#1) as GUID is on the first stage:
A pattern I've seen over and over when working on http://www.managingnews.com is that news feeds use the GUID as an indicator whether an article has changed, while the LINK of the item stays the same. In this scenario, you would usually want to discard link duplicates or update existing items with the updated data, otherwise you end up with many de facto duplicates. Typically these duplicates reflect minor typo fixes or intentional changes for spamming.
It's hard to say to what extent this scenario effects non power users of aggregator.
Given the inconsistent nature of the beast, I think an important feature is customization of deduplicating by offering contrib modules a way to override the method being used.
This is something that's going to be part of the patch over here #236237.
Further:
* I principally like the idea of finger printing/hashing. We've got very good experiences with speeding up deduplication by hashing entire feeds in FeedAPI.
* We've also got good experiences with using the pub date for deduping. In a custom module I'm using [if pub is the same then check title, if title the same then is duplicate] - but that's a lot of guessing right there :)
Questions:
* "Use TITLE + LINK PUBDATE if PUBDATE exists." - should this be TITLE + LINK + PUBDATE ?
* The weather feed contains GUIDs unique within the feed - you still can't aggregate all items?
Comment #7
morbus iff"* The weather feed contains GUIDs unique within the feed - you still can't aggregate all items?" - no idea. This issue was two years old, and that particular site I'm not working on anymore. And yes on the TITLE + LINK + PUBDATE.
I'd be much happier doing our own hashing technique in core - it's plainly obvious that using anything provided at the feed level is irrelevant (no GUID in RSS 1, ATOM, etc.) or wrong (people who don't use the GUID properly; LINKs being the same, etc.). EDIT: If people want to try more-spec-specific deduping in their custom modules, great, but I don't think we should a) revert core further to just using LINKs [per the gsoc aggregator] or b) keep the current/broken behavior in core which depends on an optional, and poorly understood, feature.
Comment #8
alex_b commenteda) - agree.
This is an interesting summary of what other aggregators do: http://www.xn--8ws00zhy3a.com/blog/2006/08/rss-dup-detection
Comment #9
morbus iffHeh. My AmphetaDesk (one of the first aggregators, and the first open source/cross platform) didn't do *any* dupe detection (he states, for no particularly important reason).
Comment #10
aron novakHere is a possible deduping way:
1) Check the GUID. If the GUID is in the DB => it's duplicate. else: go to 2.
2) If link exists and all link is different inside the downloaded feed, go to
3, else go to 4
3) Check TITLE + LINK (+ pubdate if exists) , If it's in the DB => duplicate, else => unique
4) Check TITLE + md5(DESCRIPTION). If it's in the DB => duplicate, else
=> unique
It requires to store md5(DESCRIPTION) in the DB to speed up "full text" comparison of the items.
This method means:
the item is new: two SQL queries
the item is already exists: one or two queries(depends on the feed uses guid or not)
Please tell your opinion about this!
The basic concept is: let's only trust guid for deciding that the item is duplicate. For saying unique, guid is not enough.
Comment #11
morbus iffWhy the necessity for "all link is different inside the downloaded feed" - that moves us dangerously into particular case scenarios. If we move to a "GUID then MD5" scenario, we've already "solved" this particular bug, as we're no longer falling back on LINK. We don't need to add logic /specifically/ for this bug anymore. I'd rather we only do two steps of logic - GUID then MD5. Adding in a third ("if links are different") could get us into trouble. As for the MD5, I suspected the MD5 would be md5(TITLE + DESCRIPTION) - not TITLE + md5(DESCRIPTION)?
Comment #12
dopry commentedis there a reason to drop pubdate as a part of the fingerprint....
$guid = isset($item['guid']) ? $item['guid'] : '';
...
$fingerprint = md5($guid . $title . $description . $pubdate);
and doing md5($title . $description . $pubdate) would probably be usable... We could also concatenate GUID into the hashed string. sometimes one may be empty, but that becomes a part of the signature, so missing elements are less of an imposition.
Comment #14
R.Muilwijk commentedI'm having problems with this for one of my sites. One of the things that happens is when the TITLE + DESCRIPTION is just a little bit different and it does not provide a GUID it does an insert but doesn't do a delete.
Comment #15
Anonymous (not verified) commentedUsing Aggregator in a Drupal 6 install. Responding here because this appears to be the main bug report:
The manual page or the module itself should carry a warning that it can create duplicates (under some circumstances). The Aggregator module should not be in core until this is completely resolved. Duplicates are a severe bug. It could work, it should work, but it doesn't, and therefore as a module is seriously flawed, and should either be specified as such (dev/ beta/ experimental), or not included.
And maybe this is something that cannot be fixed from a module side, until the feeds providing the data have a better mechanism which is easier to control by a feed reader for duplicates.
As for now, lost an entire day implementing something which should work, but doesn't. A reminder to read bug reports before installing a module. But I did not expect this on a core module. Hence my vote on adding a warning or removing it from core, until such time that it is fixed.
Comment #16
damien tournoud commented@design_dolphin: your are mistaken. The issue described here is that duplication checks implemented by the aggregator module are too strong, and can wrongly consider that two items are duplicates one from another, even if the feed owner intended them to be different items.
The aggregator module doesn't create duplicates.
Comment #17
Anonymous (not verified) commentedThank you for replying. Not sure if you are correct though. Can anyone confirm?
In my case it did. Was still working on how to write things down and the best way of doing so, as there are different posts on the forum describing the duplicate issue, when I read your post. Should those different forum posts be consolidated? Be happy to help out.
It turned out that the URL's in the feed items were longer than the varchar value allowed, and they were truncated when storing. Which seems odd that they duplicate? Aren't they all truncated at the same point, you'd think. Increasing the amount of characters for the 'link varchar' field in the aggregator_item table solved the problem. :-) Did the same for title, author, and guid. It solved the duplicate problem in this case. So there needs to be a better way to account for longer URL's, title, author, and guid? With pretty URL's and long article titles for example some articles get really long URL's. Also should the module give a warning in if any of the fields get truncated when importing, and not import the item? This would give an admin a heads up before the site is filled with dups.
Hope it helps
Comment #19
Anonymous (not verified) commentedCorrect spam changing issue.
Spam report filed.
Comment #20
mfbFYI, just filed a related issue: #971812: aggregator should interpret Atom entry id as equivalent to RSS item guid This bug causes new Atom entries to update existing aggregator items rather than inserting new items, in cases where entries have unique ids but share a common link.
Comment #21
luke.stewart commentedThis should probably be closed or moved to -> https://www.drupal.org/project/aggregator