Mis-behaving feed [#38669]

When I use this feed, the tags are ignored and dates are lost too.
http://www.wcc.vccs.edu/services/rssify/rssify.php?url=http://docshazam....

OK, I understand the feed generator does not include dates per post. But why are the titles dropped too? What can we do in general to better support RSSify feeds?

Comments

Comment #1

ahwayakchih commented 26 November 2005 at 22:30

Assigned:

Unassigned

» ahwayakchih

That's because of known problem which i'm not sure how to remove best way.

If You look into that feed You'll see some items have the same link. Currently aggregator2 used guid or link fields to check if item already exists in database. So if two items have the same link it treats one of them as just an update for first one :(

I could make it compare also content, but that would slow down aggregation so much that i'm afraid of it (not to mention that any editing of content and/or title would create situation in whic next aggregation will create "duplicated post").

Comment #2

dkruglyak commented 27 November 2005 at 01:03

Oh my. Unfortunately not everyone generates a clean feed...

My suggestion is to add full content comparison as a configurable advanced option. It would only be turned on for problematic feeds and not slow down everyone on balance.

For example, on our site we now have ~40 feeds and only 2 are troublesome.

Comment #3

ahwayakchih commented 27 November 2005 at 11:58

Ok, i've just commited change to CVS (You'll ave to wait for cvs package to be updated, or download aggregator2.module and aggregator2.mysql directly from CVS).

Now author of feed can enable auto-generating GUIDs for items which do not have GUID. Of course it's not needed if links are unique so i wrote hint that user should use it ONLY if links are not unique :).
GUID is generated when node is created so if someone modifies node later it will not modify GUID - so it will not create a reason for duplicates :).

There is still very small chance that some items will be seen as the same item. That's because i used md5 for generating GUIDs. There is a chance that two different contents will give the same md5. Maybe i should use sha1 instead (still has a chance for duplicate but less than md5. but it's slower)?

The only 100% sure way is to either compare content... and that would mean we'll have to duplicate content so even editing title/body will not change comparing results. And it would be slooooow. Hmm... we could store it zipped/compressed other way, and just compare compressed data, but it still would be slower than comparing hashes.

Comment #4

ahwayakchih commented 3 February 2006 at 10:18

Status:

Active

» Closed (fixed)

Mis-behaving feed

Comments

Comment #1

Comment #2

Comment #3

Comment #4

News items

Our community

Documentation

Drupal code base

Governance of community