Leech's feature of checking for duplicates is necessary and useful. Reading the support issues, Leech seems to check against the source URL, and if a node linking to that URL exists Leech flags a duplicate and doesn't download the new content.
This creates a problem for me, however.
1. I'm trying to Leech a node where the linked URL remains the same while the contents of the feed change (it's a weather feed - the weather website has an individual page for each city (the page itself is dynamic, changing contents but not URL every few hours) while the RSS feed contains the latest data. It would be nice if Leech could download the latest weather data, but since the linked URL doesn't change Leech doesn't download the feed.
2. When feeds are updated (with typos revised or data added, for example) it would be nice to overwrite the original/show the new feed. At the moment this isn't done because Leech flags it a duplicate.
Apologies, I'm no programmer, but I am a keen tester and would be happy to stress test anything someone could do.
Perhaps: When adding a feed the user is presented with the options 'display duplicates as new items' (fixing point 1 above) or 'overwrite feed items with updates' (fixing point 2 above).
To establish a duplicate checking against the URL is good, but not perfect. Perhaps check against the an MD5 hash of the feed item's title and contents? If Leech could do this it would be sweet, and a very fully featured aggregator.
Best,
Alex
Comments
Comment #1
alex_b commentedHi Alex,
Thanks for the good description of your issue.
Would it help if there was a per-feed option to overwrite duplicates? When identified as duplicate, a feed item wouldn't be skipped but it would overwrite the previously saved duplicate.
In addition to that we could do some alternative duplicate checking - say by checking the title and the description (body). But I am sceptical: What if you want to capture changes in title and body? What if title and body are the same, but the date changes?
Comment #2
AlexPenfold commentedHi Alex,
Thanks for the quick reply.
A per-feed option would be fine, in fact I imagine it would reduce the load of Drupal if this were implemented per-feed.
Using Bloglines there is an option to do something like 'show updated feed items as new items' or 'overwrite existing feed items with updates'. Combining all three options: 'updates are ignored', 'updates are shown as new nodes', updates overwrite the existing node' I imagine would be an ideal catch-all.
In terms of what an update is... even aside from the date/time the RSS feed may contain many other items, perhaps a Geo tag. Yes, it would be good to look at whether the feed item was updated as a whole, perhaps MD5 the whole feed item, then when something, anything, changes the feed would get flagged as updated. This is much more of a general case than my need at the moment (I'm currently just concerned with the body changing), but it could be a good general solution in terms up update-analysis.
The option to turn on/off this duplicate checking would be good for sites that are afraid of increased server-load.
Cheers,
Alex
Comment #3
alex_b commentedOne important question: does the feed you want to update use guid? A guid is an identifier of feed items that we could use alternatively. Check out the RSS feed and see wether same-URL items have different guid's when they change.