feedaggregator_node_feedapi() is currently using the item's link value as a method to determine if the item is unique. This has 3 current problems:

1. The aggregator_node table does not have an index on the link column. Eventually, this will become a performance issue.
2. The link field is optional and the Atom standard assigns no meaning to it. So feed items may not even have a link field, or they may duplicate the same value. This will cause the duplicate check to fail.
3. The Atom standard says the item ID is to be unique for an item for all time for a source, so probably really should be the value used to determine uniqueness in this funciton. See the standard here http://tools.ietf.org/html/rfc4287#section-4.2.6. SimplePie provides a method to get the ID value just as simply as the link value -- simply call $item->get_id() instead of or in addition to $item->get_link().

Comments

Chris Johnson’s picture

Information on required and optional elements in an Atom feed: http://www.atomenabled.org/developers/syndication/#requiredEntryElements

I hope to get time to create a patch to use ID instead of link as the unique identifier. It would be very easy, except that the re-use of the aggregator tables means that we don't have a database column to store the ID in (table aggregator_item). Wonder if that adversely affects the performance of the aggregator module and/or RSS feeds? Will have to research. :-)

budda’s picture

I never used an item ID due to inheriting the way the Drupal core aggregator worked.

Adding a new field to the aggregator_item table should not be an issue with an update() for the module. If there's no item ID available in the dbase for an item, i'll fall back to checking the link + title are different.

i'll be honest and say i've never aggregated an ATOM feed so don't know anything about their special features.

Chris Johnson’s picture

I'm rather learning about Atom (and probably RSS) as I go on this project, too. :-)

Doesn't RSS have a GUID element that uniquely identifies the feed item in a similar manner?

I did notice that the existing {aggregator_item} table had a link field, but no ID field for GUID or Item ID. It makes sense that it would be much easier to use what was there as you did. And if the link value is available, one would certainly want to preserve it, so it truly does require modifying the table to store the ID, as well.

From what I've learned about Atom, it should be easier to handle than RSS, because of the inconsistencies that apparently exist between different versions of RSS. Or at least, so it seems from my reading. As I mentioned up top, I'm not expert on either format.

This module is really well done, however. It fits my needs almost perfectly, since it provides an API. I'm going to need to extend the feedaggregator_node processor to handle some XML within my Atom feed's body and use that data to create event nodes. But thanks to your work, that appears to be straight-forward. Many thanks!

Chris Johnson’s picture

Version: master » 4.7.x-1.x-dev

Changed the version to 4.7, as that is what I must develop with. Thanks for branching it for 4.7, too!

budda’s picture

Assigned: Unassigned » budda

Looking at my first ATOM feed (http://www.tbray.org/ongoing/ongoing.atom) would it not be suitable to use the get_id() to obtain the full story link for both ATOM and RSS?

Seems to work on my test feeds anyway. The RSS feeds i've checked have a guid element in them, which the get_id() uses.
If there's no get_id() value, feedparser could then fall back to the standard get_link() i guess?

It seems the guid is simply the full url to the story anyway - so no new database fields need to be created after all?

alaa’s picture

noooooooo try http://baheyya.blogspot.com/atom.xml

ids are not links, ids are unique identifiers, some systems use links cause each post has a page of it's own, some don't cause they support giving posts an # anchor instead of a full page.

budda’s picture

It's a shame the parser library has no defined method for generating consistent GUID's for me. As such, i've filed a feature request but will look in to generating ATOM-like GUIDs for all feed items using the methods discussed in http://diveintomark.org/archives/2004/05/28/howto-atom-id#other

bobthecow’s picture

using get_id() isn't a viable option. id's aren't required to be a link. some blogs use the url as an id, but a ton do not. it doesn't work with blogger atom feeds (arguably one of the most common feeds on the intarweb). nor does it work with drupal feeds... and it's a sad day when we can't even use the aggregator with drupal feeds...

Chris Johnson’s picture

I think there is some confusion here -- maybe it's just me.

I think we (budda and I) want to use get_id() to get the unique ID for the feed entry. For Atom feeds, get_id() returns the ID element. For RSS feeds, it returns the GUID element. It just that for some RSS feeds, the GUID is a URL. According the standards, the ID is just an arbitrary string which is unique to that feed item, for both RSS and Atom.

I don't think there is any intention of using get_id() as the link (which is returned by get_link() actually).

alaa’s picture

eh budda said

It seems the guid is simply the full url to the story anyway - so no new database fields need to be created after all?

whichs sounds like replacing links with guids, we do need a new database field there is no avoiding it.

bobthecow’s picture

when i downloaded the module (a day and a half ago) it was using get_id() to get the links. by default. i had to change it back to get_link() so it would actually do something useful.

alaa’s picture

Priority: Normal » Critical

the change from get_link to get_id makes the module useless for many (most?) feeds, changing priority to critical, module as it stands is unusable, it's been 4 days since we reported this already.

budda’s picture

There wont be any updates for a while either. I don't do this as my day job or for money, and I'm going on holiday.

alaa’s picture

understood and you deserve your vacation, for us following the issue it's trivial to patch anyway, I'm just worried about people downloading the module in it's current state.

budda’s picture

That's what you get from bleeding edge dev snapshots I guess. No warranties.

budda’s picture

Status: Active » Postponed

Fixed/old file is back in CVS now.

As an update, I'm awaiting an update to the simplepie parser engine to provide a consistent GUID for feed items. Follow my request at http://simplepie.org/support/viewtopic.php?id=677

budda’s picture

Status: Postponed » Active

Having spoken with a SimplePie developer it seems updates are going to be a long time coming due to university commitments. So, at least for now, i'm going down the road of generating my own GUID for any feed - even if an ATOM feed is providing one.

The code doesn't rely on an items title or description/body - so they can still be edited at a later date. The method is based on the post over at diveintomark.org.

function feedmanager_generate_guid($feed_url, $item_date, $item_url = '/') {
  $feed_url_info = parse_url($feed_url);
  $item_url_info = parse_url($item_url);
  return "tag:{$feed_url_info['host']}," . $item_date . ':' . str_replace('#', '/', $item_url_info['path']);
}

Are there any potential problems with replacing a feed provided GUID with our own internal one?

I'll update CVS soon, once I have an upgrade in place for any existing feed items in your site dbase.

budda’s picture

Status: Active » Fixed

CVS now has a new duplicate item system in place, and an upgrade path to generate GUIDs for existing aggregator_item nodes.

You'll need to run update.php to sort your site out. backup your database first! The upgrade worked okay for two of my sites, but I make no promises.

Chris Johnson’s picture

Thanks, Budda!

Anonymous’s picture

Status: Fixed » Closed (fixed)