(This came from a conversation at OSCMS and the explosion of contrib aggregation modules. po)

To support multiple modules for pulling RSS and ATOM (and other) web feeds into Drupal, the following functions need to be supported in a core Aggregator API. In Drupal 5.1, all Aggregator.module functions are internal to the module and cannot be accessed by contributed modules. This results in duplicate code and opens potential security risks.

Below is an outline of the API functions needed by external aggregation modules. These recommendations are based on the Feed handling elements of the contributed MySite module.

feed_form($settings = array('title', 'url'))
A simple form element that returns the TITLE and URL elements for adding a new feed to a Drupal site. Optionally, this function might return all the settings that an administrator has access to (including refresh rate and category settings).

feed_verify($url, $title)
A submit hook that will check a submitted URL to see if it returns a valid XML feed. A necessary stage in processing new feeds before saving.

feed_parse($feed)
Parses the feed into the elements defined by the aggregator API (see feed_store($feed), below).

feed_check($url, $title)
Check a TITLE or URL for duplication against existing feeds.

feed_update($feed)
Updates the requested feed.

feed_title($title)
A parsing routine that allows the module to grab the base element of the feed and use it instead of a user-supplied feed title.

feed_store($feed)
A standard hook for saving feed data into the Aggregator API tables. The standard data elements might be takne from the existing {aggregator_feed} table:
- fid (incremented by the API)
- title
- url
- refresh
- checked
- link (derived from feed_parse())
- description (derived)
- image (derived)
- etag (derived)
- modified (derived from feed_update())

Comments

alex_b’s picture

Great idea.

Together with Aron Novaak I am maintaining the leech module. It was created by Marcin Konicki as the successor of aggregator2 which is basically dead now. We are very interested in somehow unifying efforts. Leech has its own dynamics though and there are a lot of people using it. I would like to be able to provide a meaningful upgrade path to some future solution. A unified API could help a lot here.

Currently, most contrib modules bring along their own RSS parser - how would this be handled by the API? Would it provide a default parser and a contrib module could plug in its own parser? I am especially thinking of those guys who are using simplepie
http://drupal.org/node/118534

What are the arguments for having the aggregation api in core as opposed to having it as contrib module?

Alex

former username: lx_barth

alex_b’s picture

Forgot to mention: Aron submitted an aggregation API as SoC project. So probably there will be real hours in summer for moving on.

former username: lx_barth

agentrickard’s picture

I talked to m3avrck about this at OSCMS (he maintains simple feed, which uses simplepie).

I think core should simply rewrite the aggregator module to allow other modules to use its functions. That just isn't poissible right now.

There should be a default parser, but if it were an API function, contrib modules would be free to ignore it.

I personally don't especially care if the final API is core or contrib (though I think it should be core); I just wrote the proposal because Robert asked me to, since I had to rip Aggregator functions into the MySite module.

So I merely made notes based on what my needs were.

Ted (m3avrck) should reach out to you in the next few days to get this work rolling.

SoC would be perfect. Both Robert and I are mentors.
--
http://ken.blufftontoday.com/
http://new.savannahnow.com/user/2
Search first, ask good questions later.

dkruglyak’s picture

I suggest we think carefully about what is already available and how we can reuse / leverage / migrate whatever already exists. Here is the list of all feed-parsing modules I found by searching modules page for 'feed' and 'scrape':

- aggregation
- aggregator2
- feedfield
- feedparser
- feed_node
- leech
- lobby
- mysite
- og_aggregator
- scraper
- simplefeed
- url_profile

The API should make it (relatively) easy to update these modules to support it and we should probably try and involve everyone who has worked on all of these. Finally, we need to define how "ancillary" feed management tasks are to be plugged in:

- scraping
- feed / url validation (especially when users enter)
- data conversion into node (CCK)
- production management (e.g. when parse jobs run, etc)
- caching if appropriate

agentrickard’s picture

This list proves the need. Personally, I want all Aggregator functions removed from the MySite module and referenced through the new API.

So we also need, in order:

- feed_scrape()
- feed_verify() in the above already covers validation.
- feed_to_node()
- feed_queue() -- to be handled by improvements to the Cron system, I think.
- feed_cache() -- although we can use cache_get() and cache_set() for that, we may need wrapper functions and a cache_feed table.

I suspect that feed_to_node() is the part that might remain in individual modules, since usage may vary widely.

--
http://ken.blufftontoday.com/
http://new.savannahnow.com/user/2
Search first, ask good questions later.

agentrickard’s picture

I'd leave scraping out of the Drupal core, for legal reasons.

So I don't think it gets into the API.
--
http://ken.blufftontoday.com/
http://new.savannahnow.com/user/2
Search first, ask good questions later.

dkruglyak’s picture

I do not think scraping is a legal issue in itself.

There are many public domain sites that have no problem with it, especially in government (e.g. congressional profiles you mentioned at OSCMS), while plenty of feed owners strongly object to any kind of feed import / re-publishing beyond individuals reading.

We might call this step something other than scraping, but the bottom line there needs to be a step to submit a request to a form, retrieve HTML and then parse it into a feed. We can call it HTML processing perhaps?

Still I think even "scraping" does not sound as bad as "leech" !!!

agentrickard’s picture

My concern was simply that an over-zealous site goes after Drupal for "enabling" the "piracy" of content.

I will yield on this point, however. A function like feed_acquire($url) or feed_import($url) can stay in the spec.

--
http://ken.blufftontoday.com/
http://new.savannahnow.com/user/2
Search first, ask good questions later.

dkruglyak’s picture

Actually I have already seen these type of issues raised about Aggregator2, which is not even a scraper. But of course nothing would stop over-zealous people from harrasing whoever they want. I think we should just have clear disclaimers in the API about using it for only legal purposes, yada, yada, yada. Purveyors of encryption technology have been dealing with such issues for decades.

I do like "acquire / import" terminology. It accurately reflects the functionality and has no negative connotations.

Caleb G2’s picture

Interesting proposal. Based on stuff I heard at OSCMS I'm especially interested to see if any php5 only features might be able to be integrated with this. It does kind of make sense that anyone who is into feed parsing is likely going to have access to php5. At least a large proportion of them than the 'overall' Drupal community.

Backwards compatibility is good, but Drupal should not to miss the demand/possibilities of the growing rss/mashup/webservices worlds.

=====
HigherVisibility

HedgeMage’s picture

Sepeck and I talked at length about the core aggregator module during OSCMS as well, and I've been contemplating what to attack so I can start submitting patches once I'm mobile again (new lappy coming in about a week and a half, yay!). Is anyone else jumping in right now? If so, I'd much rather coordinate/collaborate so we're not duplicating effort all over the place.

HedgeMage the Sleepy :)

agentrickard’s picture

I think (hope) m3vrick is the lead on this.

I am currently working on a different Aggregator patch, though: http://drupal.org/node/43245

Use this thread to coordinate efforts, though, as I think all concered should know about it by now.

--
http://ken.blufftontoday.com/
http://new.savannahnow.com/user/2
Search first, ask good questions later.

m3avrck’s picture

Ken says:

1) Let the admin select Taxonomy or Category for term storage
2) Let the admin select the parsing engine -- assuming that there is a
default engine, and that contrib modules like simplefeed or leech have
different parsing rules (ie. turn items into nodes).

1. No, using category is a bunch of redundant code that could easily be accomplished setting up a simple vocab with taxonomy. Cut the cruft ;-)
2. Yes, selecting a parsing engine would be great. You could stick to the one in Drupal or switch to one that is faster and works with more types of feeds like SimplePie.

Boris Mann’s picture

+1 for selecting parsing engine. Let's also agree (for now) that this aggregation is about aggregating items-as-nodes. Core aggregator still needs refactoring, and MAYBE we can re-use, e.g., the central lists of feeds.

So...let's not forget about core aggregator as part of this. Especially selecting the parsing engine might be something that we could get in core....

Also, remember there is this wiki page -- https://svn.bryght.com/dev/wiki/DrupalFeedParsing -- which should probably get moved to a wiki page in the aggregator group -- http://groups.drupal.org/rss-aggregation

--
The future is Bryght.

alex_b’s picture

What about creating an architecture that allows contrib modules to register functions that override/parallel the feed processing stages in core aggregator? The core aggregator could be a fully functional minimal aggregation solution, contrib modules could provide advanced features like an alternative parser or a solution to store output to CCK nodes of your choice.

Say, the core aggregator's stages of processing a feed are (I made this list off the top of my head):

1) Receive feed URL
2) Validate URL
3) Store URL

(on cron)
4) Check feed URL's refresh status
5) Retrieve feed
6) Parse feed
7) Process feed
8) Store feed-items

9) Display results

In your contrib module init() hook you could register the functions you would like to override/parallel:

function myagg_init() {
  aggregator_register(myagg_parser, AGGREGATOR_PARSER);
  aggregator_register(myagg_store, AGGREGATOR_STORE);
  // ...
}
function myagg_store($feed_array) {
  // e. g. store feed items as nodes
}

On an admin page you would then choose which alternative functions to use. This is necessary for avoiding conflicts between contrib modules. For some stages, you will have to decide which function to use (e. g. only one parser makes sense), but you will be able to choose more than one function for other stages (e. g. it could make sense to create aggregator's output plus your contrib module's output at a time).

This architecture would be highly pluggable and would provide a granular management of functions. It would enable contributing modules to very selectively override certain stages.

And then:
We should not only allow to replace or parallel stages in aggregator, but also provide the possibility to line up a series of functions for a stage.

E. g. in the "process" stage you might want to auto tag your feed item text with yahoo terms extraction service in one function, then in the next function retrieve blog ranks from technorati and then in yet another function search for flickr pictures with the same tags as the post. We would be very close to yahoo pipes here - in fact, it would be a matter of the interface :)

Boris Mann’s picture

Central feed store and administration, with everything else pluggable.

Something very minimal here seems like it might be a good direction for core. Can we rip out the entire aggregator infrastructure in 6? Don't think so, but if we can get the hooks in, then we can spend 6 making contrib modules and learning.

--
The future is Bryght.

Aron Novak’s picture

Thanks Alex for summarizing this!
My ideas are very similar to the post above.
The main goal is to unite efforts in aggregation - development area.
So the API will define exact data structures to use, so the collaboration of code parts will be more easier. If the API is ready, it's not needed to port a nice solution from one aggregator module to another, just need to tell that we want to use. In the optimal case, the module developers can make a mix of the best aggregator solutions :)
IMHO The core contains the following:
- the API + the very minimal implementation of essential parts

agentrickard’s picture

Not to pile on, but this is exactly what I was thinking. Aggregator would be the default engine and storage mechanism. Other modules could replace pieces of the functionality.

--
http://ken.blufftontoday.com/
http://new.savannahnow.com/user/2
Search first, ask good questions later.

the greenman’s picture

Have a look at the feedfield module, I developed an API for adding and removing feeds. It probably does not do all that you need, but it is being used by one or two other modules.

May help get started.

dkruglyak’s picture

I am looking at this module now and included it in my list above.

Overall we need to ensure that the API allows for painless migration from what people use now (most in production I guess have aggregator2 and leech), while re-using functionality from all the latest and greatest things like feedfield

alex_b’s picture

I agree. The new architecture should facilitate the implementation of certain functionalities of contrib modules.

We need a good survey of what's out there.

Of course, the migration paths themselves will have to be provided by contributors.

agentrickard’s picture

See http://groups.drupal.org/node/3418

And some contrib modules may just go away after migration, IMO.

--
http://ken.blufftontoday.com/
http://new.savannahnow.com/user/2
Search first, ask good questions later.

alex_b’s picture

What time horizon are we looking at here? What actual capacities do we have? Is there anybody at the starting gates ready to jump out and work on aggregator in core in the next three months? Is there anybody who needs the core aggregator issue solved in the next three months?

As already mentioned above, there is Aron Novak's Summer of Code proposal for a new Drupal aggregator architecture. I know that Aron is pretty tied up with work for university right now, so he won't come into high gears until June.

This means, that whatever he proposes, it's likely to be too late for Drupal 6 (correct me, if I am mistaken).

Is working for a Drupal 7 solution a viable option?

Alex

agentrickard’s picture

Getting it into D6 core might not happen, but I think we need to solve the Aggregator problem as quickly as possible anyway.

To be clear, the "problem" I see is the absolute explosion of modules that duplicate work. We need to concentrate effort on one or two modules that hit the functionality we need. First up is the Aggregator-API work, which may take place in Aggregator.module or in a new module.

The best roadmap may be to get a standard Contrib module developed -- one that replaces 'core' Aggregator and serves as the hub for other modules. Then we could bugfix / text that module against both D5 and D6, with an eye towards core inclusion in D7.

That said, it might be possible to get these changes into core if we (and Aron) can work fairly quickly. The Aggregator-specific changes don't look like that much work to me. Maybe 30-40 hours, tops. The real work is the coordination of all the contrib modules.

I'm not ready to dive in to code, but my role here is to document what needs to be done.

--
http://ken.blufftontoday.com/
http://new.savannahnow.com/user/2
Search first, ask good questions later.

dkruglyak’s picture

... instead of trying too hard to push it into the core.

I do not think there is a compelling case for putting brand new APIs into core until they are well implemented and tested. Look at the history of CCK. It was working pretty well before a small part of it made it into core. Views are not even close yet.

Let's first get it done purely in contribs and try to consolidate under one roof what is already available.

agentrickard’s picture

I think that makes the best sense.

Let's fix the problem with a module that replaces Aggregator. Then we'll let the community decide which module goes in core.

--
http://ken.blufftontoday.com/
http://new.savannahnow.com/user/2
Search first, ask good questions later.

csevb10’s picture

Over here we've definitely been discussing many of the same ideas for where to go with the Aggregator module.
The areas I'm most concerned with improving/augmenting are how we handle the initial feed and where we save the aggregator data.

When we get the initial feed, we want to store a copy of the current state. At a later time admin users will select elements from that stored copy of the feed to save to the db (in our case as nodes)

On the other end, during setup, there would be an interface element to aggregator to allow admin users to match fields from the xml feed with fields from a cck node type. In this way we can store certain elements of the xml selectively in appropriate areas and create a full node, referencing back to the original site.

I'm going to begin working on this system shortly, but like everyone else, I don't want to duplicate effort and am more than willing to contribute my manpower to helping this thing come together well and quickly (so long as I can accomplish the business goals!).

I'm going to develop this in the right way (so much as making any modifications to core can be considered the "right way"), but I need to develop this fairly rapidly. That being said, as much as I can gear my work toward appropriate modifications to aggregator module, and isolate major modifications to another module, I would like and be very willing to do that.

--
William O'Connor
Achieve Internet + Lead Developer

--
William O'Connor
Achieve Internet + CTO

Boris Mann’s picture

As far as I can tell, SimpleFeed is closest to ths "clean" model. No offence esp. to Development Seed that have dove into Leech, but Ted worked SimpleFeed from the ground up. There are things to learn from all the modules, but if you're looking for rapid + clean, look at Ted's stuff with SimpleFeed.

--
The future is Bryght.

csevb10’s picture

I haven't checked out SimpleFeed, but I will now. I don't want to develop on top of it, though, simply because I want as few layers as possible between what I create and core. I might be able to modify SimpleFeed if it is the cleanest of the bunch (or simply align it with the direction of the aggregator module), but I want to see how far/what direction core moves so that I can develop as little extra logic on top of the core aggregator moduleas possible. Thanks for the lead on SimpleFeed. I'll check it out.

--
William O'Connor
Achieve Internet + Lead Developer

--
William O'Connor
Achieve Internet + CTO

csevb10’s picture

Hi Boris,
I checked out SimpleFeed, and I think it is pretty well built. If we had administrative control over SimplePie caching & plugability (read api) for actions, I think SimpleFeed could work really well as a starting point if we're looking to build a replacement to the aggregator module. Right now simplefeed.module explicitly calls (at the least) the cron hook of the simplefeed_item.module...which I wouldn't consider exactly optimal. If we can clean this sort of thing up so that we can specify our own pluggable modules for this sort of behavior I think we're well on our way. Thanks for the tip.

What does everyone else think? Is anyone else set on any other module as a good/better starting point? I think the aggregation module has some nice extras for feeds (author, node publishing options, authentication), but nothing absolutely vital to making things work right now imho.

--
William O'Connor
Achieve Internet + Lead Developer

--
William O'Connor
Achieve Internet + CTO

agentrickard’s picture

William-

Some thoughts.

1) "When we get the initial feed, we want to store a copy of the current state. At a later time admin users will select elements from that stored copy of the feed to save to the db (in our case as nodes)"

I think this might not be part of the standard API. The API would download and validate the feed, then pass a hook that let your module store this data. I could be convinced otherwise, though.

Question: Wouldn't we need to check each fetched version of the feed against the original to see if the data model has changed in any way?

2) "On the other end, during setup, there would be an interface element to aggregator to allow admin users to match fields from the xml feed with fields from a cck node type. In this way we can store certain elements of the xml selectively in appropriate areas and create a full node, referencing back to the original site. "

This is definitely where module development happens on top of a standard API. Not everyone would want this feature, but it does seem like the best way to handle Feed - to - Node transformations.

--
http://ken.blufftontoday.com/
http://new.savannahnow.com/user/2
Search first, ask good questions later.

csevb10’s picture

Ken,
I agree with you. It's not only conceivable, but likely that the logic to store feeds temporarily (instead of immediately ingest) would reside in an outside module as well as the transformation of the feed to node content. My goal is to ascertain where and how these sort of overrides would happen and do my best to make sure that the modifications to the aggregator module are robust enough to handle these sort of cases so that I don't have to rewrite any logic. If we can ensure enough/the right hooks to make my workflow a possibility, then I can construct the appropriate parts of logic in a separate module and help with any modifications of aggregator module. In essence, if we're making the aggregator api work well enough to be the core for my work, I would be more than willing to help with any part of development (In fact, I can work full time on those sort of things if we have a direction)

--
William O'Connor
Achieve Internet + Lead Developer

--
William O'Connor
Achieve Internet + CTO

Boris Mann’s picture

I'm starting to see some shared requirements here.

The "dream" of dynamic content based on structured XML data fields mapped to CCK fields is something that I've been kicking around for a while. There is perhaps SOME overlap with the concepts of the Publish / Subscribe modules here: I've tossed around some concepts with John VanDyk like using Atom as the "transport" layer, and also to start looking at the Atom Publishing Protocol.

This aligns closely to the Data API stuff as well.

Anyway, perhaps start working on a bundle of requirements on a wiki page over at groups?

--
The future is Bryght.

csevb10’s picture

I'd be available most any time to discuss these sort of things to help expedite the process. I'll be in the office for the next 7 hrs, if you wanted to do it today.
I'll take a look at the wiki page at groups, try to provide input about the requirements we have, and continue looking at the the other resources you've pointed out.
--
William O'Connor
Achieve Internet + Lead Developer

--
William O'Connor
Achieve Internet + CTO

agentrickard’s picture

In working on the MySite project, I just wrote the following piece of code. It transfers aggregator feed images and writes them to a local directory.

My thought is this has security and performance benefits. But I may have missed something....

Feedback?

/**
 * This function takes a Feed image and saves it locally.  
 * We do this for added security and speed.
 * @param $fid == the feed id, taken from {aggregator_feed}
 * @param $image == the image string taken from {aggregator_feed}
 * 
 * return $newfile == the filepath string pointing to the local copy of the file
 */
function mysite_type_feed_image($fid, $image = NULL) {
  $path =  file_directory_path() . '/mysite';
  $temp = file_directory_temp();
  $newfile = '';
  if (!empty($image)) {
    if ($src = preg_match('/src="(.+?)"/', $image)) {
      $src = preg_replace('/.+src="(.+?)".+/',  '\1', $image);
      $src = check_url($src);
      $ext = explode('.', $src);
      $ext = '.' . array_pop($ext);
    }  
  }    
  $filename = 'aggregator-' . $fid . $ext;
  $file = file_check_location($path . '/' . $filename, $path);
  if (file_exists($file)) {
    $newfile = $path . '/' . $filename;
  }
  if (empty($newfile) && !empty($image)) {
    watchdog('MySite', t('Copying icon file for Feed ID %fid.', array('%fid' => $fid)), WATCHDOG_NOTICE);
    $file = drupal_http_request($src);
    $newfile = file_save_data($file->data, $temp .'/' . $filename, FILE_EXISTS_REPLACE);
    $info = image_get_info($newfile);
    if ($info['extension']) {
      if (image_get_toolkit()) {
        image_scale($newfile, $newfile, 120, 60);
      }
      $move = file_move(&$newfile, $path, FILE_EXISTS_REPLACE);
    }  
    else {
      $newfile = '';
      watchdog('MySite', t('The transfer of a MySite feed icon failed -- bad file extension -- for Feed ID %fid.', array('%fid' => $fid)), WATCHDOG_ERROR);
    }
    if ($move) {
      $newfile = $path . '/' . $filename;
    }
    else {
      $newfile = '';
      watchdog('MySite', t('The transfer of a MySite feed icon failed -- could not copy file -- for Feed ID %fid.', array('%fid' => $fid)), WATCHDOG_ERROR);
    }
  }
  return $newfile;
}

--
http://ken.blufftontoday.com/
http://new.savannahnow.com/user/2
Search first, ask good questions later.

Boris Mann’s picture

...make a new post at http://groups.drupal.org/rss-aggregation with this as a separate item so it can be discussed AND we get notification, rather than getting lost in some random forum thread :P

--
The future is Bryght.

agentrickard’s picture