Here is a patch for adding two hooks (one for the parser, one for the processor) to the aggregator to make it extensible.
The patch currently does not apply to the HEAD because of the changes. But now there are so many pending patches for the aggregator, so i'll wait with updating this patch. The basic idea can be spotted at this status too.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

alex_b’s picture

Great. Can you describe better what this patch proposes to introduce?

Aron Novak’s picture

Pluggable architecture at this patch contains the following:
Parser / processor concept
Parser is for transforming downloaded raw data into a data structure.
Processor is for reacting to add / update a feed item.

This patch makes aggregator a parser and a processor too, this means that the built-in capabilities (old parser and feed items as non-nodes) are available immediately.

From developer viewpoint, in a parser, you need to implement aggregator_parse() hook and in a processor, you need to implement aggregator_process(). There are some operations of these. (saving/unique, etc)

The patch does not alter the internals of aggregator, only makes possible to write a drop-in replacement for parsing or saving the feed item. (for example as node or whatever you can imagine, send out an email, etc)

Also, for configuring these, aggregator has a form, where you can overview your parsers and processors and aggregator offers easy-to-use settings handling for parsers and processors.
Basically parser and processor are such a module, where the important hooks are implemented.

Parsing:
$channel = module_invoke($parser, 'aggregator_parse', 'parse', $feed); - one step process

Processing:

$processors = variable_get('aggregator_processor', array());
foreach ($processors as $processor) {
  $channel = module_invoke($processor, 'aggregator_process', 'unique', $channel);
}
foreach ($processors as $processor) {
  $channel = module_invoke($processor, 'aggregator_process', 'save', $channel);
}

Two step, two operation.

jpetso’s picture

Short, very incomplete look over the patch while attempting to confirm my understanding of how it generally works:

  • aggregator.admin.inc: \ No newline at end of file
  • aggregator_aggregator_process() has a $feed parameter, but it's documented as @param $channel.
  • aggregator_aggregator_process() has a $feed parameter, but it's documented as @param $data. Both this one and the above are probably wrong, and underdocumented in any case - I couldn't write hook implementations based on the sparse apidox. The parameter and result values should either be documented in a very detailed way or not at all, with docs going into api.drupal.org's hook.module only. (They will need to end up there anyways. Don't know how core handles apidox duplication issues.)
  • Why is it that aggregator_is_enabled() returns a value for either of two different settings? Wouldn't a separation into aggregator_is_parser_enabled() and aggregator_is_processor_enabled() make more sense? (If it was a private function I wouldn't care, but for a public function this lack of distinction might bite back later.
  • In aggregator_cron():
    +  if (drupal_function_exists('_aggregator_light_delete_expired')) {
    +    _aggregator_light_delete_expired();
    +  }
    

    Shouldn't this be made unconditionally and only be introduced once that function exists? Or maybe there's some other reason to do it this way...

Other than that, very cool. I think a later patch might split parsers in two to separate the transport (currently in aggregator_aggregator_parse()) and the actual parser (currently in aggregator_parse_feed()) so that Aggregation's functionality of supporting feeds requiring HTTP or FTP authorization can be done without implementing separate parsers. (If there are feed parsers other than aggregator_parse_feed(), you'd need to write a hook_aggregator_parser() wrapper for each of those, which results in parser hooks = (transport methods * actual parsers) if you want the complete set of options.) But that's just an idea for a follow-up, and doesn't need to be handled in this patch.

Aron Novak’s picture

I rerolled the patch against the latest changes.
jpetso: Thank you for the review, i applied most of your suggestions.
aggregator_is_enabled() - yeah, it's questionable if it's a good choice or not, i have not altered it yet.
aggregator_cron(), as i saw, drupal core does this in a similar way really often.
Tests: 100% successful
Not tested, maybe broken (that's why needs more work): cron part and image support

alex_b’s picture

FileSize
27.85 KB

Rerolled after patch #323751 caused conflict.

alex_b’s picture

FileSize
28.13 KB

Fixed:

- typo wrong: variable_get('aggregator_processor', array); correct: variable_get('aggregator_processors', array);
- aggregator_is_enabled() ambiguous if a module implements parser and processor. Replaced with aggregator_get_enabled_parser() and aggregator_get_enabled_processors().

catch’s picture

We've gone to quite a lot of effort to remove $op from core - hook_user and hook_nodeapi, others like block and taxonomy to follow. It'd be a shame to add a new hoook with an $op in it at this stage.

alex_b’s picture

FileSize
28.92 KB

More fixes and adjustments:

- only call processors if $channel['items'] is an array
- explicitly configure aggregator parsers and processors in _install() (before admin/content/aggregator/settings did not look correct before first time saving the form)
- remove break; after return; in switch statements
- move 'unique' above 'save' in aggregator_aggregator_process() - corresponds with the order of call now

Changes don't break any tests.

Deduplication / 'unique' stage:

For those who're not familiar with the intentions behind a separate 'unique' stage in the aggregation process: The idea is to find out which items on the currently processed feed are unique before the items are being sent to the save stage. This separation of deduplication and saving allows other processors to modify the deduplication behavior of processors simply by implementing a hook_aggregator_process('unique') operation.

I had a better look at how unique handling works. I think it needs to be made more straightforward.

I see that on the 'unique' stage aggregator module flags whether an item exists already or not. If it exists, it doesn't just say 'FALSE': it stores the item's ID - which is smart, because it saves us going to the database again on the 'save' stage - but it looks a little funky as we're using a single slot as boolean or integer:

$feed['items'][$k]['unique']['aggregator'] = (!isset($entry->iid) ? TRUE : $entry->iid);

Further down the line it leads to a test whether the id of the current item is TRUE in order to find out whether we're looking at a unique item or not.

function aggregator_save_item() {
  if ($edit['title'] && $edit['iid'] === TRUE) {

At this stage I would rather expect that iid is not set or 0 if not available.

I need to think better about this, but at first glance, it would seem to me more intuitive to call the flag not 'unique' but 'unique_id' which would a) make storing the internal aggregator id more appropriate and b) make unique_id = 0 the common sense flag for 'this item doesn't exist yet'.

Thoughts?

Aron Novak’s picture

FileSize
29.12 KB

Patch is rerolled.
Changes:
- image support was broken
- cron part was broken

Aron Novak’s picture

I propose the following for unique handling:
Make a hook_aggregator_process_id() (exploded function instead of one huge aggregator_process with $op -s)
Add $feed['items'][$k]['id']['aggregator'] to the array if the item is already exists
At save stage, isset($feed['items'][$k]['id']['aggregator']) will determine if it's a dupe or not.

#7: makes sense. It will be refactored.

alex_b’s picture

Make a hook_aggregator_process_id() (exploded function instead of one huge aggregator_process with $op -s)
Add $feed['items'][$k]['id']['aggregator'] to the array if the item is already exists
At save stage, isset($feed['items'][$k]['id']['aggregator']) will determine if it's a dupe or not.

This sounds good to me.

alex_b’s picture

alex_b’s picture

Assigned: Aron Novak » alex_b

I'm working on a revised version after #7 and #10. Expect results soon.

alex_b’s picture

FileSize
46.54 KB

Rework of this patch.

#7: $op is gone.
#10: let's leave out deduplication overriding for now.

CHANGES

To summarize, these are the changes to Drupal Core in this patch:

* new hooks for parsers and processors of feeds:
** hook_aggregator_parse()
** hook_aggregator_parse_info()
** hook_aggregator_process()
** hook_aggregator_process_info()
** No hook for unique ID's at the moment.
* Refactor aggregator so that parsing and storage happen on parse and process hook callbacks.
* Break out sections that are only needed for aggregating
** Parsing functions are broken out into aggregator.parser.inc
** Processing functions are broken out into aggregator.processor.inc
** aggregator.module down to ~670 lines from ~1000

Consequential changes:

* $feed is object for passing through parsing and processing stages.
* aggregator_admin_settings() contains form for configuration of parsers and processors

Details:

* fixed a broken query db_merge('aggregator_category')->key(array('cid' => $id))
* fixed a notice in theme_aggregator_block_item()

Patch passes all tests. Reviewed by Aron Novak so far.

Documentation

* Expect documentation on new hooks soon.

alex_b’s picture

Pluggable architecture patch:

* Fixed an obsolete hook call in aggregator_form_alter()
* Moved aggregator_form_alter() to aggregator.processor.inc -it only contains processor related functionality.

Documentation;

Patch to drupal docs developer/hooks/core.php to document

** hook_aggregator_parse()
** hook_aggregator_parse_info()
** hook_aggregator_process()
** hook_aggregator_process_info()

catch’s picture

Status: Needs work » Needs review

Looks like this is ready for review to me.

alex_b’s picture

#16: Ready to review, yes.

Anonymous’s picture

Status: Needs review » Needs work

The last submitted patch failed testing.

Aron Novak’s picture

Status: Needs work » Needs review
FileSize
46.08 KB

Rerolled.
Notes for manual testing, these are only important if you patch an already enabled/installed aggregator module:
- after install the module, empty the cache (via the devel module), necessary because of new files
- go to admin/content/aggregator/settings and save the defaults

The documentation patch will be supplied separately in the proper queue, it will be linked from here also.

Aron Novak’s picture

Relevant documentation patch: #334752

chx’s picture

Status: Needs review » Needs work

This is a great patch. I like the API , it's a dream coming true! A few nitpicks:

  1. Right at the beginning you add two spaces to an empty line.
  2. A parser does not download anything. A parser, well, parses. Let's find another name.
  3. Let's use the form_id specific form_alter.
  4. You set the variable defaults in hook_install, and then use variable_get('aggregator_parser', ''). Let's instead use the sane default everywhere and do not set the default in install.
  5. The empty cases in help look strange.
  6. You do not need the one line wrapper around variable_get('aggregator_parser', but see the default above.
  7. Looking at aggregator_get_enabled_processors, what this code does, is array_filter but why would a getter do that, why you dont array_filter on set? Then this function can be removed as the previous one to be replaced by a simple variable_get.
  8. The submitted documentation mention multiple times "an array of data items on $feed->items" but does not detail what keys should/could be in there and what are the values.
  9. Still, the documentation, is aggregator_parse_info mandatory or optional? Same for the processor.
chx’s picture

Following up on a chat with Aron, one more which would solve 2. above -- let's separate out the fetcher as well, there are four (stream wrapper, drupal_http_request, curl, pecl http extension) fetchers at least.

alex_b’s picture

#22 - this would mean an additional hook hook_aggregator_fetch() ? Sounds good. I have been against fetch hooks in the past. Recent experience taught me otherwise: we wrote a CVS, an RDF and a comment parser for FeedAPI and in each occasion we had to reuse fetch functions from other parsers.

#21 - I agree with your points. Some clarifications:

5) These paths need to be removed. There is no admin/content/aggregator/aggregator/parser and no admin/content/aggregator/aggregator/processor page.

8) The format on what's on 'item' is entirely up to the parsers and processors in place. Neither does the patch change anything on that level.

* hook_aggregator_parse() and hook_aggregator_process() need to mention that the format of $feed->items is up to the implementers.
* aggregator_aggregator_parse() needs to be documented on the format of the $form->items it produces and
* aggregator_aggregator_process() documentation should contain a reference to aggregator_aggregator_parse().

9) Optional, needs to be better clarified.

Aron Novak’s picture

Status: Needs work » Needs review
FileSize
46.67 KB

#21,
1)-7): thanks for the notices and suggestions, all of them are fixed somehow.
There is a new file: aggregator.fetcher.inc.
8) is replied by alex_b at #23
9) The documentation patch should be updated also and make this question clear.

alex_b’s picture

#21: 8 and 9 are addressed on #3 of #334752: Documenting new hooks with aggregator pluggable architecture patch

I will review the latest patch on #24 on the weekend.

catch’s picture

Couple of minor things:

+  if (is_object($edit)) {
+    $edit = (array)$edit;
+  }
+

Why would $edit ever be an object? Can we fix that further up the chain?

aggregator_form_aggregator_admin_form_alter - aggregator module doesn't need to form_alter it's own forms - or if we do for some reason, then there ought to be a comment to say why.

This is just nitpicky stuff, will try to do a more substantive review over the next week or so.

chx’s picture

The format on what's on 'item' is entirely up to the parsers and processors in place. Neither does the patch change anything on that level.

that's some pluggable architecture, then :) well, this means that parser and processor must come from the same place? Or how are you going to provide interoperability...?

alex_b’s picture

#28 - (per IRC conversation) items format needs to be documented.

alex_b’s picture

Status: Needs review » Needs work

FAPI arrays that are affected by this patch should be reformatted so that each property sits in its own line.

alex_b’s picture

FileSize
29.23 KB

#29: fixed

#26:

* I still need to take a better look into this $edit variable.
* The form alter hook is documented - aggregator form alters its own form for keeping the processor part separate from the aggregator API part and for creating a reference for implementing a processor.

Further changes:

* Simplify admin/content/aggregator/settings - hide fetcher/parser/processor configuration if there is only one of each available and therefore nothing to configure.
* Move aggregator_form_aggregator_admin_form_alter() to aggregator.processor.inc - it only implements processing functionality.
* Clean up language in fetcher.

alex_b’s picture

#26: array $edit argument in aggregator_form_feed() is now a stdClass $feed.

Explanation:

Aggregator uses $feed inconsistently. A grep for '$feed[' in the original aggregator files finds 44 occurences, for '$feed->' finds 84.

For the API we decided to go with $feed as an object rather than an associative array. This entails some changes throughout the module. In the case that catch pointed out in #26, the $edit array changes to an object because the menu loader loads feeds as an object rather than an array. After this patch we should roll one that makes the usage of $feed consistent as object in the entire module.

This patch fails in OPML test in "Categories are correct." I post now cause the turkey's waiting and #30 is broken :)

Aron Novak’s picture

I fixed the bug of the previous patch. Tests are fine now.

alex_b’s picture

Status: Needs work » Needs review

Ready for review.

moshe weitzman’s picture

Status: Needs review » Reviewed & tested by the community

I took a look at the UI and I am quite happy with it. Code looks sane too. I think we are ready for final discussions. RTBC.

alex_b’s picture

Status: Reviewed & tested by the community » Needs work

I did more testing.

I wrote a feed-items-as-node processor (available here http://cvs.drupal.org/viewvc.py/drupal/contributions/sandbox/alex_b/aggr... ) and found two problems:

a) There are sometimes empty items in the $feed->items array on the processing stage. These items are caught in core aggregator in the aggregator_save_item() routine. Maybe we should filter them out before they actually go into the processing stage. I need to take a look at that.
b) Install a new processor (aggregator_node), activate it, aggregate some items, then go to admin/content/aggregator and click on 'remove items' - nothing happens. This makes the need for a remove hook obvious (or we form alter the remove link into the feed form :( ) - hook_aggregator_remove()

Adding this hook, we will stand at 7 new hooks for aggregator:

hook_aggregator_fetch()
hook_aggregator_fetch_info()
hook_aggregator_parse()
hook_aggregator_parse_info()
hook_aggregator_process()
hook_aggregator_process_info()
hook_aggregator_remove()
catch’s picture

Extra hook won't do any harm. Seems like the cases you mentioned would be a good candidate for a small module in simpletest implementing the new hooks.

edit: also marked #334752: Documenting new hooks with aggregator pluggable architecture patch as duplicate since hooks can be documented directly in core now with module.api.php

alex_b’s picture

Patch as in #32 containing api.php

alex_b’s picture

Status: Needs work » Needs review
FileSize
59.46 KB

* Added hook_aggregator_remove()
* Fixed 3 occurences of array $feed in comments where $feed was an object

There is a detail that might be worth discussing:

Expiry of feed items is not dealt on an API level, it's entirely in the processor's responsibility to implement a mechanism that deletes items after a while. I prefer this because it spares us an additional hook that does something very similar that hook_cron() does. However, the one drawback is that implementing modules can't expire feed items when clicking on 'update items' on admin/content/aggregator . They can only expire on hook_cron().

I would keep it this way, but I'd like to see other people's opinions.

Aron Novak’s picture

Status: Needs review » Reviewed & tested by the community

I tested the functionality and it works well.
Also I like the idea of let the processor to decide how to expire and so on.

Status: Reviewed & tested by the community » Needs work

The last submitted patch failed testing.

Aron Novak’s picture

Status: Needs work » Needs review
FileSize
59.81 KB

Rerolled the patch again. Small fix added: sometimes drupal_http_request gives 200/202/307 code, but data and headers are not set, a php notice appeared.

Dries’s picture

Status: Needs review » Fixed

Committed to CVS HEAD. Thanks!

alex_b’s picture

Yeay. Thanks everybody. This is really motivating for outstanding work on Aggregator. I posted a list of it here: http://groups.drupal.org/improvements-core

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.

akahn’s picture