Support from Acquia helps fund testing for Drupal Acquia logo

Comments

alex_b’s picture

subscribe

Aron Novak’s picture

FYI: This is under construction, by 18th of July, there will be a drop-in replacement for the core aggregator.
You can follow the development here: http://groups.drupal.org/node/11378

Aron Novak’s picture

Status: Active » Needs work
FileSize
112.86 KB

Here is the first version of the patch which is against the existing aggregator, to be applied to HEAD. Ideally, patch before you install the site.
If you'd like to apply to an existing site, be aware that existing aggregator-related data will be lost; follow this procedure: enable syndication_parser module then using the devel module reinstall the aggregator module.

This will break aggregator 's tests, because of the little bit different internals of the two modules.
What is missing from this patch:

  • Tests
  • Overridable deduping
  • Add enclosure and namespace support to the parser
  • Add meaningful and helpful information and error messages

A node processor is also available from here: http://www.drupal.org/project/new_aggregator (in fact the patch is from this project), you can copy the aggregator_node module if you'd like to see this stuff with nodes.

Morbus Iff’s picture

Subscribing.

Morbus Iff’s picture

Hrm. I noticed that you stopped GUID support (for "newness" checks) entirely - this is wrong.

See http://drupal.org/node/51644.

Besides the weather example over there, there's another use case for producing feeds that set the [link] not to the permalink of the RSS item, but to the *link being discussed* in the RSS description (or title). Off the top of my head, delicious feeds do this, though I've seen others. If Person A talks about Link A, and then Person B talks about Link A (and both set the [link] to Link A), I'll never see Person B's discourse (as a new item).

mfer’s picture

Subscribing.

Aron Novak’s picture

I'm aware of that this is a problematic area.
Fortunately replacing the deduping logic is not a hard thing at all. I'll follow the discussion at http://drupal.org/node/51644 about the topic.
If there will be consensus about this, I'm ready to adopt the new way of deduping.

Morbus Iff’s picture

Incidentally, you've currently got that particular function as "private" - but I don't think it should be. Knowing that an item is "new" seems like a very useful thing in a programmatic environment - it should be accessible to modules (ie., don't make it a "I trust you won't use this" private function).

cburschka’s picture

What does this rewrite imply for #16282: OPML Import for Aggregator?

Does the new aggregator support OPML import already, or should the OPML patch be updated to fit into the rewritten aggregator?

alex_b’s picture

I'd say #16282 will need to be ported once this patch is tight. The parts of the patch that I see effected are where feeds are being created

foreach ($feeds as $feed) {
...

and the tests - haven't looked at them though.

Aron Novak’s picture

FileSize
140.81 KB

I updated the patch. This reflects all the changes at new_aggregator alpha2 and alpha3 releases.
The most important changes:

Added feed image support.
Full test suite for Aggregator.
Tests for syndication parser, testing the feed parsing capabilities.
Minor UI improvements.
Separated all Aggregator Light processor related code to aggregator.ligt.inc.
Handling node types where the name contains underscore.
unified Aggregator and Aggregator Light modules
all the features of the existing Aggregator is available (categorizing is replaced by taxonomy)
A content-type is automatically created when you enable Aggregator module.
There is a node processor.
The parser supports namespaces and enclosures.
Meaningful error and informational messages.
More per-content-type setting options (because of killed hardcoded values)

Morbus Iff’s picture

Aron - one of the things that has always plagued aggregator was the use of an aggregator-specific input filter. I'm pretty sure there's an issue about this somewhere. Was that replicated in this module, or does it create a default input filter for use with aggregated items? For example, it's currently annoyingly obtuse to disallow images in comments, but to allow images in feed items (you have to modify settings inside settings.php - there's no UI). What's the new approach's take on that?

catch’s picture

Here's the aggregator/filter module issue: #14104: Aggregator should support filter.module

Aron Novak’s picture

By default the aggregator uses the site-wide default input filter. In the top of that you can alter the input filter per-content-type.

mustafau’s picture

This patch is huge in size and is implementing multiple new functionality. In order to get real attention you need to separate this patch into multiple issues. For example; the concepts: "Feeds as nodes", "Feed processors" and "Feed parsers" are worth three separate issues.

I believe there is a consensus on the concept of "Feeds as nodes". Drupal core is also moving towards this direction (e.g. polls in core are nodes). If you submit a separate "Feeds as nodes" patch it will get committed in no time.

The patch at #16282: OPML Import for Aggregator is going to be committed. In order to re-implement OPML import for this new aggregator you need to look at the patch at #273050: Use Batch API for OPML import

Morbus Iff’s picture

@mustafau: unfortunately, you're woefully misinformed. There is no consensus to get feeds as nodes in core, and it won't be committed in "no time". Polls have always been nodes in core, all the way back to at least 4.6. In fact, the exact opposite of what you state has been leaned toward: feeds as non-nodes definitely, feeds as nodes optional. Aron's patch offers the best of both worlds (ie., both at the same time, or one or the other). It's about the only way that feeds as nodes would ever get into core.

mustafau’s picture

@Morbus Iff: When you refer to feeds as nodes you actually mean feed items as nodes right? Because Aron's patch only offers feeds as nodes. The optional part is feed items as nodes.

Morbus Iff’s picture

Yes, you're correct. My bad.

alex_b’s picture

#15 - it's hard to see how this patch can be split up. It's in part a complete rewrite of aggregator's core in order to enable functionalities like interchangeable parsers, optional feed item processors, etc. Let me know if you've got any concrete ideas how to break it apart.

I think what's missing atm is a good itemized list of what this patch changes - not only a link to the outline on g. d. o.

Thanks for heads up on #273050

#17 - correct, feeds are nodes, feed items are pretty much the same as in existing aggregator

alex_b’s picture

I did a review of the latest patch. Some UI issues:

* Per content type settings need a clean up. I'd suggest to convert the check box into an enable/disable button, add field sets for better structure and only show advanced settings for activated parsers and processors. I rolled a patch against the project here: http://drupal.org/node/290788
* aggregator/terms help text: this text doesn't work, it only makes sense for site administrators.
* For the sake of more continuity: can we keep as many files as possible on their place in the directory? E. g. aggregator.module in aggregator/ etc.
* Help text on admin/build/node-type/*/aggregator would be great.
* I'm still not sold on using "Terms" instead of "Categories" on /aggregator

... more to come.

alex_b’s picture

Title: GSoC 08 - A new aggregator for Drupal 7 » Aggregator rework: extensible API, SimpleXML parser, use taxonomy for categorization

This is an attempt to give this patch a more specific name in response to #15.

mustafau’s picture

#19: I know it is hard to separate the functionality of this patch into several patches however it is not easy to get accepted into Drupal core either. Here are some ideas on how to make it happen:

- Submit a minimalistic feeds as nodes patch (from #15). Since core aggregator treats every feed equally we only need a built-in "Feed" content type. I believe in the future we can skip per content type settings and just implement per feed settings.

- Separate current parser from aggregator and introduce SimpleXML parser as a replacement. First of all we need to resolve this issue: #72915: Move feed parser to includes/feed.inc. Once this one is fixed then we can submit a patch introducing SimpleXML parser as a replacement.

- This patch stores hash of the feed data. I submitted a patch for aggregator implementing this functionality (#291064: Store hash of feed data).

Aron Novak’s picture

FileSize
135.28 KB

Here is a newer version of the patch. It's only a quick update to allow folks to apply the patch to the latest CVS code. The OPML import part of the aggregator is currently not included in the patch (it is planned of course)

mustafau’s picture

Can we get rid of CVS Id's like following please?

-// $Id: aggregator-item.tpl.php,v 1.2 2008/05/15 21:27:32 dries Exp $
+// $Id: aggregator-item.tpl.php,v 1.3 2008/07/20 18:06:12 aronnovak Exp $
Morbus Iff’s picture

Get rid of them? They're required in .tpl.php files. They'll be rewritten to the right value when/if the patch gets committed.

mustafau’s picture

Those lines should not be visible in patches.

recidive’s picture

The functions aggregator_parse_w3cdtf() and _syndication_parser_parse_w3cdtf() apparently do the same thing. Can we get rid of one of them?

// $I:$ on syndication_parser.test should be // $Id$ and you don't need to manually change the version number on the others.

There are evidences of some rejected lines on aggregator.admin.inc.

Also there are places that needs cleanup to be more inline with Drupal coding standards.

mustafau’s picture

Line #185 of the patch at #23 is broken.

Dries’s picture

Mmm, it is not the best idea to implement 20 changes in one patch. History has learned us that it is better to submit smaller patches. I'd really recommend that we break this up in smaller chunks and that we consider one chunk at the time.

alex_b’s picture

#15, #29: I will start thinking about how we can break this patch into smaller chunks... Aron's offline for the next week, he'll be able to look at this again once he's back.

I just wrote about the patch here: http://www.developmentseed.org/blog/2008/aug/06/improved-aggregator-drup...

I hope it's helpful for understanding the proposed changes.

Jose Reyero’s picture

The overall idea looks very good to me. The implementation looks overly complex though.

I also agree this patch is way too big and attempts to get too many things at once. If we want, as it seems to be, to have a generic aggregator engine that supports multiple parsers and multiple storage forms, only that, as simple as possible would be great as a first patch (and complex enough).

I.e. the aggregator_cron() logic is way too complex, and I don't think that queries will really scale for a big number of feeds. As opossed to the old one which was a simple query -> retrieve feed -> refresh feed...
Also arrays of feeds are handled all the time through the api, which makes it look much more complex that it should be.

So, in summary I think: Focus, Simplify. Get us a simple patch that supports different parsers and node storage as a first step. Then we'll see about cron load balancing, feeds to be skpped, and more complex features later.

Btw, I've tried the patch with latest HEAD and it has some issues, couldn't really get it to work:

- Main aggregator page
user warning: Unknown column 'i.nid' in 'on clause' query: SELECT COUNT(*) FROM aggregator_item i INNER JOIN aggregator_feed f ON i.nid = f.nid INNER JOIN node n ON f.nid = n.nid in /home/workspace/drupal/modules/aggregator/aggregator.pages.inc on line 62.
- Content type settings
Parse error: syntax error, unexpected T_SL, expecting ')' in /home/workspace/drupal/modules/aggregator/aggregator.admin.inc

Aron Novak’s picture

FileSize
27.99 KB

This week I started to explode that huge patch.
Here is the first one which contains:
- time-based cron
- pluggable architecture (you can replace the processor and the parser if you'd like)

That's all. I even don't have to touch the aggregator tests because the following are unaltered: UI, data structure, db structure.
What's missing: more abstraction of items to make possible to write really different processors.

Aron Novak’s picture

Second round of exploded patches.
aggregator_new_parser.patch only replaces a new parser to the aggregator module without the pluggable architecture.
aggregator_pluggable_and_new_parser.patch replaces a new parser and also contains the pluggable architecture changes.

catch’s picture

Aron - it'd probably be easier to review each set of patches in a single issue and link from here. Also are any of these ready to review, or is it all still needs work?

Aron Novak’s picture

The real question is here: where to go?
The above patches (the exploded ones, not the huge patch) pass the automated tests and do not break important functionality. Some minor features are not perfect at the moment.
At the moment I upgrade the previous huge patch to apply without problem to the CVS HEAD and fix the problems what the reviewers mentioned above.

I try to explain below the possibilities.

At the beginning of the Summer of Code, it seemed that mostly we agree that feeds should be nodes (http://groups.drupal.org/node/11413#comment-37089 read from this comment)
If feeds = nodes, the following changes are neccessary / obvious:
- pluggable architecture (that's the point where the aggregator fails)
- complete rework of the admin interface - lot more easier
- huge changes in categorization (taxonomy is the sane choice for categorization for nodes)
- content-type based configuration possibility (you may have several types of feeds)

I assume we need pluggable architecture for aggregator, this is obvious for me. (but correct me if the situation is different :) )
Questions:
- Do we need the power of nodeapi and node CRUD for feeds too?
- Do we want to replace the parser with a SimpleXML based parser? Now it's a good time to do, because Drupal 7 requires PHP5.
- Do we want feed URL autodetection?

The most of the changes that I did at the huge patch can be made totally independently from each other. The only change that affects tons of things is feed = node.

mustafau’s picture

A built-in "Feed" content type patch for Aggregator: #293318: Convert Aggregator feeds into entities

Aron Novak’s picture

It's awesome that you made already an important patch! I really hope that this patch will be accepted.

Aron Novak’s picture

FileSize
158.82 KB

I also re-rolled the big patch. Maybe it's also good to see the whole concept in one big patch.
Unfortunately I haven't added OPML import, this patch simply removes it. This is not good at all.
But let's focus on the above small patches and especially the questions above.

keith.smith’s picture

I have some issues with numerous strings in this patch, with this one as an example:

+      $output = '<p>' . t('You can assign terms to a feed at node edit form. Aggregator uses a selected vocabulary for each content-type. It\'s possible to alter the selected content-type at the content-type edit page Feed Aggregator tab. The feed items gets the same taxonomy information like the feed nodes. For example if the feed node is assigned with term foo, all the feed items belong to that feed will appear under foo term page.') . '</p>';

Aron, I really applaud your efforts here, and want to help. ; 'tis just that a number of the strings need tweaking or follow existing Drupal guidelines, some of which, are admittedly unwritten.

I'm setting this to CNW because I intend to either reroll this or go through this string by string with suggestions for potential changes.

Aron Novak’s picture

keith.smith: I totally agree. There are some reasons that this patch needs work,
First, maybe there will be huge changes in the DB layer.
Second, opml import is missing.
Third, the strings, yes. I really appreciate if someone helps me to make these strings well-written.

alex_b’s picture

We're starting to break out issues now, SimpleXML parser for aggregator is the first: http://drupal.org/node/302245.

There should follow a separate issue for extensible architecture. Once this separate issue is up, let's close this thread.

Aron Novak’s picture

Status: Needs work » Closed (fixed)

The separate patches are out.

jpetso’s picture

This is the set of (pending) patches originating from this one, as far as I could find those:
#72915: Move feed parser to includes/feed.inc
#303930: Pluggable architecture for aggregator.module
#302936: DB TNG: Convert aggregator.module queries

If I forgot any, feel free to add them here.