This is everything I'm working on for the aggregator so far.

Item text is saved unchanged, though titles, categories, etc. are still filtered.
Item text is uses the default text filter
Item parsing recognizes enclosures, though it isn't saved at the moment
Comments in the parser callbacks to explain what's going on

This is what I need for my own purposes. I'm willing to take requests here. I remember the idea of limiting item length, like genrating a teaser for display, was mentioned. Enclosures are captured (url, file size and MIME type)...I don't know what folks want to do with that.

I also need to discuss the drupal_xml_parser_create() function. It uses the default behavior of xml_parser_create(), which is to upper case all the tag names as it parses them out (it's called case-folding in the docs). This is the wrong behavior for XML processing in general, which needs to preserve the case. I could just set the in aggregator.module but since we're doing wrappers the right thing would be to set it in drupal_xml_parser_create(). I don't know what other modules that would affect.

Comments

dries’s picture

It's not clear what you changed, or why? What problems does this fix?

Prometheus6’s picture

When you first announced the 4.7 feature set, you mentioned we were looking forward to handling more stuff with the aggregator, and you specifically mentioned enclosures. Since I need correct processing of both Atom and RSS feeds I decided to fix it and lay the groundwork for the other stuff.

First of all, it's a pretty major improvement in parsing Atom feeds. Blogger's feeds would send item content (the equivalent of $node->body) out as application/xhtml+xml...in other words, the content would be further parsed as though it were its own xml feed. The way the parser was written, any tags in such content were pretty much extracted and appended to the end of the item content and single tags like <img> and <br>, because they had no value like <p> tags, for instance, simply vanished.

Because of that, and to implement display using the default text filter, I rewrote the parser. It uses several fewer global variables and is more of a state machine than a stack processor.

The end result of the parsing is an array of data about the feed, including an element that is an array of feed items. Every element of the xml data is retained. If you have enclosures the are stored in an array as an element in the item array. I haven't changed the aggregator_item table to use the data for enclosures because I wanted to ask how best to handle that...a separate table for enclosures is a possibility too.

In fact, if we're opening it up, there's a question of what to do with data that comes in unknown tags. Podcasts, for instance, have $lt;itunes:something> tags. Most won't need them, some will. So what would be best? Store the raw text, an array like $user->data or define a processing hook?

I know I'm solving problems that don't exist yet (except the bad Atom parsing), but I can work on this stuff now.

dries’s picture

Thanks for the explanation. That seems to make sense so I'll commit this patch as soon it has been (i) reviewed and (ii) tested.

As for the unknown data, you could store it and pass it onto the theme system layer. If the aggregator module's theme layer is well-written, it should be able to visualize the data. At the same time, I don't mind extending the existing SQL tables so we can support some common fields. Adding decent support for enclosures, categories and dc:creator makes sense given Drupal emits those too. At the very least, we should be able to interpret our own RSS feeds, not?

dries’s picture

Do you know how hard it would be to detect broken content? The Drupal Planet is often broken due to unclosed tags and what not. It would be nice if we could ignore news items that are obvsiously broken.

Prometheus6’s picture

Depends on how well HTML Corrector works.

http://drupal.org/project/htmlcorrector

If you give me a couple of broken feeds, I'll test it. And I think it would be straightforward to let administrators pick a filter set independant of the node configuration.

Prometheus6’s picture

StatusFileSize
new20.69 KB

Three patches, because I'm not sure how to generate a combined patch for the module and sql files.

First the aggregator.module patch.
The feed record (aggregator_feed table) now saves copyright information from the feeds.

The item records (aggregator_item table) save category tags as a serialized array of arrays.

  • categories[index]['DOMAIN']: optional link - in Drupal it's a URL to a taxinomy page
  • categories[index]['CATEGORYNAME']: The name of the category on the remote system

The item records (aggregator_item table) save enclosures as a serialized array. The keys are

  • enclosure['URL']: the link to download the enclosure
  • enclosure['LENGTH']: enclosure file size in bytes
  • enclosure['TYPE']: enclosure MIME type

There's also an 'enclosure_retrieved' field for the sake of anyone that wants to write code to retrieve enclosures automatically.

The theme_aggregator_page_item() function has been updated to display categories and enclosures.

Prometheus6’s picture

StatusFileSize
new1.3 KB

Update to database.mysql

Prometheus6’s picture

Update to database.pgsql

This is a best guess because I don't run PostgreSQL on any of my systems.

Prometheus6’s picture

StatusFileSize
new1.41 KB

the patch would be nice...

dries’s picture

  1. We need an upgrade path (a patch against update.php to make the required changes to the database scheme).
  2. Coding style: write $attributes rather than $attrbs.
  3. Coding style: instead of $val, write $value.
  4. I don't understand what enclosure_retrieved is for. Is that like the raw data?
  5. Coding style: placement of { and }.
  6. Coding style: write && and || rather than AND and OR.
  7. I believe this patch has some security issues; raw data is emitted straight to the database/browser. Example: ... "<a href=\"{$remote_category['DOMAIN']}\" alt=\"{$remote_category['CATEGORYNAME']}\"> {$remote_category['CATEGORYNAME']}</a>";
  8. Some hardcoded strings can't be translated.
drumm’s picture

Status: Needs review » Needs work

The item records (aggregator_item table) save category tags as a serialized array of arrays.

Serialized data should rarely be stored in the database because it makes queries harder to write. For example, in this case how would I get all the items with a certain tag?

Prometheus6’s picture

StatusFileSize
new20.52 KB

Dries:
First, let me acknowledge to code style issues.

I don't understand what enclosure_retrieved is for. Is that like the raw data?

That's just a flag that can be set if you download the enclosed file so you can serve it from your own web site. There's no functionality to do that in the aggregator; I just figure someone will want to do it, probably during a cron run.

I believe this patch has some security issues; raw data is emitted straight to the database/browser.

Only the content (Atom) and/or description (RSS) tags escape being scrubbed as the first step.

     foreach ($item as $key => $value) {
       $value = decode_entities(trim($value));
       $value = preg_replace('/\Wstyle\s*=[^>]+{question mark}>/i', '>', $value);
       $value = preg_replace('/\Won[a-z]+\s*=[^>]+{question mark}>/i', '>', $value);
      if (stristr($key, 'CONTENT')) {
        $item['DESCRIPTION'] = $value;
      }
      else if ($key != 'DESCRIPTION') {
        $value = strip_tags($value, variable_get('aggregator_allowed_html_tags', 'list of allowed tags'));
      }
       $item[$key] = $value;
     }

Later, the content/description is passed through the default filter, so it should all be safe. I guess that would depend more on the filter.

How about if I create an aggregator_filter setting, so it can use any filter at all? That would be better for sites that use, say, the bbcode filter.

Some hardcoded strings can't be translated.

Like "in content"? Those are just state indicators. They are never output, and in fact the array key and value is unset at the end of processing.

Drumm:

For example, in this case how would I get all the items with a certain tag?

Okay, I was unclear. These are the categories in the feed, not the ones local to your site. I didn't anticipate that anyone would want to query them. The aggregator_category_item table is unchanged.

Prometheus6’s picture

Status: Needs work » Needs review
StatusFileSize
new4.12 KB

A patch to the database.??sql files and update.inc

drumm’s picture

I think that there might be some interesting uses for the categories in the RSS feed. I thikn we should make them queryable now so that there isn't a problem when someone thinks up somehting cool to do with them.

Prometheus6’s picture

That calls for another table...Drupal feeds are good examples for this because we export taxinomy urls in the domain attribute:

  • iid: item id
  • external_category_name
  • external_category_domain

One row per external category.

My personal vote on this is -1, though. Once you've retrieved the items to display, each of them need another query to retrieve its categories...and you have to get them for the item to be fully themeable. An extra query per external category is needed on saving items as well. There are clever things that can be done with the external categories but they all require additional programming anyway.

In fact, I actually think aggregator_category_item is pretty useless. All that ever happens is the feed categories are assigned to each item, and I really think a join could get rid of the table.

Interesting thought though; given the possible values that will be in there, a full text filter on the field holding the serialized array would be just as searchable as a seperate table. I don't know if it's a good idea in general...

Prometheus6’s picture

Status: Needs review » Closed (fixed)

No longer applies to HEAD.

boris mann’s picture

Status: Closed (fixed) » Needs work

Thanks for your work so far. Are there bits and pieces that can be used?

@ Drumm, Dries, which bits should be worked on/are likely to be committed?

chx’s picture

An issue rarely needs to be set to closed by hand. It can be fixed or by design but closed? Why you closed?

Prometheus6’s picture

chx:

Why you closed?

The last patch, which introduced format_xss(), eliminated the configurable allowed tags. Meanwhile, while waiting for HEAD to settle down a bit, I've got the use of filters (which also conflicts with using format_xss()) more extensive parsing to use enclosures correctly and manual promotion of RSS items to story nodes...in 4.6.x.

It would be a big patch to bring all that back in line for 4.7, and we don't like big patches. So I'm restarting, and making sure my patches address one issue at a time.

Boris:

Are there bits and pieces that can be used?

See above...I'll definitely post a patch for handling enclosures this week.

chx’s picture

The last patch, which introduced format_xss(), eliminated the configurable allowed tags. -- should be very easy (I guess one line) to put it back, as filter_xss has a second argument which defines the allowable tags.

Prometheus6’s picture

Okay, aggregator_settings() wasn't touched. I can patch that by tomorrow morning (stuff to do first).

Prometheus6’s picture

StatusFileSize
new20.72 KB

Not so simple, but...

The schema updates still apply.
http://drupal.org/files/issues/AGGREGATOR_4_7_DATABASE.txt

sethcohn’s picture

Bump. Status?

Prometheus6’s picture

I haven't looked at it since December...4.7 was too elusive a target for me to keep up with at that point.

I can look at this again this weekend. Looking back over the thread, searchable categories (the ones in the feed, not the ones in the aggregator) is still sticky. Maybe I can concatenate them into a single string?

catch’s picture

Version: x.y.z » 7.x-dev
Status: Needs work » Active

Changing to real version. Marking to active since the patch is waaaay to old to apply now.

mustafau’s picture

I am marking this duplicate in light of #16485: Enclosure support for Aggregator.

catch’s picture

Status: Active » Closed (duplicate)