This is everything I'm working on for the aggregator so far.
Item text is saved unchanged, though titles, categories, etc. are still filtered.
Item text is uses the default text filter
Item parsing recognizes enclosures, though it isn't saved at the moment
Comments in the parser callbacks to explain what's going on
This is what I need for my own purposes. I'm willing to take requests here. I remember the idea of limiting item length, like genrating a teaser for display, was mentioned. Enclosures are captured (url, file size and MIME type)...I don't know what folks want to do with that.
I also need to discuss the drupal_xml_parser_create() function. It uses the default behavior of xml_parser_create(), which is to upper case all the tag names as it parses them out (it's called case-folding in the docs). This is the wrong behavior for XML processing in general, which needs to preserve the case. I could just set the in aggregator.module but since we're doing wrappers the right thing would be to set it in drupal_xml_parser_create(). I don't know what other modules that would affect.
| Comment | File | Size | Author |
|---|---|---|---|
| #22 | AGGREGATOR_4_7_RC1.txt | 20.72 KB | Prometheus6 |
| #13 | AGGREGATOR_4_7_DATABASE.txt | 4.12 KB | Prometheus6 |
| #12 | AGGREGATOR_CATEGORIES_ENCLOSURES_0.txt | 20.52 KB | Prometheus6 |
| #9 | AGGREGATOR_4_7_PGSQL.txt | 1.41 KB | Prometheus6 |
| #7 | AGGREGATOR_4_7_MYSQL.txt | 1.3 KB | Prometheus6 |
Comments
Comment #1
dries commentedIt's not clear what you changed, or why? What problems does this fix?
Comment #2
Prometheus6 commentedWhen you first announced the 4.7 feature set, you mentioned we were looking forward to handling more stuff with the aggregator, and you specifically mentioned enclosures. Since I need correct processing of both Atom and RSS feeds I decided to fix it and lay the groundwork for the other stuff.
First of all, it's a pretty major improvement in parsing Atom feeds. Blogger's feeds would send item content (the equivalent of $node->body) out as application/xhtml+xml...in other words, the content would be further parsed as though it were its own xml feed. The way the parser was written, any tags in such content were pretty much extracted and appended to the end of the item content and single tags like <img> and <br>, because they had no value like <p> tags, for instance, simply vanished.
Because of that, and to implement display using the default text filter, I rewrote the parser. It uses several fewer global variables and is more of a state machine than a stack processor.
The end result of the parsing is an array of data about the feed, including an element that is an array of feed items. Every element of the xml data is retained. If you have enclosures the are stored in an array as an element in the item array. I haven't changed the aggregator_item table to use the data for enclosures because I wanted to ask how best to handle that...a separate table for enclosures is a possibility too.
In fact, if we're opening it up, there's a question of what to do with data that comes in unknown tags. Podcasts, for instance, have $lt;itunes:something> tags. Most won't need them, some will. So what would be best? Store the raw text, an array like $user->data or define a processing hook?
I know I'm solving problems that don't exist yet (except the bad Atom parsing), but I can work on this stuff now.
Comment #3
dries commentedThanks for the explanation. That seems to make sense so I'll commit this patch as soon it has been (i) reviewed and (ii) tested.
As for the unknown data, you could store it and pass it onto the theme system layer. If the aggregator module's theme layer is well-written, it should be able to visualize the data. At the same time, I don't mind extending the existing SQL tables so we can support some common fields. Adding decent support for enclosures, categories and dc:creator makes sense given Drupal emits those too. At the very least, we should be able to interpret our own RSS feeds, not?
Comment #4
dries commentedDo you know how hard it would be to detect broken content? The Drupal Planet is often broken due to unclosed tags and what not. It would be nice if we could ignore news items that are obvsiously broken.
Comment #5
Prometheus6 commentedDepends on how well HTML Corrector works.
http://drupal.org/project/htmlcorrector
If you give me a couple of broken feeds, I'll test it. And I think it would be straightforward to let administrators pick a filter set independant of the node configuration.
Comment #6
Prometheus6 commentedThree patches, because I'm not sure how to generate a combined patch for the module and sql files.
First the aggregator.module patch.
The feed record (aggregator_feed table) now saves copyright information from the feeds.
The item records (aggregator_item table) save category tags as a serialized array of arrays.
The item records (aggregator_item table) save enclosures as a serialized array. The keys are
There's also an 'enclosure_retrieved' field for the sake of anyone that wants to write code to retrieve enclosures automatically.
The theme_aggregator_page_item() function has been updated to display categories and enclosures.
Comment #7
Prometheus6 commentedUpdate to database.mysql
Comment #8
Prometheus6 commentedUpdate to database.pgsql
This is a best guess because I don't run PostgreSQL on any of my systems.
Comment #9
Prometheus6 commentedthe patch would be nice...
Comment #10
dries commented$attributesrather than$attrbs.$val, write$value.enclosure_retrievedis for. Is that like the raw data?&&and||rather thanANDandOR.... "<a href=\"{$remote_category['DOMAIN']}\" alt=\"{$remote_category['CATEGORYNAME']}\"> {$remote_category['CATEGORYNAME']}</a>";Comment #11
drummSerialized data should rarely be stored in the database because it makes queries harder to write. For example, in this case how would I get all the items with a certain tag?
Comment #12
Prometheus6 commentedDries:
First, let me acknowledge to code style issues.
That's just a flag that can be set if you download the enclosed file so you can serve it from your own web site. There's no functionality to do that in the aggregator; I just figure someone will want to do it, probably during a cron run.
Only the content (Atom) and/or description (RSS) tags escape being scrubbed as the first step.
Later, the content/description is passed through the default filter, so it should all be safe. I guess that would depend more on the filter.
How about if I create an aggregator_filter setting, so it can use any filter at all? That would be better for sites that use, say, the bbcode filter.
Like "in content"? Those are just state indicators. They are never output, and in fact the array key and value is unset at the end of processing.
Drumm:
Okay, I was unclear. These are the categories in the feed, not the ones local to your site. I didn't anticipate that anyone would want to query them. The aggregator_category_item table is unchanged.
Comment #13
Prometheus6 commentedA patch to the database.??sql files and update.inc
Comment #14
drummI think that there might be some interesting uses for the categories in the RSS feed. I thikn we should make them queryable now so that there isn't a problem when someone thinks up somehting cool to do with them.
Comment #15
Prometheus6 commentedThat calls for another table...Drupal feeds are good examples for this because we export taxinomy urls in the domain attribute:
One row per external category.
My personal vote on this is -1, though. Once you've retrieved the items to display, each of them need another query to retrieve its categories...and you have to get them for the item to be fully themeable. An extra query per external category is needed on saving items as well. There are clever things that can be done with the external categories but they all require additional programming anyway.
In fact, I actually think aggregator_category_item is pretty useless. All that ever happens is the feed categories are assigned to each item, and I really think a join could get rid of the table.
Interesting thought though; given the possible values that will be in there, a full text filter on the field holding the serialized array would be just as searchable as a seperate table. I don't know if it's a good idea in general...
Comment #16
Prometheus6 commentedNo longer applies to HEAD.
Comment #17
boris mann commentedThanks for your work so far. Are there bits and pieces that can be used?
@ Drumm, Dries, which bits should be worked on/are likely to be committed?
Comment #18
chx commentedAn issue rarely needs to be set to closed by hand. It can be fixed or by design but closed? Why you closed?
Comment #19
Prometheus6 commentedchx:
The last patch, which introduced format_xss(), eliminated the configurable allowed tags. Meanwhile, while waiting for HEAD to settle down a bit, I've got the use of filters (which also conflicts with using format_xss()) more extensive parsing to use enclosures correctly and manual promotion of RSS items to story nodes...in 4.6.x.
It would be a big patch to bring all that back in line for 4.7, and we don't like big patches. So I'm restarting, and making sure my patches address one issue at a time.
Boris:
See above...I'll definitely post a patch for handling enclosures this week.
Comment #20
chx commentedThe last patch, which introduced format_xss(), eliminated the configurable allowed tags. -- should be very easy (I guess one line) to put it back, as filter_xss has a second argument which defines the allowable tags.
Comment #21
Prometheus6 commentedOkay, aggregator_settings() wasn't touched. I can patch that by tomorrow morning (stuff to do first).
Comment #22
Prometheus6 commentedNot so simple, but...
The schema updates still apply.
http://drupal.org/files/issues/AGGREGATOR_4_7_DATABASE.txt
Comment #23
sethcohn commentedBump. Status?
Comment #24
Prometheus6 commentedI haven't looked at it since December...4.7 was too elusive a target for me to keep up with at that point.
I can look at this again this weekend. Looking back over the thread, searchable categories (the ones in the feed, not the ones in the aggregator) is still sticky. Maybe I can concatenate them into a single string?
Comment #25
catchChanging to real version. Marking to active since the patch is waaaay to old to apply now.
Comment #26
mustafau commentedI am marking this duplicate in light of #16485: Enclosure support for Aggregator.
Comment #27
catch