Rewritten parser, plus [#32817]

This is everything I'm working on for the aggregator so far.

Item text is saved unchanged, though titles, categories, etc. are still filtered.
Item text is uses the default text filter
Item parsing recognizes enclosures, though it isn't saved at the moment
Comments in the parser callbacks to explain what's going on

This is what I need for my own purposes. I'm willing to take requests here. I remember the idea of limiting item length, like genrating a teaser for display, was mentioned. Enclosures are captured (url, file size and MIME type)...I don't know what folks want to do with that.

I also need to discuss the drupal_xml_parser_create() function. It uses the default behavior of xml_parser_create(), which is to upper case all the tag names as it parses them out (it's called case-folding in the docs). This is the wrong behavior for XML processing in general, which needs to preserve the case. I could just set the in aggregator.module but since we're doing wrappers the right thing would be to set it in drupal_xml_parser_create(). I don't know what other modules that would affect.

Comment	File	Size	Author
#22	AGGREGATOR_4_7_RC1.txt	20.72 KB	Prometheus6
#13	AGGREGATOR_4_7_DATABASE.txt	4.12 KB	Prometheus6
#12	AGGREGATOR_CATEGORIES_ENCLOSURES_0.txt	20.52 KB	Prometheus6
#9	AGGREGATOR_4_7_PGSQL.txt	1.41 KB	Prometheus6
#7	AGGREGATOR_4_7_MYSQL.txt	1.3 KB	Prometheus6
#6	AGGREGATOR_CATEGORIES_ENCLOSURES.txt	20.69 KB	Prometheus6
	AGGREGATOR_NEW_PARSER.txt	12.77 KB	Prometheus6

Comments

Comment #1

dries commented 3 October 2005 at 09:26

It's not clear what you changed, or why? What problems does this fix?

Comment #2

Prometheus6 commented 3 October 2005 at 13:15

When you first announced the 4.7 feature set, you mentioned we were looking forward to handling more stuff with the aggregator, and you specifically mentioned enclosures. Since I need correct processing of both Atom and RSS feeds I decided to fix it and lay the groundwork for the other stuff.

First of all, it's a pretty major improvement in parsing Atom feeds. Blogger's feeds would send item content (the equivalent of $node->body) out as application/xhtml+xml...in other words, the content would be further parsed as though it were its own xml feed. The way the parser was written, any tags in such content were pretty much extracted and appended to the end of the item content and single tags like <img> and <br>, because they had no value like <p> tags, for instance, simply vanished.

Because of that, and to implement display using the default text filter, I rewrote the parser. It uses several fewer global variables and is more of a state machine than a stack processor.

The end result of the parsing is an array of data about the feed, including an element that is an array of feed items. Every element of the xml data is retained. If you have enclosures the are stored in an array as an element in the item array. I haven't changed the aggregator_item table to use the data for enclosures because I wanted to ask how best to handle that...a separate table for enclosures is a possibility too.

In fact, if we're opening it up, there's a question of what to do with data that comes in unknown tags. Podcasts, for instance, have $lt;itunes:something> tags. Most won't need them, some will. So what would be best? Store the raw text, an array like $user->data or define a processing hook?

I know I'm solving problems that don't exist yet (except the bad Atom parsing), but I can work on this stuff now.

Comment #3

dries commented 3 October 2005 at 14:26

Thanks for the explanation. That seems to make sense so I'll commit this patch as soon it has been (i) reviewed and (ii) tested.

As for the unknown data, you could store it and pass it onto the theme system layer. If the aggregator module's theme layer is well-written, it should be able to visualize the data. At the same time, I don't mind extending the existing SQL tables so we can support some common fields. Adding decent support for enclosures, categories and dc:creator makes sense given Drupal emits those too. At the very least, we should be able to interpret our own RSS feeds, not?

Comment #4

dries commented 4 October 2005 at 14:53

Do you know how hard it would be to detect broken content? The Drupal Planet is often broken due to unclosed tags and what not. It would be nice if we could ignore news items that are obvsiously broken.

Comment #5

Prometheus6 commented 4 October 2005 at 17:36

Depends on how well HTML Corrector works.

http://drupal.org/project/htmlcorrector

If you give me a couple of broken feeds, I'll test it. And I think it would be straightforward to let administrators pick a filter set independant of the node configuration.

Comment #6

Prometheus6 commented 16 October 2005 at 15:13

Status	File	Size
new	AGGREGATOR_CATEGORIES_ENCLOSURES.txt	20.69 KB

Three patches, because I'm not sure how to generate a combined patch for the module and sql files.

First the aggregator.module patch.
The feed record (aggregator_feed table) now saves copyright information from the feeds.

The item records (aggregator_item table) save category tags as a serialized array of arrays.

categories[index]['DOMAIN']: optional link - in Drupal it's a URL to a taxinomy page
categories[index]['CATEGORYNAME']: The name of the category on the remote system

The item records (aggregator_item table) save enclosures as a serialized array. The keys are

enclosure['URL']: the link to download the enclosure
enclosure['LENGTH']: enclosure file size in bytes
enclosure['TYPE']: enclosure MIME type

There's also an 'enclosure_retrieved' field for the sake of anyone that wants to write code to retrieve enclosures automatically.

The theme_aggregator_page_item() function has been updated to display categories and enclosures.

Comment #7

Prometheus6 commented 16 October 2005 at 15:14

Status	File	Size
new	AGGREGATOR_4_7_MYSQL.txt	1.3 KB

Update to database.mysql

Comment #8

Prometheus6 commented 16 October 2005 at 15:15

Update to database.pgsql

This is a best guess because I don't run PostgreSQL on any of my systems.

Comment #9

Prometheus6 commented 16 October 2005 at 15:16

Status	File	Size
new	AGGREGATOR_4_7_PGSQL.txt	1.41 KB

the patch would be nice...

Comment #10

dries commented 21 October 2005 at 13:28

We need an upgrade path (a patch against update.php to make the required changes to the database scheme).
Coding style: write $attributes rather than $attrbs.
Coding style: instead of $val, write $value.
I don't understand what enclosure_retrieved is for. Is that like the raw data?
Coding style: placement of { and }.
Coding style: write && and || rather than AND and OR.
I believe this patch has some security issues; raw data is emitted straight to the database/browser. Example: ... "<a href=\"{$remote_category['DOMAIN']}\" alt=\"{$remote_category['CATEGORYNAME']}\"> {$remote_category['CATEGORYNAME']}</a>";
Some hardcoded strings can't be translated.

Comment #11

drumm

he/him

NY, US

commented 22 October 2005 at 09:21

Status:

Needs review

» Needs work

The item records (aggregator_item table) save category tags as a serialized array of arrays.

Serialized data should rarely be stored in the database because it makes queries harder to write. For example, in this case how would I get all the items with a certain tag?

Comment #12

Prometheus6 commented 22 October 2005 at 14:31

Status	File	Size
new	AGGREGATOR_CATEGORIES_ENCLOSURES_0.txt	20.52 KB

Dries:
First, let me acknowledge to code style issues.

I don't understand what enclosure_retrieved is for. Is that like the raw data?

That's just a flag that can be set if you download the enclosed file so you can serve it from your own web site. There's no functionality to do that in the aggregator; I just figure someone will want to do it, probably during a cron run.

I believe this patch has some security issues; raw data is emitted straight to the database/browser.

Only the content (Atom) and/or description (RSS) tags escape being scrubbed as the first step.

     foreach ($item as $key => $value) {
       $value = decode_entities(trim($value));
       $value = preg_replace('/\Wstyle\s*=[^>]+{question mark}>/i', '>', $value);
       $value = preg_replace('/\Won[a-z]+\s*=[^>]+{question mark}>/i', '>', $value);
      if (stristr($key, 'CONTENT')) {
        $item['DESCRIPTION'] = $value;
      }
      else if ($key != 'DESCRIPTION') {
        $value = strip_tags($value, variable_get('aggregator_allowed_html_tags', 'list of allowed tags'));
      }
       $item[$key] = $value;
     }

Later, the content/description is passed through the default filter, so it should all be safe. I guess that would depend more on the filter.

How about if I create an aggregator_filter setting, so it can use any filter at all? That would be better for sites that use, say, the bbcode filter.

Some hardcoded strings can't be translated.

Like "in content"? Those are just state indicators. They are never output, and in fact the array key and value is unset at the end of processing.

Drumm:

For example, in this case how would I get all the items with a certain tag?

Okay, I was unclear. These are the categories in the feed, not the ones local to your site. I didn't anticipate that anyone would want to query them. The aggregator_category_item table is unchanged.

Comment #13

Prometheus6 commented 22 October 2005 at 15:57

Status:

Needs work

» Needs review

Status	File	Size
new	AGGREGATOR_4_7_DATABASE.txt	4.12 KB

A patch to the database.??sql files and update.inc

Comment #14

drumm

he/him

NY, US

commented 23 October 2005 at 13:34

I think that there might be some interesting uses for the categories in the RSS feed. I thikn we should make them queryable now so that there isn't a problem when someone thinks up somehting cool to do with them.

Comment #15

Prometheus6 commented 23 October 2005 at 16:09

That calls for another table...Drupal feeds are good examples for this because we export taxinomy urls in the domain attribute:

iid: item id
external_category_name
external_category_domain

One row per external category.

My personal vote on this is -1, though. Once you've retrieved the items to display, each of them need another query to retrieve its categories...and you have to get them for the item to be fully themeable. An extra query per external category is needed on saving items as well. There are clever things that can be done with the external categories but they all require additional programming anyway.

In fact, I actually think aggregator_category_item is pretty useless. All that ever happens is the feed categories are assigned to each item, and I really think a join could get rid of the table.

Interesting thought though; given the possible values that will be in there, a full text filter on the field holding the serialized array would be just as searchable as a seperate table. I don't know if it's a good idea in general...

Comment #16

Prometheus6 commented 6 December 2005 at 15:55

Status:

Needs review

» Closed (fixed)

No longer applies to HEAD.

Comment #17

boris mann commented 6 December 2005 at 16:56

Status:

Closed (fixed)

» Needs work

Thanks for your work so far. Are there bits and pieces that can be used?

@ Drumm, Dries, which bits should be worked on/are likely to be committed?

Comment #18

chx commented 6 December 2005 at 16:56

An issue rarely needs to be set to closed by hand. It can be fixed or by design but closed? Why you closed?

Comment #19

Prometheus6 commented 6 December 2005 at 19:59

chx:

Why you closed?

The last patch, which introduced format_xss(), eliminated the configurable allowed tags. Meanwhile, while waiting for HEAD to settle down a bit, I've got the use of filters (which also conflicts with using format_xss()) more extensive parsing to use enclosures correctly and manual promotion of RSS items to story nodes...in 4.6.x.

It would be a big patch to bring all that back in line for 4.7, and we don't like big patches. So I'm restarting, and making sure my patches address one issue at a time.

Boris:

Are there bits and pieces that can be used?

See above...I'll definitely post a patch for handling enclosures this week.

Comment #20

chx commented 6 December 2005 at 20:19

The last patch, which introduced format_xss(), eliminated the configurable allowed tags. -- should be very easy (I guess one line) to put it back, as filter_xss has a second argument which defines the allowable tags.

Comment #21

Prometheus6 commented 6 December 2005 at 22:27

Okay, aggregator_settings() wasn't touched. I can patch that by tomorrow morning (stuff to do first).

Comment #22

Prometheus6 commented 9 December 2005 at 01:58

Status	File	Size
new	AGGREGATOR_4_7_RC1.txt	20.72 KB

Not so simple, but...

The schema updates still apply.
http://drupal.org/files/issues/AGGREGATOR_4_7_DATABASE.txt

Comment #23

sethcohn commented 27 May 2006 at 20:23

Bump. Status?

Comment #24

Prometheus6 commented 8 June 2006 at 19:45

I haven't looked at it since December...4.7 was too elusive a target for me to keep up with at that point.

I can look at this again this weekend. Looking back over the thread, searchable categories (the ones in the feed, not the ones in the aggregator) is still sticky. Maybe I can concatenate them into a single string?

Comment #25

catch

he/him

English

commented 26 October 2007 at 15:56

Version:	x.y.z	» 7.x-dev
Status:	Needs work	» Active

Changing to real version. Marking to active since the patch is waaaay to old to apply now.

Comment #26

mustafau commented 15 August 2008 at 19:05

I am marking this duplicate in light of #16485: Enclosure support for Aggregator.

Comment #27

catch

he/him

English

commented 16 August 2008 at 01:41

Status:

Active

» Closed (duplicate)

Rewritten parser, plus

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Comment #20

Comment #21

Comment #22

Comment #23

Comment #24

Comment #25

Comment #26

Comment #27

News items

Our community

Documentation

Drupal code base

Governance of community