Add standard mechanism for "row parsers" [#925542]

I've talked to Alex about this before and he asked me to post it to the issue queue so that I could explain it better. Here goes. :-)

I would propose that Feeds needs a standard mechanism for row parsers and possibly row processors.

Currently, as I understand how the system works, an incoming feed is parsed en masse to an intermediary form in PHP, usually an array. That intermediary form is then passed en masse to a processor, which takes that data and generates nodes, users, table records, or whatever the desired end result is.

That's all well and good when the incoming format is a unified format, like RSS or CSV. However, the vast majority of feeds one is going to process will follow a very regular format: They will essentially be an array of "things to import", represented in whatever the incoming format is. The format of each item in that array will be largely identical, and processed in a loop. Similarly, what we're saving to will be largely identical.

However, not all incoming feeds will be in a single unified format. Many formats support wrapping an arbitrary other format internally. The example that comes to mind for me is Atom. Atom itself supports a number of different payload formats, including raw text, HTML, arbitrary XML, etc. However, there is no Feeds-core supported way to handle the Atom separately from the payload (that I'm aware of). The feed is treated as a single entity. We ran into this issue when developing the Feeds Atom module. We had a rich, complex payload (a custom RDF-ish XML format for representing Entities and Fields) wrapped in Atom, but the parser and processor only get them in combined form. That means that the Atom parser we wrote cannot be used without that custom XML payload, and if we wanted to wrap that custom XML payload in a different feed format we couldn't.

Instead, I propose that the "feed parser" and the "row parser" be separated. The responsibility of the feed parser is to extract each of the entries in the feed, attach any relevant metadata, and then pass the entry and its metadata along to a configured row parser. The row parser is then responsible for Extracting the relevant data from just that one record into a standardized PHP format. That single-entry record is then passed off to a processor that turns that internal format into a node, user, table row, etc.

This is a direct mirror of row style plugins in Views; Each record in the result set is rendered separately, and then aggregated into a single output by the overall style plugin. As with Views there are use cases where that doesn't apply and you'd want to still handle the entire result set or feed at once, and those should continue to be supported. In essence, this would make Feeds a more direct mirror of Views (and I believe that comparison is apt).

There are a number of advantages to this approach:

Row parsers and Feed parsers become more generic and can be interchanged more easily; It would be awesome to have a natively supported way to swap out a different payload for Atom feeds, or a way to trivially wrap NodeXML into some other envelope format.
Because we're separating out each row, there's no requirement that rows have to be parsed and processed in the same request. It would be easy to simply dump the rows into a queue and then process them one at a time either in the same request if it's small enough, via batch API if it's an interactive pull, or over time with the Drupal Queue module (or D7's native queue system). I believe there's another issue open on the same subject. That would allow for handling vastly larger incoming feeds as queues can scale almost indefinitely.
As with the benefits of the queue over the old way that update hooks run (where you have to keep track of how many items there are to process yourself), row parsers become easier to write as you need only handle one (1) record at a time. Questions of synchronization and scaling are handled "elsewhere" and a plugin author needn't worry about it.
Because it more closely parallels Views, it should make it easier for new developers to pick up. "Ah, it's just like Views only backwards."

The disadvantages would be additional internal complexity, especially as the system is already written, and the need to provide some way to handle the case were processing of one item does depend on a previous item. That may need to just be handled with a non-row-using parser, which should be retained. There are also UI implications to be considered, although I firmly believe those are surmountable.

Hopefully that better explains what I'm talking about. :-) Alex, let me know if it's still unclear.

Comments

Comment #1

twistor CreditAttribution: twistor commented 28 September 2010 at 22:19

As someone who proliferates Feeds parsers, I am in full support of this. I'm thinking of some sort of object composition, where the combination of a feed parser and row parser implement a full parser interface. That should support old style and new style parsers.

Comment #2

alex_b CreditAttribution: alex_b commented 29 September 2010 at 16:32

Thank you for the detailed writeup.

A couple of thoughts / ideas:

I think it is not practical to refactor all Parser Plugins (FeedsCSVParser, FeedsSyndicationParser...) to use row parsers: This would not just affect the Parser Plugins themselves, but also the libraries they use (ParserCSV.inc, common_syndication_parser.inc, ...). This means not just a large amount of work, but also hampering the portability of parsing libraries and our ability to use third party libraries.

Mix and match sounds great when staying within XML, but it has some real problems when mixing different formats. E. g. what does "JSON rows in CSV documents" mean? How can we be sure that one format can technically contain another one? At the least, free mixing and matching of document parsers and row parsers moves us beyond what standards define and at its worst, certain combinations will just break.

So considering A and B I wonder whether row level parser plugins are a good idea on a global Feeds Importer configuration level and whether it wouldn't be better to come up with a Parser Plugin that exposes a row level parsing API. Such a Parser Plugin could have a lot more assumptions around the kinds of row level plugins it expects.

Just to throw it out there: adityakg GSoC student '10 was planning to come up with a Views display plugin that would be a Feeds parser. Build a View, select the view on your Feeds Parser settings, go. I still think this is an exciting idea. I don't know what came of it.

PS:

Not strictly related, but it is worth reading up on how results are being passed from fetching to parsing to processing stage in Drupal7. Basically FeedsBatch class is gone and replaced with a FeedsFetcherResult and a FeedsParserResult class:

http://github.com/lxbarth/Feeds/blob/master/plugins/FeedsFetcher.inc
http://github.com/lxbarth/Feeds/blob/master/plugins/FeedsParser.inc

Comment #3

voxpelli CreditAttribution: voxpelli commented 6 October 2010 at 12:04

I overall like this suggestion, but wanted to share some thoughts on an alternate solutions which most likely uses every possible term in the wrong way and repeats what others has already suggested, but might still be of some use:

Couldn't all of the basic parsing be done by the parser and only have the "row parsers" do any additional parsing that's required? Eg. the XML parser, CSV parser, JSON parser converts the entire basic structure into a PHP-format which are then sent to "row preprocessors"/"row parsers" which transforms the data making it more digestible for the processors to work with.

So a flow like: Fetcher -> Parser -> Row Preprocessors -> (Row) Processor

To parse an Atom feed you would then pair the XML parser with an Atom Row Preprocessor and if the Atom-feed happens to be extended with some other XML-standard, like Geo-data or perhaps Activity Streams, you could attach an additional Row Preprocessor that transforms that data into a format that the Processor can map.

The Row Processor could be able to eg. parse, combine and filter the data from the Parser to make it more digestible by the Processor. By having the Parser do all of the basic Parsing and only having the Row Preprocessor do additional parsing an Atom Row Preprocessor would be able to work with little or no modification on Atom-representations in any format, XML as well as JSON.

If someone needs to import a very non-standard XML-file where no standard Row Preprocessors could be used we could even have a generic Row Preprocessor which could be configured through a UI to combine and filter the content in certain ways.

With Row Preprocessors we would solve A) and B) since it would be an additional layer between the Parser and the Processor instead of a replacement of something that's already there. They work on the data from the Parser and transforms it to something that the Processors like.

Regarding designing Feeds in parallel with Views a problem with that might be that Views isn't very good at constructing trees of data, but Feeds must be able to parse trees with at least a few levels of data in it.