Sometimes it is useful to change feed/xml data before mapping such as scraping, custom tagging, translating, filtering, etc.

FeedAPI had hook_feedapi_after_parse() which was extremely useful with my custom data and nodes

Current options aren't that great
- extend existing parser - not great because is should be parser independent
- extend existing node processor - no way to get new sources for mapping parameters without processing or redundant code

Possible Features

Allow multiple parsers where our preprocessor would be essentially a post parser. but this doesn't seem like an elegant solution

OR

Have a class FeedsPreProcessor that developers could extend:

class FeedsCustomPreProcessor extends FeedsPreProcessor {
public function preprocess(FeedsParserResult $parserResult){...}
}

this would allow for easy modification/customization of feed/xml data

I've switched from FeedAPI to Feeds just recently and really like it. FeedAPI was far too clunky and your design has really simplified the process. The ability to customize imports means almost any type of data can be processed, which is very powerful, but without the ability to easily modify that data, it seems like an important part is missing.

Thanks, let me know please let me know what you think.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

alex_b’s picture

Have you thought about a hook_feeds_after_parse() ? That would correspond with hook_feeds_after_import().

funkmasterjones’s picture

I don't see an appropriate place where these should be implemented. You don't want an invoke call in all of the base or sub classes. The FeedsSource->import() looks best but what if parsing happens outside import() (like if per node mapping is implemented)

Even if a suitable place is found, how would one make a form for customizing. The use-case to consider would be a customizable filter. Wouldn't the best place for the customizable filter form be in the Importer GUI? I don't know how a hook would accomplish this. The only current way would be to extend an existing parser that filters after calling its parent parser or extend a processor and filter before the processor. Either way, your tied to using an existing parser or processor and the filter should be independent.

I think the only solution is allow chaining of parsers/processors or creation of an intermediate class. Otherwise, the import pipeline is limited in what it can do.

Am I missing something here?

alex_b’s picture

FeedsSource::import() would be the place, yes.

what if parsing happens outside import() (like if per node mapping is implemented)

Parsing only happens outside of import() in Feed's hook_nodeapi() implementation - I take it you are referring to _feeds_nodeapi_node_processor(). In fact I am thinking of wrapping these calls into a FeedsSource::fetch() and a FeedsSource::parse() (the latter would fetch+parse). So we could solve this problem.

The use-case to consider would be a customizable filter.

Could you give a concrete example what such a filter would do and what GUI would be required in such a case? What is the concrete functionality that you are trying to build right now?

(In general terms, I do recognize the problem you are talking about. I have implemented an auto tagger for syndication parsers here http://drupalcode.org/viewvc/drupal/contributions/modules/extractor/ - it required extending both syndication parsers that ship with Feeds.)

funkmasterjones’s picture

Thanks for your timely reply

Yes, wrapping the calls would allow for the proper placement of additional hooks. I think in general having hooks are good, but hooks aren't a great solution for this kind of problem because you may want filtering/extracting for some importers and not for others.

Your extractor mod, its design is what I originally tried and its main fault is having to create a separate classes for each parser. Your extractor shouldn't be tied to any parser because you then have duplicated code for each parser. The form for the extractor should also be separated from the simplepie/comsynd parser forms

So this is not a bug or difficulty I'm having, but rather a recommendation for design

(heres my usecase but its essentially the same as a usecase for your extractor)
Usecase: Customizable filter
Form located in importers gui between Parser and Processor
Form: "Only allow [select box of tags] that contain [textfield to hold pattern]"
(say I only want news where the title contains Iraq)
Input: select "title" option from select and enter "Iraq" in the textfield
Result: after parsing, filter module is called and removes items that don't match user input, then goes processing
Benifits: parser and processor independant on a class level and gui level

Now suppose I wanted to use your extractor along with my filter, then I would need to create 4 parsers! Suppose I wanted to add another module to the mix, then the number of parser classes grows exponentially! What if I wanted to change the order in which these mods where called, even more parsers! There are cases to extend parsers and cases to chain them, this would be the later, no?

Solution: create a new baseclass for a postparser and allow these to be chained together
you could just allow chaining of parsers but I think this is conceptually wrong.

Thanks, Derek

ckng’s picture

Talking about preprocessor, I think there is a need for postprocessor as well.
Extractor is already mentioned in #3 as an example use case.

Preprocessor and postprocessor should be interchangeable (i.e. could exist in both steps), use cases that comes to mind
- scraping, custom tagging, translating, filtering
- feed item manipulation (url, text, etc)
- content analysis (ranking, related content)
- selectively removing existing feed items
- custom output (static caching or other deliveries)

Other methods like using Rules/Trigger/Workflow are possible when dealing with node based feed items, but not for data. Even for node, those modules are a bit complicated/overkill as part of the processes, IMO.

Instead of multiple/custom parsers or pre/postprocessors, another possible route is support for multiple processors?
e.g.
Feed Node processor + Taxonomy term processor = what Extractor does
Feed Node processor + Taxonomy term processor + scraping + translating + etc processors

which are independent as all only work on the result of parser and does not rely on each other output.

giorgio79’s picture

I was just looking for such a feature , where I could filter the feed items that will be actually imported from the feed.

For example, one of my feeds has an integer value, and I am only importing those that have an integer value greater than x , such as
import if number > 5
This is similar to a views filter imho,

How about pushing the feed first into a view, and then configure the view with filters etc in views admin with whatever someone wants, and once configured use that view for the periodic and regular import?

rjbrown99’s picture

I have created a rather extensive pre-processor, but I did it as a separate module. I'm using it for affiliate marketing where I get a CSV feed of retail items that I want to create nodes for.

My module takes the CSV feeds served by HTTP/FTP, uses concurrent curl calls to download the feeds, and then processes them.

After downloading and during processing, it breaks out each field of the CSV file and allows a per-file and per-field override. Basically the overrides are include files created that allow for pretty much anything you want from changing case to normalizing keywords. If I'm working with the file Myretailer.csv and it finds an include file called Myretailer.inc, it loads it up and allows you to do whatever to the data. For example, I have an array that replaces the terms they provide in the feed with taxonomy terms in my system. If they give me "Greenish" but my term is "Green", my replacement array works on that specific field and replaces it.

I am also adding new fields to the CSV file based on the parsing results. For example, if I see the keyword "Women" anywhere I add a new CSV field for it. This also matches my taxonomy in Drupal and allows me to get the items roughly right on import and even create new fields that didn't exist on the original feed.

The final step is to move the CSV to a specific place where feeds can find it. I then just use a normal feeds import to pull the data in. This also works around the issue whereby Feeds can only import CSV files directly from the local filesystem.

I have no idea if this is the kind of thing everyone was thinking of in terms of a preprocessor. I wanted it to be separate because it makes debugging easier. I can do the download in one step, parse in another step, and look at the results before they are fed to Feeds. It has taken me a while to get the parsing right with quotation marks and what not.

infojunkie’s picture

FileSize
626 bytes

Just for reference, here's a patch that worked for me well. I was looking to pre-create nodes that would be referenced by the node processor, so I implemented hook_feeds_after_parse and called $batch->getItems() to iterate on them. FeedsImportBatch doesn't have a magic get function, so I had to add this method manually.

twistor’s picture

My thoughts on this are that in the configuration, there should be an option between Parsers and Processors to allow data manipulation. These should be chainable so that you could (for example) add a trim() function, find/replace, then make urls absolute. This would be incredibly useful.

infojunkie’s picture

Concerning the decision of whether to architect the data modifications as a hook or a new family of plugins, I would opt for the hook approach. Here's why: there are a potentially infinite number of functions that one would want to apply to modify the data, based on one's own business requirements. And most of them, like in the example #9 above, are just about trimming, replacing, or otherwise modifying strings. Which means we would end with a large number plugins such as FeedsModTrim, FeedsModReplace, etc., which really do nothing but encapsulate single PHP functions. The demand for new such plugins would never cease. On the other hand, the hook approach gives as much or more flexibility using one simple modification in the core Feeds module. Business application developers are free to implement that hook whichever way they choose.

twistor’s picture

I agree with your logic, however, I have issues. Feeds is architected such that it can support an infinite number of plugins, they don't all have to be included in Feeds. I'm currently working on Feeds XPath Parser which I'm primarily using for scraping. The question is whether end users can configure the modifications. If so, then you still end up with the same infinite plugins problem. Assuming that everyone who uses Feeds can write a hook function seems odd. Basically, I'd like to be able to provide users with a gui that they can configure. A hook could work, but then the implementations would be specific to the plugin. Allowing for modifiers in the import configuration seems, to me, to be the ideal solution. Then, Feeds could support a base set, and plugins could add things that are particular to their purpose.

infojunkie’s picture

I do agree that Drupal tends toward code-less programming, and I support that wholeheartedly. I just think that for simplicity's sake, a hook would work well and could open the door for more end-user-oriented solutions.

twistor’s picture

Status: Active » Needs review

I rescind my point. I'm thinking of creating a new module using this hook that allows for the manipulations I described. Patch works quite nicely.

alex_b’s picture

Re #2:

I don't see an appropriate place where these should be implemented.

I actually think a case can be made that any hook_after_parse() implementation needs to be parser specific. Hence I would argue that invoking them on a per parser basis is fine. This is just like the mapping API is processor specific. Indeed, in the meantime we have an OO hook in FeedsSimplePieParser (FeedsSimplePieParser::parseExtensions()).

infojunkie’s picture

Re #14: I disagree that the hook_after_parse is parser-specific. Rather, it is job-specific, because I could be using the CSV parser for two different import jobs, each of which requires different business logic (i.e., massaging input fields after they were parsed). I would not want to subclass the CSV parser every time.

alex_b’s picture

Title: Preprocessor Support for custom data mods » hook_feeds_after_parse()
FileSize
950 bytes

#15: I agree that extensions will be job specific.

My point was rather that it's very unlikely to write an after parse implementation that manipulates any parsed data, no matter where it came from. In fact, any manipulation of what's in $batch->items needs intimate knowledge of the parser that produced it. In that light, my argument was that it would be fine to create parser specific hooks.

Either way, here is a version of #8 that is more in line with hook_feeds_after_import(). Both hook signatures should be probably updated, but for the time let's keep them at least consistent.

alex_b’s picture

BTW, would be interested in what the computational cost of these hooks is.

infojunkie’s picture

Re patch #16: how can one manipulate batch items with this patch, given that this is the main point of this hook?

Re computational cost: you mean actual timing, right? Because in terms of complexity, this hook is only called once per job so there's no significant hit.

alex_b’s picture

Status: Needs review » Needs work

Sorry, yes, we would have to expose the batch object for this approach.

infojunkie’s picture

How can this issue move forward? I am relying on this new hook in most of my Feeds deployments.

alex_b’s picture

Status: Needs work » Needs review
FileSize
2.24 KB

#18 - this is actually possible with this patch:

function hook_feeds_after_parse(FeedsImporter $importer, FeedsSource $source) {
  // For example, set title of imported content:
  $source->batch->setTitle('Import number '. my_module_import_id());
}

- Added invocation of this hook to preview().
- Added documentation.

@infojunkie: If I get your RTBC, I can commit.

infojunkie’s picture

Status: Needs review » Needs work

One of the main uses for hook_feeds_after_parse will be to adjust item values. Today, class FeedsImportBatch does not allow to read or modify its items except through setItems() and addItem(). What's needed is either:

* A reference accessor to the items array, such as getItems() in #8 above, or
* A reference accessor to each item individually along with an iterator, or
* Making items public

alex_b’s picture

Status: Needs work » Needs review

#22: agreed - can we make this a separate issue though?

#912630: Make parsed items accessible for modification

infojunkie’s picture

Status: Needs review » Reviewed & tested by the community

Works for me. Thanks!

alex_b’s picture

Status: Reviewed & tested by the community » Fixed
kvvnn’s picture

At this point, if you want access to the items of an import, the following is necessary:

1) Addt the FeedsBatch function in #8 :

public function getItems() { return $this->items;  }

in FeedsBatch.inc around line 182 (in v 1.15, see patch in #8).

2) Implement the patch in #21 .

3) In a custom module, use :

function yourmodule_feeds_after_parse(FeedsImporter $importer, FeedsSource $source) {
    // this is from the function from #8
    $items = $source->batch->getItems();

    // find out what my items are comprised of
    dsm($items[1]);

    // now you can do something like loop through the items and manipulate them
    for($i=0; $i<sizeof($items); ++$i) {
        // personally, i am interested in the email address field from a CSV file i'm uploading
        $email = $items[$i]['e-mail address'];
        // i want to see if the email address already exists in my data table
        $duplicate = db_result(db_query("SELECT timestamp FROM {feeds_data_distribution_list} WHERE e_mail_address = '%s'", $email));
        // if so
        if ($duplicate || !$email) {
            // this is a duplicate, delete the row
            // i created a method in FeedsBatch.inc : public function removeItem($i) { unset($this->items[$i]); }
            $source->batch->removeItem($i);
        }
    }

}

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.