Sometimes it is useful to change feed/xml data before mapping such as scraping, custom tagging, translating, filtering, etc.
FeedAPI had hook_feedapi_after_parse() which was extremely useful with my custom data and nodes
Current options aren't that great
- extend existing parser - not great because is should be parser independent
- extend existing node processor - no way to get new sources for mapping parameters without processing or redundant code
Possible Features
Allow multiple parsers where our preprocessor would be essentially a post parser. but this doesn't seem like an elegant solution
OR
Have a class FeedsPreProcessor that developers could extend:
class FeedsCustomPreProcessor extends FeedsPreProcessor {
public function preprocess(FeedsParserResult $parserResult){...}
}
this would allow for easy modification/customization of feed/xml data
I've switched from FeedAPI to Feeds just recently and really like it. FeedAPI was far too clunky and your design has really simplified the process. The ability to customize imports means almost any type of data can be processed, which is very powerful, but without the ability to easily modify that data, it seems like an important part is missing.
Thanks, let me know please let me know what you think.
Comment | File | Size | Author |
---|---|---|---|
#21 | 663860-21_hook_feeds_after_parse.patch | 2.24 KB | alex_b |
#16 | 663860-16_hook_after_parse.patch | 950 bytes | alex_b |
#8 | feeds.after-parse.patch | 626 bytes | infojunkie |
Comments
Comment #1
alex_b CreditAttribution: alex_b commentedHave you thought about a
hook_feeds_after_parse()
? That would correspond with hook_feeds_after_import().Comment #2
funkmasterjones CreditAttribution: funkmasterjones commentedI don't see an appropriate place where these should be implemented. You don't want an invoke call in all of the base or sub classes. The FeedsSource->import() looks best but what if parsing happens outside import() (like if per node mapping is implemented)
Even if a suitable place is found, how would one make a form for customizing. The use-case to consider would be a customizable filter. Wouldn't the best place for the customizable filter form be in the Importer GUI? I don't know how a hook would accomplish this. The only current way would be to extend an existing parser that filters after calling its parent parser or extend a processor and filter before the processor. Either way, your tied to using an existing parser or processor and the filter should be independent.
I think the only solution is allow chaining of parsers/processors or creation of an intermediate class. Otherwise, the import pipeline is limited in what it can do.
Am I missing something here?
Comment #3
alex_b CreditAttribution: alex_b commentedFeedsSource::import()
would be the place, yes.Parsing only happens outside of import() in Feed's hook_nodeapi() implementation - I take it you are referring to
_feeds_nodeapi_node_processor()
. In fact I am thinking of wrapping these calls into aFeedsSource::fetch()
and aFeedsSource::parse()
(the latter would fetch+parse). So we could solve this problem.Could you give a concrete example what such a filter would do and what GUI would be required in such a case? What is the concrete functionality that you are trying to build right now?
(In general terms, I do recognize the problem you are talking about. I have implemented an auto tagger for syndication parsers here http://drupalcode.org/viewvc/drupal/contributions/modules/extractor/ - it required extending both syndication parsers that ship with Feeds.)
Comment #4
funkmasterjones CreditAttribution: funkmasterjones commentedThanks for your timely reply
Yes, wrapping the calls would allow for the proper placement of additional hooks. I think in general having hooks are good, but hooks aren't a great solution for this kind of problem because you may want filtering/extracting for some importers and not for others.
Your extractor mod, its design is what I originally tried and its main fault is having to create a separate classes for each parser. Your extractor shouldn't be tied to any parser because you then have duplicated code for each parser. The form for the extractor should also be separated from the simplepie/comsynd parser forms
So this is not a bug or difficulty I'm having, but rather a recommendation for design
(heres my usecase but its essentially the same as a usecase for your extractor)
Usecase: Customizable filter
Form located in importers gui between Parser and Processor
Form: "Only allow [select box of tags] that contain [textfield to hold pattern]"
(say I only want news where the title contains Iraq)
Input: select "title" option from select and enter "Iraq" in the textfield
Result: after parsing, filter module is called and removes items that don't match user input, then goes processing
Benifits: parser and processor independant on a class level and gui level
Now suppose I wanted to use your extractor along with my filter, then I would need to create 4 parsers! Suppose I wanted to add another module to the mix, then the number of parser classes grows exponentially! What if I wanted to change the order in which these mods where called, even more parsers! There are cases to extend parsers and cases to chain them, this would be the later, no?
Solution: create a new baseclass for a postparser and allow these to be chained together
you could just allow chaining of parsers but I think this is conceptually wrong.
Thanks, Derek
Comment #5
ckngTalking about preprocessor, I think there is a need for postprocessor as well.
Extractor is already mentioned in #3 as an example use case.
Preprocessor and postprocessor should be interchangeable (i.e. could exist in both steps), use cases that comes to mind
- scraping, custom tagging, translating, filtering
- feed item manipulation (url, text, etc)
- content analysis (ranking, related content)
- selectively removing existing feed items
- custom output (static caching or other deliveries)
Other methods like using Rules/Trigger/Workflow are possible when dealing with node based feed items, but not for data. Even for node, those modules are a bit complicated/overkill as part of the processes, IMO.
Instead of multiple/custom parsers or pre/postprocessors, another possible route is support for multiple processors?
e.g.
Feed Node processor + Taxonomy term processor = what Extractor does
Feed Node processor + Taxonomy term processor + scraping + translating + etc processors
which are independent as all only work on the result of parser and does not rely on each other output.
Comment #6
giorgio79 CreditAttribution: giorgio79 commentedI was just looking for such a feature , where I could filter the feed items that will be actually imported from the feed.
For example, one of my feeds has an integer value, and I am only importing those that have an integer value greater than x , such as
import if number > 5
This is similar to a views filter imho,
How about pushing the feed first into a view, and then configure the view with filters etc in views admin with whatever someone wants, and once configured use that view for the periodic and regular import?
Comment #7
rjbrown99 CreditAttribution: rjbrown99 commentedI have created a rather extensive pre-processor, but I did it as a separate module. I'm using it for affiliate marketing where I get a CSV feed of retail items that I want to create nodes for.
My module takes the CSV feeds served by HTTP/FTP, uses concurrent curl calls to download the feeds, and then processes them.
After downloading and during processing, it breaks out each field of the CSV file and allows a per-file and per-field override. Basically the overrides are include files created that allow for pretty much anything you want from changing case to normalizing keywords. If I'm working with the file Myretailer.csv and it finds an include file called Myretailer.inc, it loads it up and allows you to do whatever to the data. For example, I have an array that replaces the terms they provide in the feed with taxonomy terms in my system. If they give me "Greenish" but my term is "Green", my replacement array works on that specific field and replaces it.
I am also adding new fields to the CSV file based on the parsing results. For example, if I see the keyword "Women" anywhere I add a new CSV field for it. This also matches my taxonomy in Drupal and allows me to get the items roughly right on import and even create new fields that didn't exist on the original feed.
The final step is to move the CSV to a specific place where feeds can find it. I then just use a normal feeds import to pull the data in. This also works around the issue whereby Feeds can only import CSV files directly from the local filesystem.
I have no idea if this is the kind of thing everyone was thinking of in terms of a preprocessor. I wanted it to be separate because it makes debugging easier. I can do the download in one step, parse in another step, and look at the results before they are fed to Feeds. It has taken me a while to get the parsing right with quotation marks and what not.
Comment #8
infojunkie CreditAttribution: infojunkie commentedJust for reference, here's a patch that worked for me well. I was looking to pre-create nodes that would be referenced by the node processor, so I implemented
hook_feeds_after_parse
and called$batch->getItems()
to iterate on them.FeedsImportBatch
doesn't have a magic get function, so I had to add this method manually.Comment #9
twistor CreditAttribution: twistor commentedMy thoughts on this are that in the configuration, there should be an option between Parsers and Processors to allow data manipulation. These should be chainable so that you could (for example) add a trim() function, find/replace, then make urls absolute. This would be incredibly useful.
Comment #10
infojunkie CreditAttribution: infojunkie commentedConcerning the decision of whether to architect the data modifications as a hook or a new family of plugins, I would opt for the hook approach. Here's why: there are a potentially infinite number of functions that one would want to apply to modify the data, based on one's own business requirements. And most of them, like in the example #9 above, are just about trimming, replacing, or otherwise modifying strings. Which means we would end with a large number plugins such as
FeedsModTrim
,FeedsModReplace
, etc., which really do nothing but encapsulate single PHP functions. The demand for new such plugins would never cease. On the other hand, the hook approach gives as much or more flexibility using one simple modification in the core Feeds module. Business application developers are free to implement that hook whichever way they choose.Comment #11
twistor CreditAttribution: twistor commentedI agree with your logic, however, I have issues. Feeds is architected such that it can support an infinite number of plugins, they don't all have to be included in Feeds. I'm currently working on Feeds XPath Parser which I'm primarily using for scraping. The question is whether end users can configure the modifications. If so, then you still end up with the same infinite plugins problem. Assuming that everyone who uses Feeds can write a hook function seems odd. Basically, I'd like to be able to provide users with a gui that they can configure. A hook could work, but then the implementations would be specific to the plugin. Allowing for modifiers in the import configuration seems, to me, to be the ideal solution. Then, Feeds could support a base set, and plugins could add things that are particular to their purpose.
Comment #12
infojunkie CreditAttribution: infojunkie commentedI do agree that Drupal tends toward code-less programming, and I support that wholeheartedly. I just think that for simplicity's sake, a hook would work well and could open the door for more end-user-oriented solutions.
Comment #13
twistor CreditAttribution: twistor commentedI rescind my point. I'm thinking of creating a new module using this hook that allows for the manipulations I described. Patch works quite nicely.
Comment #14
alex_b CreditAttribution: alex_b commentedRe #2:
I actually think a case can be made that any hook_after_parse() implementation needs to be parser specific. Hence I would argue that invoking them on a per parser basis is fine. This is just like the mapping API is processor specific. Indeed, in the meantime we have an OO hook in FeedsSimplePieParser (FeedsSimplePieParser::parseExtensions()).
Comment #15
infojunkie CreditAttribution: infojunkie commentedRe #14: I disagree that the hook_after_parse is parser-specific. Rather, it is job-specific, because I could be using the CSV parser for two different import jobs, each of which requires different business logic (i.e., massaging input fields after they were parsed). I would not want to subclass the CSV parser every time.
Comment #16
alex_b CreditAttribution: alex_b commented#15: I agree that extensions will be job specific.
My point was rather that it's very unlikely to write an after parse implementation that manipulates any parsed data, no matter where it came from. In fact, any manipulation of what's in $batch->items needs intimate knowledge of the parser that produced it. In that light, my argument was that it would be fine to create parser specific hooks.
Either way, here is a version of #8 that is more in line with hook_feeds_after_import(). Both hook signatures should be probably updated, but for the time let's keep them at least consistent.
Comment #17
alex_b CreditAttribution: alex_b commentedBTW, would be interested in what the computational cost of these hooks is.
Comment #18
infojunkie CreditAttribution: infojunkie commentedRe patch #16: how can one manipulate batch items with this patch, given that this is the main point of this hook?
Re computational cost: you mean actual timing, right? Because in terms of complexity, this hook is only called once per job so there's no significant hit.
Comment #19
alex_b CreditAttribution: alex_b commentedSorry, yes, we would have to expose the batch object for this approach.
Comment #20
infojunkie CreditAttribution: infojunkie commentedHow can this issue move forward? I am relying on this new hook in most of my Feeds deployments.
Comment #21
alex_b CreditAttribution: alex_b commented#18 - this is actually possible with this patch:
- Added invocation of this hook to preview().
- Added documentation.
@infojunkie: If I get your RTBC, I can commit.
Comment #22
infojunkie CreditAttribution: infojunkie commentedOne of the main uses for
hook_feeds_after_parse
will be to adjust item values. Today,class FeedsImportBatch
does not allow to read or modify itsitems
except throughsetItems()
andaddItem()
. What's needed is either:* A reference accessor to the items array, such as
getItems()
in #8 above, or* A reference accessor to each item individually along with an iterator, or
* Making
items
publicComment #23
alex_b CreditAttribution: alex_b commented#22: agreed - can we make this a separate issue though?
#912630: Make parsed items accessible for modification
Comment #24
infojunkie CreditAttribution: infojunkie commentedWorks for me. Thanks!
Comment #25
alex_b CreditAttribution: alex_b commentedCommitted, thank you.
http://drupal.org/cvs?commit=422412
Comment #26
kvvnn CreditAttribution: kvvnn commentedAt this point, if you want access to the items of an import, the following is necessary:
1) Addt the FeedsBatch function in #8 :
in FeedsBatch.inc around line 182 (in v 1.15, see patch in #8).
2) Implement the patch in #21 .
3) In a custom module, use :