Active
Project:
Feed Scraper
Version:
6.x-1.x-dev
Component:
Code
Priority:
Normal
Category:
Feature request
Assigned:
Unassigned
Reporter:
Created:
18 May 2009 at 23:23 UTC
Updated:
28 Jan 2010 at 18:39 UTC
Jump to comment: Most recent file
Comments
Comment #1
summit commentedSubscribing, +1 for this feature!
greetings, Martijn
Comment #2
gemini commentedI'm looking into the same feature. Trying to use Calais for content classification via RSS feeds, but very often there is a short snippet included (not full content)... Calais has SemanticProxy, which scrapes all content from the feed URLs and then does its analysis, but it returns everything it can find on the page, which skews its accuracy. Would be awesome if Feed API Scraper could read content from the original story URL and applied XPath expression to return just needed portion of the original content. This could be done as one of the mapping options.
Comment #3
maxya123 commented+1 I want this feature too!
Comment #4
mchelensince the HTML inside an RSS feed item can be scraped, a source webpage could be wrapped in RSS using an external script
Comment #5
maxya123 commentedCould you provide simple example?
Thanks
Comment #6
vacilando commented+1
Comment #7
rajbangar commented+1
Comment #8
maxya123 commentedHi, I've tried to make small modification to parser simple pie, to provide original html item->options->orightml for the scraper. I know it looks ugly, but it works :) . Don't know why, but I can't get proper page encoding. I could use some help with it.
Anyway, here is whay I've added to the simple pie parser module, whole file is attached.
Comment #9
gemini commentedI tried the same thing, but extracting HTML from my own server seemed pretty intense and I turned to Yahoo Pipes. Basically Yahoo pipes puts all HTML inside a description node on the RSS feed and then I use FeedAPI Scraper to extract the necessary content from the description.
Comment #10
derekhu commentedCan you describe how to do it using Yahoo pipes?
I am very interested in a recipe or instruction. Thanks.
Comment #11
toma commentedSubscribing, +1 for this feature!
Comment #12
gemini commented@derekhu - Sorry, I haven't see your questions right away.
Here are a few pointers on how you can do it:
1. Fetch a feed
2. Loop through and get the original article links which we'll use to extract full content
3. Loop through the extracted links and Fetch Pages into a separate field
Thats how you can get page content. Further you can assign it to a description field and get full feed. I added a few more steps using regular expressions to extract pure content (without extra code like sidebars, headers etc).