This module is at the moment usefull for scraping html inside a rss-feed-item. It would be wonderful when it is extended to scrape information on html pages as well.

CommentFileSizeAuthor
#8 parser_simplepie.zip3.05 KBmaxya123

Comments

summit’s picture

Subscribing, +1 for this feature!
greetings, Martijn

gemini’s picture

I'm looking into the same feature. Trying to use Calais for content classification via RSS feeds, but very often there is a short snippet included (not full content)... Calais has SemanticProxy, which scrapes all content from the feed URLs and then does its analysis, but it returns everything it can find on the page, which skews its accuracy. Would be awesome if Feed API Scraper could read content from the original story URL and applied XPath expression to return just needed portion of the original content. This could be done as one of the mapping options.

maxya123’s picture

+1 I want this feature too!

mchelen’s picture

since the HTML inside an RSS feed item can be scraped, a source webpage could be wrapped in RSS using an external script

maxya123’s picture

Could you provide simple example?
Thanks

vacilando’s picture

+1

rajbangar’s picture

+1

maxya123’s picture

StatusFileSize
new3.05 KB

Hi, I've tried to make small modification to parser simple pie, to provide original html item->options->orightml for the scraper. I know it looks ugly, but it works :) . Don't know why, but I can't get proper page encoding. I could use some help with it.
Anyway, here is whay I've added to the simple pie parser module, whole file is attached.

	//Curl Init
	$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
	$target_url = $simplepie_item->get_link();

	$ch = curl_init();
	curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
	curl_setopt($ch, CURLOPT_URL,$target_url);
	curl_setopt($ch, CURLOPT_FAILONERROR, true);
	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
	curl_setopt($ch, CURLOPT_AUTOREFERER, true);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
	curl_setopt($ch, CURLOPT_TIMEOUT, 360);
	$curr_item->options->orightml= curl_exec($ch);
gemini’s picture

I tried the same thing, but extracting HTML from my own server seemed pretty intense and I turned to Yahoo Pipes. Basically Yahoo pipes puts all HTML inside a description node on the RSS feed and then I use FeedAPI Scraper to extract the necessary content from the description.

derekhu’s picture

Can you describe how to do it using Yahoo pipes?
I am very interested in a recipe or instruction. Thanks.

toma’s picture

Subscribing, +1 for this feature!

gemini’s picture

@derekhu - Sorry, I haven't see your questions right away.

Here are a few pointers on how you can do it:
1. Fetch a feed
2. Loop through and get the original article links which we'll use to extract full content
3. Loop through the extracted links and Fetch Pages into a separate field
Thats how you can get page content. Further you can assign it to a description field and get full feed. I added a few more steps using regular expressions to extract pure content (without extra code like sidebars, headers etc).