allowing scraping of html pages [#466414]

Comment	File	Size	Author
#8	parser_simplepie.zip	3.05 KB	maxya123

Comment #1

summit commented 14 June 2009 at 12:45

Subscribing, +1 for this feature!
greetings, Martijn

Log in or register to post comments

Comment #2

gemini commented 13 August 2009 at 00:01

I'm looking into the same feature. Trying to use Calais for content classification via RSS feeds, but very often there is a short snippet included (not full content)... Calais has SemanticProxy, which scrapes all content from the feed URLs and then does its analysis, but it returns everything it can find on the page, which skews its accuracy. Would be awesome if Feed API Scraper could read content from the original story URL and applied XPath expression to return just needed portion of the original content. This could be done as one of the mapping options.

Log in or register to post comments

Comment #3

maxya123 commented 9 September 2009 at 18:10

+1 I want this feature too!

Log in or register to post comments

Comment #4

mchelen

he/him

commented 19 September 2009 at 04:35

since the HTML inside an RSS feed item can be scraped, a source webpage could be wrapped in RSS using an external script

Log in or register to post comments

Comment #5

maxya123 commented 7 October 2009 at 02:20

Could you provide simple example?
Thanks

Log in or register to post comments

Comment #6

vacilando commented 12 October 2009 at 12:45

+1

Log in or register to post comments

Comment #7

rajbangar commented 20 October 2009 at 08:13

+1

Log in or register to post comments

Comment #8

maxya123 commented 21 October 2009 at 21:46

Status	File	Size
new	parser_simplepie.zip	3.05 KB

Hi, I've tried to make small modification to parser simple pie, to provide original html item->options->orightml for the scraper. I know it looks ugly, but it works :) . Don't know why, but I can't get proper page encoding. I could use some help with it.
Anyway, here is whay I've added to the simple pie parser module, whole file is attached.

	//Curl Init
	$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
	$target_url = $simplepie_item->get_link();

	$ch = curl_init();
	curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
	curl_setopt($ch, CURLOPT_URL,$target_url);
	curl_setopt($ch, CURLOPT_FAILONERROR, true);
	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
	curl_setopt($ch, CURLOPT_AUTOREFERER, true);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
	curl_setopt($ch, CURLOPT_TIMEOUT, 360);
	$curr_item->options->orightml= curl_exec($ch);

Log in or register to post comments

Comment #9

gemini commented 4 December 2009 at 21:14

I tried the same thing, but extracting HTML from my own server seemed pretty intense and I turned to Yahoo Pipes. Basically Yahoo pipes puts all HTML inside a description node on the RSS feed and then I use FeedAPI Scraper to extract the necessary content from the description.

Log in or register to post comments

Comment #10

derekhu commented 23 December 2009 at 03:58

Can you describe how to do it using Yahoo pipes?
I am very interested in a recipe or instruction. Thanks.

Log in or register to post comments

Comment #11

toma commented 17 January 2010 at 11:43

Subscribing, +1 for this feature!

Log in or register to post comments

Comment #12

gemini commented 28 January 2010 at 18:39

@derekhu - Sorry, I haven't see your questions right away.

Here are a few pointers on how you can do it:
1. Fetch a feed
2. Loop through and get the original article links which we'll use to extract full content
3. Loop through the extracted links and Fetch Pages into a separate field
Thats how you can get page content. Further you can assign it to a description field and get full feed. I added a few more steps using regular expressions to extract pure content (without extra code like sidebars, headers etc).

Log in or register to post comments

allowing scraping of html pages

Comments