Poorman's screen scraper [#283607]

While working on memetracker, I've come across a problem I think would be best solved within Feedapi. Memetracker uses Feedapi to aggregate feed items and then from these feed items identifies memes. A key to identifying a meme is following the links within the feed items. If multiple feed items link to another web page within a short period of time it indicates that the page being linked to and the pages linking are all part of a rising meme.

The problem is what to do if the linked to page is not from a site that the memetracker is already following. Because Memetracker doesn't have a copy of the content, it needs to get it somehow.

Ideally what I'm hoping for is a function call which I pass in a url from which a node is created and returned.

But how to do this?

I first thought of using some sort of screen scrapping library which could grab the title / content from any site -- but I couldn't find a library which supported this and writing my own seems daunting (anyone know of a good open-source library that does this sort of screenscraping? preferably in PHP?).

Next I thought of finding some sort of webservice which does this sort of thing -- but I didn't find one and relying on a 3rd-party service for an essential part of Memetracker seems risky.

Last I thought of an idea that should work ~80% of the time and wouldn't be too hard to implement. These are the steps the function would take:

1. Grab the URL.
2. Find the RSS/Atom feed within the page (If there's no feed, return an error).
3. Download the feed and create a node from the feed item who's url matches the url passed into the function.
4. Return the newly created node.

I *think* this will work. I'm writing it up here to get feedback (or that hopefully someone knows a better way to meet my needs). I think this would be a useful function for FeedAPI to have as I can imagine other modules using it other than just Memetracker.

Any thoughts / comments / feedback? Assuming y'all think it will work, I'll start writing it in ~2 weeks when I finish up some other development tasks before then.

Comments

Code

» Code feedapi (core module)

This guy (below) had a pretty good suggestion for simple scraping using simpleXML, I think I will make a tiny module using this method for some specific use cases.

http://www.nicklewis.org/node/962

didn't notice comment above, sorry!