Hello there,

I have used migrate module successfully to operate on a set of html files stored on the file system. I was wondering if anyone here knew whether this could also be easily extended to doing screen scraping of a website. Is there a particular source class I should be extending.

I was thinking it might make sense to use querypath or some other library to facilitate this, or should I just be using some basic cURL wizardry. If anyone else has any advice on how to integrate this feature into migrate module or just any points in general it would be greatly appreciated!

Comments

mikeryan’s picture

No one's done a source plugin along those lines, as far as I know. It would definitely look different from the existing ones, where the list of items to process is statically determined - in this case, as you went through the site you would be adding new pages to a to-do list and then picking them up later. You wouldn't easily be able to get a count, so this is another use case for #1341776: Option to skip counting.

mikeryan’s picture

Title: Web Crawler » Source plugin for crawling web site
sylus’s picture

Status: Active » Closed (works as designed)