Source plugin for crawling web site [#1343638]

Hello there,

I have used migrate module successfully to operate on a set of html files stored on the file system. I was wondering if anyone here knew whether this could also be easily extended to doing screen scraping of a website. Is there a particular source class I should be extending.

I was thinking it might make sense to use querypath or some other library to facilitate this, or should I just be using some basic cURL wizardry. If anyone else has any advice on how to integrate this feature into migrate module or just any points in general it would be greatly appreciated!

Comments

Comment #1

mikeryan

he/him

English

Pittsfield, MA, USA

commented 16 November 2011 at 19:31

No one's done a source plugin along those lines, as far as I know. It would definitely look different from the existing ones, where the list of items to process is statically determined - in this case, as you went through the site you would be adding new pages to a to-do list and then picking them up later. You wouldn't easily be able to get a count, so this is another use case for #1341776: Option to skip counting.

Comment #2

mikeryan

he/him

English

Pittsfield, MA, USA

commented 16 November 2011 at 19:33

Title:

Web Crawler

» Source plugin for crawling web site

Comment #3

sylus commented 26 September 2013 at 01:04

Status:

Active

» Closed (works as designed)

Source plugin for crawling web site

Comments

Comment #1

Comment #2

Comment #3

News items

Our community

Documentation

Drupal code base

Governance of community