Crawler for the list of links [#1070476]

Hello,
is this module ready to be used ?

My use case is :
- i have a feed importer which create a node from an external URL
- i have a external page with several elements, each of them having a link to an URL i want to import
- this page has pagination link with "next" / "previous" buttons

Can i use this module to import a feed from every elements found in the page, find the "next" page url, and continue over and over ..?

Comment	File	Size	Author
#4	feeds_crawler-FeedsListCrawler_and_Source_URL-1070476-4.patch	12.69 KB	dmitriy.trt

Comments

Comment #1

tekken commented 3 March 2011 at 01:34

subscribing

Comment #2

twistor commented 11 March 2011 at 22:12

Assigned:

Unassigned

» mitchell

This module handles the latter part. It will do pagination, as in following a next link: it doesn't, however, grab a set of links from a page.

Comment #3

summit commented 19 June 2011 at 11:46

Subscribing, how to grab a set of links from a page then please?
greetings, Martijn

Comment #4

dmitriy.trt commented 16 March 2012 at 17:20

Title:	How to use this module ?	» Crawler for the list of links
Category:	support	» feature
Status:	Active	» Needs work

Status	File	Size
new	feeds_crawler-FeedsListCrawler_and_Source_URL-1070476-4.patch	12.69 KB

Patch implements new fetcher class FeedsListCrawler. It uses listing page as a starting point and imports pages from links found (XPath to get item link is configurable). It is able to parse multiple listing pages. For now "Next" link can be found using XPath only ("auto" and $index pattern methods are missing, this part needs work). Class was tested on HTML pages only.

Original class FeedsCrawler is re-factored a bit to share code with new class and allow it to access some methods.

Patch also includes "Source URL" mapping source implementation, because it becomes quite hard to get original URL of each item.

Looks like there are some problems with periodic import interval. Job scheduler executes only one job on each cron run. I'm going to solve this problem a bit later, but can't make any promises about missing "Next" link extraction methods.