Feeds xPath/Crawler job

By Marko B on 7 Oct 2010 at 01:56 UTC

Hello,

Need a person good with xPath and feeds, would like to import data from other site, scrap in fact.
You get link like http://www.restorani.com.mk/restorani.php?kategorija=1 and you have list of restaurants. Crawl the list, create nodes for each item in list. Data wouldn't be directly from list but from links on the list using feeds crawler.

modules:
http://drupal.org/project/feeds_crawler
http://drupal.org/project/feeds
http://drupal.org/project/feeds_xpathparser

or something else you need.

Send your estimates for setting this up and coding xpath for it.

Comments

Can you clarify the

alfaguru commented 9 October 2010 at 11:31

Can you clarify the requirement, please? The link is to a web page, not an RSS feed. It's not possible to use feed parsing nor xpath to scrape HTML pages. Perhaps I've misunderstood what you are trying to achieve?

Feeds Crawler and Feeds XPath Parser

jvizcarrondo commented 10 October 2010 at 17:44

Feed module not just a reader rss, is a data extrator, you can add data in your drupal drupal using csv, xml and other, even Mailhandler, Once I was trying to import data from a html with XPath HTML parser and had some problems in the code and the namespace, I finally made it impossible to import, as this was a testing task, I have not started again. I think the perfect job could be done with modules that are exposed:

Feeds Crawler: to read the links to data
Feeds XPath Parser: To extract data to nodes that are called in the crawler.

perhaps need some code, I do not know
Juan

yes juan, somethin like that,

Marko B commented 12 October 2010 at 18:18

yes juan, somethin like that, it can be done for sure.

Adriadrop Drupal development

Approach

Edward C. Zimmermann commented 13 October 2010 at 12:22

I'd approach this differently and not try to import into Drupal but have Drupal used for the UI and presentation level. The problem is that records come, change and go. You are not just, I think, in want of import but also a kind of synchronization rather than aggregation. Nothing worse, I think, then following a guide that's out of date..
To address this and the problem of db performance we tend to just offload it. We abstracted the Drupal node layer so that they could be abstracted away from being a RDBMS. The result of a search or browse (which is also a search) is a pseudo-node and only becomes a "real" node--- inserted into the db-- when we have a reason--- such as comments (and then we store a "snapshot" and the comments tend to relate to the state of things at that moment and not how they might have developed).

Well i could use both cases

Marko B commented 16 October 2010 at 00:50

Well i could use both cases Edward, for some cases i would just like to import, or better say scrap, for some other i would like to do what you said. More like follow. For example, following a page with movies in cinema and displaying it somewhere if rss is not present or not enough.

Adriadrop Drupal development

"Well i could use both cases

Edward C. Zimmermann commented 18 October 2010 at 07:08

"Well i could use both cases Edward, for some cases i would just like to import, or better say scrap,....For example, following a page with movies in cinema and displaying it somewhere if rss is not present or not enough."

Screen scrapping is just the process step of grabbing content from a Web that does not provide it via another means (RSS, some other XML format, etc.). The scrappers we use, for example for news, produce RSS as-if, from the point of view of the rest of the pipeline, the site delivered RSS. For IBU News we went one step further and let server/gateway processes handle the flow from Web to RSS (actually an enhanced RSS format). This has proven quite advantageous as our priority was (and is) to import feeds as available (among others RSS, Atom, CAP) and not to maintain screen-scrappers (as Web formats and designs change one must change, resp. modify, the parser rules). By controlling things via a db that contains for each "feed" its URL one can, once a feed becomes available and we don't need to scrape content we can transparently switch over. Archiving the snapshots (of what the feeds are pointing, resp. referring, to) is something else and there we don't need to really scrape.. To detect that content has been unchanged despite significant changes to the environment (Ads, banners, etc.) we use other techniques and algorithms that don't need to be taught the structure..

Feeds xPath/Crawler job

Comments

Can you clarify the

Feeds Crawler and Feeds XPath Parser

yes juan, somethin like that,

Approach

Well i could use both cases

"Well i could use both cases

News items

Our community

Documentation

Drupal code base

Governance of community