By Marko B on
Hello,
Need a person good with xPath and feeds, would like to import data from other site, scrap in fact.
You get link like http://www.restorani.com.mk/restorani.php?kategorija=1 and you have list of restaurants. Crawl the list, create nodes for each item in list. Data wouldn't be directly from list but from links on the list using feeds crawler.
modules:
http://drupal.org/project/feeds_crawler
http://drupal.org/project/feeds
http://drupal.org/project/feeds_xpathparser
or something else you need.
Send your estimates for setting this up and coding xpath for it.
Comments
Can you clarify the
Can you clarify the requirement, please? The link is to a web page, not an RSS feed. It's not possible to use feed parsing nor xpath to scrape HTML pages. Perhaps I've misunderstood what you are trying to achieve?
Feeds Crawler and Feeds XPath Parser
Feed module not just a reader rss, is a data extrator, you can add data in your drupal drupal using csv, xml and other, even Mailhandler, Once I was trying to import data from a html with XPath HTML parser and had some problems in the code and the namespace, I finally made it impossible to import, as this was a testing task, I have not started again. I think the perfect job could be done with modules that are exposed:
Feeds Crawler: to read the links to data
Feeds XPath Parser: To extract data to nodes that are called in the crawler.
perhaps need some code, I do not know
Juan
yes juan, somethin like that,
yes juan, somethin like that, it can be done for sure.
Adriadrop Drupal development
Approach
I'd approach this differently and not try to import into Drupal but have Drupal used for the UI and presentation level. The problem is that records come, change and go. You are not just, I think, in want of import but also a kind of synchronization rather than aggregation. Nothing worse, I think, then following a guide that's out of date..
To address this and the problem of db performance we tend to just offload it. We abstracted the Drupal node layer so that they could be abstracted away from being a RDBMS. The result of a search or browse (which is also a search) is a pseudo-node and only becomes a "real" node--- inserted into the db-- when we have a reason--- such as comments (and then we store a "snapshot" and the comments tend to relate to the state of things at that moment and not how they might have developed).
Well i could use both cases
Well i could use both cases Edward, for some cases i would just like to import, or better say scrap, for some other i would like to do what you said. More like follow. For example, following a page with movies in cinema and displaying it somewhere if rss is not present or not enough.
Adriadrop Drupal development
"Well i could use both cases
Screen scrapping is just the process step of grabbing content from a Web that does not provide it via another means (RSS, some other XML format, etc.). The scrappers we use, for example for news, produce RSS as-if, from the point of view of the rest of the pipeline, the site delivered RSS. For IBU News we went one step further and let server/gateway processes handle the flow from Web to RSS (actually an enhanced RSS format). This has proven quite advantageous as our priority was (and is) to import feeds as available (among others RSS, Atom, CAP) and not to maintain screen-scrappers (as Web formats and designs change one must change, resp. modify, the parser rules). By controlling things via a db that contains for each "feed" its URL one can, once a feed becomes available and we don't need to scrape content we can transparently switch over. Archiving the snapshots (of what the feeds are pointing, resp. referring, to) is something else and there we don't need to really scrape.. To detect that content has been unchanged despite significant changes to the environment (Ads, banners, etc.) we use other techniques and algorithms that don't need to be taught the structure..