Some hints for using this. (XPath syntax)

dman - June 26, 2006 - 21:37
Project:Scraper
Version:HEAD
Component:Miscellaneous
Category:task
Priority:minor
Assigned:Unassigned
Status:active
Description

Early days testing, I'd make a couple of observations-

It would be nice to have a few example/sample profiles to play with for folk getting their head around the system.

- When creating XPaths, use double quotes, not single ones (as would be more common in XSLT)
eg
body=xpath://*[@id="bodyCopy"]
not
body=xpath://*[@id='bodyCopy']

... PHP5 $_REQUEST appears to be discarding the single-quoted stuff. I haven't seen that behaviour before, but it's annoying. Not the fault of this module AFAIK.

- When your source document is ALREADY valid, namespaced XHTML, eg it has a DOCTYPE and an xmlns in the header, it's harder for XPath to find nodes by name. This is normal XML/XSL behaviour, but a pain to work around.
A dirty solution when
page_title=xpath://head/title
fails, is to go :
page_title=xpath:/*[local-name()="head"]/*[local-name()="title"]
.. this is because "defaultns:head" is different from "xhtml:head". looking up local-name() discards the namespace, but makes your paths messy to read
... the long solution is to cast the xpath into the same namespace as the source doc, and that's fiddly.

- Documentation. Maybe I've missed something, but I can't see which button to press to get the results imported into Drupal nodes as advertised. The multi-record algorithms look cool (I do get the idea, but a demo would be nice).

- Have you compared this functionality with import html? and .. Oh good, I see you are active over at import/export API :-B

#1

dman - June 27, 2006 - 00:26

Follow-up to my suggested example -
I got this going to summarize a users posting history on Drupal.org. User tracking currently has no RSS feed or notification (that I've found), so I always have to visit my personal profile to find my own 'subscribed' threads. There may be other ways to do this, BUT, if I wanted to do something with the info, or stalk myself, a demo config would be like this:

URLs to Scrape:

http://drupal.org/user/33240/track

XPath to define records of data within the HTML:

//*[@id="tracker"]//tr

Instructions to extract field values from each record:

type=xpath:td[1]
link=xpath:td[2]/a/@href
title=xpath:td[2]/a
replies=xpath:td[4]
update=xpath:td[5]

Produces:

My currently tracked Drupal threads:

type, link, title, replies, update
"issue", "/node/70931", "Some hints for using this. (XPath syntax)", "0", "2 hours 27 min ago"
"forum topic", "/node/27661", "Developing a government website in Drupal", "48", "5 hours 31 min ago"
"issue", "/node/56001", "New template/upgrade changes even more broken in IE (Safari as well)", "14", "1 day 12 hours ago"
"forum topic", "/node/31656", "ProjectOpus.com -- Would love some feedback!!!", "85", "2 days 22 hours ago"
"forum topic", "/node/70335", "API docs. Drupal functions that start with an underscore -- what does this mean?", "5", "4 days 1 hour ago"
...
  ...

... No, I don't know what I'm going to do with this info yet, but it shows the XPath engine is working nicely!
I'v seen some other online RSS scrapers that will take this sort of result to the next level (by re-outputting this result as a new RSS feed). All nice fun.

.dan.

#2

dado - June 27, 2006 - 16:25

Dan,
Thanks for your great comments! Your expertise around XPath exceeds mine, so any comments are welcome & educational. Also nice to see this thing is working elsewhere. Thanks for the reality check. Good luck stalking yourself.

I checked out your Import HTML project some months ago, but it was not 4.7 compliant at the time. Plus it seems a bit over my head. Since you understand both, could you explain how Scraper and Import HTML overlap and how they differ? Any avenue for collaboration/merging the 2?

"I can't see which button to press to get the results imported into Drupal nodes as advertised" - sorry if I advertised this. I did not intend to. Currently, the avenue to importing as nodes is to export a CSV file and import it using e.g. node_import. Note that node_import is in a state of flux, and a new Import/Export API is afoot (Summer of Code project).

Great example. Will add to documentation shortly. Any other such somments are welcome!

#3

dman - June 27, 2006 - 19:22

The guts of your scraping is following the same path I took with import_html (now 4.7 BTW)

- Get html file to be scraped,
- tidy it,
- run custom XPath queries on it to produce data.

Differences include:

- I use XSL to hold the XPath stuff rather than a new database table to hold 'profiles' (yeah, it's complex)
- My selection of import files and parameters expects heirachical sites of pages, rather than islands of data within one page
- The idea is to get the result straight into Drupal nodes. node_import CSVs are a tragic way to look at info that probably has a lot of structure already.
- Mine is for a bulk, run-once import of lots of data. Yours seems to be for synching with a remote source. (What's your example use-case scenario?)

If I could add pagination (creating multiple nodes from one source document) to my system, it would do much of what scraper does now. Just via XSL, which (as I documented) is even scarier than regular expressions.

I'll see if the import API comes out being useful, see if there's any convergence on there.

#4

dado - June 27, 2006 - 20:51

Dan,
Great questions. Thanks for exposing my shotty documentation. As for how Import HTML works - I see, that was my sort of impression from your documentation. There are certainly advantages to your XSL approach (standards-based) but I have never been fond of XSL, and I have difficulty getting it to work.

My initial desire was to use straight up XQuery (including XPath) to mine data out of web pages. But no XQuery library exists for PHP (unless something real recently emerged). Hence my decision to create a little XPath-delivering syntax.

Typical use case:
(1) Desire to make community site about X (jobs, events, music)
(2) Identify half dozen or more sites which contain public data about X
(3) Create a scraper job for each site
(4) Periodically run each scraper job to extract the data then e.g. node_import to pull into nodes

Instant community site! Ha ha.

As for use case of Import HTML: is it designed for wikis or similar, semi-structured, data-rich sites?

#5

dman - June 28, 2006 - 00:11

FWIW, I'll attach an XSL that demonstrates my above example of user-post-scraping into CSV.
Run it over (for example) that URL http://drupal.org/user/33240/track and you'll get the CSV immediately. Like a stand-alone version of scraper (when using the right tools, of course). No php or regexps tho.
I like http://www.xmlcooktop.com/ for XSL development, if you are interested in investigating that angle.

My reasoning is that I can develop the scraper profiles in a portable way (a set of XSL templates) rather than spending time in Drupal creating config profiles.
I've also got URL-rewriting going on - which you may need to look at soon once you drag external content out of context.
I still have to figure the best way to get from my scraped data structures into node structures. It's currently following the import/export 'xml' module format, but I want to replace that with the import/export API approach.

I don't know what the story is with XQuery, as it seems you'll have to run any wild content out there through tidy anyway. As you've probably found.

The import_html was designed for legacy sites - made from plain file.html files stored in directories and stuff that need to migrate into a CMS. Tune up the scraper patterns, then run once to slurp all the old pages into Drupal nodes and menus.
More for content than data if you see the difference.

I haven't tried it yet on emulating feeds or synchronisers, but that was always forseen.

AttachmentSize
scan drupal user tracker.xsl.txt 967 bytes

#6

dado - June 28, 2006 - 03:02

dman,
i agree tidy is definitely needed for any XML-oriented data extracting mechanism.

your xsl looks quite elegant now that i see it in some context. every xslt i have ever written has been an epic multiday journey. i do have a utility function that converts relative URLs to absolute URLs.

i see, Import HTML is designed for pulling legacy static sites into e.g. Drupal. Nice. seems like both of our modules serve a "legacy" purpose (but needed for ~decades to come? i see vanilla HTML being written for the foreseeable future by content-focused folks. it's up to drupallers to make that content more interesting, right?)

i would think we could find some common purpose in designing importation code for our respective mods. Not sure how out of the box this would be after the Import/Export API is finished. I could easily pump out Scraper data as XML similar to your output format.

Perhaps I should merge with Import HTML's methods in other ways. Thanks again for your constructive ideas.
dave

 
 

Drupal is a registered trademark of Dries Buytaert.