I use QueryPath module 1.x dev and Feeds_QueryPath_parser 1.x dev module. I am newbe with QueryPath library and connot properly setup css selectors to export content from HTML page.
I am trying to parse http://blogs.yandex.ru/search.xml?text=%22%D0%B5%D0%B4%D0%B8%D0%BD%D0%B0... (comments from livejournal.com)

In Firefox Firebug i detect html structure for elements i want to export:
root repeated pattern (i.e. context in Feeds, Settings for QueryPath parser) div[class='b-item Ppb-c-ItemMore SearchStatistics-item'] ,
comment title: h3[class='title'] ,
comment author: ul[class='info b-hlist b-hlist-middot'] li a
journal author: ul[class='info b-hlist b-hlist-middot'] li a (with offset 1)

All these setup works well for SimpleHTMLDOM Parser module i used resently http://drupal.org/project/simplehtmldom_parser and i understand QueryPath is another library, i need different syntax (i've read http://www.ibm.com/developerworks/opensource/library/os-php-querypath/in...).

I tried several times, according to manual, such as:
div."b-item Ppb-c-ItemMore SearchStatistics-item" for context
.title for title
ul.'info b-hlist b-hlist-middot'>li:first for comment author
ul.'info b-hlist b-hlist-middot'>li:second for journal author
(of course i set up mappings, etc.)
and variants without quotes, starting dots, starting divs etc. - nothing imported.

What i am missing? (see below images attached)

Comments

mbutcher’s picture

If the HTML has '

', that means that there are three classes on that element. div.foo, div.bar, and div.baz.

Ideally, what you want is a short and concise selector, like:

div.b-item

This should work in recent versions of QueryPath:

div[class="b-item Ppb-c-ItemMore SearchStatistics-item"]

But in older versions (QueryPath 2.0 and earlier), there was a bug that prevented quoted strings from matching.

You would probably be most successful setting your root repeated pattern to:

div.Ppb-c-SearchStatistics>div. SearchStatistics-item

And then doing 'h3.title' and so on. If I were you, I would avoid any pattern that used TAG[class=CLASSES], as it is slow and error prone. Use TAG.ONE_CLASS instead. (Also, div."list of classes" is not valid CSS3, so QueryPath does not support it. div.list.of.classes might work, though.)

ul.info>li:first
ul.info>li:eq(2)

and so on.

joomlerrostov’s picture

Thanx for reply. I've tried several options you provided - best result is import of nodes without any content (when i use repeated pattern div.b-item).
when i change it to div.b-item>div.Ppb-c-ItemMore>div.SearchStatistics-item or div.b-item.Ppb-c-ItemMore.SearchStatistics-item during import i see error message:

DOMDocument::loadHTML() [domdocument.loadhtml]: Tag wbr invalid in Entity, line: 52 (Z:\home\mnews\www\sites\all\modules\querypath\QueryPath\QueryPath.php: 1872)

(and sometimes on line: 28)

i dont know reason

mbutcher’s picture

The error you are posting is from libxml's DOM parser, and indicates that it encountered an illegal element in the source HTML.

The tag is not valid HTML. Does the destination document have that tag?

joomlerrostov’s picture

Yes. I've inspected html and found wbr tag. How can i set to ignore this?
i tried sites\all\modules\querypath\QueryPath\QueryPath.php line 85, ignore_parser_warnings set to TRUE - no matter

mbutcher’s picture

If we're still working on the same site, the easy way to work around this is to parse the document as XML.

This works for me:

$url = 'http://blogs.yandex.ru/search.xml?text=%22%D0%B5%D0%B4%D0%B8%D0%BD%D0%B0%D1%8F%20%D1%80%D0%BE%D1%81%D1%81%D0%B8%D1%8F%22&ft=comments&server=livejournal.com&holdres=mark&full=1';
print qp($url, 'div.b-item')->textImplode("\n");

Since the document declares itself to be XHTML, it can be parsed with plain old qp(), instead of htmlqp(). This avoids checking the HTML DTD, so any valid XML will parse properly.

EDIT: I did the above just using the QueryPath 2.1 library.

joomlerrostov’s picture

I tested code you provide - put it in file, file put to sites\all\modules\querypath\QueryPath\, add before require_once "QueryPath.php" ;
and it works.
But what i need is to create drupal nodes from imported piece of content, for this i use module Feeds QueryPath Parser - maybe (i dont know) problem in that module. I posted this issue to http://drupal.org/node/1269032 but no answer for now.
so i dont know why via simple file querypath parse url successfully, but when i put url in feed settings, then import items - it produce an error.