where should I begin with the debugging:

the xml-output from sphinxsearch_xmlpipe.php
or my sphinx installation?

[ROOT@psrLAMP01:/opt/sphinx] bin/indexer --all
Sphinx 0.9.8.1-release (r1533)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file '/opt/sphinx/etc/sphinx.conf'...
indexing index 'platforms_idx-main'...
ERROR: index 'platforms_idx-main': source 'platforms_src-main': XML parse error: no element found (line=20320, pos=0, docid=978).
total 0 docs, 0 bytes
total 57.780 sec, 0.00 bytes/sec, 0.00 docs/sec
indexing index 'platforms_idx-delta'...
ERROR: index 'platforms_idx-delta': xmlpipe: expected '<document>', got 'XMLPipe for delta index failed: Could not obtain list of main i'.
total 0 docs, 0 bytes
total 0.010 sec, 0.00 bytes/sec, 0.00 docs/sec
distributed index 'platforms_idx-join' can not be directly indexed; skipping.

hmmm... i use sphinx 0.9.8.1. is that a problem?

saluti
roberto

Comments

markus_petrux’s picture

hmmm... i use sphinx 0.9.8.1. is that a problem?

AFAICT, no. 0.9.8.1 is just a bugfix release.

I had a similar report, that it may give you some hints on how to see what could be the cause:

http://drupal.org/node/320044

I would start here with the same "recipe" to try to sched some light here:

1) Try invoking the XMLPipe command from the shell redirecting the output to a file. Then open the file to see how that line looks like. Maybe there's a PHP error, the file is interrupted, or something...

2) In your report above, docid=978 is the node ID. You could create a temporary main index with XMLPipe command where first/last node ID is that one. If that works, then it looks like the problem is somewhere else.

3) Try looking at admin/logs/watchdog filtering sphinxsearch reports. Maybe there's something there that it was missed before.

Finally, please bear in mind that the Drupal 6 version of this module is not finished, and it might change significantly (battle plan). I'm not just trying to port the module from D5, but also add a few more features, and many things can be changed. Due to lack of time, I would not be offering upgrade paths from 6.x-1.x-dev to 6.x-1.x-dev.

roberto.ch’s picture

the problem is the memory.

php memory-limit 512M is enough for 978 nodes.
if I give 1024M, it is enough for my 1468 nodes...

the memory uses is very high. I must splitting
the main. but, when i have 20000 nodes? very
many main...

i set the parameter "Nodes per chunk" in
admin/settings/sphinxsearch to 500, but without
success.

saluti
roberto

markus_petrux’s picture

512M for the XMLPipe generator looks a lot of memory to me.

The .htaccess file we're using here for that looks like the following, for ~14,000 nodes, ~52,000 comments.

php_value memory_limit 64M
php_value max_execution_time 3600
php_value mysql.connect_timeout 3600

Maybe you're touching the mem_limit parameter of the indexer section in your sphinx.conf file? We're using 256M, which seems to be enough for us, and we had problems when we set this value too high.

You can check in the watchdog log the resources used by the PHP process, and that's what you can adjust from your .htaccess file located in the sphinxsearch_scripts subdirectory. Not to be confused with mem_limit in sphinx.conf.

Does that help?

nimistar’s picture

Title: XML parse error: no element found » XMLPipe for delta index failed: Could not obtain list of main indexes from Sphinx.

Look I want to generate `delta` index by source 'http://x.x.xx.x/sphinxsearch_scripts/sphinxsearch_xmlpipe.php?mode=delta', but this source use started demon `searchd` and I can't generate this index ... if I stop this demon - source can't by used - I'm stupid ?

markus_petrux’s picture

Title: XMLPipe for delta index failed: Could not obtain list of main indexes from Sphinx. » XML parse error: no element found

@nimistar: Yours seem to a different problem than the one reported here. Please, open a separate issue.

@roberto.ch: Were you able to deal with the memory issue?

roberto.ch’s picture

no, I could not yet solve this problem so far.
workaround: i split the main in pieces of 500 nodes...

saluti
roberto

Rickdrummer’s picture

Almost one year since last comment... and almost one year since last d6 sphinx search release. Very sad :(

Just like roberto.ch said, the main problem of indexing with xmlpipe is memory limit and it depends on the modules quantity, installed in drupal. I'm using CCK with many-many extra fields, so indexer reaches memory top on ~3700 of nodes. If I disable CCK module, on the same config, xmlpipe returns 60.000 nodes - it's better, but not the way it should be, 'cause I can't turn off the CCK module.

I'll try to make pure SQL query (just like it's done in IPB sphinx conf), instead of xmlpipe requests. Maybe it will work in drupal too.

sped2773’s picture

This definitely seems to be a memory problem somewhere (not necessarily with this module), when I run the indexer script to call the XML pipe I can sit back and watch the apache process eat away at memory until it just gives up, our node have significant body content (large) but have been unable to get the main to index since we passed 1000 nodes, so now the delta index is starting to get fat as the main index always failed. I have tried everything including changing the memory limit in the .htaccess for the sphinxsearch_scripts (php_value memory_limit 768M - I know I can't quite believe I have to set this to try and index 1500 nodes!).

I having been through the code and cannot find anything that would indicate things shouldn't work correctly, I have even gone paranoid unsetting variables all over the place to ensure nothing is getting chance to hold onto memory but none of this has worked, debugging in Nusphere it is easy to see which loop the memory is increasing on i.e.

// Process nodes for this loop.
foreach ($nids as $nid) {
if ($nid > $last_nid) {
break;
}
$nodes_counter++;
$xmlpipe_document = sphinxsearch_xmlpipe_document($main_index_id, $nid);
if ($xmlpipe_document) {
print $xmlpipe_document;
unset($xmlpipe_document);
}
}

but this isn't doing anything special (as you can see I am specifically unsetting $xmlpipe_document variable - paranoid or what!). Digging deeper into the sphinxsearch_xmlpipe_document method;-

// Collect Sphinx document data.
$document = array_merge(sphinxsearch_invoke_api('sphinx_document', $node), array(
'main_index_id' => $main_index_id,
));

// Build Sphinx document stream.
$output = ''."\n";

foreach ($document as $name => $value) {
if (isset($GLOBALS['sphinxsearch_xmlpipe_schema'][$name])) {
if ($GLOBALS['sphinxsearch_xmlpipe_schema'][$name]['sphinx_type'] == 'text') {
$value = '';
}
$output .= '<'. $name .'>'. $value .''."\n";
}
}
$output .= ''."\n";

unset($document);
unset($node);
unset($name);
unset($value);

Nothing unusual here but for some reason each iteration that calls this method the memory just increases but never gets cleared down, I have also tried using a static for $node, $document and $output just so I know that multiple variables are not being kept.

I am probably going to have to break down this into separate indexes but this is a short term solution, the longer term solution (I thought) was merging indexes i.e. never do a main index just run a delta every 5 minutes and merge this every 6 hours or so but unfortunately I can't get the indexer to work with the --merge flag it just doesn't merge but reports there were no issues.

Hope this post maybe helps other people using sphinx or those considering using sphinx. Sphinx itself is a great product but the xml pipe approach seems to have its limitations. I'm not saying its this module it could be further down in Drupal but I am abit out of ideas how to address this. I did read a post about the node_load and true flag so it didn't the cache node and thought great thats it but when I looked that had already been implemented in the latest release. Like Rickdrummer, I feel this module has been somewhat abandoned which is a shame.

If anyone has any other ideas how to tackl;e this problem please let me know!

zeezhao’s picture

Not sure if this helps, but I can give you my experience using drupal 5 version.

I have been able to index over 2 million nodes by have the index built in batches of 100,000. So in the sphinx.conf, I have entries for index for each batch using the first_nid & last_nid variables e.g.

....
sphinxsearch_xmlpipe.php?mode=main&id=0&first_nid=0&last_nid=100000&"
....
sphinxsearch_xmlpipe.php?mode=main&id=1&first_nid=100001&last_nid=200000&"

... and so on..
sphinxsearch_xmlpipe.php?mode=main&id=2&first_nid=200001&last_nid=300000&"

.....
sphinxsearch_xmlpipe.php?mode=main&id=23&first_nid=2300001&last_nid=2400000&"
....
sphinxsearch_scripts/sphinxsearch_xmlpipe.php?mode=delta"

[this is just showing relevant parts and not full sphinx.conf format]

This way, 512MB in php was enough to build each index by running:
indexer index_main0

indexer index_main1
....
indexer index_delta

I also use CCK too... But again this is drupal5 and sphinxsearch with sphinx 0.9.8

Also, your mysqlk my.,cnf may need to be amended to allow for more memory etc.

markus_petrux’s picture

One thing that I have observed consuming a lot of memory in batches processing nodes is the token module. If you have any module that uses tokens to render nodes, then that maybe the source of the problem. The current version of the token module for D6 generates all the values for all tokens, even when just one is needed, and I think it catches this in static variables.

We're also partitioning our indexes, as in the example provided by zeezhao above.

I expect to get back to this module in a week or two. Though there's little we can do, I'm afraid, in regards to memory usage when processing nodes. It depends on the modules that are involved there. :(

Taras_’s picture

I think the right way to easily fight such kind of problems here: http://groups.drupal.org/node/9795#comment-106974

Even if you solve all known problems, you can get new one after installing "not so good" new Drupal module.

dhthwy’s picture

I was only able to process ~3k nodes before croaking with this XML error due to PHP hitting a wall on its memory limit.

Investigating further I found that the node object wasn't being freed, causing a very ugly memory leak. Somehow there must be a leftover reference to the node somewhere in either the drupal code or in some other module's code ( I can't find where the problem might be in sphinx's module ). Telling node_load to reset the cache doesn't appear to help.

For PHP5 I was able to do this to put a plug in the memory leak:

In file sphinxsearch.xmlpipe.inc, function: sphinxsearch_xmlpipe_document

Just before return $output

--

// Loop thru the node object and destroy its members

foreach($node as $k => $v) {
unset($node->$k);
}

unset($node, $k, $v);

--

After doing this I had no problem processing 11k nodes, If I remember correctly it didn't eat up any more than 150 megs of ram.

dhthwy’s picture

I thought there was another memory leak somewhere but it turned out to be some caching done in the Devel module.

With Devel off, (and with code in the previous comment) the Sphinx Search XML script doesn't really leak any memory at all. For 11k nodes, PHP's memory_get_usage reports 13 Megs. Since we're processing one node at a time, memory shouldn't really be an issue unless there's a leak.

The only 3rd party modules I've got turned on aside from Sphinx Search are Views, CCK, and Chaos Tool Suite.

Also, the memory leak described in the previous comment appears to kick in only when the CCK module is turned on. I'm using number & text CCK types, and it seems to happen when it uses either one of the Number and Text CCK modules, probably affects other types as well.

Markus, this is a very interesting module for a fantastic search engine, I hope you haven't given up on it.

markus_petrux’s picture

Status: Active » Closed (won't fix)

Cleaning up the issues queue. I'll soon post a method to invoke the XMLPipe generator from the command line. Please, follow #327816: Ability to execute XMLPipe generator from PHP CLI