I'm wondering if there's some sort of memory usage inefficiency in the way the index statistics page is generated? I've been slowly building a search index of around 33,000 records (will be closer to 100,000 when I'm done importing all the data), and as of this morning I can no longer access the index statistics page without getting a fatal out-of-memory error.

Some data: our search index files (the contents of sites/default/files/luceneapi_node) take up 15M of disk space, and our PHP memory_limit is currently 112M. The out-of-memory error always occurs in one of the Zend_Search_Lucene files, though it's never the same one.

I know there's bound to be some extra memory overhead involved in how PHP represents the parsed data internally, but it doesn't seem to make sense that the in-memory data would take up over 7 times as much space as the same data in the filesystem, right?

I'm certainly open to the possibility that I'm doing something wrong here; has anyone else experienced these kinds of problems with similar index sizes?

Thanks!
Adam

Comments

cpliakas’s picture

Hi jazzslider.

Thanks for posting. Search Lucene API is geared towards smaller indexes below 10,000 nodes. Anything larger and you really should explore a distributed solution such as Apache Solr Search Integration or Acquia Search. Search Lucene API is intentionally a fully integrated solution. The advantage is that it is very easy to install and maintain, and the disadvantage is that the resources are consumed on the page load thus inhibiting scalability. The goal is to provide a viable solution for sites that require something more than the core search but where Apache Solr would add too much complexity or overhead for the scope of the project.

Going forward, Search Lucene API 3.0 will introduce "adapters" that will allow the module to interoperate with other types of Lucene, such as Java Lucene, CLucene, and possibly database storage engines such as the one for Oracle. See the issue at #558564: Add support for Java Lucene. Early testing has shown that the Java Lucene implementation on top of the Zend Server will scale to hundreds of thousands of documents (possibly millions), which is exciting.

Unfortunately the 2.0 version is not designed to accommodate indexes of this size. I would rather have you use the best search solution for your index size than try to force Search Lucene API.

Hope his clears things up,
Chris

cpliakas’s picture

Status: Active » Closed (won't fix)

Respectfully marking this as "won't fix" since it is outside the scope of what 2.0 is trying to accomplish.

Dries Arnolds’s picture

I've got a site with 8800 nodes and the page uses 140MB of php memory. Isn't that a bit excessive? I understand a fix isn't neccessary for sites that are out of scope of this project 10.000+ nodes, but it seems to use a lot of memory even under the limit.

i.e. the 8800 nodes contain a lot of text. Most articles with 3000 words or more.

cpliakas’s picture

Hi Pixelstyle.

You are correct. 140MB is a lot (although other contributed modules undoubtedly contribute to this number), but since Search Lucene API is an integrated solution, it cannot offload the memory or resources to an external process. In creating this project, I was well aware that the cost of the ease of installation was scalability. I am fine with it because the Apache Solr Search Integration project already has the market cornered for scalability. However, the Java Lucene adapter for Search Lucene API 3.0 will be able to handle hundreds of thousands of documents without a problem.

Furthermore, the index size is directly related to how much text you have in your content. Large pieces of content like the ones you describe above will lower the ceiling of the number of nodes the module can handle. If you do want to stick with Search Lucene API, you will have to configure the search result limit to something like 5000 to cap how much memory is used.

In order to get the most out of the module, you have to limit the amount of content that is indexed. To help achieve this, I am planning on releasing a "Search Lucene Index Minify" project that will add additional options to reduce the sizes of the index to push Search Lucene API to it's maximum number of nodes.

Sorry for your troubles,
Chris

Dries Arnolds’s picture

Hi Chris,

My message wasn't meant as criticism, but more like the opposite. I am so loving faceted search that I was hoping there was a way to be able to use it with larger sites.

That last option sounds cool. Would the solution you propose enable me to use faceted search for a limited number of node types, and leave out others (like forum etc)?

cpliakas’s picture

Hi Pixelstyle.

I like criticism :-). Hearing the good, bad, and ugly about Search Lucene API helps it become more useful and efficient, and I think it is very important to improve scalability in the future. Tough love leads to a better module.

In terms of the faceted searches, you have to exclude content of that type in the Search Lucene Content "General settings" tab. This would have the obvious side effect of not being able to search content of that type in regular searches, though. As a side note, I am hoping that the Apache Solr project and Search Lucene API will one day share a lot of common code so you can use similar code, interfaces, and hooks regardless of what backend you are using. I think that would be very helpful in your situation where the Zend Framework backend might not be able to handle the amount of content you are indexing.

Thanks again for the posts,
Chris

Dries Arnolds’s picture

Thanks Chris

I think I'm going to take a stab at installing Apache Solr. I'll keep using this module for the smaller websites though.

cpliakas’s picture

I think that it makes sense for your site, and if you were coming to me to build the application is it what I would recommend. Apache Solr takes a little while to get running, but it is a great project and works well once you find it's sweet spot. The biggest thing that i think throws people off is that you have it set up according to the instructions, you run cron, but nothing is indexed because there is a delay of a few minutes before the content is searchable. The application will tell you how long the delay is, so make sure you wait that amount of time before you start debugging.

Hope that helps, and thanks for the use cases.
~Chris