I think this module is going to be very, very useful.

I have tested this on a large site with about 15000 nodes. The cron would stall with a memory overflow, even when I set the php.ini limit to 128Mb.

Are there any plans to handle these situations?

Comments

Scott Reynolds’s picture

Ya for sites of size, going to have to do all the math on the Database layer. its a difficult problem. Still hashing it out.

Scott Reynolds’s picture

Was thinking about this some more and was wondering why only give cron 128mb. I know that for most php apps thats fine but this bad boy does a lot. Wondering how much memory it takes to get this job done?

EDIT: really the module trades memory for query processing time. All the memory hogging it does is for speed.

Scott Reynolds’s picture

StatusFileSize
new2.76 KB

Let me know if this works for ya. It does the calculation via SQL. Need to revisit some indexes maybe? Still some stuff to evaluate here but this is getting there. Some static caching of the total term counts.

This applies only to the node tag similarity object. But will port to the rest.

Scott Reynolds’s picture

Status: Active » Needs review
Scott Reynolds’s picture

Status: Needs review » Needs work

doh! moved to fast. this won't work. Revising

Scott Reynolds’s picture

StatusFileSize
new2.8 KB
Scott Reynolds’s picture

Status: Needs work » Needs review
Scott Reynolds’s picture

Wanted to write this down before I forget. Need to array_chunk() the saved cache so that it doesn't get to large. Once it hits a size, write it to the DB.

Scott Reynolds’s picture

So I was worried that we have this $pending_saves cache. This cache represents which similarities are going to be written to the database on php exit(). This cache, at max size will be N by N where N is the node throttle set on the similarity create page. So say you set it to be 200 nodes per calculation run, the pending_saves cache would be at most 200 x 200. Never larger.

The real memory hog was the things like in the taxonomy similarity, that held all the terms per nid. It was convenient to do it this way but it takes up to much memory. This patch here fixes that one class. Working on back porting similar fixes.

I feel like this approach is much nicer then writing each individual similarity to the database, thus resulting (in the above case) 200 queries. This technique will make those 200 queries into one query.

Büke Beyond’s picture

I was very cautiously testing this module, suspecting the N^2 count, so I started out testing the title and search index computations on a small set of nodes (<20). That is when I discovered the database was not cleaning out the lower threshold matches (the float fix in the other post).

Then I set out to try just the Title computations on the 15000 news nodes, that is when the cron started stalling, requiring manual clean up of the cron semaphore from the database to resurrect it.

I may try bigger memory limits. The memory constraints, I believe, are mainly for security attacks. If Drupal/PHP/Apache offered a way to selectively give the cron request (with proper authentication, eg from localhost) the larger memory, it would be more feasable. Also, a lot of Drupal sites start out on shared hosting with limited settings.

There is also the possibility of running the algorithm internally from PHP CLI and skipping the web access.

Scott Reynolds’s picture

Status: Needs review » Needs work

/me sigh shes still ballooning somewhere

Scott Reynolds’s picture

15000+ nodes
Memory used at: devel_init()=1.36 MB, devel_shutdown()=15.94 MB.

4447124 similarities. Getting better. Seems like to many similarities. Each node has a term so it calculates against that. And the min_sim isn't respected yet

Scott Reynolds’s picture

StatusFileSize
new2.83 KB

ok 15000 nodes processing 500 at a time
Memory used at: devel_init()=1.37 MB, devel_shutdown()=15.94 MB.

page executed on my laptop in 7 mins, which is slow but for cron, usually run every 15 mins thats fine. And my laptop isn't a server kernel. Its running X playing, pandora, my IDE is open etc, etc, etc.

Attached is the patch to the term similarity.

Scott Reynolds’s picture

Adding indexes to the temporary tables made it execute in .5 mins

Scott Reynolds’s picture

To do search index, the following MySQL commands

CREATE TEMPORARY TABLE local_mag SELECT sid as nid, SQRT(SUM(POW(score,2))) as mag FROM search_index WHERE type = 'node' GROUP BY sid;
CREATE TEMPORARY TABLE local_mag_2 SELECT * FROM local_mag;
alter table local_mag add primary key(nid);
alter table local_mag_2 add primary key(nid);

Then the big nasty

SELECT IF (s1.sid > s2.sid, s1.sid, s2.sid) as nid1, IF (s1.sid > s2.sid, s2.sid, s1.sid) as nid2, (SUM(s1.score * s2.score) / (m1.mag * m2.mag)) as sim FROM search_index s1 JOIN search_index s2 ON s2.word = s1.word AND s2.sid <> s1.sid AND s1.type = 'node' JOIN local_mag m1 ON m1.nid = s1.sid JOIN local_mag_2 m2 ON m2.nid = s2.sid WHERE s1.sid IN (*INDEXING_SUB_QUERY_WITH_LIMIT*) AND s1.type = 'node' GROUP BY s1.sid, s2.sid HAVING sim > MIN_SIM;

And bingo, you have a gradually updating similarity calculation. its clean and smooth. This is pretty exciting. Though not sure on what I'm going to do with the title one. Its virtually impossible for it to follow a similar pattern. It would need its own seperate index similar to search index but with it n-gramed. That will be the last one to be tackled, perhaps removed...

Flying Drupalist’s picture

I very much want to use this module, but performances issues scares the bejeeus out of me. Subscribe!

Scott Reynolds’s picture

hehe i need to do a commit. Got a lot of it handled i think. Makes use of temp tables and cool stuff so its fast.

Flying Drupalist’s picture

Thanks, then I can't wait. :)

mrfelton’s picture

ready to commit yet?! subscribing

Scott Reynolds’s picture

StatusFileSize
new21.95 KB

ok shes not done yet, but heres a patch for most of them.
committed this patch. still need to remove the title one as it won't scale, ever...