Resource Limits [#406628]

Comment	File	Size	Author
#20	first_pass.patch	21.95 KB	Scott Reynolds
#13	node_term.patch	2.83 KB	Scott Reynolds
#6	tax_scale.patch	2.8 KB	Scott Reynolds
#3	tax_scale.patch	2.76 KB	Scott Reynolds

Comment #1

Scott Reynolds commented 18 March 2009 at 23:29

Ya for sites of size, going to have to do all the math on the Database layer. its a difficult problem. Still hashing it out.

Log in or register to post comments

Comment #2

Scott Reynolds commented 19 March 2009 at 15:50

Was thinking about this some more and was wondering why only give cron 128mb. I know that for most php apps thats fine but this bad boy does a lot. Wondering how much memory it takes to get this job done?

EDIT: really the module trades memory for query processing time. All the memory hogging it does is for speed.

Log in or register to post comments

Comment #3

Scott Reynolds commented 19 March 2009 at 17:08

Status	File	Size
new	tax_scale.patch	2.76 KB

Let me know if this works for ya. It does the calculation via SQL. Need to revisit some indexes maybe? Still some stuff to evaluate here but this is getting there. Some static caching of the total term counts.

This applies only to the node tag similarity object. But will port to the rest.

Log in or register to post comments

Comment #4

Scott Reynolds commented 19 March 2009 at 17:08

Status:

Active

» Needs review

Log in or register to post comments

Comment #5

Scott Reynolds commented 19 March 2009 at 17:35

Status:

Needs review

» Needs work

doh! moved to fast. this won't work. Revising

Log in or register to post comments

Comment #6

Scott Reynolds commented 19 March 2009 at 17:55

Status	File	Size
new	tax_scale.patch	2.8 KB

Log in or register to post comments

Comment #7

Scott Reynolds commented 19 March 2009 at 17:55

Status:

Needs work

» Needs review

Log in or register to post comments

Comment #8

Scott Reynolds commented 19 March 2009 at 19:05

Wanted to write this down before I forget. Need to array_chunk() the saved cache so that it doesn't get to large. Once it hits a size, write it to the DB.

Log in or register to post comments

Comment #9

Scott Reynolds commented 19 March 2009 at 23:28

So I was worried that we have this $pending_saves cache. This cache represents which similarities are going to be written to the database on php exit(). This cache, at max size will be N by N where N is the node throttle set on the similarity create page. So say you set it to be 200 nodes per calculation run, the pending_saves cache would be at most 200 x 200. Never larger.

The real memory hog was the things like in the taxonomy similarity, that held all the terms per nid. It was convenient to do it this way but it takes up to much memory. This patch here fixes that one class. Working on back porting similar fixes.

I feel like this approach is much nicer then writing each individual similarity to the database, thus resulting (in the above case) 200 queries. This technique will make those 200 queries into one query.

Log in or register to post comments

Comment #10

Büke Beyond commented 20 March 2009 at 00:07

I was very cautiously testing this module, suspecting the N^2 count, so I started out testing the title and search index computations on a small set of nodes (<20). That is when I discovered the database was not cleaning out the lower threshold matches (the float fix in the other post).

Then I set out to try just the Title computations on the 15000 news nodes, that is when the cron started stalling, requiring manual clean up of the cron semaphore from the database to resurrect it.

I may try bigger memory limits. The memory constraints, I believe, are mainly for security attacks. If Drupal/PHP/Apache offered a way to selectively give the cron request (with proper authentication, eg from localhost) the larger memory, it would be more feasable. Also, a lot of Drupal sites start out on shared hosting with limited settings.

There is also the possibility of running the algorithm internally from PHP CLI and skipping the web access.

Log in or register to post comments

Comment #11

Scott Reynolds commented 20 March 2009 at 00:32

Status:

Needs review

» Needs work

/me sigh shes still ballooning somewhere

Log in or register to post comments

Comment #12

Scott Reynolds commented 20 March 2009 at 23:54

15000+ nodes
Memory used at: devel_init()=1.36 MB, devel_shutdown()=15.94 MB.

4447124 similarities. Getting better. Seems like to many similarities. Each node has a term so it calculates against that. And the min_sim isn't respected yet

Log in or register to post comments

Comment #13

Scott Reynolds commented 23 March 2009 at 19:03

Status	File	Size
new	node_term.patch	2.83 KB

ok 15000 nodes processing 500 at a time
Memory used at: devel_init()=1.37 MB, devel_shutdown()=15.94 MB.

page executed on my laptop in 7 mins, which is slow but for cron, usually run every 15 mins thats fine. And my laptop isn't a server kernel. Its running X playing, pandora, my IDE is open etc, etc, etc.

Attached is the patch to the term similarity.

Log in or register to post comments

Comment #14

Scott Reynolds commented 23 March 2009 at 19:18

Adding indexes to the temporary tables made it execute in .5 mins

Log in or register to post comments

Comment #15

Scott Reynolds commented 24 March 2009 at 07:13

To do search index, the following MySQL commands

CREATE TEMPORARY TABLE local_mag SELECT sid as nid, SQRT(SUM(POW(score,2))) as mag FROM search_index WHERE type = 'node' GROUP BY sid;
CREATE TEMPORARY TABLE local_mag_2 SELECT * FROM local_mag;
alter table local_mag add primary key(nid);
alter table local_mag_2 add primary key(nid);

Then the big nasty

SELECT IF (s1.sid > s2.sid, s1.sid, s2.sid) as nid1, IF (s1.sid > s2.sid, s2.sid, s1.sid) as nid2, (SUM(s1.score * s2.score) / (m1.mag * m2.mag)) as sim FROM search_index s1 JOIN search_index s2 ON s2.word = s1.word AND s2.sid <> s1.sid AND s1.type = 'node' JOIN local_mag m1 ON m1.nid = s1.sid JOIN local_mag_2 m2 ON m2.nid = s2.sid WHERE s1.sid IN (*INDEXING_SUB_QUERY_WITH_LIMIT*) AND s1.type = 'node' GROUP BY s1.sid, s2.sid HAVING sim > MIN_SIM;

And bingo, you have a gradually updating similarity calculation. its clean and smooth. This is pretty exciting. Though not sure on what I'm going to do with the title one. Its virtually impossible for it to follow a similar pattern. It would need its own seperate index similar to search index but with it n-gramed. That will be the last one to be tackled, perhaps removed...

Log in or register to post comments

Comment #16

Flying Drupalist commented 21 April 2009 at 19:55

I very much want to use this module, but performances issues scares the bejeeus out of me. Subscribe!

Log in or register to post comments

Comment #17

Scott Reynolds commented 21 April 2009 at 20:42

hehe i need to do a commit. Got a lot of it handled i think. Makes use of temp tables and cool stuff so its fast.

Log in or register to post comments

Comment #18

Flying Drupalist commented 21 April 2009 at 21:38

Thanks, then I can't wait. :)

Log in or register to post comments

Comment #19

mrfelton commented 23 April 2009 at 16:15

ready to commit yet?! subscribing

Log in or register to post comments

Comment #20

Scott Reynolds commented 23 April 2009 at 16:45

Status	File	Size
new	first_pass.patch	21.95 KB

ok shes not done yet, but heres a patch for most of them.
committed this patch. still need to remove the title one as it won't scale, ever...

Log in or register to post comments

Resource Limits

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Comment #20

News items

Our community

Documentation

Drupal code base

Governance of community