Needs work
Project:
Similarity Objects
Version:
6.x-1.x-dev
Component:
Code
Priority:
Critical
Category:
Feature request
Assigned:
Unassigned
Reporter:
Created:
18 Mar 2009 at 22:33 UTC
Updated:
23 Apr 2009 at 16:45 UTC
Jump to comment: Most recent file
Comments
Comment #1
Scott Reynolds commentedYa for sites of size, going to have to do all the math on the Database layer. its a difficult problem. Still hashing it out.
Comment #2
Scott Reynolds commentedWas thinking about this some more and was wondering why only give cron 128mb. I know that for most php apps thats fine but this bad boy does a lot. Wondering how much memory it takes to get this job done?
EDIT: really the module trades memory for query processing time. All the memory hogging it does is for speed.
Comment #3
Scott Reynolds commentedLet me know if this works for ya. It does the calculation via SQL. Need to revisit some indexes maybe? Still some stuff to evaluate here but this is getting there. Some static caching of the total term counts.
This applies only to the node tag similarity object. But will port to the rest.
Comment #4
Scott Reynolds commentedComment #5
Scott Reynolds commenteddoh! moved to fast. this won't work. Revising
Comment #6
Scott Reynolds commentedComment #7
Scott Reynolds commentedComment #8
Scott Reynolds commentedWanted to write this down before I forget. Need to array_chunk() the saved cache so that it doesn't get to large. Once it hits a size, write it to the DB.
Comment #9
Scott Reynolds commentedSo I was worried that we have this $pending_saves cache. This cache represents which similarities are going to be written to the database on php exit(). This cache, at max size will be N by N where N is the node throttle set on the similarity create page. So say you set it to be 200 nodes per calculation run, the pending_saves cache would be at most 200 x 200. Never larger.
The real memory hog was the things like in the taxonomy similarity, that held all the terms per nid. It was convenient to do it this way but it takes up to much memory. This patch here fixes that one class. Working on back porting similar fixes.
I feel like this approach is much nicer then writing each individual similarity to the database, thus resulting (in the above case) 200 queries. This technique will make those 200 queries into one query.
Comment #10
Büke Beyond commentedI was very cautiously testing this module, suspecting the N^2 count, so I started out testing the title and search index computations on a small set of nodes (<20). That is when I discovered the database was not cleaning out the lower threshold matches (the float fix in the other post).
Then I set out to try just the Title computations on the 15000 news nodes, that is when the cron started stalling, requiring manual clean up of the cron semaphore from the database to resurrect it.
I may try bigger memory limits. The memory constraints, I believe, are mainly for security attacks. If Drupal/PHP/Apache offered a way to selectively give the cron request (with proper authentication, eg from localhost) the larger memory, it would be more feasable. Also, a lot of Drupal sites start out on shared hosting with limited settings.
There is also the possibility of running the algorithm internally from PHP CLI and skipping the web access.
Comment #11
Scott Reynolds commented/me sigh shes still ballooning somewhere
Comment #12
Scott Reynolds commented15000+ nodes
Memory used at: devel_init()=1.36 MB, devel_shutdown()=15.94 MB.
4447124 similarities. Getting better. Seems like to many similarities. Each node has a term so it calculates against that. And the min_sim isn't respected yet
Comment #13
Scott Reynolds commentedok 15000 nodes processing 500 at a time
Memory used at: devel_init()=1.37 MB, devel_shutdown()=15.94 MB.
page executed on my laptop in 7 mins, which is slow but for cron, usually run every 15 mins thats fine. And my laptop isn't a server kernel. Its running X playing, pandora, my IDE is open etc, etc, etc.
Attached is the patch to the term similarity.
Comment #14
Scott Reynolds commentedAdding indexes to the temporary tables made it execute in .5 mins
Comment #15
Scott Reynolds commentedTo do search index, the following MySQL commands
Then the big nasty
And bingo, you have a gradually updating similarity calculation. its clean and smooth. This is pretty exciting. Though not sure on what I'm going to do with the title one. Its virtually impossible for it to follow a similar pattern. It would need its own seperate index similar to search index but with it n-gramed. That will be the last one to be tackled, perhaps removed...
Comment #16
Flying Drupalist commentedI very much want to use this module, but performances issues scares the bejeeus out of me. Subscribe!
Comment #17
Scott Reynolds commentedhehe i need to do a commit. Got a lot of it handled i think. Makes use of temp tables and cool stuff so its fast.
Comment #18
Flying Drupalist commentedThanks, then I can't wait. :)
Comment #19
mrfelton commentedready to commit yet?! subscribing
Comment #20
Scott Reynolds commentedok shes not done yet, but heres a patch for most of them.
committed this patch. still need to remove the title one as it won't scale, ever...