PHP Fatal error: Allowed memory size of 536870912 bytes exhausted (tried to allocate 162 bytes) in /vagrant/repos/apisite/includes/file.inc on line 960

I've successfully setup API and successfully parsed Drupal 7 core as a test. However when I try to parse the code of a D7 distribution with many contrib modules cron continually runs out of memory. It does not get to the point where the files are added to the job queue for parsing. My assumption at this point is that all of the time is spent in api_scan_directories(), which needs to build the entire array of file names in a single pass.

Any ideas on how to work around the problem or improve api_scan_directories()?

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

jhodgdon’s picture

Status: Active » Postponed (maintainer needs more info)

I agree with your conclusion -- if it doesn't make it to the point when any jobs are added to the queue, it is most likely crashing in api_scan_directories(). Also, if you're running the same version of Drupal 6.x that I am (or something pretty close, which is quite likely), the error message says line 960 in file.inc, which is in the file_scan_directory() function, on a line where it does a recursive call to file_scan_directory() [to scan sub-directories]. And the api_scan_directories() function is the only place that calls file_scan_directory() in the API module.

But I wonder at it running out of memory -- that seems really really unlikely -- it's just building up an array of all the files. Is it possible that it's doing an infinite recursion, due to some symbolic links in the directory tree where it's scanning? e.g. if your directory structure was:

top/dir/
top/dir/foo/
top/dir/foo/bar --> symbolic link to top/dir/

Then I think file_scan_directory could conceivably run into infinite recursion. Otherwise, I can't imagine that you would run out of memory just making a list of the files in your directories. Can you check on this?

jhodgdon’s picture

If you wanted to debug this further, you could also put a watchdog() call into your file.inc right before it does the recursive call on line 960, and see if that tells you anything.

rickvug’s picture

Title: Parser runs out of memory on large codebase » Parser runs out of memory on large codebases
Status: Postponed (maintainer needs more info) » Active

In the end I was able to work around the problem by deleting directories that don't need to be scanned. The primary offenders were the libraries directory as well as all .git directories. Deleting the directories brought the number of files down from around 11,000 down to 5000. Adding the directories to the exclude list didn't have an noticeable effect on memory usage as the check for exclusion is done after the file_scan_directory() call, which is where cron was stalling. Perhaps passing the function the array of directories to exclude would be more efficient? By default it may make sense to add .git directories as well.

Aside from all of this I wouldn't be surprised if the problem was exacerbated by the fact that I'm running a VM that is using my host file system for storage via NFS. That can't be terribly fast!

jhodgdon’s picture

Priority: Normal » Major

Hm. I wonder if there is a way for us to not scan .git directories, and I wonder if there is a way for us to check for exclusion during the recursion of file_scan_directory()? For instance, we could do our own recursion in a loop instead of passing the $recurse argument into file_scan_directory(), maybe? It seems like it might be worth investigating, as it would save both execution time and memory.

Anyway, thanks for reporting this. It seems like it might be a problem for other people in the future. I'm upping the status to "major" to indicate I think we should fix this sooner than other issues left at "normal".

jhodgdon’s picture

See also #1186226: Allow scanning of code within hidden paths, which also pertains to scanning directories. Especially comment #3.

jhodgdon’s picture

Status: Active » Needs review
FileSize
16.35 KB

Here's a patch that I think should fix this issue (as much as it is possible to do so with a patch of the API module anyway). It changes the way directories are scanned, so that excluded/hidden directories are excluded from being recursively scanned.

It also includes some tests, and also fixes
#1186226: Allow scanning of code within hidden paths
along the way.

Included here is also a fix to
#1380678: Running some tests results in error but tests pass (?!?)
I probably need to scrape that out and put it in a separate patch. Sorry!

jhodgdon’s picture

FileSize
7.88 KB

Here's the same patch without the stuff from #1380678: Running some tests results in error but tests pass (?!?), which I'll commit separately.

jhodgdon’s picture

rickvug: If you could possibly test the patch in #7 and see if it fixes your original problem (before you removed all those directories), that would be helpful!

rickvug’s picture

Status: Needs review » Active

I have some good new and bad news. The good news is that after applying the patch I was able to successfully start parsing the codebase, including hidden .git and .hg directories. A quick look at the diff shows that the approach makes more sense and should be more efficient. The bad news is that when I reverse the patch I'm also able to successfully parse the entire codebase. I didn't time how long the initial cron run took in both cases. The number of jobs in the queue was the same. I don't think that I'll be able to replicate my original problem. Perhaps it was down to my VM running slowly that day - who knows.

At this point should the code be marked RTBC anyhow? It looks like a good clean up to me and I can confirm that everything still works as expected.

jhodgdon’s picture

Status: Active » Needs review

Thanks for testing... I'm happy to have independent confirmation that the patch doesn't break anything, and that it results in the same jobs being placed in the queue. :)

Definitely, some timing tests of cron would be helpful. Actually, if you have the dblog or syslog module enabled, and if you have more than one branch defined, you should be able to see the cron timing in your logs. Although maybe it won't be useful... My most recent cron run log, for instance, shows 5 "Updated --- branch ---" entries, but they're all marked at time 11:43. I guess that's not too useful -- they all took less than a minute. Note that the directory tree is read *each* cron run to see if there are any new, removed, or updated files.

Anyway that is a good idea for testing. I will rig up some temporary timing that will record the time (in milliseconds or seconds) to parse each branch with watchdog(), and compare with an without the patch.

Meanwhile, setting the status back to "needs review" as this needs to be tested before we can commit this patch.

jhodgdon’s picture

Here's a quick patch [NOT TO BE COMMITTED] that instruments the cron run to record how long the directory parsing takes.

I have 3 files branches on my test site currently: the sample functions from the API project (small), Drupal 6.22, and Drupal 7.7. Actually, I'm going to add another branch containing a git checkout of Drupal 8 as well, because none of those others has large quantities of files in hidden directories (the Sample branch has 1 file in 1 hidden directory for testing, but the others I downloaded from zip so they don't have .git directories).

So, I ran cron with and without the patch in #7, with this time-logging patch applied in both cases. Reported are the times taken to run api_update_branch_files(). I'm reporting a couple of runs on each, all run from the Status Report page from within Drupal.

Note that the very first time for the Drupal 8 branch is longer, because I just added it and the first time through, it has to add a bunch of files to the database and set up a bunch of jobs. And then the time goes down each run as the files are gradually parsed, because the database has records of increasing numbers of those files. rickvug: that might explain why you cannot reproduce this issue now... maybe? Anyway, it started out over 2 seconds, and gradually decreased. What I'm showing is after all the initial parsing had been done, which better reflects the time just to traverse the directory trees.

Anyway, here are the results...

Times to parse directories (seconds)
Branch With patch in #7 Without patch in #7
Sample 0.000818, 0.0020599, 0.001864 0.00079488, 0.000777
Drupal 6.22 0.0192, 0.05652, 0.0483949 0.0194, 0.01939
Drupal 7.7 0.04312, 0.074578, 0.0793688 0.04802, 0.0466599
Drupal 8 from git 0.16282, 0.050574, 0.0510 0.05228, 0.05102

I'm not seeing a lot of difference in the time to run cron here. There are lots of fluctuations, but the with/without columns don't look much different to me.

So, time didn't tell me much.. how about memory? The quick way I thought of to test this was to reduce my PHP memory limit until cron fails, and thereby measure how much memory is needed with and without the patch... let's see.

Without the patch in #7, cron runs with 35M, but fails with 34M.... Same thing with the patch. But looking at the log, it's failing sometime after all the directory parsing is done, so this patch isn't affecting memory use much on my setup. I guess I would need a huge codebase with lots of .git directories to test it on I guess.

So... I am not sure this patch does much to help. It seems sensible, but so far I can't see much improvement and see also comment #9. Thoughts?

jhodgdon’s picture

Ah, just found PHP function memory_get_peak_usage().

Using this I find that the branch directory walking is reporting memory usage as follows:

With patch No patch
Branch Peak memory before Peak memory after Difference Peak memory before Peak memory after Difference
Sample 18200016 18216312 16296 18195160 18200736 5576
Drupal 6 18219896 18677760 457864 18205392 19380272 1174880
Drupal 7 18677760 19307976 630216 19380272 20864112 1483840
Drupal 8 19307976 19382784 74808 20864112 20947704 83592

OK this looks like a difference. Less memory is used with the patch in #7, except in the very small sample branch. However that might be a bit of a mistake, because the patch includes some more files for the sample branch.

I forgot to make a patch to show this instrumentation unfortunately. It was basically a call to memory_get_peak_usage() right before and after the call to api_update_branch_functions().

tobiasb’s picture

with a file (150k) no problem
Peak memory usage was 21.67 MB [0.49 sec, 20.57 MB]

With a file (165k) and w/o the patch (#7 ) I got timeouts etc.

PHP Fatal error:  Allowed memory size of 268435456 bytes exhausted (tried to allocate 72 bytes) in ../sites/all/modules/contrib/api/parser.inc on line 695
Drush command terminated abnormally due to an unrecoverable error.                                                                       [error]
Error: Allowed memory size of 268435456 bytes exhausted (tried to allocate 72 bytes) in
.../sites/all/modules/contrib/api/parser.inc, line 695 [5.77 sec, 255.95
MB]
WD cron: Cron run exceeded the time limit and was aborted. [5.77 sec, 255.95 MB]  

Memory limit was 250M.
With 350M there was no problem to parse the file.
Peak memory usage was 269.93 MB [5.64 sec, 29.73 MB]

The biggest File in Drupal 7 is drupal-6.filled.database.php (> 500k)

jhodgdon’s picture

Status: Needs review » Reviewed & tested by the community

This patch does not address the memory needed to parse files. This patch addresses the memory needed to traverse the directory tree (which is done in each cron run for each branch, to see if there are any new, removed, or changed files that need to be parsed).

So what you are probably seeing is that the memory used during the directory tree traversal is reduced, so that when the file is parsed, there is still space in memory to take care of parsing the file.

Anyway, that is another confirmation that this patch does reduce memory usage in a helpful way, thanks!

I will go ahead and mark it "reviewed and tested". I won't be committing the patch until after the next release though, because we are in the "beta" stage and I don't want to change the code much right now.

jhodgdon’s picture

Status: Reviewed & tested by the community » Fixed

The patch in #7 has been committed.
http://drupalcode.org/project/api.git/commit/f896383

jhodgdon’s picture

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

juampynr’s picture

This can happen for many reasons while parsing.

Increasing memory limit at settings.php is an easy way of solving it.

I will add a Troubleshooting page at the module's documentation. Please feel completely free to relocate/edit it.

jhodgdon’s picture

Thanks, that page looks OK.

juampynr’s picture

Nice, I feel proud that somebody from the Docs team reviewed it. Thanks to you too.