People keep talking about performance between json and PHP but I haven't seen a single line of data actually posted.

So time for some actual testing,

The idea of this test is to simulate a full read of default module configuration - just reading the files into a PHP variable. We can assume that PHP variable would then need to be written somewhere but it's the same by the time it gets to that point.

Here's what I did:

Created a small (possibly too small) array.

Wrote the same array out via drupal_var_export() to 500 php files, and via json_encode to 500 .json files. I've attached generate_json.php and generate_php.php and attached them there (as .txt files). You can stick these in your drupal 8 root and run them with drush scr - create the sites/default/config dir first.

Then I made two scripts - read_json.php and read_php.php. The json version does $foo = json_encode(file_get_contents($foo)); the php version just includes the file.

I then hit the files from a browser, once, profiling the whole thing with xhprof, with and without APC enabled. That adds up to three results sets instead of four, since JSON was within margin of error with and without APC -only relative percentages were different.

Attaching 4 PHP scripts and a bunch of xhprof screenshots.

Summary:

For all three tests, total time spent was < 20ms to read data from 500 files. Since these are defaults and supposed to be read very rarely this is negligible (json came out faster in wall time but it might be close to margin of error with APC enabled). This probably will matter if a level 2 store was reading from files, but I wasn't trying to simulate that.

With the PHP version, peak memory was 1.6mb. Total memory used when running APC was just under 50% of the memory used by Drupal (2.8 out of 6mb).

With the JSON version, peak memory was 100kb.

Why is this?

When you include a PHP file you can never free the memory up from including the actual file. Without APC it takes CPU + memory just to generate the opcode each time (this will also be true for an APC cache miss - which is likely to be the case with rarely-read defaults), additionally, things like function definitions and constants can never be freed up (not sure what overhead there is in APC when there are no functions, classes or constants but at least test case suggests there is something).

With JSON, json_decode() uses some memory when it's working, but as soon as you leave function scope, that memory is freed up again.

If you don't like my test write your own and post it here. This took much less than an hour to do, just running the same test yourself should be a few minutes at most using the files provided.

I'm also uploading a screenshot of the apc.php per-directory cache report. This shows the 500 files in sites/default/config, and the fact they're taking up 2mb of memory in APC. You can see some Drupal core modules in the list - combined these files take up more than any individual module (it's actually about the same usage as about 8 medium-sized Drupal modules).

Now if your level 2 store is also using PHP files for storage (which is the recommendation), you can double that to 1000 files in APC.

Is 500 config files too much? Probably not. Let's say core as 50 modules, each with a config file each. Then 200 contrib modules - some like views are providing 20-30 files (one per default view or similar), some just provide 1. This can easily get to 500 - and most modules have many more than 4-5 variables, the actual files could be 5, 10, 20 times as large.

Comments

catch’s picture

StatusFileSize
new103.77 KB
pounard’s picture

Nice tests, thanks.

There are a lot of things to take into account, first of all: loading PHP files doesn't do any file I/O if you are actually using APC with apc.stat = 0, while the JSON file loading will do a lot of them. This can make a huge difference if you have a slow FS (memory consumption VS. velocity always has been the real fight).

Another stuff, if you store a copy of the file into cache, JSON gets serialized (caching...) so if you fetch the JSON, you unserialize the string, then decode it; If you store a serialized PHP array, you just unserialize it and can use it directly! But this won't be true anymore if you store directly a JSON string and not a serialized PHP string (but if I'm not mistaken, the actual cache API does serialize when storing).

Fetch with (cache backend|database) storage would be an interesting thing to do. If I understood well the debate, you don't need to re-read real files each hit because configuration is supposed to be cached (then not really read from files when in production).

sun’s picture

The detail on APC memory is interesting news for me, thanks for that.

The rest is pretty much known:

http://www.phpdevblog.net/2009/11/serialize-vs-var-export-vs-json-encode...

+ follow-up http://www.phpdevblog.net/2009/11/serialize-vs-var-export-vs-json-encode...

catch’s picture

There are a lot of things to take into account, first of all: loading PHP files doesn't do any file I/O if you are actually using APC with apc.stat = 0, while the JSON file loading will do a lot of them.

If the files aren't cached in APC (which the defaults probably won't be), or if you have to clear the stat cache to reload the files (since this is defaults and/or level 1 store), then it will have to stat the files no?

Another stuff, if you store a copy of the file into cache, JSON gets serialized (caching...) so if you fetch the JSON, you unserialize the string, then decode it; If you store a serialized PHP array, you just unserialize it and can use it directly! But this won't be true anymore if you store directly a JSON string and not a serialized PHP string (but if I'm not mistaken, the actual cache API does serialize when storing).

This is a bit different to what I'm looking at here - I would think that Level 2 backends with any caching, will store the native PHP structure and unserialize it, regardless of file format - it would be strange to cache the json string serialized then decode the json.

The cache API itself does not do serialize when storing - the db caching backend does, but it would be possible to, for example, write a backend that does a json_encode()/json_decode() on $data and use that for the config bin without unserialize(). The cache implementations can do whatever they like with the PHP structure they're given (as long as it comes back out OK).

Fetch with (cache backend|database) storage would be an interesting thing to do. If I understood well the debate, you don't need to re-read real files each hit because configuration is supposed to be cached (then not really read from files when in production).

Right that's the level 2 store - this test was only for a "scan the file-based config and compile it into something useful" - if we were going to do that on runtime I'd be throwing a fit but that doesn't seem to be the plan.

pounard’s picture

StatusFileSize
new1.01 KB

I did some further tests, see the file attached. No caching involved or anything.

Tested on PHP 5.2 (linux gentoo):

pounard@guinevere /var/www/d7-core $ php -f test.php 
Testing 20000 iterations, array size is 10000 items.
json_encode(): 10.8694529533
json_decode(): 80.2525508404
serialize():   63.9363901615
unserialize(): 72.3440270424

And PHP 5.3 (linux ubuntu):

pounard@blaster:~
[Fri Jun 24, 16:56] $ php -f /tmp/test.php
json_encode(): 12.629450082779
json_decode(): 97.817018985748
serialize():   71.509289979935
unserialize(): 91.29324889183

The test is biased because it testes only with int and not strings (actually got the test from one of Sun's URLs and modified it). But it shows that unserialize() is (a bit) more faster than json_decode() (really pretty much nothing).

Configuration should be stable on a production environment, and read operations are more important than write operations IMHO. If the site rebuild often its configuration then we can assume that there is a serious problem behind.

One more note: the PHP 5.2 box's hardware is really faster than the 5.3 one's, that explains why numbers are higher on PHP 5.2 where we could expect the opposite.

EDIT: Did the same tests with array filled with random strings, and got a really huge difference: unserialize() two times faster than json_decode(). The code remains biased because it doesn't test with a hierarchical array though.

PHP 5.2:

pounard@guinevere /var/www/d7-core $ php -f test-strings.php
Testing 20000 iterations, array size is 5000 items.
json_encode(): 47.4064810276
json_decode(): 91.0086319447
serialize():   33.496325016
unserialize(): 42.2469871044

PHP 5.3:

pounard@blaster:~
[Fri Jun 24, 17:27] $ php -f /tmp/test-strings.php
Testing 20000 iterations, array size is 5000 items.
json_encode(): 63.624093055725
json_decode(): 109.14283394814
serialize():   42.207105875015
unserialize(): 52.044392108917
sylvain lecoy’s picture

This can serve as a good base for performance tests for the final version if we implements the idea that sun submitted which is here: http://groups.drupal.org/node/157379#comment-528739. Allowing multiple formats defined in different parsers.

catch’s picture

My main worry with configuration rebuilds are if they're combined with lots of other rebuilds (things like the installer, or submitting the modules page for example) - operations like that can take Drupal over the edge with total memory usage.

The actual speed of reading the files like this doesn't worry me much at all (as long as it's not hundreds of ms) - as long as we're only talking about defaults and the level one store - it should be rare as pounard says.

There's one exception in that people are also talking about replacing settings.php with whichever format we use, with settings.php (or boot.json or whatever) the APC argument holds very strong, and that is going to need it's own set of benchmarks. With a bit of patching it is possible to get the first few bits of Drupal bootstrap to not issue any syscalls at all at the moment, would be a shame to lose that.

pounard’s picture

@#7 If I could +1 one your post, I'd definitely do it.

Crell’s picture

For most of the level 1 (file) storage I agree with catch. CPU usage is not the bottleneck, as it's not a critical path. Memory usage is. On the memory usage front, json beats included PHP for the reasons catch explains in #0.

pounard’s picture

@#8 Crell, yes I agree, as catch said in #7 memory will be critical mainly at rebuild time if the operation happens when a lot of other cache are being rebuilt.

I'm wondering if there's any other path for reducing memory usage while rebuilding cache globally? When the cache rebuilding become a problem there is maybe an overall problem with cache usage, which seems to be twisted.

I now that PHP include memory cost cannot be lowered here, but if 2MB alone was a problem, nobody would use Views. I don't want to minimize this memory usage, don't get me wrong, but it seems that catch's original testing assumed that 500 configuration files is a standard, but I really don't see when a site will have more than 40 or 50 of them (which seems really too much by the way: if you go into the path of one file per module, probably worth the shot to combine small ones together per packages, better for file loading, but also better for reducing amount of cache_get(), cache_set() or level 2 database retrieval queries).

Crell’s picture

There is not expected to be any cache_get() calls, at all. The Level 2 active store is not using the cache system. I think we're all on the same page that level 2 should be pluggable, though, so very-high-traffic sites can switch to Mongo or Redis or some other data store more appropriate for key/value-blob than SQL.

There's a lot we don't know yet about D8's internal usage patterns, and I fully expect them to change radically once we start adding more changes to the system. I think the take-away here is that we need something that will scale decently on rebuild, and JSON does. That gives us flexibility later. Trying to put all memory optimization in the config format is not a good approach; rather, while rebuilding other parts of the system we should keep memory usage in mind, since we're going to be mucking with most of Drupal anyway during D8 (again).

JSON scales better than PHP does, memory-wise. That gives us more flexibility and wiggle room. That's enough for me on the performance front.

pounard’s picture

@#11 Pretty much what I said except for the part where you defend JSON. It appears that we mainly agree.

catch’s picture

I'm very skeptical that the level 2 store will be able to avoid using the cache system in core without degrading compared to current D8 performance, however that's not relevant here and we'll have to see how it shapes up.

gdd’s picture

Note that it should be "the level two store is not currently using the cache system". We haven't really discussed this at all that i recall.

catch’s picture

StatusFileSize
new73.8 KB
new454 bytes
new495 bytes

Eval screenshot and scripts - did this last week but never got 'round to posting it.

pounard’s picture

Good point, eval() is slow. But whatever formatting is chosen, I hope you will never eval() PHP but include it directly. The good performance of PHP is when it's being included "normally".

Crell’s picture

It's not the speed that's the issue, it's the memory usage. include() for PHP has a very bad impact on memory usage.

catch’s picture

Would be useful to run these same benchmarks with both a much larger array, and a nested array. Not working on that at the moment but good to sanity check this (for both variation in speed and memory).

damien tournoud’s picture

On my limited testing, as soon as you are working on non trivial data structures, nothing beats serialize/unserialize. [not even igbinary_serialize/igbinary_unserialize, which is interesting].

json_encode/decode being faster is only true for trivial data structure, which probably means that it has a smaller fixed cost.

Crell’s picture

How trivial a data structure are we talking about? I expect *most* configuration objects will not be too outlandish, with a few outliers that are quite complex, eg, Views. But I don't know what the balance point will be there.

catch’s picture

Relatively complex/large:

System table (although that is only partially configuration at the moment, there are things like schema version that are computed and can't be recalculated or configured, then there is some registry/caching like file locations and info.

Content types (split between custom storage and variables at the moment so usually bigger than the node_type table).

Fields.

Instances (particularly if the configuration object for instances ends up being for the bundle rather than separate objects).

gdd’s picture

Status: Active » Closed (fixed)

Given that we've chosen a file format I am closing this issue unless someone can come up with a good reason not to.