"there have been bugs with outputting the proper string lengths in the serialization format when not using utf-8 encodings in all your data. "
Can anyone confirm this on the latest Solr release?
Using php or phps can expose a number of unexpected security issues as mentioned by the thread above. Security is more important than performance and therefor it is wise to stick with json as it is more tested and widespread.
I start reading more information and have more questions.
Why php-solr-client is removed from 7.x? For easier deployment? update frequency? I know it's a little bit outdated, but why not working on it and make a solution for php as a whole?
Both of php-solr-client and Drupal_Apache_Solr_Service.php can't have flexible support for customised request handlers in Solr now.
Anyone has experience with http://pecl.php.net/package/solr?
I know it's depending on server side and can't be an option for Drupal now, but can it be the main solution for PHP in the future?
Explains more. There is nothing stopping you from using that but the safest option and most supported is still json. If you'd like to make this more dynamic, you can open a new issue and make patches?
I'm going to chime in on this with a couple points. I'm currently writing a PHP to SOLR proxy for some custom stuff and I've been looking into what's best in the wt= formats for performance.
As you can see phps is miles ahead of all other formats, including exec() which would get used in wt=php and json_decode which gets used in wt=json (which is what apachesolr module is currently using).
If we care about speed, unserialized is the fastest (updated benchmarks with latest versions of PHP & Solr 4.x would be appreciated).
CONCLUSION
unserialize via wt=phps is the fastest, by a long shot.
Security
Reason I believe we're not using unserialize seems to be for security reasons.
I hate to break it to everyone here, but Drupal heavily uses unserialize all over the place on user submitted content.
cache_set / cache_get happen every page request by default and I do believe they will contain user submitted input. Additionally any third party module can also call cache_set / cache_get and thus potentially have security implications (I'm looking at you apachesolr module)
So if cache_set / cache_get are using them, why can't we. Does core do something with data sanitation that I don't know about? Is cache_get a major security issue in Drupal core that no one is talking about?
Additionally the {users}.data field is specifically for serialize / unserialize additional data to be stored alone side a user. No one is talking about security implications here.
CONCLUSION
Drupal heavily uses serialize & unserialize, on what I would expect to be user input, and thus if those are vulnerable, we have larger issues. If they're not vulnerable, then we could probably use them as well.
---
So personally I just wrote this because, 1. I'd like to use what is fastest and 2. I already know Drupal heavily uses serialize / unserialize. So if someone could explain to me why cache_get is not vulnerable, but it would unserialized would be in ApacheSolr module....then I'll crawl back into my troll hole.
----
From Solarium's code (which is the PHP solr backend we're using)
/**
* Get responsewriter option
*
* Defaults to json for backwards compatibility and security.
*
* If you can fully trust the Solr responses (phps has a security risk from untrusted sources) you might consider
* setting the responsewriter to 'phps' (serialized php). This can give a performance advantage,
* especially with big resultsets.
*
* @return string
*/
public function getResponseWriter()
I would assume most people are using their own Solr server, and thus wouldn't it be safe to trust your own server?
As you can see, the time for the page to load was 200.000 microsec
The time to process the response was 15000 microseconds. This is about 13% of the whole page request. I wonder if you could quickly hack something together that proves this change to be significant in terms of performance. I have heard about this quite a lot but no one shows any data in the context of Drupal?
I suppose you could easily change the connection class and try it out?
you should not call unserialize on user supplied data, but it is fine if it is called on previously serialized data.
So
unserialize($_POST['foo']);
is bad.
unserialize(serialize($_POST['foo']));
is ok.
In the case of cache, serialization is done during set, so no raw userdata gets to unserialize.
So assuming you can trust the SOLR you're dealing with (which is practice you should never trust anything), then we should be ok to use phps.
The blog & graph I posed is from the Author of Solarium. He's added PHPS support to Solarium 3.x (I believe apachesolr module is using 2.x). Solarium 3.x should be used when using Solr 4. It's a little pain to setup and requires PHP 5.3 as it uses namespaces. He's enabled support for PHPS in Solarium 3, but I believe it's turned off by default and uses json.
I've also noticed a bug with SOLR 4.x + JSON when using PseudoFields field aliases, single field values always get returned as arrays, which I believe is a bug in ApacheSolr JSON response writer as this doesn't happen with PHPS. I've asked in #solr in freenode, but was unable to get a response.
Conclusion
Just wanted to post this for future people looking into the subject. The graphs for differences in the miliseconds between json_encode & unserialize. unserialize support would require Solarium 3.x, which would require PSR-0 support in PHP 5.3 and also requires other dependencies. While unserialize is faster, you need to be able to trust your SOLR server to not hack you, if you can do that then security risk is minimal (I wouldn't trust say Acquia Search for instance, but my own servers I would). And finally support for PHPS should be optional, and default to json anyways.
For the performance:
I have written a minimalistic module for searching with Solr with phps https://github.com/shenzhuxi/solr (Another module solr_ajax is also included for who want the best performance without PHP).
After testing, I found that there is no obvious difference on performance (from query to html generated) between JSON and PHPS for regular use cases (<=3 keywords, <=50 doc/page for the result, with short text summary) on PHP 5.4, because (I think) the bottle neck is PHP rather than Solr as Nick_vh mentioned http://drupal.org/node/1616940 (BTW, what tools did you use for this testing?).
In a project I worked before, PHPS did show advantage over JSON on performance, because the query is complicated and the text of each doc in the result is bigger than regular use cases. In addition, less codes for converting format on PHP side was also a reason.
BTW, I'm not sure whether the performance of json_decode() and unserialize() changed in PHP 5.4.
For the security:
I agree with j0rd. The risk in the connection between Drupal and Solr is not about serialize/unserialize itself. It's not a reason to stop using PHPS.
For the codes:
I think the most important reason to use PHPS is that the codes can be reduced a lot and it will be easier to maintenance. It's obvious after comparing most ways of using Solr with PHP like solr-php-client, Zend and Solarium.
Of course, it's not an advantage for Apachesolr and Facet API module which is already there.
It's more like Apache httpd and Nginx, there is few chance to re-design the former but it's nice to have the latter.
Comments
Comment #1
pwolanin commentedno, absolutely not. It's insecure, and I don't think it's faster than JSON.
Comment #2
shenzhuxi commentedRemoving solr-php-client which converts json to object in php will definitely to enhance the performance, maybe not so much for one request.
Why unserialize is insecure?
Comment #3
shenzhuxi commentedHere is why solr-php-client keeps using JSON
http://code.google.com/p/solr-php-client/issues/detail?id=6
"serialized PHP is faster to parse than JSON"
"there have been bugs with outputting the proper string lengths in the serialization format when not using utf-8 encodings in all your data. "
Can anyone confirm this on the latest Solr release?
Comment #4
nick_vhIn a follow up from a question that was asked in Drupal Science Camp in Cambridge :
https://issues.apache.org/jira/browse/SOLR-1967
Using php or phps can expose a number of unexpected security issues as mentioned by the thread above. Security is more important than performance and therefor it is wise to stick with json as it is more tested and widespread.
Hope that answered your remaining questions?
Comment #5
shenzhuxi commentedI start reading more information and have more questions.
Why php-solr-client is removed from 7.x? For easier deployment? update frequency? I know it's a little bit outdated, but why not working on it and make a solution for php as a whole?
Both of php-solr-client and Drupal_Apache_Solr_Service.php can't have flexible support for customised request handlers in Solr now.
Anyone has experience with http://pecl.php.net/package/solr?
I know it's depending on server side and can't be an option for Drupal now, but can it be the main solution for PHP in the future?
Comment #6
nick_vhThis issue
#635510: Use PECL Solr client
Explains more. There is nothing stopping you from using that but the safest option and most supported is still json. If you'd like to make this more dynamic, you can open a new issue and make patches?
Comment #7
j0rd commentedI'm going to chime in on this with a couple points. I'm currently writing a PHP to SOLR proxy for some custom stuff and I've been looking into what's best in the wt= formats for performance.
Performance
First things first, benchmarks from last year:
http://www.raspberry.nl/2012/02/28/benchmarking-php-solr-response-data-h...
Graph:
http://www.raspberry.nl/files/solr-decode-performance.html
As you can see phps is miles ahead of all other formats, including exec() which would get used in wt=php and json_decode which gets used in wt=json (which is what apachesolr module is currently using).
If we care about speed, unserialized is the fastest (updated benchmarks with latest versions of PHP & Solr 4.x would be appreciated).
CONCLUSION
unserialize via wt=phps is the fastest, by a long shot.
Security
Reason I believe we're not using unserialize seems to be for security reasons.
I hate to break it to everyone here, but Drupal heavily uses unserialize all over the place on user submitted content.
cache_set / cache_get happen every page request by default and I do believe they will contain user submitted input. Additionally any third party module can also call cache_set / cache_get and thus potentially have security implications (I'm looking at you apachesolr module)
So if cache_set / cache_get are using them, why can't we. Does core do something with data sanitation that I don't know about? Is cache_get a major security issue in Drupal core that no one is talking about?
Additionally the {users}.data field is specifically for serialize / unserialize additional data to be stored alone side a user. No one is talking about security implications here.
CONCLUSION
Drupal heavily uses serialize & unserialize, on what I would expect to be user input, and thus if those are vulnerable, we have larger issues. If they're not vulnerable, then we could probably use them as well.
---
So personally I just wrote this because, 1. I'd like to use what is fastest and 2. I already know Drupal heavily uses serialize / unserialize. So if someone could explain to me why cache_get is not vulnerable, but it would unserialized would be in ApacheSolr module....then I'll crawl back into my troll hole.
I put a support request into core issue queue as I'm generally curious for finding out
#1992518: cache_get() uses unserialize(). Is this a security concern if user submitted data is unserialize().
----
From Solarium's code (which is the PHP solr backend we're using)
I would assume most people are using their own Solr server, and thus wouldn't it be safe to trust your own server?
Comment #8
nick_vhI'm not sure if I mentioned this here already but here is a breakdown of what is happening in a page load - focussed on apache solr
http://drupal.org/node/1616940
As you can see, the time for the page to load was 200.000 microsec
The time to process the response was 15000 microseconds. This is about 13% of the whole page request. I wonder if you could quickly hack something together that proves this change to be significant in terms of performance. I have heard about this quite a lot but no one shows any data in the context of Drupal?
I suppose you could easily change the connection class and try it out?
Comment #9
j0rd commentedFrom my issue report in drupal issue queue.
So assuming you can trust the SOLR you're dealing with (which is practice you should never trust anything), then we should be ok to use phps.
The blog & graph I posed is from the Author of Solarium. He's added PHPS support to Solarium 3.x (I believe apachesolr module is using 2.x). Solarium 3.x should be used when using Solr 4. It's a little pain to setup and requires PHP 5.3 as it uses namespaces. He's enabled support for PHPS in Solarium 3, but I believe it's turned off by default and uses json.
I've also noticed a bug with SOLR 4.x + JSON when using PseudoFields field aliases, single field values always get returned as arrays, which I believe is a bug in ApacheSolr JSON response writer as this doesn't happen with PHPS. I've asked in #solr in freenode, but was unable to get a response.
Conclusion
Just wanted to post this for future people looking into the subject. The graphs for differences in the miliseconds between json_encode & unserialize. unserialize support would require Solarium 3.x, which would require PSR-0 support in PHP 5.3 and also requires other dependencies. While unserialize is faster, you need to be able to trust your SOLR server to not hack you, if you can do that then security risk is minimal (I wouldn't trust say Acquia Search for instance, but my own servers I would). And finally support for PHPS should be optional, and default to json anyways.
Comment #10
shenzhuxi commentedFor the performance:
I have written a minimalistic module for searching with Solr with phps https://github.com/shenzhuxi/solr (Another module solr_ajax is also included for who want the best performance without PHP).
After testing, I found that there is no obvious difference on performance (from query to html generated) between JSON and PHPS for regular use cases (<=3 keywords, <=50 doc/page for the result, with short text summary) on PHP 5.4, because (I think) the bottle neck is PHP rather than Solr as Nick_vh mentioned http://drupal.org/node/1616940 (BTW, what tools did you use for this testing?).
In a project I worked before, PHPS did show advantage over JSON on performance, because the query is complicated and the text of each doc in the result is bigger than regular use cases. In addition, less codes for converting format on PHP side was also a reason.
BTW, I'm not sure whether the performance of json_decode() and unserialize() changed in PHP 5.4.
For the security:
I agree with j0rd. The risk in the connection between Drupal and Solr is not about serialize/unserialize itself. It's not a reason to stop using PHPS.
For the codes:
I think the most important reason to use PHPS is that the codes can be reduced a lot and it will be easier to maintenance. It's obvious after comparing most ways of using Solr with PHP like solr-php-client, Zend and Solarium.
Of course, it's not an advantage for Apachesolr and Facet API module which is already there.
It's more like Apache httpd and Nginx, there is few chance to re-design the former but it's nice to have the latter.