Please check

Just use Solr to return serialized php should be faster and the codes should be shorter.

solr-php-client is an old project before Solr have more format support.

Comments

pwolanin’s picture

Status: Active » Closed (won't fix)

no, absolutely not. It's insecure, and I don't think it's faster than JSON.

shenzhuxi’s picture

Removing solr-php-client which converts json to object in php will definitely to enhance the performance, maybe not so much for one request.

Why unserialize is insecure?

shenzhuxi’s picture

Here is why solr-php-client keeps using JSON
http://code.google.com/p/solr-php-client/issues/detail?id=6

"serialized PHP is faster to parse than JSON"

"there have been bugs with outputting the proper string lengths in the serialization format when not using utf-8 encodings in all your data. "
Can anyone confirm this on the latest Solr release?

nick_vh’s picture

In a follow up from a question that was asked in Drupal Science Camp in Cambridge :

https://issues.apache.org/jira/browse/SOLR-1967

Using php or phps can expose a number of unexpected security issues as mentioned by the thread above. Security is more important than performance and therefor it is wise to stick with json as it is more tested and widespread.

Hope that answered your remaining questions?

shenzhuxi’s picture

I start reading more information and have more questions.

Why php-solr-client is removed from 7.x? For easier deployment? update frequency? I know it's a little bit outdated, but why not working on it and make a solution for php as a whole?

Both of php-solr-client and Drupal_Apache_Solr_Service.php can't have flexible support for customised request handlers in Solr now.

Anyone has experience with http://pecl.php.net/package/solr?
I know it's depending on server side and can't be an option for Drupal now, but can it be the main solution for PHP in the future?

nick_vh’s picture

This issue
#635510: Use PECL Solr client

Explains more. There is nothing stopping you from using that but the safest option and most supported is still json. If you'd like to make this more dynamic, you can open a new issue and make patches?

j0rd’s picture

I'm going to chime in on this with a couple points. I'm currently writing a PHP to SOLR proxy for some custom stuff and I've been looking into what's best in the wt= formats for performance.

Performance

First things first, benchmarks from last year:
http://www.raspberry.nl/2012/02/28/benchmarking-php-solr-response-data-h...

Graph:
http://www.raspberry.nl/files/solr-decode-performance.html

As you can see phps is miles ahead of all other formats, including exec() which would get used in wt=php and json_decode which gets used in wt=json (which is what apachesolr module is currently using).

If we care about speed, unserialized is the fastest (updated benchmarks with latest versions of PHP & Solr 4.x would be appreciated).

CONCLUSION
unserialize via wt=phps is the fastest, by a long shot.

Security

Reason I believe we're not using unserialize seems to be for security reasons.

I hate to break it to everyone here, but Drupal heavily uses unserialize all over the place on user submitted content.

cache_set / cache_get happen every page request by default and I do believe they will contain user submitted input. Additionally any third party module can also call cache_set / cache_get and thus potentially have security implications (I'm looking at you apachesolr module)

So if cache_set / cache_get are using them, why can't we. Does core do something with data sanitation that I don't know about? Is cache_get a major security issue in Drupal core that no one is talking about?

Additionally the {users}.data field is specifically for serialize / unserialize additional data to be stored alone side a user. No one is talking about security implications here.

CONCLUSION

Drupal heavily uses serialize & unserialize, on what I would expect to be user input, and thus if those are vulnerable, we have larger issues. If they're not vulnerable, then we could probably use them as well.

---

So personally I just wrote this because, 1. I'd like to use what is fastest and 2. I already know Drupal heavily uses serialize / unserialize. So if someone could explain to me why cache_get is not vulnerable, but it would unserialized would be in ApacheSolr module....then I'll crawl back into my troll hole.

I put a support request into core issue queue as I'm generally curious for finding out
#1992518: cache_get() uses unserialize(). Is this a security concern if user submitted data is unserialize().

----
From Solarium's code (which is the PHP solr backend we're using)

/**
     * Get responsewriter option
     *
     * Defaults to json for backwards compatibility and security.
     *
     * If you can fully trust the Solr responses (phps has a security risk from untrusted sources) you might consider
     * setting the responsewriter to 'phps' (serialized php). This can give a performance advantage,
     * especially with big resultsets.
     *
     * @return string
     */
    public function getResponseWriter()

I would assume most people are using their own Solr server, and thus wouldn't it be safe to trust your own server?

nick_vh’s picture

I'm not sure if I mentioned this here already but here is a breakdown of what is happening in a page load - focussed on apache solr

http://drupal.org/node/1616940

As you can see, the time for the page to load was 200.000 microsec
The time to process the response was 15000 microseconds. This is about 13% of the whole page request. I wonder if you could quickly hack something together that proves this change to be significant in terms of performance. I have heard about this quite a lot but no one shows any data in the context of Drupal?

I suppose you could easily change the connection class and try it out?

j0rd’s picture

From my issue report in drupal issue queue.

you should not call unserialize on user supplied data, but it is fine if it is called on previously serialized data.

So

unserialize($_POST['foo']);
is bad.

unserialize(serialize($_POST['foo']));
is ok.

In the case of cache, serialization is done during set, so no raw userdata gets to unserialize.

So assuming you can trust the SOLR you're dealing with (which is practice you should never trust anything), then we should be ok to use phps.

The blog & graph I posed is from the Author of Solarium. He's added PHPS support to Solarium 3.x (I believe apachesolr module is using 2.x). Solarium 3.x should be used when using Solr 4. It's a little pain to setup and requires PHP 5.3 as it uses namespaces. He's enabled support for PHPS in Solarium 3, but I believe it's turned off by default and uses json.

I've also noticed a bug with SOLR 4.x + JSON when using PseudoFields field aliases, single field values always get returned as arrays, which I believe is a bug in ApacheSolr JSON response writer as this doesn't happen with PHPS. I've asked in #solr in freenode, but was unable to get a response.

Conclusion

Just wanted to post this for future people looking into the subject. The graphs for differences in the miliseconds between json_encode & unserialize. unserialize support would require Solarium 3.x, which would require PSR-0 support in PHP 5.3 and also requires other dependencies. While unserialize is faster, you need to be able to trust your SOLR server to not hack you, if you can do that then security risk is minimal (I wouldn't trust say Acquia Search for instance, but my own servers I would). And finally support for PHPS should be optional, and default to json anyways.

shenzhuxi’s picture

For the performance:
I have written a minimalistic module for searching with Solr with phps https://github.com/shenzhuxi/solr (Another module solr_ajax is also included for who want the best performance without PHP).
After testing, I found that there is no obvious difference on performance (from query to html generated) between JSON and PHPS for regular use cases (<=3 keywords, <=50 doc/page for the result, with short text summary) on PHP 5.4, because (I think) the bottle neck is PHP rather than Solr as Nick_vh mentioned http://drupal.org/node/1616940 (BTW, what tools did you use for this testing?).

In a project I worked before, PHPS did show advantage over JSON on performance, because the query is complicated and the text of each doc in the result is bigger than regular use cases. In addition, less codes for converting format on PHP side was also a reason.

BTW, I'm not sure whether the performance of json_decode() and unserialize() changed in PHP 5.4.

For the security:
I agree with j0rd. The risk in the connection between Drupal and Solr is not about serialize/unserialize itself. It's not a reason to stop using PHPS.

For the codes:
I think the most important reason to use PHPS is that the codes can be reduced a lot and it will be easier to maintenance. It's obvious after comparing most ways of using Solr with PHP like solr-php-client, Zend and Solarium.
Of course, it's not an advantage for Apachesolr and Facet API module which is already there.
It's more like Apache httpd and Nginx, there is few chance to re-design the former but it's nice to have the latter.