As per the workaround mentioned in discussion, the default settings.php file that is shipped does not always prevent PHP from adding the PHPSESSID in the URL.

Although this is a hosting setting issue, a workaround exists for it:

A patch is attached, and it just adds the following line to settings.php

ini_set('url_rewriter.tags', '');
CommentFileSizeAuthor
settings_0.patch576 byteskbahey

Comments

dries’s picture

If this mechanism kicks in, your session ID isn't shared but the sessions probably won't work either. Is this a good idea?

kbahey’s picture

I am not sure which is the lesser evil: not having sessions or having session IDs in the URL.

Perhaps we can add it in the settings.php as a comment ("If you still have the sessions in the URL then try this" kind of thing).

killes@www.drop.org’s picture

I think sessions would still work. I am not convinced we should add this to settings.php, though. Maybe as an uncommented option?

bradlis7’s picture

I think that the session naturally uses cookies to store their session ID, but browsers and search engines without cookies use the "?PHPSESSID=..." to store the variable. The big problem I have with PHPSESSID is that it shows up in search results. Is there any way to tell if the user is a website crawler?

markus_petrux’s picture

Search engines are identifiable by their user agent string, but this information can ve hijacked.

1) You have to maintain a list of user agent strings, used by countless search engines.
2) You have to maintain a list of IPs that each search engine use, to prevent hijacking.
3) You have to find a way to maintain all this information up to date.

Ugh!

Maybe the browsecap module can help with the user agent strings, but what with the IPs?

kkobashi’s picture

I don't understand the concern of hijacking/spoofing. Does it really matter if a user agent claims to be GoogleBot or something else?

Any type of user agent can visit a Drupal site. It can claim to be Mickey Mouse, Madonna or Britney Spears. Do we really care if it hasn't logged in and validated? There's infinite user_agents. The default behaivior to access Drupal is not based on user agent, but rather on user authentication.

In the case of search engine crawlers, transparent session IDs are showing up in search listings. How I know is because this problem keeps showing up in my results. And I can tell you that I tried every suggestion to date and none have worked. My SERPs show PHPSESSID and its a real bitch to get rid of those in the index, let alone stopping the crawlers from creating them.

This PHPSESSID problem can be caused from anywhere. From a server configuration issue, to a PHP issue, to a Drupal issue. The problem is, it is so very difficult to test and nail down because your results won't show back until the crawlers spider your site and make it publicly available - several weeks later.

Why cant we look at the HTTP_USER_AGENT string before opening a session. If the string contains a user agent that we identify in our list, then don't open the session. There is only one session.start() call in the entire Drupal source code and that is in session.inc so it is very well isolated.

Perhaps Dries can look into this and analyze the cases of doing this:

if (HTTP_USER_AGENT is "google", "lycos", "whatever") {
return;
}
// else go on our merry way like we have been doing
session_start();

markus_petrux’s picture

I believe it is not as easy as bypassing session_start(). You would have to do something when accessing the $_SESSION array. And you would also have to maintain the list of user agents used by search engines, even if you don't care about them being used by anyone else, those strings may change, new crawlers may appear, etc.

Default settings.php includes the following statements:

ini_set('arg_separator.output',     '&');
ini_set('magic_quotes_runtime',     0);
ini_set('magic_quotes_sybase',      0);
ini_set('session.cache_expire',     200000);
ini_set('session.cache_limiter',    'none');
ini_set('session.cookie_lifetime',  2000000);
ini_set('session.gc_maxlifetime',   200000);
ini_set('session.save_handler',     'user');
ini_set('session.use_only_cookies', 1);
ini_set('session.use_trans_sid',    0);
ini_set('url_rewriter.tags',        '');

That is the PHPSESSID variable is not appended to URLs, even if cookies are not supported by the user agent.

I've seen many Drupal based sites indexed by Google that do not have the SID.

magico’s picture

Version: 4.6.0 » 4.6.9
Component: other » base system

Still a problem in 4.6.9?

magico’s picture

Version: 4.6.9 » x.y.z
Status: Needs review » Active

I don't know if this issue was solved in 4.7 or HEAD. But because I think it's important, I'm pushing this to HEAD to confirm it's status.

magico’s picture

Version: x.y.z » 4.6.9
Status: Active » Fixed

Forget, it seems it was applied long time ago!

magico’s picture

Version: 4.6.9 » 4.7.3

Fixed in 4.7.x but can be used in 4.6.9 if someone has problems.

Anonymous’s picture

Status: Fixed » Closed (fixed)