Does Drupal get indexed by Google?!!

By philipk on 21 May 2006 at 18:44 UTC

I have no idea why my site isn't being indexed :(

The site is here:
http://www.playstationteam.com

http://www.google.com/search?q=site%3Aplaystationteam.com

I've used clean URLs as well and got a few inbound links.

Although when I log in to the admin area I see lots of 'robots.txt not found.' by Anon... could this have something to do with it?

Comments

It could take some time to

potential commented 21 May 2006 at 18:59

It could take some time to get indexed by google. Installing the google sitemap module may help get more of your pages indexed.

Robots.txt tells a search engine not to spider your site. Since you have page not found errors for robots.txt, that is a good thing.

I read somewhere that Google

rivena commented 21 May 2006 at 19:14

I read somewhere that Google also has some guidelines for webmasters... those may be useful to follow. :)

Anisa.

-----------------------------------------------------------
Kindness builds stronger bonds than necessity.

www.animecards.org - 16,000 card scans and counting!
-----------------------------------------------------------

"Robots.txt tells a search engine not to spider your site."
abit simplistic explanation, you can put an empty robots.txt file to remove the entries in the watchdog and your site will be indexedjust as well

especially for new sites

harry slaughter

Omaha NE USA

commented 15 November 2006 at 20:15

My devbee site was brand new in march. It took 4 months for the site to be included in search results.

Now, thanks to Drupal (and probably gsitemap), I have spectacular search placement.

Drupal sites are *extremely* Google-SEO friendly.

--
Drupal tips, tricks and services
http://devbee.com/ - Effective Drupal

--
Devbee - http://devbee.net/

obviously yes

chx commented 21 May 2006 at 20:42

drupal.org is just a Drupal install just like any other. Googlebot eats quite a lot of resources around :)

and also, do you think Drupal would be popular if it somehow blocked Googlebot??
--
The news is Now Public | Drupal development: making the world better, one patch at a time.

--
Drupal development: making the world better, one patch at a time. | A bedroom without a teddy is like a face without a smile.

Drupal works wonders when it

ericatkins commented 21 May 2006 at 21:27

Drupal works wonders when it goes to getting into Google's index. Just check out drupal.org's indexing:

http://www.google.com/search?q=site:drupal.org

Over 2,300,000 results!

Google and Drupal

Z2222 commented 21 May 2006 at 23:18

The Gsitemap module is good. I'm not sure if it is necessary, but I make a URL alias from 'gsitemap' to 'sitemap.xml' since I think that is where Google normally expects the sitemap to be. Then go here and make an account so you can see if Google has any spidering errors on your site: http://www.google.com/webmasters/sitemaps/

Google often takes 6 to 8 months to include your site in the results. It looks like they have some of your pages in their index already, but aren't showing a lot of them. It looks like they are indexing both the www version and non-www version of your domain name. You could add this to your .htaccess file (somewhere under RewriteEngine On, and only use the www version of your domain name. In your post above, you used the www version for your link to your site, but the non-www version when you linked to the Google cache. They are different.

This will tell the search engines that only the www version of your domain exists -- replace the your_site with your domain name:

#Changes to www form of domain
RewriteCond %{HTTP_HOST} ^your_site.com
RewriteRule (.*) http://www.your_site.com/$1 [R=301,L]

A robots.txt file like this is handy if you want to keep the search engines out of certain places on your site, for example, you don't want them indexing both the regular version of your page and the print-friendly version also. I found this somewhere on Drupal.org:

User-agent: * 
Disallow: /aggregator
Disallow: /tracker
Disallow: /comment/reply
Disallow: /node/add
Disallow: /user
Disallow: /files
Disallow: /search
Disallow: /book/print
Disallow: /admin

I'll just wait and see I

philipk commented 22 May 2006 at 13:37

I'll just wait and see I guess...

One of the reasons for using Drupal was its SEO features.. I just hope its all spidered in time for the PS3 launch.

Evidence of Drupal's Search Engine Friendliness

cmsproducer commented 22 May 2006 at 14:09

We had a similar discussion in the past and have some interesting information detailing various experiences that prove the point that a well configured Drupal set up is well indexed by Google and other major search engines.

http://drupal.org/node/20033
http://www.cmsproducer.com/search-engine-optimization-seo-google-msn
http://drupal.org/node/36726

-----
iDonny - Web Content Management System Design, Development. & CRM

mine is well configured :P

philipk commented 22 May 2006 at 16:48

mine is well configured :P

SEO and Drupal

Z2222 commented 22 May 2006 at 19:41

Drupal is very search engine friendly. If you put those lines in your .htaccess file it should fix the Google spidering errors.

Good pointers

cmsproducer commented 24 May 2006 at 15:04

This is useful information to prevent the indexing of some unnecessary stuff like the links to RSS feeds as regular pages, or the comment links. Also, making sure that a consistent domain format is used (with/without www) makes sure that the google index does not thinkt hat you are posting duplicate content (You can also use the Drupal multi-site feature to point all your domain paths to one of them)
-----
iDonny - Web Content Management System Design, Development. & CRM

Thank you for this post

potential commented 26 May 2006 at 03:01

Thank you for this post Guitarmiami. A few dozen of my pages had been indexed in google a two months ago. Then, after I read this thread, I checked again and all had been removed! I believe this happened because both the www and non-www version of the site were accessable, and google determined that the sites had duplicate content. Drupal, by default, lets one access both versions of a website. Therefore, MAKE SURE you put in this rewrite code into your .htaccess file or you may encounter the same problem.

#Changes to www form of domain
RewriteCond %{HTTP_HOST} ^your_site.com
RewriteRule (.*) http://www.your_site.com/$1 [R=301,L]

Different rewrite code for me

marcoBauli commented 16 October 2006 at 13:05

the code for redirection found above does not work for me. I had to insert the following instead:

 RewriteCond %{HTTP_HOST} !^www\.mysite\.com$ [NC]
 RewriteRule .* http://www.mysite.com%{REQUEST_URI} [L,R=301]

could anyone please tell if this code is good anyway for SEO purposes?

thx

redirects

Z2222 commented 16 October 2006 at 18:14

I learned a better way to do it than in my earlier post.

If you want it to redirect to the "www" version, use this:

  RewriteCond %{HTTP_HOST} !^www\.example\.com$ [NC]
  RewriteRule (.*) http://www.example.com/$1 [R=301,L]

If you want the "no-www" version, try this:

  RewriteCond %{HTTP_HOST} !^example\.com$ [NC]
  RewriteRule (.*) http://example.com/$1 [R=301,L]

my robots.txt

twohills commented 26 May 2006 at 04:20

User-agent: *
Crawl-Delay: 10
Disallow: /aggregator/
Disallow: /tracker/
Disallow: /comment/reply/
Disallow: /node/add/
Disallow: /taxonomy/
Disallow: /user/
Disallow: /files/
Disallow: /search/
Disallow: /book/print/
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /admin/

User-agent * and Googlebot

cmsproducer commented 26 May 2006 at 13:09

In addition the the above robots.txt specification, I keep another set of specifications just for googlebot
User-agent: Googlebot
In my case, it's a duplicate of the generic specifications but I maintain the duplicate for googlebot just to make sure that it sees my specifications every time.

You can use the engine specific declarations to cater for engine particulatities (some engines may or may not support given features such as URLs with querystring, or the presence of a session ID in the URL - that is if you cannot use cookies to maintain state)

-----
iDonny - Web Content Management System Design, Development. & CRM

robots.txt

Z2222 commented 27 May 2006 at 17:01

Drupal should already block access to the contents of directories like /database, /modules, /includes, etc. You will already get a 404 error if you try to access a URL like example.com/includes/common.inc.

If you look at the default .htaccess file (4.7.1) it includes this:

# Protect files and directories from prying eyes.
<FilesMatch "(\.(engine|inc|install|module|sh|.*sql|theme|tpl(\.php)?|xtmpl)|code-style\.pl|Entries.*|Repository|Root)$">
  Order deny,allow
  Deny from all
</FilesMatch>

Robots.txt advertises your directory structure to the world, so better not to put more in there than necessary. In this case it doesn't really matter because anyone can download Drupal and find out your directory structure, but it's something to keep in mind about robots.txt in general.

How many nodes does google know about? oh about 25million

dgtlmoon commented 24 May 2006 at 04:01

http://www.google.com.au/search?hs=V3j&hl=en&safe=active&client=firefox&...

Results 1 - 10 of about 25,000,000 for inurl:node/1..999999999 with Safesearch on. (0.34 seconds)

google knows about 25 million drupal nodes? cute :D

and considering...

rcross commented 24 May 2006 at 08:49

and considering how many people use path_auto.module, and thus don't have links to any "node/*" pages, the count is probably much much higher.

It might be fitting to use a comment from someone once I explained and showed drupal to them -
"drupal should run the internet!!"

--Ryan

--Ryan
Ryan Cross
James Cross Construction Services
Project Management Software

The code to only use your

StevenSokulski commented 25 August 2006 at 22:43

The code to only use your site url with or without the 'www' (never both ways) is already in the htaccess file. Starting around line 54 there is some info on it and the option to uncomment code to either use only with 'www' or only without 'www' so its literally a 30 second fix.

.htaccess

Z2222 commented 27 August 2006 at 23:04

That default .htaccess code in Drupal 4.7 is not good to use. I reported it as a bug, but never heard anything more about it. I wish someone would fix it.

If you use that default .htaccess comment it will rewrite something like example.com/page1 to www.example.com (home page) -- when it should rewrite to www.example.com/page1.

This is the correct code to use:

# This is the better way to do it:
RewriteCond %{HTTP_HOST} ^example.com
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

This default version from 4.7.3 is not good. I highly recommend not using it because it will not redirect correctly:

  # This is NOT a good way to do it:
  # RewriteCond %{HTTP_HOST} !^www\.example\.com$ [NC]
  # RewriteRule .* http://www.example.com/ [L,R=301]

Thankyou so much

twohills commented 28 August 2006 at 02:28

I had shelved this until I had time to nut out the correct syntax. Once again the Drupal community contributes and saves me time. Thanks again.
Not entirely sure about your RewriteCond regex though?? Needs a "\" before the "." eh? And what does the "$1" in the RewriteRule refer back to?

$1

Z2222 commented 8 September 2006 at 00:25

The $1 refers back to the first set of parentheses in a regular expression.

In the default Drupal .htaccess you have this problem:

http://example.com/page1 redirects to http://www.example.com/ -- this is a bad redirect because it doesn't take the visitor (or search engine) to the correct page.

The $1 adds the correct page:

http://example.com/page1 redirects to http://www.example.com/page1

Thanks. You are totally

StevenSokulski commented 28 August 2006 at 22:20

Thanks. You are totally right. Hadn't noticed that. So to use the non-www version I would place this in my .htaccess:

# This is the better way to do it:
RewriteCond %{HTTP_HOST} ^example.com
RewriteRule (.*) http://example.com/$1 [R=301,L]

Correct?

Shouldn't it be: # This is

alliax commented 28 August 2006 at 23:37

Shouldn't it be:

# This is the better way to do it:
RewriteCond %{HTTP_HOST} ^example\.com
RewriteRule (.*) http://example.com/$1 [R=301,L]

Backslash

Z2222 commented 8 September 2006 at 00:24

Yeah, that is probably better.

Errors

potential commented 18 September 2006 at 03:45

These are giving me redirect errors. Here is how I have it now, redirecting to the non-www:

   RewriteCond %{HTTP_HOST} !^example\.com$ [NC]
   RewriteRule ^(.*) http://example.com/$1 [L,R=301]

Is this correct? I have been having trouble getting indexed properly, so I'd like to make sure.

Thanks.

error

Z2222 commented 22 September 2006 at 16:36

There is an error. You are telling the server to redirect http://example.com to http://example.com (same url).

Use this and you should be fine:

RewriteCond %{HTTP_HOST} ^example\.com$ [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]

Then test it. You should get the following redirect behavior:

http://example.com redirects to http://www.example.com
http://example.com/page1 redirects to http://www.example.com/page1

What differences would have

StevenSokulski commented 23 September 2006 at 16:37

What differences would have to be made so that instead of redirecting to www.example.com it redirects to example.com sans www?

Thanks.

Drupal SEO

Z2222 commented 23 September 2006 at 21:06

If you want it without the www, then use this:

RewriteCond %{HTTP_HOST} ^www\.example\.com$ [NC]
RewriteRule ^(.*)$ http://example.com/$1 [R=301,L]

More info here:
http://tips.webdesign10.com/drupal-seo-404-ok-and-htaccess

guide to setup drupal to be search engine friendly

jt6919 commented 8 September 2006 at 14:55

here's the guide I wrote for making your drupal site friendly for search engine crawlers:
http://www.smorgasbord.net/how_to_optimize_drupal_web_site_for_google_ya...

it certainly made my page indexing and page ranking higher.

I can't help with all of

pcoskat commented 16 October 2006 at 18:30

I can't help with all of that code stuff, but getting indexed might be helped if you're backlinked from a site that is regularly indexed by Google.

Currently I have a site that the spiders LOVE. When I launch a new site, I add a link at the old site, and the new site ends up in Google by the end of the week.

Can´t get google to index my site

abu2000 commented 10 November 2006 at 21:43

I have submited my site to google several months ago. I got the home page to index but that is as far as I got. I have the gmap module, clean urls and have read all sugestions on this page. I cannot get the resto of the pages indexed. I recently set up another site and in just a few weeks the whole thing was indexed but can´t figure out what I did different. As a matter of fact the site that indexes is almost a drupal out of the box installation, it doesn´t even have de gsite module.

I must say that it could be because the site that indexes has a link in a heavely visited site, but can it make that much a difference?.

My site is

http://www.100x100electronica.com.ar

the other site that indexes is

http://comunidad.demotores.com.ar

Thank you in advance for any help.

http://www.100x100electronica.com.ar

yes, links make that much of a difference!

jt6919 commented 16 November 2006 at 17:07

Yes, the links make a major difference! However, read the 10 Step Search Optimization Guide I just posted in another issue - because it exactly addresses your problem.

Also - two other guides / pages I've written that can help you are:
http://www.smorgasbord.net/how_to_optimize_drupal_web_site_for_google_ya...
and
http://www.search-optimization-school.com

Links

Z2222 commented 25 November 2006 at 01:40

Links pointing to your site from other related sites are very important.

Also, did you install the Nodewords module? Install that and then make sure that the meta description is different on every page. It looks like your page is outputting n/d for the meta description on many (all?) pages.

I will check meta descriptions

abu2000 commented 5 December 2006 at 13:42

I have the nodewords module installed. I will check the settings to see what is going on.

Thanks guitarmiami!!

http://www.100x100electronica.com.ar

I can´t get meta description to work

abu2000 commented 5 December 2006 at 14:44

I have nodewords installed and can´t get it to insert the description on every node. I have enabled "Use the teaser of the page if the meta description is not set." but I get a n/d in the meta description. If I don´t eneble it I don´t get a description. Where is the teaser of the page, does it generate automatically or do I have to write it?. Could it be I have category module installed and it is not compatible with nodewords?.

Thanks

http://www.100x100electronica.com.ar

Spidering your site

TheWhippinpost commented 5 December 2006 at 14:55

You have in excess of 3000 links which, though not a problem specifically, might have pushed the site into an aggressive (Google) filter if the site just popped-up from nowhere... best to grow a site organically rather than just appear in an "un-natural" big bang.

In addition, there are 130 broken links.

What would really worry me, is the following link:

http:/ /www. 100x100electronica.com.ar/node (Note how I didn't make that a link!)

I expected to see this, and I wasn't "disappointed" - This is a problem for many Drupal sites... if my spider can pick it up, then Google will too.

It is a doorway into supplemental hell, ie... Duplicate content!

I don't know which of your pages linked to it but if I were you, I'd plug it as soon as possible, either by hacking the module responsible for producing it, and/or, use absolute links... oh, and the robots.txt file.

(ADDED: Ah yes, it's the breadcrumb - Click on a category page and mouse-over your "Principal" breadcrumb; there you will see:

http ://www. 100x100electronica.com.ar/node

I notice also that the "categorias" has been duplicated: http :// www. 100x100electronica.com.ar/ categorias-0 - Note the "-0" which is a classic indication.

So, if Google finds a link to the intended link, http ://www. 100x100electronica.com.ar/categorias, you'll see that on your site, there is no content at that page.

I also noticed that "Camaras digitales" has the same "-0" too.

It can get ugly when you have all this to contend with, best of luck)

Mike
------------------------------------------------------------------------------------------
A simple thanks to those that help, a price worth payng for future wealth.

Can u clarify a bit please?

twohills commented 6 December 2006 at 08:15

Hi Mike

I don't understand much of what you say :-)

Are you saying there is something wrong with having ".../node" links? What do you mean by duplicate content?

Cheers

Are you saying there is

TheWhippinpost commented 6 December 2006 at 16:05

Are you saying there is something wrong with having ".../node" links? What do you mean by duplicate content?

Hi Twohills

OK, in a nutshell, Google uses their PageRank system to determine a site's authority. As part of that, they clearly have to make a decision as to what content should rightfully be credited with that authority. One of the biggest factors that play a part in this calculation is, links... mainly inbound, but also internal too.

Google also has a problem: spammers... not just spammers but also "innocent" people who might unwittingly find your content sufficiently interesting to copy and publish in a forum/website, or wherever. Clearly, Google has to then try and make sure that your content is still correctly credited, and not the copy.

Again, part of how they do that is, links (and age).

Keep in mind that all this is calculated by computers... and computers are dumb.

The link I highlighted that has "/node" appended, points to your home page... but everyone can also get to your home page via: www. 100x100electronica.com.ar - Clearly, with two different paths to the same content, we have a duplicatation issue which Google (or many of today's influential search engines (SE's)) has to decide which is the credible authority and which gets "kicked-to-the-side" (it doesn't want it indexed twice afterall).

Having a duplication issue with a home page, is the worst case scenario as it sets the path into your site. This problem is compounded even further if your site has relative internal links and other duplicate paths - such as - in your case - the "/categorias-0 link I mentioned.

I said "worse case scenario" didn't I!... well, not quite, Drupal also allows an even worse path trail which makes the problem exponentially more "dangerous" - I've raised it elsewhere, here is not the place.

All the above is why I daren't include the "/node" link to your site from here for fear that Google would crawl it.

I hope that is a little clearer.

Mike
------------------------------------------------------------------------------------------
A simple thanks to those that help, a price worth payng for future wealth.

Found the solution to /node

abu2000 commented 7 December 2006 at 13:23

Thanks a lot, I have been seing for quite some time that google was spidering my home/node and had no idea where it was picking it up. Damn Breadcrumbs. It seems to be a problem with drupal 4.7. I applied the patch and now it is working fine. Hope this helps:

You can find the solution here: http://drupal.org/node/78129

Thanks all for your help :-)

http://www.100x100electronica.com.ar

Does Drupal get indexed by Google?!!

Comments

New forum topics