Okay, I have been trying to figure this out since yesterday. I have gsitemap (cvs version works on 4.7) and urllist, but I get errors on both after repeated submits to google.

I have clean URLs and I can see both maps fine using my browser.

I have submitted the following sitemaps to Google - (their response in italics)

Definitions by Google (online here)

    URL not allowed - Your Sitemap contains a URL that is not allowed based on the Sitemap's location.
    Network unreachable - We encountered a network error when we tried to access the page.
    5xx error - See RFC 2616 for a complete list of these status codes. Likely reasons for this error are an internal server error or a server busy error. If the server is busy, it may have returned an overloaded status to ask the Googlebot to crawl the site more slowly. In this case, we'll return again later to crawl additional pages.

Here is my Robots.txt which I have checked with Google

User-agent: *
Crawl-Delay: 10
Disallow: /aggregator
Disallow: /tracker
Disallow: /comment/reply
Disallow: /node/add
Disallow: /user
Disallow: /files
Disallow: /search
Disallow: /book/print
Disallow: /admin/

I saw a post about .htaccess and tried this with no success (placed at the top of the file)

AddType text/xml .xml 
AddType text/xml .xs

Comments

isaacbowman.com’s picture

Now Google is giving me a different error on /gsitemap - 5xx error Network unreachable

/?q=gsitemap still gets - - URL not allowed (Line 1) with URL http://www.isaacbowman.com/ This url is not allowed for a Sitemap at this location. (I got this error on every page when I submit this sitemap)

Isaac Bowman
www.isaacbowman.com

SamL-1’s picture

but could this be related to www.isaacbowman.com vs plain isaacbowman.com ?

pulsifer’s picture

Your sitemap was submitted for http://isaacbowman.com/, but every URL in the sitemap is for http://www.isaacbowman.com/

isaacbowman.com’s picture

I resubmitted with www and the sitemaps were accepted just minutes ago. I had read through google's help and did not catch this issue. I am going to see if the any other errors pop-up after the spiders have finished the site.

Thanks!

Isaac Bowman
www.isaacbowman.com

WeRockYourWeb.com’s picture

Hi Isaac,

If you haven't already done so, I would recommend redirecting your non-www to your www pages, to prevent both from getting accessed/ indexed and thereby risk losing PageRank.

Add this to your .htaccess file right after the line #RewriteBase /drupal and replace domain.com with your domain :

#custom redirects

RewriteCond %{ENV:REDIRECT_STATUS} =200
RewriteRule ^ - [L]

# Redirect non-www to www
RewriteCond %{HTTP_HOST} !^www\..*
RewriteRule ^.*$ http://www.domain.com%{REQUEST_URI} [R=permanent,L]

#end custom redirects

Hope this helps,
Alex

Contract Web Development

plux-1’s picture

I've been following this thread and some other similar ones. I've made the changes to the rewrite rules in .htaccess, mod rewrite is running, and as one post suggested, I made a url alias of sitemap.xml to gsitemap because google seemed to prefer it.

I still get errors of "This is not a valid URL. Please correct it and resubmit" from google sitemaps. The url's in the xml are in the format of /home for example. I don't know if google is looking for a more complete address or /home/ .

The generated sitemap is at http://www.251northriverroad.com/sitemap.xml

WeRockYourWeb.com’s picture

Never mind - I just answered my own question - I see that I can specify by editing a content item whether or not it appears in the sitemap :)

Thanks for an awesome module!!

Alex

Contract Web Development

ideviate’s picture

hi,
why Disallow: /admin/ and Disallow: /files ?
when we use slash?

powered by Drupal www.universideliyiz.biz

WeRockYourWeb.com’s picture

Hi There,

You shouldn't have to disallow /admin/ - that is blocked automatically from my understanding. /files is meant for your files, kept separate from Drupal core files. You might want to read this on keeping your Drupal site tidy.

As far as the backslash in robots.txt, I believe the difference is that /files/ will block everything within the directory /files/, whereas /files (no backslash) will also block the filename /files (and anything that begins with /files*)

Cheers,
Alex
----------
Contract Web Development