Are there any plans to make XML Sitemaps follow instructions laid down in the robots.txt?

For instance I have just excluded a taxonomy term - Disallow: /promotional/

It would be great if this was reflected in the sitemap is it will confuse the Google bot.

I see that you are constantly developing this module so keep up the good work

Comments

avpaderno’s picture

Title: Intergrating robots.txt with XML Sitemaps » Exclude URLs that are not allowed in robots.txt

XML Sitemap already provides a way to exclude a taxonomy term from the site map; it's enough to select Not in site map in the vocabulary, or in the term form.
The same is true for nodes, or content types.

I am changing the title to avoid confusion with another request already present.

AndyW’s picture

Yeah my throwing in taxonomy is misleading.

One of the features of xml-sitemaps.com is that it will not list files, directories and paths forbidden in robots.txt and I was wondering if something similar will be done for XML Sitemaps (Drupal)?

avpaderno’s picture

Component: xmlsitemap_term.module » Code

That is different, as in that case they work on URLs. XML Sitemap handles term, node, or user IDs that are then translated into URLs.

I think it's easier for who adds a directory in robots.txt to also change the priority of the node (or taxonomy term, or user) associated to that URL, and avoid so that the URL appears in the site map.

avpaderno’s picture

Status: Active » Closed (won't fix)

I am changing the status to won't fix as this feature will not be implemented.

LuisCypher’s picture

Sorry for dragging up an old issue. The problem with this is that items like /admin/build , /reports/ and other areas are added to the sitemap. They are in my robots.txt but for some reason they are still getting crawled (possibly malicious bots access the sitemap?)

avpaderno’s picture

The problem with this is that items like /admin/build , /reports/ and other areas are added to the sitemap.

The only way this can happen is if you are using xmlsitemap_menu.module, and you have selected the navigation menu to be used to add links in the sitemap. Just deselect that menu from the XML Sitemap settings, and the problem will be solved.

I would make you notice that you are not using an updated version of the development snapshot, as the code has been first changed to verify if the anonymous user has access to the links added in the sitemap, and in the last commits xmlsitemap_menu.module has been removed.

They are in my robots.txt but for some reason they are still getting crawled.

The use of a robots.txt file doesn't stop a crawler from accessing the URLs being reported in the file.
The file simply reports a list of URLs that the webmaster doesn't want to get crawled, or a list of URLs that are not accessible to the search engines or other crawlers because they authenticate the user accessing the page at those URLs.
There are some malicious crawlers that scan the content of robots.txt to get a list of URLs they then try to access; as the webmaster marked them as not accessible, those are the first URLs being checked.

There is no need for a web site to have a sitemap.xml for having crawlers that access to some particular URLs; it's enough to have a robots.txt file.

LuisCypher’s picture

Thanks for the quick reply and clearing up the confusion for me. , I didn't know that the menu module would include links that anon users wouldn't be able to access.

avpaderno’s picture

I didn't know that the menu module would include links that anon users wouldn't be able to access.

This has been resolved in the last commits; to say the true, it's not even a change I made the other day, but it's dated farer back.

dave reid’s picture

Version: 6.x-1.x-dev » 6.x-2.x-dev
Status: Closed (won't fix) » Postponed

Moving to 6.x-2.x for consideration. I'm thinking something like 'Review current sitemap links for robots.txt violations'. The batch process would then change the {xmlsitemap}.status to FALSE and excluding it from the sitemap.

not_Dries_Buytaert’s picture

Status: Postponed » Active

This feature should also work with modules, like: http://drupal.org/project/robotstxt

I hope it is alright that I changed the status, since 2.x dev (http://drupal.org/node/449710) has been released.

dave reid’s picture

Version: 6.x-2.x-dev » 7.x-2.x-dev
Status: Active » Postponed

It's still not an actively-developed feature, so please leave as postponed.