Postponed
Project:
XML sitemap
Version:
7.x-2.x-dev
Component:
Code
Priority:
Normal
Category:
Feature request
Assigned:
Unassigned
Reporter:
Created:
4 Feb 2009 at 09:38 UTC
Updated:
24 Sep 2010 at 13:42 UTC
Are there any plans to make XML Sitemaps follow instructions laid down in the robots.txt?
For instance I have just excluded a taxonomy term - Disallow: /promotional/
It would be great if this was reflected in the sitemap is it will confuse the Google bot.
I see that you are constantly developing this module so keep up the good work
Comments
Comment #1
avpadernoXML Sitemap already provides a way to exclude a taxonomy term from the site map; it's enough to select Not in site map in the vocabulary, or in the term form.
The same is true for nodes, or content types.
I am changing the title to avoid confusion with another request already present.
Comment #2
AndyW commentedYeah my throwing in taxonomy is misleading.
One of the features of xml-sitemaps.com is that it will not list files, directories and paths forbidden in robots.txt and I was wondering if something similar will be done for XML Sitemaps (Drupal)?
Comment #3
avpadernoThat is different, as in that case they work on URLs. XML Sitemap handles term, node, or user IDs that are then translated into URLs.
I think it's easier for who adds a directory in robots.txt to also change the priority of the node (or taxonomy term, or user) associated to that URL, and avoid so that the URL appears in the site map.
Comment #4
avpadernoI am changing the status to won't fix as this feature will not be implemented.
Comment #5
LuisCypher commentedSorry for dragging up an old issue. The problem with this is that items like /admin/build , /reports/ and other areas are added to the sitemap. They are in my robots.txt but for some reason they are still getting crawled (possibly malicious bots access the sitemap?)
Comment #6
avpadernoThe only way this can happen is if you are using xmlsitemap_menu.module, and you have selected the navigation menu to be used to add links in the sitemap. Just deselect that menu from the XML Sitemap settings, and the problem will be solved.
I would make you notice that you are not using an updated version of the development snapshot, as the code has been first changed to verify if the anonymous user has access to the links added in the sitemap, and in the last commits xmlsitemap_menu.module has been removed.
The use of a robots.txt file doesn't stop a crawler from accessing the URLs being reported in the file.
The file simply reports a list of URLs that the webmaster doesn't want to get crawled, or a list of URLs that are not accessible to the search engines or other crawlers because they authenticate the user accessing the page at those URLs.
There are some malicious crawlers that scan the content of robots.txt to get a list of URLs they then try to access; as the webmaster marked them as not accessible, those are the first URLs being checked.
There is no need for a web site to have a sitemap.xml for having crawlers that access to some particular URLs; it's enough to have a robots.txt file.
Comment #7
LuisCypher commentedThanks for the quick reply and clearing up the confusion for me. , I didn't know that the menu module would include links that anon users wouldn't be able to access.
Comment #8
avpadernoThis has been resolved in the last commits; to say the true, it's not even a change I made the other day, but it's dated farer back.
Comment #9
dave reidMoving to 6.x-2.x for consideration. I'm thinking something like 'Review current sitemap links for robots.txt violations'. The batch process would then change the {xmlsitemap}.status to FALSE and excluding it from the sitemap.
Comment #10
not_Dries_Buytaert commentedThis feature should also work with modules, like: http://drupal.org/project/robotstxt
I hope it is alright that I changed the status, since 2.x dev (http://drupal.org/node/449710) has been released.
Comment #11
dave reidIt's still not an actively-developed feature, so please leave as postponed.