Simple Sitemap
Introduction
Simple Sitemap provides a fast and efficient method of generating a Sitemaps.org compliant XML sitemap. This documentation aims to assist programmers in adding or overriding URL entries; at current, this module is probably not appropriate for use in sites without ready access to a programmer. For more information about sitemaps in general, you should read about the Sitemaps.org protocol.
Please visit the Simple Sitemap project page for details about progress and planned features.
Goals
Simple Sitemap is built as an alternative to XML Sitemap for very large sites. It takes a very different approach to building the XML document which makes far more assumptions than XML Sitemap and allows less customization (unless you're a module developer), so it is recommended that smaller sites continue to use XML Sitemap. At present, Simple Sitemap is not a replacement for or a duplication of XML Sitemap; though the result is the same, the features available and the methods used are very different and aimed at very different sorts of Drupal-powered sites.
Unless you specifically extend this module to support other types of URLs, the only URLs that will be indexed are those with entries in the url_alias database table. This may change in the future to allow you to specify other types of URLs to index. The reason for this is simple: speed. If you need Simple Sitemap's speed but also want to index other types of URLs, you can implement Simple Sitemap's hook.
History
Simple Sitemap was built by the New York Observer because XML Sitemap was too resource intensive given our tens of thousands of nodes and terms. Since performance is most important in our case, we made some subjective decisions, including the decision to use the url_alias table as the basis for the sitemap (safe for us, probably not for other sites) and the decision to run on a cron job rather than on each node operation.
How it Works
Simple Sitemap builds its XML file using only two database queries running on a cron job. It doesn't regenerate after every node insert/update/delete to avoid increasing the wait time during such an operation (an important consideration for large sites with various staff contributing to the site). The first query reads all data from url_alias associated with a node as well as the node itself, taking care to only select published nodes which have already been published. It derives a weight for the node based on how many reads it has. The second query selects all entries in url_alias which are not nodes. The script breaks the URLs into collections of a size you specify and saves them directly to the filesystem and serves them appropriately on request. It's assumed that your webserver is set to compress content; module-level compression may come in the future if it's requested.
Extending Simple Sitemap
You can implement two hooks which Simple Sitemap provides: hook_simplesitemap_add and hook_simplesitemap_process.
hook_simplesitemap_add
This hook simply allows you to return URLs to Simple Sitemap for inclusion in its XML output. It takes no arguments, and expects an array of arrays as return, or just an empty array if no results were found.
<?php
function hook_simplesitemap_add() {
return array(
array(
'loc' => '/interesting-content',
'priority' => 1.0,
'changefreq' => 'hourly',
'lastmod' => '2007-12-25T12:00:00'
),
array(
'loc' => '/boring-content',
),
);
}
?>The first nested array contains all the possible elements; any others will be ignored. Note that 'loc' is the only required element. It must be a relative URL starting with / and duplicate URLs will be ignored; e.g. the first definition of a URL (modules are processed alphabetically, then the contents of the database table url_alias) will be the only one used. hook_simplesitemap_process() does fire on URLs added with hook_simplesitemap_add()
hook_simplesitemap_process
This function allows you to interrupt the processing of a URL or add or modify data to it. It takes three arguments, $row, $item, and $is_internal. $row is the results of the internal Simple Sitemap query or null if the URL comes from another module, $row and $item are both associative arrays of 'loc', 'priority', 'changefreq', and 'lastmod' (just as in hook_simplesitemap_add) which will be converted to XML, and $is_internal is true if the URL comes from the Simple Sitemap module, false if it comes from elsewhere. You should modify $data to suit your needs (generally based on the contents of $row) and return $data. If you return false, the URL will be removed from the sitemap.
<?php
// obviously this is just an example; starting URLs with 'private' or important' is a bit preposterous
function hook_simplesitemap_process($row, $item, $is_internal) {
if($is_internal) {
if(substr($row['loc'], 0, 7) == "private") {
return false;
}
if(substr($row['loc'], 0, 9) == "important") {
$item['priority'] = 1;
}
}
return $item;
}
?>It would be wise to avoid executing a database query in this hook unless you can run it once and make the results a static variable since this function will be called on every URL element.
