Just a suggestion -- if you make the 6.x branch 6.x-2.x, that will keep the version numbers in sync for the 5.x version and the 6.x version. This might lead to fewer problems down the road.

CommentFileSizeAuthor
#2 xmlsitemap_links_table-6.x.txt18.5 KBwayland76

Comments

wayland76’s picture

Title: 6.x-2.x » 6.x-0.x -> 6.x-2.x

Oops. I was making the false assumption that the 6.x port was a port of the 2.x branch. Obviously not. Can we make it so that 6.x is a port of the 2.x branch instead of the 1.x branch?

Well, then, I'll make this issue one where we can work on porting the patch from http://drupal.org/node/201644 to 6.x.

wayland76’s picture

Status: Active » Needs review
StatusFileSize
new18.5 KB

Ok, I've updated that patch so it works with 6.x (and fixed a bug in the calling of custom_rewrite_url_outbound along the way).

wayland76’s picture

I've just realised there's a bunch of other stuff that you also did to 6.x. I haven't ported that at this point. Do you want me to do that, or did you want to do that?

darren oh’s picture

As soon as issue 221555 is resolved, 5.x-2.x will be stable. I would then recommend it as the base for a stable 6.x release.

wayland76’s picture

I see it's fixed now :).

Ok, so what we should do then, if I understand correctly, is:
a) Fix the various bugs on the 2.x branch
b) Take a diff between the 5.x branch and the 6.x branch
c) Port that diff to the 2.x version
d) Release a copy of the 2.x version with the diff applied as a 6.x-dev release

Is that the process?

wayland76’s picture

Status: Needs review » Active
wayland76’s picture

Status: Active » Needs review

Btw, why are the translations being dropped for the 6.x version?

wayland76’s picture

Status: Needs review » Active

How did I manage that? :). I'm hoping issue #263547: Node paths being blanked gets resolved first as well.

darren oh’s picture

I'll plan on making the first official 5.x-2.x release together with the first official 6.x release.

darren oh’s picture

Category: feature » task
darren oh’s picture

By the way, translations are not being dropped. Translations for Drupal 6 modules go in the translations sub-directory.

darren oh’s picture

Assigned: Unassigned » darren oh
lslinnet’s picture

Is it possible to have someone create pre patched download of this, so us which don't know how to patch using cvs and stuff works also has a chance to come with some input.
(or just create a new branch and describe how to download it)

darren oh’s picture

That will happen as soon as there is something to test.

mo6’s picture

subscribe

netbjarne’s picture

also subscribing :) Looking forward to test a beta or release candidate :)

wayland76’s picture

Title: 6.x-0.x -> 6.x-2.x » XML Sitemap 6.x-0.x -> 6.x-2.x
Mattias-J’s picture

subscribing

scottrigby’s picture

look forward to testing this as well – cheers! :) Scott

dave reid’s picture

Very willing to test once official porting begins. Subscribing.

Artem’s picture

Subscribing to get notification when XML Sitemap is ported to 6.x or at least until there is a beta. Sitemaps are not the critical element and I would risk using beta.

DanielJohnston’s picture

Subscribing

avpaderno’s picture

Title: XML Sitemap 6.x-0.x -> 6.x-2.x » 6.x-0.x -> 6.x-2.x
darren oh’s picture

Title: 6.x-0.x -> 6.x-2.x » 6.x-0.x -> 6.x-1.x

Kiam@avpnet.org is now my co-maintainer for this module. For everyone's benefit, here is the detailed development plan I shared with him.

The 6.x code needs to be ported from 5.x-2. The 5.x-2 version addresses the major problems with XML Sitemap:

Configuration
No special set-up required. If Drupal works, XML Sitemap works. Zlib not needed.
Multi-site
It is no longer possible for one site's map to overwrite another's when they share a files directory.
Multi-language
Nodes are no longer excluded when their language differs from the administration page's at the time the site map is generated.
Caching
Uses Drupal caching instead of creating cache files.

One problem remains: large sites can still exceed their memory limits when populating the links table. Official 5.x-2 and 6.x-1 releases will be made simultaneously once this problem is addressed.

superflyman’s picture

Assigned: darren oh » Unassigned
Category: task » support

subscribe

avpaderno’s picture

ATTENTION
6.x-1.x-dev is still a work in progress, and it should not be used in any production sites. Install it only if you want to help to find any problems.

Your help is very welcome.

jerry’s picture

subscribing

avpaderno’s picture

To avoid that Drupal says that 6.x-0.x-dev is a not supported release, I made it appear on the project page.
This should resolve the problem of whom have that version installed, and keep to receive an error message from Drupal.

lars toomre’s picture

Thank you for making xmlsitemap-8.x-0.x-dev a supported release. The Drupal 6.x update function deceived me into believing that that the xmlsitemap=6-x-1-dev is also a supported release ready for production websites. That clearly has not been true!!

q0rban’s picture

What else is left to be done in getting a stable 6.x version of this module? I could potentially help out if there was a roadmap of where you're at, and what's left that needs to be done.

kmonty’s picture

Subscribing

avpaderno’s picture

The code needs to be rewritten in some points.

The one query to get the data, and the path alias approach is faulty.
It doesn't work for taxonomy terms, because such query is not able to catch an alias for a path that can be forum/<id>, taxonomy/term/<id>, or something completely different used by a third party module; the only way to get the correct alias for a taxonomy term is to use the functions given by Drupal core.
It doesn't completely work for nodes when local.module is enabled. In such case, there could be more than one alias for the same node, each for a different language; in this case the query actually used finds the first alias for the node URL, with the result that in the site map there could be links like http://example.com/it/prova-per-drupal, http://example.com/la/lorem-ipsum.

xmlsitemap_node.module should also check if the node it is adding on site map is accessible from an anonymous user (which includes also the search engines crawlers). Differently, there would be URLs in the site map that point to Drupal content whose access is forbidden to the search engines; the result is that, i.e., Google Webmaster Tools would report this like an error. There are many cases where an author restricts the access to a node to a limited number of Drupal roles (which may exclude anonymous users).

xmlsitemap_menu should inject links in the site map only if the current user is the anonymous user (I think this is also the case that occurs when CRON tasks are being executed). Like it works now, there could be menu links in the site map that point to pages the anonymous user is not able to access (that depends on who has been the last user to access the web site, but it's probably more true when the last user is an user with access to administration menu - when the navigation menu is selected to be included into the site map).

avpaderno’s picture

Version: 7.x-2.x-dev » 6.x-1.x-dev

In my opinion, the only way to solve the problem with site maps containing a large numbers of URLs is to make the module be able to create more than one site map. In this way, who has a lot of nodes can create different site maps (i.e, one for every content types that have more than X nodes, and one for the other content types which don't have too much nodes).

Changing the module to support the creation of more than one site map would allow to create a site map for every different language, as requested.

darren oh’s picture

We already have the ability to generate multiple site maps, but we still have to look up URL aliases ahead of time to get decent performance. The problem now is that we need to enable the XML Sitemap tables to be populated in multiple requests, so that the number of URL aliases looked up at once can be limited.

avpaderno’s picture

It still missing the user interface to allow the user to define the site maps content. That means to change also the way xmlsitemap.module shows its user interface to select the content type included in the site map.
The table where the URLs are saved in needs another field which contains the ID of the site map.

I am working on that.

darren oh’s picture

I don't like the idea of splitting up site maps by content type. The purpose of splitting up site maps is to prevent them from getting too large. We need to attack the problem directly. Right now we can specify how large a site map may be. We need to add the ability to add URLs in multiple queries and to specify how many URLs may be added in one query.

Splitting site maps by language is a different issue. As far as search engines are concerned, each language will be treated as a separate site, so http://example.com/, http://example.com/en/, and http://example.com/es/ would be considered different sites.

avpaderno’s picture

Suppose there is a web site which has 10900 posts of type blog, forum, and a custom type.
If the user is not able to select which content type must be in a site map, the issue with the time took by Drupal to create a large site map cannot be resolved. If I am able to have a different site map for different content types, I can also dedicate a site map for the content type that has more nodes, or that I think it will have the most nodes.

I don't mean that the user has the possibility just to select one content type for site map; I mean to give to the user (the administrator of a Drupal powered web site) the possibility to create more site maps, and to decide which content types put into every site maps. He will need to split the site map only in the case he has problems with XML Sitemap because it is too slow to generate the full site map (or it uses too much memory); it's probable that the normal user of the project modules will never have such need).

Maybe I am wrong, but the ability to split the same site map in different chucks didn't help with the time occurring to generate the site map; if that would be the case, then the issue reported should be already fixed. I guess that is because the search engine asks all of the site map chunks in a single time.
For what I can see from Google behavior, it reads the site map basing on the higher frequency reported by the site map; I know this because I disabled the automatic submission of the site map to the search engines, and noticed that Google kept to read the site map once per day, just because there was some content nodes which were reported to change daily. I imagine that, in the case I would have had more site map chunks, it would have downloaded every single chunks, as it is interested to read the full site map (and it's not able to remember on which chunk was the URL associated to content which changes daily).

There are also other things which can help in making the generation of the site map faster, like an index for the xmlsitemap table. If the table would be indexed by loc, then the queries could be faster.

darren oh’s picture

There is no need to use content types for this purpose. The bottleneck is in populating the XML Sitemap tables. Currently, we use module_invoke_all('xmlsitemap_links') to populate the xmlsitemap table. We need to replace this with a function that can populate the xmlsitemap table in batches and keep track of its progress. We should also be able to avoid repopulating the entire xmlsitemap table when a single URL is updated. XML Sitemap sub-modules could also use batch processing when populating their own tables on installation.

avpaderno’s picture

My point is that if a search engine downloads the full site map just because there is a single URL that is said to link to content that is updated daily, then when the site map is split in two, the links required to create the site map the search engine is asking for are exactly the half of the total number of links.
To create different site maps basing on the content type is just an example; there could be different criteria to decide which content must be used to populate the site map (it's enough the used criterion is not the alphabetical order of the URLs :-) ).

To split a task in batches is possible using the batch functions made available from Drupal starting from the version 6; those functions don't replace the hook used by XML Sitemap but make possible to divide a task in multiple batches. In this case, the implementation of hook_xmlsitemap_links() would receive a parameter giving it an indication which batch of links it should add (a progressive number could be enough).
This can be a help for modules that must add a lot of links to the xmlsitemap table.
It doesn't resolve the problem of creating the XML data when the search engine requires the content of the site map. Even if the site map is split into different chunks, the search engine will ask for the full content of a site map, chunk after chunk; if the site map is made only of 100 links (versus the full list of 1000 links, i.e.) the time to create it is a fraction of the time required to create a single site map of 1000 links.

darren oh’s picture

The site map index gives the last modified time for each chunk. If URLs are ordered by last modification date, only one chunk will need to be downloaded.

There really isn't a problem with large site maps being downloaded frequently: they are cached after being created. It is the overhead of populating the xmlsitemap table that causes problems.

Those who really want to generate multiple site maps can use Views to do so. It should not be part of the module configuration. They should be aware that the sitemaps.org protocol requires a site map to serve URLs from a single directory only.

avpaderno’s picture

In the case of populating the xmlsitemap table, Drupal 6 offers some functions to make possible the batch processing of data. In the case they are the solution to populate the table in reasonable times, then the implementation of hook_xmlsitemap_links() need to be changed so that an implementation of such hook doesn't call the same hook implemented from other modules (like it happens with xmlsitemap_node which causes the execution of the implementation of the hook made in xmlsitemap_file).

avpaderno’s picture

It is also true that the table get erased, and repopulated when a module changes the value of a variable used by XML Sitemap; it would rather be preferable to make possible for the modules to insert links that really changed. This means that there should be a way to record some conditions which could change the links reported like, i.e., the language set last time the links has been generated.
The approach to pick up one of many aliases of a node from the database doesn't work well, especially if it picks up a different language every time, with the result to get a list like http://example.com/it/esempio, http://example.com/en/another-big-example, http://example.com/eo/esperoj when the list could have been http://example.com/en/example, http://example.com/en/another-big-example, http://example.com/en/hopes.

avpaderno’s picture

It is also possible to get language-neutral aliases, but that would mean to change a query from, i.e., LEFT JOIN {url_alias} ua ON ua.src = CONCAT('node/', CAST(n.nid AS VARCHAR)) to LEFT JOIN {url_alias} ua ON ua.src = CONCAT('node/', CAST(n.nid AS VARCHAR)) AND ua.language = '' if that is still valid SQL. In such cases there would be the problem that the site map would not get populated from aliases, when there isn't a language-neutral alias.

I understand the main issue is the performance of the operation of populating the xmlsitemap table, but the code cannot be changed without taking in consideration multi-language sites.
It's true that for a search engine other languages content would appear in different directories (http://example.com/it, http://example.com/eo, etc...), but the Drupal installation is single, so XML Sitemap should take in consideration this.

avpaderno’s picture

The implementations of hook_xmlsitemap_links() need to be optimized.
It happens, i.e., that xmlsitemap_file.module makes an inner join with the {node} table to find the data it needs; this means that for the few files being attached to some nodes (few compared to the total numbers of nodes), the {node} table get completely checked. If the module would save more data in its own table, it would be possible to remove the inner join.

avpaderno’s picture

I found a solution at the problem of populating the xmlsitemap table. xmlsitemap.module must not empty the table; the single modules will remove the links they put in the table when the lastmod field of a row is lower than the change time of the object they put in their own database.

If a node is changed at, i.e., December 9 1:00 PM; xmlsitemap_node.module will remove from the table all the links that are associated to nodes, and whose change time is prior of that moment.
This means that any modules populating the table must save in a variable the last change time associated with the Drupal object they check (node, taxonomy, user, etc...), and that in the xmlsitemap there is a field saying which module saved the link.

executex’s picture

Hmm, I apparently thought that porting of Drupal 5 --> Drupal 6, was just a matter of changing the way some functions work, and small modifications. I didn't know you had to change the way the module worked in 5.x?

So what's wrong with the Drupal 6 version as of now?

Well anyway, hopefully you guys can release a stable Drupal 6 version.

korvus’s picture

subscribing

hass’s picture

As said above XML file generation should be moved to batch API as this is something that takes MUCH time and makes PHP running out of time. We plan to build many sites with 200.000 url alias table entries and this shouldn't timeout while generating the XML files.

avpaderno’s picture

The way I am planning to do (and that I am already implementing) is the following:

  • When xmlsitemap.module is first activated (or the first time after the update empties the xmlsitemap table) the module will give a warning in the status page, giving to the administrator a link that will start the batch process of populating the project module tables, including the xmlsitemap table.
  • After the xmlsitemap table is written the first time, the table will be updated, and the modules will never empty it;
    xmlsitemap_node.module will delete only the rows that refers to deleted nodes.
  • xmlsitemap_file.module will not make a join with the the node table and another table like it is doing now; as xmlsitemap_node.module already makes a join with that table and its own table, xmlsitemap_file.module will use the result got from xmlsitemap_node.module, avoiding so to repeat a join with a table that contains a lot of rows to get the same result obtained from xmlsitemap_node.module.

There will be a page that permits to clean the table used to contain the links to report in the site map, in case the administrator notices some problems with the site map reported to the search engines.

avpaderno’s picture

The generation of the XML string representing the site map cannot be done using the batch API, because the batch API breaks the task being done into subsequent HTML requests; I am not sure if the search engines would be able to support such functionality (I remember that Google Webmaster Tools gives back an error in its reports, if it gets redirected too many times).

The batch API could be used to create the XML content that will be then cached from Drupal (thanks to the call to drupal_page_footer()). In that case, there could be a page into the XML Sitemap administration pages with a link to generate the XML content of the site map.
The only problem is that the batch API already outputs the progress bar page, therefore XML Sitemap could not output the XML content at the same time.

hass’s picture

When xmlsitemap.module is first activated (or the first time after the update empties the xmlsitemap table) the module will give a warning in the status page, giving to the administrator a link that will start the batch process of populating the project module tables, including the xmlsitemap table.

Build the files via cron or run this batch together with the translation import if the module is activated.

The generation of the XML string representing the site map cannot be done using the batch API, because the batch API breaks the task being done into subsequent HTML requests; I am not sure if the search engines would be able to support such functionality (I remember that Google Webmaster Tools gives back an error in its reports, if it gets redirected too many times).

You can do subsequent batch api requests and save the result of the generation for e.g. in cache table if all is completed - run a final batch that generates the full file from the chunks generated before. This can and must all be done in background before showing the files to the search engines.

Additional, we should have the manual batch generation link to allow manual regeneration and/or clear button that allows regeneration by cron. All this regeneration stuff should be kept in background until ready for not having a time lag where the search engines may get broken or unfinished XML files.

avpaderno’s picture

There aren't files being built. The site map is being output to the standard output for a PHP script.

The batch tasks are executed, i.e., when you update the installed modules using update.php; after the page showing the available updates, you will see a page with a progress bar that normally (as the updates take few time) show Remaining 0 of X (with X being the total number of updates). If the updates will be slow, you would see, i.e., Remaining 4 of 4, Remaining 3 of 4, Remaining 2 of 4, Remaining 1 of 4, and Remaining 0 of 4 with the URL changing all times. At the end, there is a pause, and the URL changes to the final URL that shows the result of the updates (the page with the list of query done).
The batches work if they are started by a person; all the times a batch is executed is because a person clicked on a link that caused the batch process to start.

avpaderno’s picture

Status: Active » Closed (duplicate)

I am setting this like a duplicate of #359104: XML sitemap 6.x-1.x-dev progress.
This report is also not updated with the more recent changes in the development of the project modules.