We have one major remaining architechtural problem in the 6.x-2.x rewrite: 'overriding' the default status (inclusion/exclusion) from the sitemap.

The 6.x-1.x version currently lump both status and priority values into one column: priority. If priority = -2, the link is included and has default priority. If priority = -1, then the link is excluded from the sitemap. If the priority is a value from 0.0 to 1.0, the link is included and has the assigned priority value.

I wanted to keep the status and priority values as separate fields in the 6.x-2.x rewrite because then it is easy to change the default status or priority for a content type, and have those changes be immediately applied to the existing data in the xmlsitemap table with one simple UPDATE SQL. In the 6.x-2.x we currently have an boolean access column, which stores whether a link is visible or not (like if a node is private or published), a boolean status field if the link is included or excluded by default (the content type setting), a priority column which is the default priority for the item, and finally a priority_override column which if it is not NULL, overrides the priority field (COALESCE(priority_override, priority) as priority). See http://project.davereid.net/api/function/xmlsitemap_schema/xmlsitemap

But now I need to figure out how to store in the database (and expose in the interface) how to override the status value to include or exclude a specific link. I have come up with three solutions:

  1. Create a status_override column, similar to the priority_override column.
    If the value of status_override is NULL, then the value of status is used. Otherwise, the value of status_override is used.
    Pros: Consistency with the way the priority overriding works. We can easily update the default status values for content types, vocabularies, etc since status is still its own column. Easiest to understand what is going on in the database.
    Cons: In order to select all the links that are visible in the sitemap, we need to have COALESCE(status_override, status) = 1 which will not be as efficient for sites that will have large amounts of data in the table. Code is a little more complex in the interface since we can't easily give a select option element a NULL result value.
  2. Create constants for all possible status values and just use the status field.
    For example: XMLSITEMAP_STATUS_EXCLUDED = 0, XMLSITEMAP_STATUS_INCLUDED = 1, XMLSITEMAP_STATUS_EXCLUDED_OVERRIDE = 2, XMLSITEMAP_STATUS_INCLUDED_OVERRIDE = 3.
    Pros: Only one database column and easy to index.. To select included links, we only need to have WHERE status = XMLSITEMAP_STATUS_INCLUDED OR status = XMLSITEMAP_STATUS_INCLUDED_OVERRIDE.
    Cons: Not as easy to tell what is going on in the database. I don't link to use constants.
  3. Use a negative status value to represent a status override and just use the status field.
    For example, link excluded by default = 1, link included by default = 2, link overridden to be excluded = -1, link overridden to be included = -2.
    Pros: Only one database column and easy to index. Again to select the included links we just need to have WHERE status = 2 OR status = -2 or possibly ABS(status) = 2 but the latter is probably not as efficient.
    Cons: We can't use boolean 0/1 logic for status, which is not as easy to understand what is going on. Probably still need to use constants.

I'd like your input on what makes the most sense from an architectural standpoint since there's not really anything in Drupal core as an example that I can follow. If anyone has any more suggestions, let me hear 'em please!

Comments

quicksketch’s picture

we currently have an boolean access column, which stores whether a link is visible or not (like if a node is private or published), a boolean status field if the link is included or excluded by default (the content type setting)

Could you clarify the difference between the "access" and "status" columns? It seems like if either of these were FALSE the link would be excluded. What's the reason for two fields with the same purpose?

Other than that, I'd say go with your instinct. If priority_override works, so should status_override. Any idea what performance impact a NULL priority/status_override has? I imagine such rows will be in the underwhelming minority compared with the vast amount of sites that will just leave all override values NULL. I'm all for clarity of code despite being slightly more verbose, plus I hate constants too. Option 1 gets my vote based what I know.

Anonymous’s picture

I haven't looked at the 2.x code but I imagine access to mean "does the anonymous user have access to the link".

The status_override column sounds like what I would lean toward so go for that.

Dave Reid’s picture

Before I just had a 'status' column that was an all-inclusive boolean that was TRUE that say for a node, node_access() was TRUE and the content type was set to be included in the sitemap. But now if I wanted to change a content type from excluded to included, I couldn't just change that field to 1 since I would have to re-check node_access for all those records.

So I split into an access field, that stores if the 'link' is visible to the anonymous user (node_access(), $menu_item['access'], etc), and a separate status column that is a boolean for the default status of the link (content type inclusion setting). Now if I want to set a content type from excluded to included I just need to run UPDATE {xmlsitemap} SET status = 1 WHERE status = 0 and I still retain all the node_access() values.'

The current SQL to get all the visible and included links for the sitemap cached files is WHERE access = 1 AND status = 1.

coupet’s picture

Only one database column and easy to index..

Simplicity, I suggest #2

Not as easy to tell what is going on in the database.

Include additional documentation

Constants - install.inc
http://api.drupal.org/api/file/includes/install.inc/6

Dave Reid’s picture

Issue tags: +6.x-2.0-alpha blocker

Tagging

Dave Reid’s picture

After a lot of review, it seemed that solution #1 made the most sense to me. I was about to start implementing it when I had crazy idea #4:

#4: What if instead of storing values for status and priority in two different fields each, what if we make status_override and priority_override just a boolean field to say "hey, the values in the priority or status fields (respectively) are not the defaults - they're overridden!". This seems to make the most sense over the proposed solutions and doesn't duplicate data.

For example:

A node with default status (excluded) and default priority (0.5) would be:
status = 0
status_override = 0
priority = 0.5
priority_override = 0

If we overrode the same node to be included and with a priority of 0.8, then it would be:
status = 1
status_override = 1
priority = 0.8
priority_override = 1

If we need to make any mass updates for default values, we can still use one simple update query: UPDATE {xmlsitemap} SET priority = 0.8 WHERE type = 'node' AND node_type = 'story' AND priority_override = 0. Easy as that! Also, to select all the 'included' links in the sitemap it is a simple condition as well: SELECT * FROM {xmlsitemap} WHERE access = 1 AND status = 1.

The only downside to this approach is it will be a little harder to code in the interface than I currently have, but I'm sure I can do it.

What do you all think? Does this seem like a good approach?

Anonymous’s picture

I still think the idea in #2 is simplest. You only have a row in the override table if there was data overridden.

Hmm... That would also mean we could remove the priority and status columns from the xmlsitemap table and always assume the default value unless data was present in the override table. The override table would not be truncated during a rebuild, only the xmlsitemap table would be.

That means xmlsitemap table would become
id, type, loc, access, lastmod, changefreq and changecount

xmlsitemap_override would be
loc, priority and status

You already know what the SQL will be. You just COALESCE the defaults.

EDIT: The priority and status columns would be nullable to allow for setting one or the other and not both.

Dave Reid’s picture

I had discussed this more with David Strauss on IRC and I'm going to proceed with #6 since I've worked out how it will translate to interface/form code. Ideally there I would have a separate xmlsitemap_override table with id, type (the primary keys) along with status and priority. I can't however see how the separate override table would work in the code and it will take more thought and planning. For now I can use db_query_temporary() before the rebuild to save any override values and add them back after the rebuild.

Having a COALESCE with the status override is not ideal at all since we need to run a WHERE condition during the file generation. With the approach in #6 it is a simple WHERE status = 1 and it's done.

DamienMcKenna’s picture

How far along is the work on this? Please let me know if I can help, I've got a team of developers and DBAs eager to help make XMLSitemap work correctly.

Dave Reid’s picture

Status: Active » Fixed

Finished all the coding and architecture needed for the problem. Marking as fixed.

Dave Reid’s picture

Committed to CVS in http://drupal.org/cvs?commit=253560. I'm going to create a 6.x-1.0-unstable1 release so please help test the upgrade so alpha can get closer!

Status: Fixed » Closed (fixed)
Issue tags: -Needs architectural review, -6.x-2.0-alpha blocker

Automatically closed -- issue fixed for 2 weeks with no activity.