There is a very old request for the Domain Access module to support Canonical URLs to improve SEO: #414882: Include Canonical URL tag in links extra-domains.

To explain the background, the basic issue is that when you install Domain Access, essentially *ALL* Drupal requests -- be them node pages, views, taxonomy term pages, error pages, or literally ANYTHING -- could be considered duplicate content from the primary domain by search engines. Often the case is that the entire page is exactly the same, but the layout (eg block positions) and style (theme) are different. This presents a serious problem for people that use the Domain Access module, that they may not be aware of.

One would think you can get around this problem through proper configuration of Domain Access, and certainly you can prevent duplication of node content, by ensuring that nodes are never 'sent to all affiliates' and that only one piece of content is visible from each domain. However this is only one of many possible use-cases that Domain Access enables. What about content that is sent to all affiliates, or content that is sent to two affiliates, but specifically not to be seen on the primary domain.

The problem is exacerbated by the fact that ANY PAGE REQUEST from subdomains that is NOT a node, is a copy of the same page from the primary domain. Modules that provide specific pages like Ubercarts /cart url, or Contact modules /contact URL, are (without further customization) available at all the domains. There should be an easy way to ensure users are not penalized for these types of configurations.

Sadly, the conclusion on #414882: Include Canonical URL tag in links extra-domains was that this functionality is something for a sub-module to handle, not Domain Access directly. Furthermore, since NodeWords is built to handle Canonical URLs and this module is the bridge between Nodewords and Domain Access, I present a proposal to make this module the key for Canonical URLs support for Domain Access.

I've written a working proof of concept that adds Canonical URL support to domain_meta, by hooking into hook_nodewords_tag_alter. There are currently two specific cases supported:

  • Nodes that distribute content to multiple domains, should have a canonical url that points to the primary domain.
  • All non-node pages on the non-primary domain should have a canonical url that points to the primary domain.

There may be more special cases (eg taxonomy and users) however the current client's needs did not demand a solution for this at the moment, so they are relegated to the second rule above for non-node pages.

There are a few changes in the NodeWords queue that may have an effect on how the canonical urls work going forward so we need to keep an eye on #1244132: Change Canonical URL handling with option to use the url_alias, full path.

Otherwise, the patch in the first comment is a great starting point, and has been heavily tested already by myself.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

jwilson3’s picture

Status: Active » Needs review
FileSize
8.03 KB
nonsie’s picture

I took a look at it and based on the first impression it looks good - however I want to actually apply the patch before committing it.

nonsie’s picture

Status: Needs review » Needs work

It still needs a change to hook_domainbatch(). Essentially keep it consistent.

nonsie’s picture

I had time to dig into this feature and I'm not sure I understand the purpose fully. Is this supposed to be turned on/off per domain basis or per DA install basis?

At the same time I'm contemplating moving keywords/description away from Settings to a separate screen, these ought to be hook_domainlinks()/hook_domainbatch() instead of hook_domainform(). Dunno how much coffee I had had on the day I wrote that in

jwilson3’s picture

The canonical URL setting is a setting that has to be set per-installation, not per-domain. Either all the sites implement the canonical URL enhancement, or none of them do. Therefore, this is a setting that should be shown *only* on the hook_domainform and that there is nothing that needs to go into hook_domainbatch regarding canonical URLs.

I agree that the keyword/description functionality is not optimal, and #1230764: Allow metadata for default domain is a good start, but moving keywords/descriptions out of the hook_domainform, should be taken up on a separate ticket completely. There is sufficient difference between the functionality of canonical urls, and per-domain keywords/descriptions that their implementations don't need to follow each other to the T.

Finally, if you are going to move keywords/description out of hook_domainform then it makes the most sense to move the 'Enable cannonical urls' checkbox out of the 'Meta tags' fieldset, into the 'SEO' section of the form. Does that make sense?

nonsie’s picture

Thanks for the explanation. Planning to roll a new patch and commit both this and #1230764: Allow metadata for default domain today

jwilson3’s picture

@nonsie:

Would you like me to move the canonical URL checkbox to the SEO fieldset (out of the domain_meta fieldset) in hook_domainform? I can test and re-roll now if you wish.

jwilson3’s picture

Status: Needs work » Needs review
FileSize
8.27 KB

Reroll... I moved the checkbox to the 'Advanced Settings' fieldset on the domainform, just below the "Search Engine Optimization" radio button and the default source domain field.

Moving this around led me to discover that there is a submodule called domain_source, that I'm not using, that allows you to set the primary source domain on a per-node basis.

There is also a variable provided by the basic domain module called domain_default_source, that allows one to change the default domain. I'm not totally sure if this setting should be ignored, or factored in when calculating the canonical URL. My hunch is that it should be used. Looking at how domain module uses the value, its only used when the SEO setting to redirect nodes is enabled. However, I have the SEO redirect setting enabled, but the nodes are not redirected to the primary domain. I dont know if this is a bug in the domain module, or another misconfiguration of my own. Anway, long story short, both of these features seem very unlikely to be used, and I honestly don't have time to continue to test this out at this moment, so I've added @todo comments as reminders that these need to be looked at for consideration in determining which domain to use when formulating the canonical url.

nonsie’s picture

Status: Needs review » Needs work

Thanks for the re-roll, gets this one step closer but not quite. One more question - which version of Nodewords are you using? The versions listed as supported on the project page do not have NODEWORDS_TYPE_NODE defined.

To clarify domain_default_source sends user viewing nodes assigned to all affiliates to this particular domain, it does not however change the default domain (this is a feature only in D7). For example - you have node that is published to all affiliates and you have set the default source domain in advanced settings to example2.com (domain ID 2). This means that when clicking on a link to this node you are sent to example2.com not the domain you were on. By default this value is your primary domain or you could opt it to be the same domain you were on to begin with (Do not change domain option).

Your nodes are probably not getting redirected to the primary domain (despite SEO redirect being enabled) because you have it set to "Do not change domain". Let me know if this is NOT the case and I'll track it in DA issue queue.

Which brings us back to this specific issue. IMHO canonical url should be determined in the following order for nodes that are published to more than one domain (either explicitly or all affiliates):
1 If Domain Source is enabled, respect its settings (based on node)
2 If SEO rewrite is enabled (based on install wide setting):
2.1 Node is published to all affiliates - respect SEO rewrite settings
2.2 Node is published to multiple domains but not to all affiliates - go to #3
3 Default to primary/default domain if it is one of enabled domains or node is published to all affiliates, otherwise default to the first enabled domain

For other paths it should use #2 if enabled or #3.

jwilson3’s picture

Using nodewords 6.x-1.x-dev (which is the one currently recommended by DamienMcKenna, the module co-maintainer). As this module has no stable release yet, i don't see a risk in going ahead and committing this. (I updated from nodewords 6.x-1.11 to 6.x-1.x-dev without issue.)

Also, in your list of order for determining canonical url for a given node, if I'm not mistaken, there is also the (optional) possibility to define a custom canonical url on the node edit form, which should probably take precedence over everything. Unfortunately, there is no way to tell whether an existing canonical url value passed into the hook alter was provided by a user, or by some other automatic system that may be interacting with nodewords and have been executed before the domain_meta module's nodeword's alter hook got called. This is where the entire logic-tree breaks down, because by default nodewords does create a default canonical url.

I'm probably making things more complicated and thus, less likely to actually make it into a working patch, by pointing all this extra stuff out. The patch I've made does provide the basic functionality for the most common use cases, which is better than nothing at all, and goes a long way towards improved SEO.

jwilson3’s picture

Your nodes are probably not getting redirected to the primary domain (despite SEO redirect being enabled) because you have it set to "Do not change domain". Let me know if this is NOT the case and I'll track it in DA issue queue.

You were correct. I have the SEO redirect enabled, but have the 'default source' set to not change anything. This is somewhat of a counter-intuitive or sub-optimal UX, but a topic for a completely different discussion.

nonsie’s picture

Status: Needs work » Needs review
FileSize
9.06 KB

I've re-rolled your patch with some changes from #9. It will override user defined canonical url for nodes published to all affiliates since (as you stated earlier) there's no way to determine if the canonical url nodewords passes along is a default or user defined. It could be solved if canonical url was present in $options['default'] in hook_nodewords_tags_alter().
I could use some extra eyes on this patch before I commit it.

DamienMcKenna’s picture

Subscribe. I'll take a look at the site James has it configured on and provide some feedback.

jwilson3’s picture

I think there is one too many ORs in this sentence, no?

+  // If the node is sent to all domains or more than one domain then the
+  // canonical URL should point to either the domain specified in Domain Source
+  // or for sitewide SEO rewrites or default to the primary domain.

This sentence is missing a period.

+  // Domain Source node rewrites override SEO redirects if set

I'd rather see this section without the !$domain condition...

+      if ($use_source) {
+        $domain = domain_get_node_match($options['id']);
+      }
+      // No domain is set, first domain is not the current domain
+      if (!$domain && $first_domain_id != $current_domain['domain_id']) {

... something like:

+      if ($use_source) {
+        $domain = domain_get_node_match($options['id']);
+      }
+      // No domain is set, first domain is not the current domain
+      elseif ($first_domain_id != $current_domain['domain_id']) {

Reading through the code, it makes sense, but I haven't tested this with Domain Source module enabled yet.

jwilson3’s picture

Ok. whoops, I mis-interpreted that last block of code where domain_get_node_match() could possibly return what? (FALSE? / NULL?)

nonsie’s picture

The last block should always return an array for our purpose - it might return NULL for the first save operation but this does not affect the use case in this module.
And a new patch

mr.j’s picture

Subscribing. Good stuff. I am still using the hack I came up with in the thread the OP linked to for displaying one set of forums on multiple subdomains.

Surely the "SEO rewrite" option is an obsolete hack to work around a problem that canonical URLs now solve?

I have always found that the SEO rewrite can be off-putting for a user where the site has different themes or layouts between subdomains. i.e. they click a link and SEO rewrite sends them to an entirely different looking subdomain, and most links off that subdomain now point to itself, not the subdomain the user came from. For this reason I have never used it, and it seems like a good candidate for retirement if canonical URLs are handled properly.

mr.j’s picture

I am seeing a problem with this patch and the current 6.x dev version of nodewords.

Specifically on nodes that are published to multiple domains the canonical tag is prefixing the full node URL with the default domain (instead of replacing it) so I end up with something like this:

<link rel="canonical" href="http://www.mydomain.com/http://de.mydomain.com/forum/discussion/topic" />

I am not sure why url() is returning a full absolute path. My hunch is that it is a result of the domain module including settings_custom_url.inc in hook_boot which can force absolute URLs in domain_url_outbound_alter().

Similar code as in domain_meta_alter_page_canonical_url() fixes it for me:

--- Base (BASE)
+++ Locally Modified (Based On LOCAL)
@@ -188,9 +188,21 @@
   // Create the nodewords Canonical URL based on the selected $domain's url,
   // plus the local path.
   if (isset($domain)) {
-    return $domain['path'] . trim(url('node/'. $options['id']), '/');
+    $url = trim(url('node/'. $options['id']), '/');
+    
+    // Replace the current domain with the default domain in the canonical URL.
+    if (strpos($url, $current_domain['path']) === 0) {
+      return str_replace($current_domain['path'], $domain['path'], $url);
   }
+    // Otherwise (assuming someone hasn't already specified the URI by hand)
+    // prepend the default domain to the URL for relative paths. This section
+    // will need further attention depending on the implementation in
+    // http://drupal.org/node/1244132
+    elseif (strpos($url, $domain['path']) === FALSE) {
+      return $domain['path'] . $url;
 }
+  }
+}
 
 /**
  * Implements hook_domainlinks()
jwilson3’s picture

#18 may be a valid bug, but since the code to fix it is repeated from an earlier section, it should be extracted into a helper function to reduce duplicate code, and improve readability.

mr.j’s picture

Yep I figured someone who doesn't see the problem I had might want to eyeball it first.

Anyway here is a full patch to domain_meta.module including the use of a single function for that shared code.

jwilson3’s picture

Thanks for the patch.

From a read through, aside from some missing punctuation and inconsistent capitalization in the function comment, it looks good. I haven't tested the code though and probably wont have a chance to anytime in the near future. Maybe nonsie can have a look?

.jon’s picture

Thanks for working on this, I'll join by testing.

.jon’s picture

Status: Needs review » Reviewed & tested by the community

Ok, tested with patch from #20 and so far looks good.

+ module serves different meta tags to domain A (our primary or domain 0) and domain B
- the meta tags of domain A are added to the meta tags of domain B (not an issue in our case, but could be)
+ canonical urls are added correctly into headers based on the domain where content is created

Questions:

1. Is it necessary to print canonical in the headers on the original source domain? (linking to itself) Not sure if this has any downsides either.
2. When people refer to "primary domain", do they mean the domain 0, or the domain used to publish the content? We create content on domain B and publish it on domain A (0).

Anyway, will put the patched version to production and report how it goes, we have a nice example case of not getting any search engine hits on domain B (original source), but some hits on domain A. Could be both missing unique meta tags of domain B and the missing canonical urls, so this module looks like it solves both.

mr.j’s picture

We have the canonical tags on every domain, including the one it was published to. It cannot hurt, and can be beneficial in ways you hadn't intended. eg we use a CDN for images and css and the other day we found duplicate pages had been unintentionally added to Google's index on the CDN subdomain. If we had the canonical tags on there in the first place that would not have happened.

Re "primary domain", I take that to mean the domain the content was published to.

jwilson3’s picture

@23: Your scenario of the content belonging to domain B but being shared to domain A is interesting, but when I wrote this originally, I was unable to find a way to determine which domain is intended to be the "primary" domain on a per-node basis. Thus, for most simple cases, the primary domain usually ends up being the *first checked domain* in the list (on the node edit page) of valid domains where the content should be published.

The decision tree for selecting the url to use nodes is heavily commented in the patch in #20, look at the domain_meta_alter_node_canonical_url function.

nonsie’s picture

I agree with jwilson3 - there really isn't a way to determine which is the "primary" domain.
Unless anyone objects I am going to commit #20 and roll a new release today.

nonsie’s picture

Status: Reviewed & tested by the community » Fixed

And it's been committed. Thanks for helping out!

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.