Make print urls robots.txt friendly [#49794]

If you run a large site (mine is around 200,000 pages), then search engine traffic volume and associated server load are significant issues as search engines revisit each page periodically. It's best if you can direct search engines to read only the versions of pages you want indexed. Telling them after they access the url that they shouldn't index the content means a lot of wasted traffic.

As such, it's best if URLs for print pages, email to friend pages, and any other per-node accessory pages are able to be excluded as a set using robots.txt.

Currently the Print Friendly Pages module uses URLs like: http://www.example.com/node/766/print which can't be excluded by robots.txt because they lack a common prefix. By contrast, if they used an URL like http://www.example.com/print/node/766 then you could have robots.txt exclude all of http://www.example.com/print/

If people specify URLs using the Path module, it would be nice to have the print module use corresponding URLs like http://www.example.com/print/user/specified/path/ that could be caught by the same robots.txt exclusion.

Comment	File	Size	Author
#1	print.module.patch	1.19 KB	ngaur

Comments

Comment #1

ngaur commented 25 February 2006 at 14:38

Version:

4.6.x-1.x-dev

» 4.7.x-1.x-dev

Status	File	Size
new	print.module.patch	1.19 KB

The attached patch will change the url format to a robots.txt friendly one.

I figured that if the site admin doesn't want print urls indexed, then they don't need the search engine to visit them at all. Hence
this patch also adds rel="nofollow" to the links to the printer friendly pages where appropriate.

The patch applies against the 4.7.0 version of the module. Backporting should be straightforward.