Pathologic
Pathologic is an input filter which can correct paths in links and images in your Drupal content in situations which would otherwise cause them to “break;” for example, if the URL of the site changes, or the content was moved to a different server. Pathologic can also solve the problem of missing images and broken links in your site’s RSS feeds.
Example use cases
Here's some hypothetical situations in which Pathologic can save the day.
- You run a personal site, and the address of your site has recently changed. Perhaps you moved to a shiny new domain name, or perhaps perhaps you moved the Drupal installation from one subdirectory to another. Now all the images and internal links in your content don't work. You could go through your site node by node and update all the paths… or you could install Pathologic.
- You oversee a site which has testing and production servers at separate URLs. Copy-editors (and/or you) edit content on the testing server, and that eventually gets pushed over to the production server. When those darn editors link to other content on the site, they sometimes link to content using the test server's URL; these links break when the content is published to the production server. You could get frustrated at your editors (and/or yourself) when this happens… Or you could install Pathologic and never have to worry about it again.
- Your Drupal site has been up for a while, but you've recently discovered the Clean URLs feature and enabled it. Your links still work, but they still have that ugly "?q=" thing in them, and you have better things to do with your time than go through all your content to prettify the links. Or maybe you're going the other way; you used to have Clean URLs enabled, but you've had to disable it, and now your links are broken. Pathologic to the rescue!
- Links and/or images in your site content uses relative paths (eg, <a href="tag/food/pizza">) which work fine for people reading content on your site, but break gracelessly for people reading the content through RSS or some other sort of external feed. You could just start using absolute paths instead, but you're too set in your ways and would rather have a tool like Pathologic do it for you.
Installation
Pathologic is an input filter, so getting it installed and configured is a little bit more difficult than standard modules, but the instructions below will walk you through the process.
- Install the Pathologic module as normal. (If you’re a total Drupal newbie, these instructions for installing modules may be helpful – and welcome to the community, by the way!)
- In the Administer menu, select “Input formats” from the “Site configuration” section. A list will appear of the various input formats your site uses. Find one in the list which you want to use Pathologic with, and click the “configure” link for that format.
- On the next page, find the section labeled “Filters.” Check the box next to “Pathologic.” All other options on this page can be left alone. Click the “Save configuration” button at the bottom.
- This will take you back to the same page, with a message telling you “The input format settings have been updated.” Now, find the “Rearrange” tab at the top of the page and click it.
- This will bring you to a list of filters that this input format uses. Pathologic should be at the bottom of this list; if so, you don't have to do anything. If it is not, adjust the values in the “Weight” column so that Pathologic has the highest value. Click “Save configuration” when done.
- If you wish to use Pathologic with other input formats, go back to step 2 and repeat the process.
- Pathologic is now working on all old and new content which uses the input format(s) you added it to.
Is configuration necessary?
Depending on how you intend to use Pathologic and how the paths in your currently-existing content are formed, further configuration may not be necessary. To understand if further configuration is necessary in your case, and to explain how to go about that configuring, allow me to take a moment to explain how Pathologic works.
Pathologic looks at paths that are located in href attributes of links (<a> tags), as well as the src attributes of image tags and tags for other embedded media (<img>, <embed>, etc). If you wish, you can configure Pathologic to work on src attributes but ignore href attributes, or vice versa.
After finding a path in an attribute, Pathologic then determines if a path is “local.” It does its magic on local paths, but leaves other paths alone.
Let's assume that your Drupal site is up and running at http://example.com/drupal/. Pathologic considers a path local if:
- The path is a relative path. That is, it does not have a protocol fragment (such as http://) and does not begin with a slash. For example, tags/food/pizza will be considered a local path, but /tags/food/pizza and http://drupal.org/tags/food/pizza are not.
- The path is an absolute path that points to a resource located within your Drupal installation. Our example is located at http://example.com/drupal/, so http://example.com/drupal/tag/food/pizza is considered a local path. However, while http://example.com/not_drupal/ points to a resource on the same domain name, it points to something outside of the Drupal installation, so it is not considered local.
- The path contains only an anchor fragment, such as #pizza.
- The path is an absolute path which begins with a URI of another Drupal installation which you’ve instructed Pathologic to consider local.
Aha! That last one is where things start getting interesting. Let’s say you’ve grown tired of using http://example.com/drupal/, so you’ve moved your site over to http://example.net/. (For those interested in using Drupal in a test/production server paradigm, imagine that example.com is the test server and example.net is the production server.) If all the paths in your content are relative paths, then Pathologic will handle them perfectly – no need for further configuration. However, if they are absolute paths that begin with http://example.com/drupal/, then Pathologic will not consider them local paths and will ignore them. However, we can tell Pathologic to consider such paths as local paths and to fix them.
Configuring Pathologic
If you've determined that configuring Pathologic may be necessary, here's how to go about it.
- In the Administer menu, select “Input formats” from the “Site configuration” section. A list will appear of the various input formats your site uses. Find one in the list which you are using Pathologic with, and click the “configure” link for that format. (Note that if you are using Pathologic with more than one input format, you will have to repeat this configuration process for each input format.)
- Click the “Configure” tab at the top of the next page.
- Find the Pathologic section on the next page – it should be near the bottom.
- Toggle the “Transform values of href attributes” and “Transform values of src attributes” check boxes as may be necessary.
- Enter the paths of other/previous Drupal installations which should be considered local in the “Additional paths to be considered local” text field. Enter one path per line. For the above example, we’d want to enter http://example.com/drupal/.
- Click the “Save configuration” button when done.
(Note for those using testing and production servers; in cases where it would be inconvenient to have separate settings on each server, it’s safe to put the path for the “current” server in the “Additional paths” field. Pathologic will simply remove it when it does its trick. In other words, both the example.com and example.net servers can have both http://example.com/drupal/ and http://example.net/ in the field.)
Now sit back and enjoy the fruits of Pathologic’s labor.
WYSIWYG editor compatibility
If the site is using a WYSIWYG content editor such as FCKeditor, TinyMCE, etc and Pathologic doesn’t seem to be doing anything, it may be due to the fact that such editors often try to output paths which begin with a slash character. Such paths are usually ignored by Pathologic, because Pathologic considers such paths to be absolute. However, you can trick Pathologic into working with such paths by using the “Additional paths to be considered local” field. If the Drupal installation is at the root level of a web site (such as http://example.com/), simply enter a single slash in to the “Additional paths” field. If it's in a subdirectory (such as http://example.com/foo/drupal/), enter the full subdirectory path, with slashes at both the beginning and end (so /foo/drupal/ in this case). See the “Configuring Pathologic” section above for more information.
Migrating from Path Filter
Path Filter is an input filter which works similarly to Pathologic, but requires one to type a prefix of “internal:” before all internal paths they want Path Filter to function on. A down side to this is that a site’s content becomes strewn with these bits, and if Path Filter is disabled, those “internal:” prefixes are going to be spat out to web browsers that won’t know what to do with them. That’s one of the reason I avoided using such “hints” in Pathologic.
If you are interested in migrating from Path Filter to Pathologic, be aware that Pathologic will automatically look for a prefix of “internal:” in your paths, and will ignore it if found. This means you should be able to use Pathologic as a drop-in replacement to Path Filter, with no additional configuration.
Caching issues
Drupal caches the output of input formats for speed. This can cause some stale data problems with the paths that Pathologic creates if circumstances change to make those paths incorrect. See this issue and this issue for examples of this sort of problem which have come up in real-world use. Unfortunately, there’s no real good way to fix this without making Pathologic something other than a standard input filter (and cacheable). To avoid these sorts of problems, consider these tips:
- Do not change the URL path of established nodes, particularly if you have linked to them in your site content. Decide on a good URL path when the node is created and keep it. (If changing the path is truly necessary, change the path on the node editing form as normal, then go to Administer > Site building > URL aliases and create a new path which points the “old” path to the node to avoid breaking both internal and external links.)
- Pathologic may behave unpredictably if only part of your site or some of your users are connected via an HTTPS connection; namely, some of the links will have an https:// protocol prefix and some will have an http:// one, depending on which sort of connection the user is using when the content is run through the input format. To avoid this, I suggest that HTTPS support be all-or-nothing on your site; either all connections use it, or none. Also, if your site did not previously use HTTPS connections but you’ve recently enabled this (or vice versa), flush your site’s cache so that Pathologic rebuilds paths in your content to use (or not use) the https:// prefix.
- When migrating Drupal database contents from one site to another, exclude the contents of the cache tables (basically, all tables with names which begin with “cache”). This is actually a good idea whether you're using Pathologic or not. On a Unix-like system, the following shell command can strip data out of a database dump which generally shouldn’t be migrated when moving a Drupal database from one site to another.
sed -E -e "/^INSERT INTO \`(cache|watchdog)/d" < /path/to/dump.sql > /path/to/dump-stripped.sql
You will need to tweak the regular expression a bit if your database uses a table prefix. If you are unable to run this command or otherwise avoid migrating cache data, you should clear your site’s cache after importing the data; you can do this by going to Administer > Site configuration > Performance and clicking the “Clear cached data” button near the bottom of the page.
Questions? Suggestions? Need help?
Please open an issue on Pathologic’s issue queue or contact the author and I’ll get back to you soon. Thanks for trying Pathologic!
