We use the MarkDown input filter on our site, which means that (internal) links are stored in the node database, and seen by Links Checker, as for instance [click here](/node/123)
These URLs are currently not extracted.
This would be solved by first sending the node's body through the input filters, before extracting links.

Comments

hass’s picture

hass’s picture

Status: Active » Postponed (maintainer needs more info)

Does it work on initial scanning?

Please provide me a few content examples for the MarkDown filter tests. I'm not using it nor have an idea how the content need to look.

wimh’s picture

The syntax is the same as used on Wikipedia. Full description is here: http://daringfireball.net/projects/markdown/syntax#link
The links look like this (with an optional link title):

[an example](http://example.com/)
[an example](http://example.com/ "Title")

No problem for external links (they are found), but for internal/relative links you get this:

[click here](/node/123)
[click here](/node/123 "Title")

hass’s picture

after #451456: 301 auto-update could break links has been commited the full url is no longer extracted. I see that supporting filters isn't working today, but we need to write extra logic for every filter. If we don't do this we are not able to automatically update them on 301 status codes. Running the filters only isn't working reliably.

Only think about link module. This module saves the link in the node table and not in the node content. It would be wrong to run filters on node content.

wimh’s picture

Does this mean you want to know/care about all possible input formats?

Personally, I feel it's more important to fetch all possible links that the end-user can see. If some of these are stored in the database in an impractical form for automatic updating, or not at all when they are generated on the fly, I would prefer to lay the burden on the site's admin and have them do manual updates, rather than keep the broken link visible and annoy the visitors...
But in any case, it's your module, so you choose ;-)

(BTW: I'm curious, what fraction of broken links can be fixed automatically by following the 301s? I tend to see that most are 404s and need manual intervention anyway...)

hass’s picture

Status: Postponed (maintainer needs more info) » Active

Yes, I think there is no other way.

But it shouldn't be so many filters for URLs... over time we could get more and more in... I expect max 10... not sure if this is realistic. Maybe we can do this with a modular plugin system. The main issue about running all filters is - if you you use some filters that only save a reference in your node you would only see a link in node 10 is broken, but the links isn't there to fix. This sounds very confusing to me. Also see #387758: Links in views may not easy to find. I'm happy about every idea for optimisations... :-)

On my site I've seen ~10 x 301 and only two 404, but I have only 1400 links on this site... so - maybe not very representative :-). I'm very interested to hear from others about their statistics...

hass’s picture

May be really easier to run the filters with a customized version of http://api.drupal.org/api/function/check_markup. Need to investigate further. At least I need to add a blacklist to this function may be named linkchecker_check_markup(). Such a blacklist need to filter out all filters not required and only using references to links (like views).

But changing the source URL would become an unsolvable challenge in a general manner. I do not like to put the burden to support 301 link updates for all possible filters on my desk...

wimh’s picture

Yes, I was already running my site with a patch such as this one in _linkchecker_add_node_links()

- $text_items[] = $node->body;
+ $text_items[] = check_markup($node->body, $node->format);

This works fine for extracting links. The same should probably be done in _linkchecker_add_comment_links() and _linkchecker_add_box_links().
Changing URLs in the source is indeed not possible this way. Sending the contents through the filtering could be a configuration option, which would be mutually exclusive with the option to update links.
(But I don't like things changing my contents without a manual check anyway, so I'd choose the filtering option)

hass’s picture

Yeah, as a quick fix this should work. I think you need to change the code to check_markup($node->body, $node->format, FALSE) or cron may not able to apply the filters on nodes having more than 100 links.

I thought first to go this way too, but done it now a bit different to replace the current linkchecker filter function with this filter approach. But while doing my first tests this morning I found out that my two monster test nodes (~1800 and ~4000 links in body) have issues with applying the filters. Cron simply times out after 5 minutes. It seems like applying the filters on the content simply takes too long. I need to investigate further on this in detail and I hope there is a way to speed this up. In general it seems working well, but I see new cases appearing in the queue regarding cron failures or overloaded sites. As a note - I'm also running the filters on comments and blocks.

Over all there are a few other required changes to prevent batch API failures... not sure what else.

wimh’s picture

> But while doing my first tests this morning I found out that my two
> monster test nodes (~1800 and ~4000 links in body) have issues
> with applying the filters. Cron simply times out after 5 minutes.

But the same filter is also applied while you save the node. Didn't it time out then too? And why wasn't the cached value used (shouldn't it have been in the cache_filter table)?

hass’s picture

No, saving the nodes works - but also takes some good time. I removed the cache functions in the beginnings. The reason is - that linkchecker need to have it's own cached version (for e.g. without views reference filter and other blacklisted filters applied) or there is no real need to cache (only scanned once). I need to re-implement it in a different way and see if this may help something...

hass’s picture

It always times out if markdown is active... try it yourself with a node of 2MB HTML code or more and 2000 links. Use standard filtering (line break, url filter and html corrector) and with markdown enabled. Maybe it's caused by the highly broken HTML inside this 2000 links monster test node... very difficult to figure out what cron does all in background... :-/ but markdown seems not the fastest filter from my point of view.

I get such errors and cron always fail:

[15-Jul-2009 09:04:01] PHP Fatal error:  Maximum execution time of 240 seconds exceeded in \sites\all\modules\markdown\markdown.php on line 2075
[15-Jul-2009 09:11:51] PHP Fatal error:  Maximum execution time of 240 seconds exceeded in \sites\all\modules\markdown\markdown.php on line 2075
[15-Jul-2009 09:16:52] PHP Fatal error:  Maximum execution time of 240 seconds exceeded in \sites\all\modules\markdown\markdown.php on line 2075
[15-Jul-2009 09:35:20] PHP Fatal error:  Maximum execution time of 240 seconds exceeded in \sites\all\modules\markdown\markdown.php on line 2075
[15-Jul-2009 20:55:24] PHP Fatal error:  Maximum execution time of 240 seconds exceeded in \sites\all\modules\markdown\markdown.php on line 2075
[15-Jul-2009 21:02:06] PHP Fatal error:  Maximum execution time of 240 seconds exceeded in \sites\all\modules\markdown\markdown.php on line 2075
[15-Jul-2009 21:04:06] PHP Fatal error:  Maximum execution time of 240 seconds exceeded in \sites\all\modules\markdown\markdown.php on line 2075
[15-Jul-2009 21:28:21] PHP Fatal error:  Maximum execution time of 240 seconds exceeded in \sites\all\modules\markdown\markdown.php on line 2075
[15-Jul-2009 21:40:08] PHP Fatal error:  Maximum execution time of 240 seconds exceeded in \sites\all\modules\markdown\markdown.php on line 2075
[15-Jul-2009 22:11:11] PHP Fatal error:  Maximum execution time of 240 seconds exceeded in \sites\all\modules\markdown\markdown.php on line 2075
[15-Jul-2009 22:23:24] PHP Fatal error:  Maximum execution time of 240 seconds exceeded in \sites\all\modules\markdown\markdown.php on line 2075
[15-Jul-2009 23:44:48] PHP Fatal error:  Maximum execution time of 240 seconds exceeded in \sites\all\modules\markdown\markdown.php on line 2075
[15-Jul-2009 23:50:26] PHP Fatal error:  Maximum execution time of 240 seconds exceeded in \sites\all\modules\markdown\markdown.php on line 2075
wimh’s picture

Is this with using job queueing? Or is the markdown filtering of that one node taking >240 seconds? It's strange then that saving the node works, but doing the same filtering during cron would take so much longer...

hass’s picture

Yes, job_queue executes this task. Since I have enabled markdown - viewing the node is sometimes not possible or rendering takes minutes. I thought this is an easy patch, but this seems not the case :-(.

wimh’s picture

Since I have enabled markdown - viewing the node is sometimes not possible or rendering takes minutes.

But that just means your test node is incompatible with the markdown filtering, right? Nodes that have no problems with markdown (i.e. not too complicated or invalid HTML) should also pass Linkchecker's filtering without problems?

hass’s picture

Maybe... but who can say this for sure? I will post a patch and than you can test yourself and we can optimize.

hass’s picture

Status: Active » Needs work
StatusFileSize
new6.42 KB

Patch attached. Still needs work for adding the blacklist and url filter and more testing.

wimh’s picture

It's running on my site now, I had it clear all link data and analyze again, all seems to be working.

hass’s picture

Status: Needs work » Fixed

http://drupal.org/cvs?commit=243012

After thinking back and forth and forth and back again and again and again :-) - I decided to change the logics like core works. Please try latest DEV. I would be happy to hear if all works well.

hass’s picture

wimh’s picture

Thanks! I just installed the latest DEV. There was a problem during the database upgrade, see #532178: Database update #6209 failed. Otherwise, it seems to work fine, I'll keep you posted on how it behaves in the longer run.

hass’s picture

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.