How to stop Non-Links from being checked? [#523656]

I tried searching for this, as I'm thinking I don't have some checkbox checked correctly to stop Link Checker from checking what aren't actual links.

# # #

I run Link Checker and it reports that URLs are bad (500, 400, etc.) that aren't valid links within the node.

Example Page:

http://winetest.fermentedreviews.com/twin_fin_2005_cabernet_sauvignon_ca...

"www.twinfinswines.com" on that page comes up as "php_network_getaddresses: getaddrinfo failed: Name or service not known." Other pages on the site with non-links of www.xyz.com come up with various other error codes, based upon whatever the server sends back.

I'm guessing that every text entry of a plain www.xyz.com within each node is being checked? And I'm just getting the normal resultant errors (the checked Server doesn't want to talk to HEAD, etc.)?

Is there a way to have Link Checker only check items that are enclosed within type markup?

I currently have checked:

Scan node types for links: {All node types selected}

Scan comments for links
Scan blocks for links
Check full qualified domain names only*

Extract links in and tags
Extract links in tags

All other setting are default.

Any pointers on what to change?

Thanks,
Sam

*Thought this would exclude plain text, but it didn't.

Comments

Comment #1

Michael-IDA commented 18 July 2009 at 20:37

Hmm, I guess I don't have enough permission to edit my own posts. The above was eaten a bit by a filter, here's the entire post in a code block so things aren't truncated or missing.

I tried searching for this, as I'm thinking I don't have some checkbox checked correctly to stop Link Checker from checking what aren't actual links.

# # #

I run Link Checker and it reports that URLs are bad (500, 400, etc.) that aren't valid links within the node.

Example Page:

http://winetest.fermentedreviews.com/twin_fin_2005_cabernet_sauvignon_california.html

"www.twinfinswines.com" on that page comes up as "php_network_getaddresses: getaddrinfo failed: Name or service not known."  Other pages on the site with non-links of 'www.xyz.com' come up with various other error codes, based upon whatever the server sends back.

I'm guessing that every text entry of a plain 'www.xyz.com' within each node is being checked?  And I'm just getting the normal resultant errors (the checked Server doesn't want to talk to HEAD, etc.)?

Is there a way to have Link Checker only check items that are enclosed within <A HREF=...> type markup?

I currently have checked:

Scan node types for links: {All node types selected}

Scan comments for links
Scan blocks for links
Check full qualified domain names only*

Extract links in <a> and <area> tags
Extract links in <img> tags

All other setting are default.

Any pointers on what to change?

Thanks,
Sam

*Thought this would exclude plain text, but it didn't.

Comment #2

hass commented 19 July 2009 at 09:16

The module runs per default an URL filter to find as many links as possible... I can add an extra checkbox to disable this feature, but I think it could be better to fix this links. Otherwise you are able to disable checking of this individual link since version 2.2. Aside - why do you have links in your content that do not link to a real URL?

"php_network_getaddresses: getaddrinfo failed: Name or service not known." is a clear statement... DNS lookup for the domain name has failed.

Server error 500 are often missconfigured servers. Try again with GET method. You can change this in "edit link settings".

Comment #3

hass commented 19 July 2009 at 09:19

Category:	support	» feature
Status:	Active	» Postponed (maintainer needs more info)

Only to note - if Check full qualified domain names only is checked only URLs like http://example.com/foo/bar are extracted. If you'd like to extract local path's like /node/1234 you need to uncheck this.

Comment #4

Michael-IDA commented 25 July 2009 at 22:20

Status:

Postponed (maintainer needs more info)

» Active

Hi hass,

> "if Check full qualified domain names only is checked only URLs like http://example.com/foo/bar are extracted."
I have "Check full qualified domain names only" checked and it still checks plain text "www.example.com" as a link.


> "Aside - why do you have links in your content that do not link to a real URL?"
Not up to me, thats how the users of that domain want to enter the data.


> "Otherwise you are able to disable checking of this individual link since version 2.2"
They have 300+ nodes many with text links like "www.example.com", marking each one individually isn't really an option.  Besides which they are one of the smallest sites I work with, the largest has 3K nodes.  Although, I'd hope the largest doesn't input text links.



On the off chance it's some other module causing interference, I've pasted all modules in use on that site below.

Thanks for looking into this,
Sam

PS: As a side note, no I wasn't complaining that Link Checker was returning error codes for sites (that's suppose to happen), just that it was checking things it shouldn't.


#########
Drupal 6.13
Link checker 6.x-2.2

All modules enabled:
Aggregator     6.13
Blog     6.13
Blog API     6.13
Book     6.13
Color     6.13
Comment     6.13
Contact     6.13
Database logging     6.13
Help     6.13
Menu     6.13
Path     6.13
PHP filter     6.13
Ping     6.13
Profile     6.13
Search     6.13
Statistics     6.13
Taxonomy     6.13
Throttle     6.13
Tracker     6.13
Trigger     6.13
Update status     6.13
Upload     6.13

Devel     6.x-1.16

Advanced help     6.x-1.2
Automatic Nodetitles     6.x-1.1
Excerpt     6.x-1.0
Forward     6.x-1.9
Global Redirect     6.x-1.2
Job queue     6.x-3.0
Link checker     6.x-2.2
nofollowlist     6.x-1.0
Pathauto     6.x-1.1
Scheduler     6.x-1.3
Service links     6.x-1.0
Theme Settings API     6.x-1.4
Token     6.x-1.12
Token actions     6.x-1.12
TokenSTARTER     6.x-1.12

CAPTCHA     6.x-1.0-rc2

Views     6.x-2.6
Views UI     6.x-2.6

Comment #5

hass commented 25 July 2009 at 23:16

Could you please do not add CODE tags around all your postings... not that easy to read.

> "Aside - why do you have links in your content that do not link to a real URL?"
Not up to me, thats how the users of that domain want to enter the data.

Teach the authors that visitors of their site do *not* like to read DEAD links. They are looking for working weblinks. Everything else makes no sense on the internet.

Link checker was initially designed to find as many links as possible. If your users add broken / buggy links to the site they need to disable link checking for all this broken links they do not like to fix. I'm not sure what the problem is... this is what link checker has been build for!? If you do not like to see *broken* links - turn the link checker module off. Nevertheless I would suggest to tell the authors that links are references to other sites that need to exists - if not - the option is to turn off link checking *individually* for this buggy links. It doesn't matter how many nodes the site have as you can have thousands of links in one node.

The URL filter (note: this one converts "www.example.com" into real links) runs *always* on all content. Today there is no way to turn this behaviour off. Maybe in a future version.

Comment #6

hass commented 26 July 2009 at 11:25

Status:

Active

» Closed (duplicate)

#497096: Links generated by input filter

Comment #7

Michael-IDA commented 13 November 2009 at 22:02

"Could you please do not add CODE tags around all your postings... not that easy to read."

At least YOU COULD READ them. Without code tags, Drupal.org was removing content. Which I said previously, "The above was eaten a bit by a filter, here's the entire post in a code block so things aren't truncated or missing."

"Teach the authors"

You don't work in the real world do you? Nor get paid to do a job? Nor read up on Google's BS with nofollow?

http://gazebo.commonplaces.com/2009/06/a-few-words-on-relnofollow-or-ple...
"PageRank that would flow to that link simply ‘exaporates’ when you make that link nofollow."

I understand this is a "free" effort on your part, and your module is a great idea, but I'm sorry you think your opinion of how others should operate hinders its use.

Comment #8

hass commented 14 November 2009 at 02:17

Title:

How to stop Non-Llinks from being checked?

» How to stop Non-Links from being checked?

1. You complained about non-links that has been checked in past. Ok, no longer the case! Non-links are not extracted if the URL filter is not enabled for a format (since 2.3) and URL filter can be globally disabled in latest dev or next 2.4.

2. Now you talk about nofollow - what the heck has this to do with linkchecker? Linkchecker verifies links on your site if they are broken or not. It doesn't care about any search engines stuff and I have currently no plans to add exclusion of links with rel nofollow attribute. It would be easy to add this in D7, but I believe it's pretty useless for what link checker intentionally does.

After you have upgraded to the latest version, press the button 'Analyze content for links' on link checker settings page to cleanup old "Non-Links" stuff.

How to stop Non-Links from being checked?

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

News items

Our community

Documentation

Drupal code base

Governance of community