As per request: issue 313595 open a new issue/feature request about it (this one is already too full of comments on other, unrelated issues).

The pingback-to-myself behavior works... BUT ONLY when url inside anchor tag is full url (including domain). Relative urls wont trigger pingback - in my opinion they should as relative urls are very common and practical. In fact i don't see any point in not pinging myself for relative urls.

Comments

jonathan1055’s picture

Yes I noticed this too. It is the regex code in _pingback_extract_urls() which needs to be looked at. It only searches for strings starting with http or https, and stops at the next space. (it may do other things too). The comment says that the expression is stolen from trackback.module, so it was not written specifically for what pingback needs. The prepared node body text contains href anchor tags, so it may be better to search for href= and use the value in quotes which follows it. This would have the additional benefit that the full url would be found. I have been working on removing urls from the found list which only point to images, and have discovered that partial urls are returned from this function when the image file name contains a space. Using the full text up to the closing quote would solve both of these problems.

My regex writing is not very sophisticated, so if anyone else would like to have a go, please do.

Jonathan

andreashaugstrup’s picture

Version: 6.x-1.x-dev » 6.x-2.x-dev

A problem is that we can't assume HTML as the input format. Pingback module must also work if plain text or e.g. markdown is used. So two regexps are needed: One for HTML (searching only between quotes in HTML links) and one for everything else (looking for URLs but stopping once a space has been reached.

Moving into 2.x-dev also.

siliconmind’s picture

I agree. Two regex (or extended regex) is needed to catch plain urls AND href="" contents.
You can borrow code from url filter to catch plain urls because not all urls will start with "http:" or "https:" - url filter will also accept ulrs that start with "www".
Still a regex to get href="" contents is needed... I'm not a regex guru but maybe something like this?

<?php
$regex = '/.*?href(?:\s|)=(?:\s|)("|\')(.[^"]|[^\']*?)\1.*/is'; // ${2} is href contents
?>