Active
Project:
Pingback
Version:
6.x-2.x-dev
Component:
Code
Priority:
Normal
Category:
Feature request
Assigned:
Unassigned
Reporter:
Created:
11 Mar 2009 at 21:38 UTC
Updated:
25 Mar 2009 at 11:47 UTC
As per request: issue 313595 open a new issue/feature request about it (this one is already too full of comments on other, unrelated issues).
The pingback-to-myself behavior works... BUT ONLY when url inside anchor tag is full url (including domain). Relative urls wont trigger pingback - in my opinion they should as relative urls are very common and practical. In fact i don't see any point in not pinging myself for relative urls.
Comments
Comment #1
jonathan1055 commentedYes I noticed this too. It is the regex code in _pingback_extract_urls() which needs to be looked at. It only searches for strings starting with http or https, and stops at the next space. (it may do other things too). The comment says that the expression is stolen from trackback.module, so it was not written specifically for what pingback needs. The prepared node body text contains href anchor tags, so it may be better to search for href= and use the value in quotes which follows it. This would have the additional benefit that the full url would be found. I have been working on removing urls from the found list which only point to images, and have discovered that partial urls are returned from this function when the image file name contains a space. Using the full text up to the closing quote would solve both of these problems.
My regex writing is not very sophisticated, so if anyone else would like to have a go, please do.
Jonathan
Comment #2
andreashaugstrup commentedA problem is that we can't assume HTML as the input format. Pingback module must also work if plain text or e.g. markdown is used. So two regexps are needed: One for HTML (searching only between quotes in HTML links) and one for everything else (looking for URLs but stopping once a space has been reached.
Moving into 2.x-dev also.
Comment #3
siliconmind commentedI agree. Two regex (or extended regex) is needed to catch plain urls AND href="" contents.
You can borrow code from url filter to catch plain urls because not all urls will start with "http:" or "https:" - url filter will also accept ulrs that start with "www".
Still a regex to get href="" contents is needed... I'm not a regex guru but maybe something like this?