It appears that you can link to PDF documents, if the browser is using Acrobat Reader plugin, with a #page=nn URL fragment, which causes the PDF to open at the specified page (see http://helpx.adobe.com/acrobat/kb/link-html-pdf-page-acrobat.html, for example).
However this breaks linkchecker, which, although the link works in the browser, returns a 404 "URL fragment identifier not found in content" error.
wget --spider http://unfccc.int/resource/docs/2011/cop17/eng/09a02.pdf#page=16
says "200 OK" and "Remote file exists." but linkchecker says "404" and "URL fragment identifier not found in content".
This is technically correct, if we assume the URL points to an HTML document, but in this case it's a PDF and the fragment page=16
will never appear in the content. So the additional checking, and overriding of the 200 response with a 404 by linkchecker (added by #1875602: Check URL fragment identifiers in content), isn't wanted here.
Perhaps the checking added by #1875602: Check URL fragment identifiers in content should only be used if the returned document is of type text/html
?
Comment | File | Size | Author |
---|---|---|---|
#4 | Issue-2088461-by-fonant-hass-PDF-link-with-page-resu.patch | 1.35 KB | hass |
#1 | linkchecker-2088461.patch | 1.45 KB | fonant |
Comments
Comment #1
fonant CreditAttribution: fonant commentedA quick fix is to add a content-type check on the $response and check that we have a suitable text response (as requested in the Accept request header) before we check for the fragment being present.
Simple patch attached.
Comment #2
hass CreditAttribution: hass commentedPlease provide a git patch.
Comment #3
hass CreditAttribution: hass commentedI'm wondering why this Accept has no effect:
Comment #4
hass CreditAttribution: hass commentedAttaching Git patch
Comment #5
hass CreditAttribution: hass commentedhttp://drupalcode.org/project/linkchecker.git/commit/82c810a
Comment #6
hass CreditAttribution: hass commentedComment #7
hass CreditAttribution: hass commentedhttp://drupalcode.org/project/linkchecker.git/commit/0eaf68b
Comment #8
fonant CreditAttribution: fonant commentedThat Accept does have an effect, but it says:
"Prefer
text/html
orapplication/xhtml+xml
resources (with a preference of 100%), if not thenapplication/xml
resources (with a preference of 90%), but if they aren't available I'm happy with any resource type that matches the request URL (with a preference of 80%).It's a little easier to read if you insert spaces, as commas separate the options. The semicolons have lower precedence, and are used to specify the relative weighting of each option:
'text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8'
This is good, because we _do_ want to be able to test for the existence of non-text resources, such as images and PDF files.
The problem is merely in assuming that the response is HTML if there is a "#" in the URL, which is mostly the case, but not always: Acrobat allows the use of # to indicate a page number in a PDF.
Comment #9
hass CreditAttribution: hass commentedYeah, I know, but I expected that the server may throw an error if I request foo and he can only deliver bar... however never tested it :-). This hopefully works now.