Hi,
I've been importing pages using Import HTML and I noticed that non-HTML files (e.g. pdf, etc) are imported into the proper directory (e.g. under /files) but the links to them are not changed in the HTML file. I am currently trying to just import a subdirectory (/systems/webdevelopment) within our site to ensure everything is working before I attempt more areas:
Here are some settings I have in Import HTML:
Default Input Filter: Unfiltered HTML
Extra File Storage Path: files/systems/
Import Site Prefix: systems/
Site Root on the Server: /usr/local/lib/html/systems/
Subsection to list: webdevelopment/
I have a page under /systems/webdevelopment/ that links to a file under /systems/webdevelopment/dwsettings/filename.ste on our old site. When using Import HTML, this .ste file is properly moved to /files/systems/webdevelopment/dwsettings/. However, the page it links from keeps the old URL as a link to it instead of the /files/ URL. Therefore I get a File Not Found error when I click on the link.
I tried with a normal .pdf file and had the same problem. The settings above seem to put all the html files in the proper place and move the non-html files to the proper place. It just doesn't link to them properly in <a href> tags. Images within <img> tags work properly.
Any ideas?
Rob
Comments
Comment #1
dman commentedThe normal option is that "Links found within the sources will be rewritten to try and allow for the new paths".
The system does its best, but there may be some syntax or patterns of links that it misses.
You'll have to paste the exact source of the link that should be getting redirected (and isn't) so we can see what the mistake is.
The logic is contained in the transformation file rewrite_href_and_src.xsl if it needs fiddling with.
It's scary XSL, but reasonably documented.
Note that fully-justified links
a href="http://my.site/resources/document.pdf"are always seen as external links, and are not rewritten. Only local links of the forma href="/resources/document.pdf"ora href="../resources/document.pdf"can reliably be found. (I'll add some documentation about that to the settings screen)If that's your problem, some extra level of preprocessing will have to happen to remove all the hard-coded references to the source server. That isn't currently done. HTML files with hard pointers to themselves (their server name & absolute link) are a pain.
We should be able to fix them up however... but it's a TODO. Easiest done with a global search & replace currently.
s|http://your.server/|/|gComment #2
rgraves commentedThanks for the quick reply.
Is the "Links found within the sources will be rewritten to try and allow for the new paths" a check box that can be checked off? I don't see an option for that so I wasn't sure if it is something I could accidentally change or that's just how it works.
I was having trouble with anchors when importing files into Drupal using Import HTML (see http://drupal.org/node/158735). I implemented the solution provided on that page.
I thought this might be the problem, but when I removed that code and tried importing it again, the same issue with file paths occurred.
Attached is a sample file that I'm trying to import. The problem is with the .ste links.
Thanks
Rob
Comment #3
rgraves commentedHere is another test file I made that has a link to a normal pdf file. Same issue occurred
Comment #4
dman commentedOK, I ran an import with debug turned up, and I see what's happening. These links are being treaded like normal document links, not resource links. I recall expecting this would be a problem, it's obliquely noted in the rewrite_href_and_src.xsl file as a problem that needs extra work.
... Ah no. The real problem is that it starts with a /. Damn. I thought that was handled. I said that was handled, but it looks like it isn't. Root-relative links ARE being ignored also. I wonder why. The code was never pushed in there. Probably something that looked like it would have needed another parameter in some cases. I needed path_to_top or something...
I'm looking at it, and I think I've got it, but I recently made some other changes to the importer also - to handle CCK fields better - which need testing.
I'll check this in, but something may be unstable. Backup before trying it.
There's two more options in the settings now.
Comment #5
dman commentedI probably should have submitted those changes as a patch for now, rather than just checking it in. Meh. Too many threads to keep track of that way. It's the dev branch anyway.
And, I was in the middle of other changes to the same working copy. So much for version control.
http://drupal.org/cvs?commit=83294
Comment #6
rgraves commentedSince we're in the early stages of importing our site, I'll give your changes a try and report any errors if I find them.
Thanks for your speedy response to this issue!
Rob
Comment #7
rgraves commentedI did one test and it almost worked.
The link to /systems/webdevelopment/dwsettings/livesite_8.ste was rewritten to ../../files/systems/systems/webdevelopment/dwsettings/livesite_8.ste.
The only problem is that there are 2 systems directories in the path and there should only be one (i.e. ../../files/systems/webdevelopment/dwsettings/livesite_8.ste)
Since I'm only importing a subdirectory right now for testing, here are some settings I'm using that may help in fixing this problem:
Site Root on the Server: /my/web/root/dir/systems/
Subsection to list: webdevelopment/
Extra File Storage Path: files/systems/
Import Site Prefix: systems/
I tried removing 'systems/' from the last 2 configuration options but the modules says it will load the files into the wrong places so these appear to be the settings that will work for my situation.
Rob
Comment #8
rgraves commentedHas anyone had a chance to look into this?
Comment #9
dman commentedCleaning up issue queue by closing stuff from the Drupal-5 branch and over a year old.
(Some better checks on folders and slashes have been built in to the D6 version. Probably eliminated all test errors I've been able to replicate so far)