Here are the error messages:

* Importing 1 files now
* Fetching content from 'file:///tmp/1fixed/simple.htm' now.
* Copying between identical source and destination, file:///tmp/1fixed/simple.htm files/imported/simple.htm , importing file in-place.
* Processed page to extract content. Title:''
* Import of /simple.htm did not quite validate. New page is NOT being added.

* user notice: Path 'files/imported/simple.htm' was not found. This should have been a local copy of the file being imported, but the paths may be wrong somehow. Abject failure processing /simple.htm in /var/www/html/sites/default/modules/import_html/import_html.module on line 1350.
* user warning: File 'files/imported/simple.htm' not found, cannot load XML source. in /var/www/html/sites/default/modules/import_html/coders_php_library/xml-transform.inc on line 54.
* Failed to initialize or parse XMLdoc input
* This node has no body


PHP Version 5.1.6, I checked the import_html out of the CVS for Drupal 4.7 a day or so ago.

So even though the error message clearly shows two file paths it claims that they are equal. If I look at the line generating the error (import_html.module, ln 1299) it is possible that both parts of the equality are failing and thus preserving equality. On the other hand, the initial file was chosen from the tree display (sometimes).

This happens if I do an import and give the directory (and it does see the files in the directory) and then pick the file from the list or if I do the demo.

and for what it is worth, here is the file:




Title of Index

Simple Link

Link to Bowl1

Comments

dman’s picture

Title: Import_html confused about paths » Import_html confused about file:/// paths when using realpath()

The file:/// syntax was always confusing.
I think your deductions about

    if(realpath($source_path) == realpath($dest_path)){
    	drupal_set_message("Copying between identical source and destination, $source_path $dest_path , importing file in-place.");
    	return TRUE;
    };

evaluating as FALSE==FALSE are probably right.

It seems that sometimes, file:///path syntax works (the directory was browsable) but at a later time, when we wanted to work further with it (get realpath) it stopped working.
The realpath may not be strictly neccessary all the time, but I'm sure I had to put it in there for a reason once. There may be alternatives.

Did you try with just /tmp/1fixed/simple.htm ? Thats the way I usually do it.

.dan.

ehowland’s picture

Leaving off the file:/// also fails.

* Importing 1 files now
* Fetching content from '/tmp/1fixed/simple.htm' now.
* Copying between identical source and destination, /tmp/1fixed/simple.htm files/imported/simple.htm , importing file in-place.
* Processed page to extract content. Title:''
* Import of /simple.htm did not quite validate. New page is NOT being added.

This suggests that the error is farther upstream. I have set the permissions on both the original file and on the to be world Read/Write and that has not fixed it. I also copied the file to /var/www/html/files/imported/simple.html (in order to see if it was the copying that was bad) and that was not successful either.

ehowland’s picture

Well I thought I should drop realpath and then maybe I would get a more informative error.

I now get:

    * Importing 1 files now
    * Fetching content from '/tmp/1fixed/simple.htm' now.
    * Local file copy failed (/tmp/1fixed/simple.htm to files/imported/simple.htm)
    * Source is
    * Dest folder isArray ( [0] => 64768 [1] => 15631027 [2] => 16895 [3] => 3 [4] => 500 [5] => 500 [6] => -1 [7] => 4096 [8] => 1166120728 [9] => 1166120728 [10] => 1166120728 [11] => -1 [12] => -1 [dev] => 64768 [ino] => 15631027 [mode] => 16895 [nlink] => 3 [uid] => 500 [gid] => 500 [rdev] => -1 [size] => 4096 [atime] => 1166120728 [mtime] => 1166120728 [ctime] => 1166120728 [blksize] => -1 [blocks] => -1 )

    * warning: copy(/tmp/1fixed/simple.htm) [function.copy]: failed to open stream: Permission denied in /var/www/html/sites/default/modules/import_html/import_html.module on line 1306.
    * warning: stat() [function.stat]: stat failed for /tmp/1fixed/simple.htm in /var/www/html/sites/default/modules/import_html/import_html.module on line 1308.

This was interesting in two ways 1) Source is ending up blank. I don't know what to think about that. 2) it was saying that it had a permission problem. This was not a file permission problem since the file was world readable. So I moved the file into the web tree (/var/www/files) but not into the imported subfolder. The error message went away and even if I reinserted the realpath calls it still did not give me an error. So the permissions problem must be that apache is not allowed to read files outside it's root. The error message may want to reflect this possiblity. In general restricting apache is a good thing, but not here. A new identical copy of the original file is created inside of /var/www/files/imported I must have confused myself when I tried moving files into the web tree yesterday.

I still do not see the nodes (under localhost/imported) and there are no new entries into the node table in the database but that must be a different problem.

dman’s picture

Are you sure PHP can access that location at all?
safemode or open_base_dir can mysteriously prevent access to things you are SURE are available.

http://nz2.php.net/manual/en/features.safe-mode.php#ini.open-basedir

It's very obscure, and has puzzled me at least once before.
PHP version? Server/phpinfo?

dman’s picture

looks like open_basedir then.
check your php.ini

"The files must be avaialable to the server" :-) Available includes being allowed to read them ;-)
But yeah, I can add a note to that effect.
Strange however that browsing and listing WAS allowed ?

It looks like the debug information is telling you lots, so that's a good thing.

.dan.

dman’s picture

Title: Import_html confused about file:/// paths when using realpath() » Import_html confused about file:/// paths when open_basedir in effect
Component: Code » Documentation
Status: Active » Fixed

I've extended the docs to explain what's happening here.
The settings page will flag a warning if the server has the restrictions, and refers to the help doc which describes your work-around.

ehowland’s picture

It seems this was not the problem after all. I checked my php.ini and it turns out that open_basedir was commented out. I figured there might be something like this in the Apache config but then had to put out another fire and got back to this project today.

I did not find anything in the Apache files.

I did find your debugging system (thanks) and turned debuging up to 10:) I cannot understand what else has changed but now I do not get the tree I was getting before. I can, however, tell you partly where it is going wrong. It seems that I have two problems:

1. It seems that all it finds the files in the directory and then import_html_sort_list_into_tree throws out all the files at this line 936 (if (preg_match('|'.$regexp.'|', $rel_path)){ debug("excluding $rel_path with exclusion :{$regexp}:", 4); continue 2; } If I comment out the line then the tree comes back.

2. The html gets changed to a node and then nothing else ever happens.

I get to text through the XSL transform:

TRANSLATED from messy source into a pure xhtml page to import

<?xml version="1.0" encoding="UTF-8"?>
<nodes xmlns:xhtml="http://www.w3.org/1999/xhtml"><node type="story"><title>BOWL1</title>
<teaser/>

<body>
				
<!--Imported Full Body :(-->

<a href="bowl1.htm" >Bowl1 </a></body>
<user uid="1">dman</user>
<vocabularies><vocabulary vid="1"><name>Section</name><terms><term tid="3"><name>Portfolio</name></term></terms></vocabulary></vocabularies></node>

</nodes>

And it finds one html element:

Found 1 html elements in source doc
: in _import_html_process_html_page(), line 1424 import_html.module : in import_html_import_files(), line 1164 import_html.module : in menu_execute_active_handler(), line 418 menu.inc 0.00s elapsed. (9 total)

But the next block seems not to be executed since I do not see the debug inside it.

foreach($html_elements as $html_element) {

But if I uncomment the debug at the end of _left _import_html_process_html_page() there is nothing left in the node at that point:

 : in _import_html_process_html_page(), line 1468 import_html.module : in import_html_import_files(), line 1164 import_html.module : in menu_execute_active_handler(), line 418 menu.inc 0.00s elapsed. (9 total)

I don't understand why I don't see anything from within the foreach($html_elements as $html_element) list. I don't think $node = import_html_xhtml_to_node($html_element); is ever called.

ehowland’s picture

probably not a surprise if I print_r($html_elements) I get:

DOMNodeList Object
(
)

So in this piece of code:

    $importxml = xmldoc_plus_xsldoc($xmldoc, $xsldoc, $parameters);
    debug("Transform Successful", 2);
    debug("<h2>TRANSLATED from messy source into a pure xhtml page to import</h2><textarea rows='20' cols='80'>" . $importxml . "</textarea>", 3);
  }
  else {
    trigger_error("Failed to initialize XSLdoc", E_USER_WARNING);
  }

  if ($importxml) {
    $xmldoc = parse_in_xml_string($importxml, false);
    //
    // Allow one source document to produce multiple nodes
    // If the process has resulted in xt:document blocks, each block
    // is a new item.
    // Either there is a html element in the input ... or many of them.

    $html_elements = xml_getElementsByTagName($xmldoc,'html');

    debug("Found ".count($html_elements)." html elements in source doc",2);

It makes $importxml (see debug dump of that variable in the last post) which is xml with the main type of node. Then along comes xml_getElementsByTagName($xmldoc,'html') which is looking for an html element which that node no longer has (the outermost container is node not html).

Changing html to node helps until we get to this block of code:

  //
  // BODY is the thing with id=content 
  // 

   debug("<h2>XML DOM being scanned for XPATH data extraction</h2><textarea rows='20' cols='80'>".print_r(xml_toString($datadoc),1)."</textarea>",3);
  $content_element = xml_getElementById($datadoc, 'content');  

  if(!$content_element){drupal_set_message("Failed to find a body, anything with id='content' in this page");};
  $node->body = xml_toString($content_element,TRUE);

  // It's possible that our input was content-encoded (if it came from RSS or the old import-xml node)
  // If so, the entities should be unwrapped.
  // Other (HTML) imports should not require this
  $node->body = ($node->content_encoded) ? html_entity_decode($node->body):$node->body;

  debug("<h2>Got Body</h2><textarea rows='20' cols='80'>".$node->body."</textarea>",2);

So this is like before. There is something in $datadoc but it has no id=content so by the time the Got Body debug statement comes along then the nodes are empty.

So at this point It dawns on me that I am not using the right XSL (html2drupal.xslt). Not only do I get nodes rather than html, my <div id="content"> is gone. I was OK with the xsl dumping the whole body into the node but it seems to realy realy want that content id to work.

I should have downloaded the example site described in walkthough.htm.

ehowland’s picture

p.s. I still do not understand why I have to put the files in my web tree. No open_basedir in my .htaccess /etc/php.ini or anywhere that I see.

dman’s picture

yeah, the html2drupal syntax hasn't bee used since 4.6 it was for compatability with the older import/export module.

I now used pure html with microformats html2simplehtml.xsl
that would explain a lot of confusion, it's looking for totally the wrong input.
I thought I removed html2drupal...

I dunno why your system is refusing access to files outside webroot if you are sure that safe mode etc is not on. did you try writing a bit of sample php to get at files yourself? just to see...

ehowland’s picture

In order to clarify things I renamed my root html directory from /var/www/html to /var/www/html_reg. I created a new /var/www/html directory and put this file in as index.php:

<html>
<body>
<h3>before include /tmp/1fixed/fragment.htm (fails)</h3>
<?php include '/tmp/1fixed/fragment.htm' ?>

<h3>end of first include, now try /var/www/html_reg/html2import/fragment.htm (OK)!!!!</h3>
<?php include '/var/www/html_reg/html2import/fragment.htm' ?>

<h3>end of second include, now try /var/www/fragment.htm (OK)</h3>
<?php include '/var/www/fragment.htm' ?>

<h3>end of third include, now try /var/fragment.htm (fails)</h3>
<?php include '/var/fragment.htm' ?>

<h3>end all includes </h3>
</body>
</html>

As the comments in the html indicate, anything above www works and everything (I tried) outside does not. It is not so surprising that this is so although I was surprised that a directory under www but not named in the Apache config files (my renamed web root /var/www/html_reg/html2import/fragment.htm) did work.

Having said that, I still think it must be a PHP config issue. There is now no .htaccess file since I started over without one. The error comes from PHP like this.

Warning: include(/tmp/1fixed/fragment.htm) [function.include]: failed to open stream: Permission denied in /var/www/html/index.php on line 4

Warning: include() [function.include]: Failed opening '/tmp/1fixed/fragment.htm' for inclusion (include_path='.:/php/includes:/tmp/1fixed') in /var/www/html/index.php on line 4

It could be that Apache is blocking PHP from accessing the files - but I don't see anything in the httpd.conf that addresses access at the /var/www level (although there is alot, as you would expect, for /var/www/html?) It is probably something stupid. You can see that I made the include_path explicit and added /tmp/1fixed into the path inside /etc/php.ini, with no change in behavior. I do not see anything relevant inside /var/httpd/config.d/php.conf

I will be off this issue until after the new year dealing with some deadlines.

dman’s picture

Well you're on the right track, showing what the story is.
Keep tweaking and you'll find it I guess.
it'll likely be php.ini, not httpd.conf where the rule is, if you're sure it's not old-fashioned permissions..
There's no mention of your /var/www/ path you've found, although you've seen that that seems to be the cut-off point. hm.
Keep scanning the php.net docs on safe_mode and related bits.

Anonymous’s picture

Status: Fixed » Closed (fixed)