Which date format is required?

asb - November 10, 2008 - 20:50
Project:Import HTML
Version:5.x-1.x-dev
Component:Documentation
Category:support request
Priority:normal
Assigned:Unassigned
Status:active
Description

Hi,

I got Import HTML somehow working (after lots of fixing in the HTML source ;). Now I'd like to start importing, but can't get the module to recognize dates:

<p id="date">2008-11-10 20:57:28 +0100</p>

This is supposed to go into the node creation date; Import HTML seems to identify this paragraph as a date, but complains that it has to be formatted correctly. However, it does not say how dates are supposed to be formatted; I tried everything that came to mind, from a UNIX timestamp to sorting the date as DD-MM-YYYY or DD.MM.YYYY - nothing worked.

Also, Import HTML outputted lots of errors (too much to include in this post), and refused to import anything. Amazingly, if changing the paragraph to

<p>2008-11-10 20:57:28 +0100</p>

the import continued, but of course with this paragraph embedded into the node's body.

Question:

a) Can Import HTML modifiy the node's creation date?
b) What kind of date format does it expect?

Thanks & greetings, -asb

#1

dman - November 10, 2008 - 23:33

Very good question!
OK, in most ways, you are doing the right thing. Items with ids should become node attributes directly.
I believe however that we would need to set $node->created or $node->changed , not 'date' however.
Meta tags <meta name="created" value="2008-11-10 20:57:28 +0100"/> should also work.

BUT I'm not 100% sure of the date format. It SHOULD be the same as is used on node edit forms, I imagine. Which AFAIK is anything that php parse_date recognises. The real work is done by node_save().
I'm confused that it even identified the 'date' id field as a special case.
import_html does not have any special case handling to ensure dates are parsed any different from strings. It's just another field. This could/should be passed through a validator I guess.

The module WILL get extremely verbose when anything goes wrong, but I can't guess what error you saw - whether it was from import_html itself or from core when trying to submit an illegal value.
You want to see lots of errors, turn on the DEBUGLEVEL flag in the module! That gives a full dump of each phase of parsing.

... yeah OK, internally node.module uses $node->date to mean $node->changed.

node.module:node_validate()

<?php
   
// Validate the "authored on" field. As of PHP 5.1.0, strtotime returns FALSE instead of -1 upon failure.
   
if (!empty($node->date) && strtotime($node->date) <= 0) {
     
form_set_error('date', t('You have to specify a valid date.'));
    }
?>

... looks like it should handle it.
One way that COULD break is if there were more that one 'date' element found in the input. This internally casts it into an array of values, which may confuse matters.
Or you have a cck field called 'date' ?

#2

asb - November 12, 2008 - 15:55
Version:5.x-1.2» 5.x-1.x-dev

Hi Dan,

thanks a lot for this thorough explanation. In the meantime, I upgraded to 5.x-1.x-dev and did some experimenting.

(a) CCK "Link" field

Created a CCK "Link" field for hyperlinks, called field_website. Code to import:

<a href="http://www.mysite.tld/" id="website">mysite</a>

Seems not to work.

<p id="website"><a href="http://www.mysite.tld/">mysite</a></p>

Seems not to work, also.

Tried again with id="field_website", same result. Obviously I'm doing something wrong...

(b) Node "Created" date field

Changed id=date to id=created. Code to import:

<p id="created">2006-11-12 15:45:51 +0100</p>

WOW! It somehow does work now. Resulting node creation date:

Created by someone at 1. Januar 1970 - 1:33.

Not exactly what I had in mind, but the node's creation date is being touched somehow.

Also, even if the date has been kind of parsed, it's still imported in the node's body. Code fragment:

    <p id="created">
      2006-11-12 15:45:51 +0100
    </p>

Trying to get a closer look at what's happening; stripped the imput HTML to the bare bones:

<!doctype HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Test</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="Content-Language" content="de">
</head>
<body>

<p id="created">2006-11-12 15:45:51 +0100</p>

</body>

</html>

Delete the previously imported node, started a new import. Result:

Importing 2 files now
...
Fetching content from 'files/import/D/De/index.asp' now.
Processed page to extract content. Title:'Test'
Inserting New Node.D/De/index.asp

Resulting node creation date: "Created at 12. November 2008 - 16:34..."

Next step: I did not delete the node, reran /build/import_html/import_site. Result:

Importing 1 files now
We already have 'D/De/index.asp' in the system as 'node/36'. Overwriting it with the new import
Fetching content from 'files/import/D/De/index.asp' now.
Processed page to extract content. Title:'Test'
Node 36 Exists, updating it.

Resulting node dreation date: "Created at 1. Januar 1970 - 1:33..."

Now THAT I don't understand. Time for a coffee... ;)

Thanks again for your module, without it I wouldn't even have thought about getting the old data into Drupal.

Greetings from Berlin, -asb

PS: Small notice about the output of the import_html module during imports; on my installation, there seem to be three kinds of output:

a) System messages, written on top of the node in green;
b) Other, more verbose messages in back (admin theme: garland), e.g.:

Importing '/D/De/index.asp'
Processing import page. Full file path: 'files/legacy/D/De/index.asp' , relative path under current section: '/D/De/index.asp'
files/legacy/D/De/index.asp was not tidy enough - running tidy over it now so I can parse it.
Namespaces are html : http://www.w3.org/1999/xhtml - xsl:stylesheet : http://www.w3.org/1999/XSL/Transform Will search for body content labelled '' in the source
Absorbing 'content' from 'meta's with a 'name' from source doc (description=) had a null value
Path to save this page as is 'D/De/index.asp'

This is written on top of the node itself, before the site header starts.

c) Some kind of debugging output, even more verbose, smaller and in light gray; e.g.:

: in _import_html_import_files() line 1228 import_html.module : in ...
...
: in parse_in_xml_file(), line 101 xml-transform.inc ...

 
 

Drupal is a registered trademark of Dries Buytaert.