Importing cross-referenced nodes to Drupal
I'm working on a project which will (eventually) involve replacing an entire CMS with Drupal. The first stage of this involves importing large text records which have been exported in XHTML format from the CMS.
The XHTML is messy, but can be tidied up enough to create a node (there's no relevant taxonomy or user information to worry about). The major problem is that the text is stuffed with cross-references that use a rather intimidating UID mechanism to create:
- Footnotes
- Links to specific locations within the same record
- Links to specific locations in other records
In addition to this, the export/import won't be just a one-off - some records will need to be updated continually until we can stop using the CMS. For various logistical reasons, it's not practical to move all the content to Drupal and stop using the existing CMS immediately - we can't even do it by publication, due to the cross-references between publications. Therefore, I'm envisaging a process something like:
- export record from CMS using existing system
- open a new or existing node in Drupal (of the appropriate type, with the relevant CCK fields)
- use some magic widget to import the record
- save the record as an updated version, retaining any previously added taxonomy etc
More detail on the links:
1. Footnotes
I'd like to use the http://drupal.org/project/footnotes module for this, but it depends on having the footnote text adjacent to the footnote reference, whereas they're in different places in the originating content, eg:
...should assess what the likely level of her/his earnings are on the evidence available.<Endnote_Reference id="GUID833B9628-CA89-43F8-859B-BE0B6E283007">127</Endnote_Reference> in the text, and then
<endnote id="GUID833B9628-CA89-43F8-859B-BE0B6E283007" number="127">CH/48/2006</endnote> separately at the bottom.
I imagine that the easiest way to do this is just by converting them to internal anchors?
2. Links to specific locations within the same record
This uses an anchor-like mechanism, eg:
...cannot be met by HB (see <see id="GUIDF250F758-F851-42AC-BB56-936EE219E291">reference</see>).</p> would link to
<destination id="GUIDF250F758-F851-42AC-BB56-936EE219E291"/><p class="bhead">Payments that cannot be met by housing benefit</p>
Again, I envisage using internal anchors unless someone has a better solution - the problem here is distinguishing the records which can be linked within the existing record from those referring to other records - see below.
3. Links to specific locations in other records
This looks exactly like type 2, but the destination doesn't exist in the same record - it's in another record in the same system. This part is doing my head in. The best guess I've been able to come up with so far is:
- extract all of the IDs in a CCK field for each node (probably index them separately in the database)
- make the link into a query which:
- searches for the relevant ID
- generates a URL for that record, including the anchor
- follows the URL to the appropriate location
This would probably work for type 2 also - a bit of overkill but since I can't tell the references apart (short of validating them within the record) it's probably easier to handle them the same way.
So... any feedback/opinions/comments welcomed:
- does this make sense?
- is it workable?
- any suggestions re how to go about it?

bump...
have I come up with a challenge that can't be solved by the collective minds of drupal.org? :)