How to convert long Word doc to Drupal Book using this module

ntripcevich - October 8, 2008 - 06:09
Project:HTML2Book
Version:5.x-2.x-dev
Component:Miscellaneous
Category:feature request
Priority:normal
Assigned:tomfeinberg
Status:closed
Description

I'm just writing to thank the author(s) of this module. It saved me a lot of time!

I was able to convert my 945 pg archaeology dissertation into a Drupal book with very little drudgery thanks to this module. It was structured hierarchically in MS Word 2003 (using Headers), and I had made a PDF out of it.

The hardest part were actually not related to this module: it was bringing over the embedded graphics (eps, jpg) as linked out JPG files, and getting the bibliographical references (Endnote) to appear as hyperlinked files to nodes in drupal Biblio.

Here's the document as a 499 node Drupal Book.
http://mapaspects.org/tripcevich-phd-diss

I've pasted my notes below in case it helps others.

Best, Nico Tripcevich

Word 2003 / PDF to Drupal Book (Drupal 5.x)

This page describes how I converted a long MS Word document with embedded graphics and references to hierarchical Book structure in Drupal 5 CMS.

The original document was my 945 page PhD dissertation with hundreds of Endnote references and figures that began as an MS Word 2003 document with external (Linked) graphics primarily in EPS and JPG formats. The PDF that resulted was submitted to ProQuest and is served online here, however it was not readily accessible to Browsers and search engines unless I converted it to a Book structure with 499 content nodes (plus the Biblio nodes) that can be viewed here.

Summary

The principal challenges were batch-processing the following steps

  • converting the structure to a web based hierarchy (Drupal Book)
  • converting EPS graphics to web-viewable JPG format
  • converting bibliographical references to web-viewable hyperlink format.

These were addressed by getting the document out of Word into XHTML, converting the figures to placed JPG files using Regular Expressions searches in Dreamweaver, and exporting Endnote references to Biblio module with a CiteID key as the basis for the URL Alias.

 

Adding tags to document and formatting for web.

Original document had hierarchical titles specified through stylesheets. I had used the default Heading tags (e.g., <h1>, <h2>) in the Word document while writing the dissertation.

  1. In Drupal 5 install the following modules: Book (core), HTML2Book, Html Corrector, and HTML Tidy as described in the HTML2Book module description.
  2. Embedded Graphics export: MS Word supports many embedded graphics formats that are not converted adequately to web-viewable formats. In particular, EPS files were garbled turning the output. I used a Photoshop droplet to batch-convert all placed EPS files to 144 dpi JPG files with the same filename.
    1. Using Dreamweaver I converted tags that resembled

      <IMG src="/diss_images/filename.EPS" border="0" width="565" height="865">

      to something more like the following

      <a href="sites/all/files/diss_images/filename.JPG"><IMG src="/diss_images/filename.EPS" border="1" width="500"></a>

      Note that in addition to changing the EPS to JPG and the path to a Drupal sites/all path, the file is now displayed inside a hyperlink using an IMG SRC tag that shows only 500px width (proportionally). If the user clicks the image they are shown a larger version. The page is more readable (no giant graphics) but larger views are available on demand. The only drawback is that the large image is loaded on every view although its displayed smaller. There are alternatives, such as producing a medium size version, but this is more work.

Formatting In text citations for linking out to Biblio nodes

As of Endnote X2 this program still cannot produce hypertext links to a bibliographical view. I worked around that by modifying the In-Text citation style to include the unique ID number Endnote Ref# to use as a CiteKey in Biblio. Then I used Regular Expressions search (in Notepad++ or Dreamweaver) to convert the in-text citation to a hyperlink.

  1. The formatted Endnote tags with the Endnote reference included, were derived from the American Antiquity journal style and looked like this

    (<a href= "/biblio/ref_custom1">Author Year</a>)

Process for Import to Drupal

  1. Select out all or part of the text in Word (I did chapter by chapter so 50-200 pages at a time). Make sure the graphics have loaded. If they appear as an empty box with a red X try reloading them in Word by selecting and hitting F9 or toggling the code display with Alt-F9 (windows).

If you also have a PDF of the document you can try Exporting the PDF to XML and then you'll get an "images" folder full of the embedded graphics

  1. Save this selection out to an HTML file with the Save As... to the "Web Page, Filtered...(HTM)" format.
  1. In Drupal 5 create your first page of a Book so the top level hierarchy is started. Create a secondary page and make sure the PathAuto and Tokens-derived URL works. I used [bookpath-raw]/[title-raw]
  2. Open the HTML file saved out from Word and paste into Drupal. There are different ways to do this but one that worked for me was to
    • Turn off Rich text (TinyMCE or other) and paste in straight HTML. Allow the HTML Tidy to work.
    • I had to strip out other tags including replacing "<p class = c1>" with <p>, correcting the "-" symbol in places, and turning my custom <quote> style tag from Word into <blockquote> and </blockquote>. Dreamweaver has a "Find/Replace Tags" command that makes this easy.
  3. Bring in your Book, probably removing the Topmost <h1> tag. That is to say, the Bookpage you're creating has a title already so the topmost tag should not reproduce that.
  4. Note that the HTMLTidy (5-1.x-dev, 19 june 2007) has conflicts with CCK custom node types so I turn it off when not in use.

Bibliographical entries: Endnote db to Biblio via XML

Exporting from Endnote x2

Settings for Export from Endnote to Biblio 5.x

    • The XML format should include the tag <rec-number>.
    • Export the file to XML using File>Export... then Save as Type... XML, Output Style: (I think any will do as long as <Rec-number> is included in the XML)
    • Import to Biblio
    • Note: rec-numbers are specific to each database (but cumulative) so they aren't the best unique ID#. That's just what worked for me.
      Its probably better to make use of the digital object identifier (DOI) and the Biblio citekey functionality.

Bibliographical data into Biblio as EndNote 8+ XML

Taxonomy settings: I used a separate Drupal Taxonomy Vocabulary for Pubs so I could can hide that entire Vocabulary in the Book and the Biblio
using Taxonomy Hide module (because the default Taxonomy teaser view is useless).

Used a text editor to replace <rec-number></rec-number> tags with <custom1></custom1>

Upload the XML file to Biblio using Endnote8+ XML format.

If you have trouble with the upload timing out you may have to cut your file into smaller XML file chunks during the export from Endnote.

    • Due to memory limitations you may have to go down to batches as small as 100. You can "Groups" in Endnote (available as of Endnote X) alphabetically so there was a A-B group of 80 references and a C-D group of 92 references, and so on. These can be brought from Endnote into Biblio one by one. Discussion of issue.


Then I went into the PathAuto settings (v5.x) and changed the Node path settings for Biblio type "Pattern for all biblio paths:" to biblio/ref_[biblio_custom1], but this also required modifying the Biblio.module code to add support for showing Custom1 field in the PathAuto+Token system for URL Aliasing, as described in
http://drupal.org/node/89038#comment-869934

I used Custom1 instead of CiteKey because Custom1 is imported in the original Endnote8 XML parser (where CiteKey is not provided). The "unique ID" number used in the URL is relatively arbitrary: it is the <rec-number> in the Endnote bibliography that I used in graduate school. If these records are brought into another DB or even saved into a subset Endnote .ENL file the ID numbers change.

Note: Due to memory limitations I had to import the Bibliography in batches of 100. I created "Groups" in Endnote (available as of Endnote X) alphabetically so there was a A-B group of 80 references and a C-D group of 92 references, and so on. I exported these groups from Endnote and brought them into Biblio one by one. This limitation might be solved in the Drupal 6 version of Biblio (actually, I didn't have this problem with a v5 version).

from http://mapaspects.org/article/converting-word-doc-and-pdf-drupal-book-we...

#1

ntripcevich - March 13, 2009 - 20:11

FYI -- the site referred to in this post was just updated to Drupal 6.10 and the Book content is still there and it looks good. Unfortunately this HTML2Book module isn't available for Drupal 6.x, so I'm glad that I used it when my site was still in drupal 5.

#2

tomfeinberg - April 29, 2009 - 10:52
Title:Success converting long dissertation to Drupal Book using this module!» Nice info!
Version:5.x-1.x-dev» 5.x-2.x-dev
Assigned to:Anonymous» tomfeinberg

Thanks for sharing this information, it really helps me with my MBA Dissertation.

#3

seat - June 11, 2009 - 17:30

If you are having trouble with uploading a big file it's probably due to the memory limitations on PHP.
Try this http://drupal.org/node/344898

#4

ntripcevich - October 1, 2009 - 18:29
Title:Nice info!» How to convert long Word doc to Drupal Book using this module

I can't edit my original post, but I wanted to point out that in the section entitled
"Adding tags to document and formatting for web." the resulting HTML tags should be:
...something more like the following
<a href="sites/all/files/diss_images/filename.JPG"><IMG src="/diss_images/filename.JPG" border="1" width="500"></a>

(I changed EPS to JPG in the IMG tag.)

#5

vikasvashishth - November 11, 2009 - 07:56

Hi Is there any way in drupal 6 that we can easily convert word file into drupal book?
Thanks in advance, as html2book is only for drupal 5

 
 

Drupal is a registered trademark of Dries Buytaert.