I have been asked to migrate an academic website into drupal. The site is essentially a massive searchable text document. It it made up of 1000's of html files which contain plain text.

What will be the best way to migrate these html files into drupal? I'm trying to avoid having to create a new article for each html file as this will take months!

Thanks

Dan

Comments

WorldFallz’s picture

It sort of depends on the relationships between the files, but I would definitely consider starting with every file being a node. It shouldn't take months-- checkout the import_html module.

Pedja Grujić’s picture

Also take a look at Feeds module ( http://drupal.org/project/feeds/ ) , we have used it successfully in past to import 1000's of static files.

Pedja
Drupal Geek at New Target Inc.
http://www.newtarget.com

WorldFallz’s picture

i can't believe i forgot that one!

dga5000’s picture

Many thanks for the suggestions about the modules. I will have a look at both of them.

The files make up a massive text book of quotes from 4 authors. The html files reflect this and are divided into 4 sections. Each individual quote (and there may be 100's in one html file) has the quote itself, the author and a number for the quote. I need to make quotes for each author searchable for text terms in the quote and the quote number.

Any tips greatly received!

WorldFallz’s picture

Then you'll probably want to use feeds or node_import (not import_html) and make sure each quote gets imported as an individual node (create a new content type for it) with author, text (quote), and quote number fields. Then you can create listings and a search using views with exposed filters.

Though full text is indexed and searchable with core drupal, I would set it up with a new content type & fields for maximum flexibility -- even if you have to write a script to parse the files and output an importable format.

Pedja Grujić’s picture

Parsing each quote into its own node would probably be best in long run, it would allow for a more flexible system.

I would suggest use Taxonomy to create different authors, sections... and then specify those in your content type.

Pedja
Drupal Geek at New Target Inc.
http://www.newtarget.com

dga5000’s picture

What type of parser do you think would be most suitable for this type of content?

dga5000’s picture

Thanks for the very useful advice. I took a look at node_import and it seems the author is not maintaining it anymore and views Feeds as the natural replacement for Drupal 7. I shall spend some time now trying to work out how to batch import html with feeds.