In order to build the pipeline that's described in here, you need the Drupal Transformations module (along with its dependency, Field Tool) in addition to CSV Transformations that we already used in HOWTO #1.
This time, we're actually going to import data into Drupal. Make sure to have a database backup before executing the pipeline, in case something goes wrong. (Transformations cannot undo data modifications once they have been performed, so you should only use it with reasonably well-tested pipelines and valid data sources.) The goal is to import all rows of the courses.csv file from HOWTO #1 into Drupal nodes. For that purpose, you will need three things:
- A content type with appropriate fields, called "your content type" for the rest of this HOWTO. You probably want the course name as title and set up CCK fields for the other CSV columns; a CCK text field for the "Instructor" and CCK number fields (integer for "Units" and "Max number of attendants", decimal for "Unit length") for the rest. Consult the CCK handbook on how to create such a content type.
- A transformation pipeline that transforms the CSV file into structured records - basically the same thing that we did in HOWTO #1, only this time we'll be storing the data rows as nodes instead of just extracting the header row.
- As part of the pipeline above, we need another pipeline that is executed once for each row. This pipeline does the actual mapping of CSV to node/CCK fields, and takes care of storing the node.
I'll refer to the first of these pipelines as "outer pipeline", and the other one as "inner pipeline". We'll create the former first so that it's easier to follow the data flow, but generally I prefer to create the inner pipeline first so that I can create both in one go. In the end, it doesn't matter for the end result.
So let's head to the pipeline overview page and add a new one called "Import courses from CSV: outer pipeline". This one looks very much like the pipeline from HOWTO #1, with just a small difference. Add the following operations: "Read file contents, line by line" (in "Files and directories"), "CSV text lines to list of records" (in "CSV"), and - this is what's new in here - the "For-each loop" operation (in "Data flow"). As in the previous HOWTO, connect the "File path" input to a new pipeline parameter so you can provide it at execution time, and connect the "Text lines" output with the "CSV text lines" input.
Data input widgets
We assume that courses files, if you should ever import more than one, will always be in the same format, that is, you've got a semicolon as field delimiter and a header row with column names. Let's assign this information as fixed options so that you don't need to provide them every time you execute the pipeline. You'll notice that the text labels for all the input slots are links; you can follow those links to assign a fixed value to that input slot. If you click on the "Delimiter" text label, you can assign a fixed delimiter input. First up, you don't want the default value (the comma) to be used, so uncheck the "Use the default value for this input" checkbox. Then enter the actual delimiter (a semicolon) into the text field. Finally, press the "Store value" button to actually assign the value - you'll notice that the connection endpoint is now a filled circle (for "connected" - yes, fixed values also count as "connections") and has a light-blue background, which stands for "fixed value assigned".
You can do the same thing with the "Skip first line" input - follow the link of its text label, uncheck the "Use default value" checkbox, and instead check the "Skip first line" checkbox as we don't want to import the header row as a new node. Transformations has a couple of input widgets for various data types, it comes with widgets for texts (single-line and multi-line), numbers, and boolean values (checkboxes), plus a few widgets for types defined by a given operation, e.g. selection boxes for operations and pipelines. Developers can add more data input widgets for various data types, if they like to.
When an operation defines an input slot with a data type for which there is no input widget available, Transformations will fall back to plain PHP input. The "Column names" input of the CSV parser operation, for example, is defined as a list of text values, and Transformations does not offer a widget for that data type. If you know a bit of PHP, you can assign values for such input slots by providing a PHP expression (which will be instantly executed and its result stored as fixed input value), but otherwise it's probably a better idea to stay away from assigning those inputs directly.
For-each, part 1
So the pipeline is configured to process the CSV file to return the second to last line as "List of records" output of the CSV parser operation. Each of these records is (as in HOWTO #1) a list of text values, and for each record we want to create a separate node. Transformations can only execute an operation once in any given pipeline, so the "For-each loop" operation solves this issue by encapsulating any operation and repeatedly executing it for each value in the list that it receives. Remember that pipelines can also be operations, so you can run the CSV records through any appropriate pipeline that you built before. However, we haven't yet built the inner pipeline that imports a record as Drupal node, so we need to do that before finishing our outer pipeline. Save the outer pipeline in its current state, and go back to the pipeline overview page in order to create another pipeline.
The inner pipeline will be called "Import courses from CSV: inner pipeline" (surprise!) and has quite a few operations to add:
- "Extract child elements from structure (simple mapping)" (in "Lists and structures"),
- "Create new node object" (in "Drupal nodes"),
- "Set fields in node object" (in "Drupal nodes"), and finally,
- "Save node" (also in "Drupal nodes").
Lists, keys and element mapping
Some background on lists: In Transformations, a list always consists of elements (or "values") with a key associated to each value. The key is a string or number, and may or may not be used by an operation that's processing a list. The "Extract child elements" operation exposes list elements as output slots that are assigned a specific key, which is why it's a "mapping" operation (the key is mapped to the slot). So we can extract single values from a list and connect them upfront if we know their associated keys. The "CSV text lines to list of records" operation assigns numeric keys for each record, starting with 0 for the first field (as is usual in programming for array structures). Therefore, what the inner pipeline receives as pipeline parameter corresponds to a PHP array like this:
0 => "Course Name",
1 => "Instructor",
2 => "Units",
3 => "Unit length (hours)",
4 => "Max number of attendants",
Actually, this is exactly the list that resulted from executing the pipeline in HOWTO #1. Moreover, the "Key map" input slot of the "Extract child elements" expects to receive a list that looks just like the above; it will use the above list's keys (0 to 4) to extract values with the same key from the list, and it uses the above list's values ("Course Name", "Instructor", etc.) to label the output slots for better readability. In other words, if you assign a list to the "Key map" input that has the same keys as the list items that it's going to process later on, the "Extract child elements" operation will let you wire up the individual elements upfront.
And that's what we want here: take a line from the CSV file (the header row) and use it as template for all the other lines that have the same keys (0 to 4) but different element values. Indeed, you can just take the above code block and paste it into the PHP data input widget for assigning a fixed value to the "Key map" input slot, and it will offer five more output slots that you can connect. As it relies solely on keys for mapping, the operation doesn't mind whether the element values are header row titles or example rows, it's just the description used for the output slots. If you disconnect the "Key map" input slot, the outputs will be gone again, if you modify it then you'll get a different set of output slots.
Importing the result of another pipeline
Typing the above array code is cumbersome and error-prone, plus we've already built a pipeline in HOWTO #1 that can extract this header row list from the original file, so instead of writing it by yourself you can import the value from the "Extract first row of CSV file" from that HOWTO. Let's try that now: follow the "Assign fixed value" link (the description text) of the "Key map" input, and instead of entering the above PHP array into the text field, select the "Import value from pipeline" pseudo-tab. You can execute the "Extract first row of CSV file" from there in order to retrieve its output. Select it, and supply the same arguments that you used in HOWTO #1 (correct file name, delimiter being a semicolon, usage of delimiter default value disabled). Note that you can select the pipeline output that will be imported - it's is not an issue with this pipeline as there is only a single output, but it can be a good thing for pipelines with multiple outputs. When you hit the "Import value" button, the header row of the CSV file will show up as output slots waiting to be connected.
Oh, and you also want the "Item structure" slot to be connected to a pipeline parameter, because that's where the inner pipeline is going to get its text-value list from.
Now that we can map the CSV fields to other input slots, the only thing left for the inner pipeline is to make use of them and create a new node from that data. In Transformations, this involves creating a yet-unsaved node object, modifying it so that it contains the desired values ("Set fields in node object") and saving the node from volatile memory to the database ("Save node object"). First up, assign the content type to the "Create new node object" operation, which will create the "New node object" output slot. Connect that slot to the "Existing node object" input of "Set fields", and the "Modified node object" output to the "Node object" input of the "Save node" operation. Connecting the saved node object to a pipeline output doesn't hurt, so let's do that too.
You might notice that the set of available fields in the "Set fields" operation was reduced when you connected the "Existing node object" slot; that's because connecting to an output with fixed content type automatically sets the "Content type" field. It's mostly a usability improvement though, as you're so lazy that you prefer not to enter that value by yourself.
All you need to do now is to connect the CSV fields from the "Extract child elements" outputs to the appropriate node fields of the "Set fields in node object" operation. As long as the data types of both connections match (e.g. don't assign instructor names to number fields), the resulting node will be alright. Actually, there's a mismatch in types here: The CSV fields are single values, whereas the "Set fields" operation specifies each CCK field slot as a list of values, so that it's also possible to assign multiple values to a CCK field. There's some special logic in the CCK importer so that you can connect the single-value fields with CCK's list values, you might call this a dirty hack and that's probably true. But hey, it works for now.
Well yeah, that's it! Let's save (*important*! only saved stuff gets used by other pipelines.) and head back to the outer pipeline, and see what we can do with our new inner pipeline now. You can leave out the node fields that you want to leave unmodified, such as the node id or the published/promoted/sticky options. It's your own responsibility to fill out required fields though, otherwise Drupal will complain or import nodes that don't adhere to their own "requiredness" settings.
For-each, part 2
There's one input slot left red in our outer pipeline, it's the "Inner operation" input of the for-each operation. Now that the inner pipeline exists, we can specify that we want to execute it for the retrieved CSV records. When you assign "Pipeline" as fixed value for the "Inner operation" slot, the input slots of the inner operation will appear. In other contexts you might also use any other, non-pipeline operation as inner operation, but for now we don't need that.
The "Pipeline" operation itself has a required input named "Pipeline", which you need to fill in as well - assign the inner pipeline ("Import courses from CSV: inner pipeline") as the one that we want to have executed. Selecting our inner pipeline will, again, bring up its slots as further input slots of the for-each operation. The inner pipeline's only pipeline parameter is the "Item structure" (connected to the "Extract child elements" operation). Notice how it's still singular although we wanted to connect the "List of records" to a "List of item structures" input.
If you connect the "List of records" with the for-each operation's standard "Item list" input, it won't work: this input just exists so that you can repeat an operation without actually passing the list values as a pipeline parameter. Use it this way if you just care about how often the inner operation executes.
This is not the way we want it though, as we want to actually pass the list values to the "Item structure" input, one by one. The "List element" input of the for-each operation defines which of the inner operation's input is "the list", whereas you can pass regular arguments to the inner operation's other input slots. Obviously, there can be only one list input, which we specify by assigning "Item structure" as fixed value for the "List input" operation. The for-each operation's original "Item list" input will then disappear, and the "Item structure" input is replaced by an "Item structure list". This is what you can connect to the "List of records" output - both are a list of of text value lists, so their types are compatible. As you like to see what you accomplished, you connect the "List of outputs" with a new pipeline output.
Do the import!
Do it. Navigate to the "Execute" tab, enter the file path of the CSV file, and hit "Execute". The output you see then shows the output of the "Save node" operation in the inner pipeline, once for each imported row/node. Have a look at the front page (or wherever you decided to list nodes of the given content type) and verify that the courses have indeed been imported.
In this HOWTO and the previous one, you've learned about pretty much all of the features that Transformations offers, and how to use them. You also stumbled across a few shortcomings regarding usability and type safety. Employ the former if you like Transformations, and consider writing up a few recipes for some other transformation pipelines if you created interesting and/or reusable ones. Consider contributing new operations and data widgets if you're a Drupal developer, or consider tackling above shortcomings if you're a rockstar legend.
Do correct this page if you find any errors, or more likely, incomprehensible wordings. Be bold, the documentation is a wiki! And no, I'm not done yet. More stuff to come - Transformations can also do CSV exports, as well as import/export of XML data (which is fun, actually), and whatnot. Go check it out!