I'm importing to a custom content type that has a taxonomy field with a tagged vocabulary (I hope that makes sense). I want the taxonomy terms to be able to contain ','s (i.e. commas) and I don't want multiple values. So when I'm asked for the multiple value separator I try ':' (colon) or '@' (at), neither of which appears in the terms being imported. However, the import splits on commas anyway and classifies the nodes with multiple terms.

I'm guessing this might be related to some of the other taxonomy issues... but I thought I'd report it in case this problem is new.

Comments

Robrecht Jacques’s picture

Can you input taxonomy terms/tags on the content type edit page (eg content/add/story) with comma's?

Martin.Schwenke’s picture

Fantastically spotted! Thanks...

So, the answer is: yes, provided I (double-)quote them. So, I can do that in my CSV file...

There is still a very minor problem. When I import "A, B" I lose the space after the comma. That doesn't happen if I add a quoted tag on the content type edit page.

Any ideas? ;-)

Thanks again...

Martin.Schwenke’s picture

Title: Tagged vocabulary values always split into multiple values with comma » Tagged vocabulary values lose spaces after commas

I've tried escaping everything I can think of but I can't find a work-around that lets me keep the spaces - they only disappear if they're immediately after a comma. They display just find in all the sample data views but from the preview import on the spaces are gone.

Martin.Schwenke’s picture

OK, it is way too late at night and I was changing the separator for the wrong field to '@' and leaving this one as a ','. So, I can now work around this...

The real problem is that the explode/trim in node_import_values() is executed even if $value is protected/delimited by double quotes:

          if ($map_count == 1 && strlen($mseparator) > 0) {
            $fieldvalues = strlen($value) > 0 ? array_map('trim', explode($mseparator, $value)) : array();
            break;
          }

The condition could also check that the 1st and last characters of $value aren't both '"' (i.e. a double-quote). I'm happy to provide a patch... but I'm happy to take advice on which of PHP's many pattern matching functions you prefer in your code... :-)

Robrecht Jacques’s picture

OK, so quoting the term works on the edit page... this means that probably node_import should just quote the tags it provides.

The default multiple values separator is "||" except for free-tagging vocabularies (where it is ","). Would quoting work? ... without testing ... if you provide something like this tag has, a comma || this tag doesn't have one *and* specify that the multiple separator is "||", at first node_import would get the right terms (being this tag has, a comma and this tag doesn't have one but when node_import submits the value it will translate this to this tag has, a comma, this tag doesn't have one. The reason why node_import does this is because a tag-vocabulary expects a comma. So that is wrong. Solution for this: submit "this tag has, a comma","this tag doesn't have one". This is a bug and needs to be fixed.

Another bug you apparently spotted is that somehow "A, B" value is translated to "A,B". Need to think/investigate that one a bit more.

This is unrelated to the other taxonomy bug reports, so keeping this open. Interesting exception case to keep in the SimpleTests I'm writing now...

Seems you are not from Holland/Germany if posting this was late at night even if the name is a hint towards those countries. (I'm from Belgium myself)

Robrecht Jacques’s picture

OK, maybe a reaction on

The real problem is that the explode/trim in node_import_values() is executed even if $value is protected/delimited by double quotes:

If you would provide this tag has, a comma || this tag doesn't have one and the multiple separator is || we end up with two values: this tag has, a comma and this tag doesn't have one. The fix of the the bug you're seeing, I've explained above: just make sure you quote the values you submit them (as the form element expects).

If you would provide this tag has, a comma , this tag doesn't have one and the multiple separator is , (which is the default for tags) we end up with three values: this tag has, a comma and this tag doesn't have one. You propose to have something like "this tag has, a comma","this tag doesn't have one" and have it parse as this tag has, a comma and this tag doesn't have one. This would means:

  • either we allow to disable parsing multiple values at all and allow users to submit the value as is in the file (some extra option),
  • or somehow the multiple separator needs to be escapable or text-delimited (just like CSV itself) ... (adds some more extra options),

The bug itself (as said before) is easily fixable... the solution in this comment (both of them, although I'd prefer the first one), would need some more work.

Robrecht Jacques’s picture

Another additional comment: currently if you submit: "this tag has, a comma", this tag doesn't have one you'd also end up with three values: "this tag has", a comma" and this tag doesn't have one. It was for this case the two options were formulated.

The bug I'd like to fix in -rc5 would be that if you submit: this tag has, a comma || this tag doesn't have one that you would end up with the correct values.

Martin.Schwenke’s picture

No, you won't get 3 values. You'll get 2 values and you will just lose the space after the comma in the first value. ;-)

So, by example, if $mseparator is set to ',' and you submit "this tag has, a comma", this tag doesn't have one then you get 3 intermediate values "this tag has, a comma" and this tag doesn't have one. However, then when you submit for preview or import the first 2 values will be combined into "this tag has,a comma" (because, somehow a (real or implied) comma gets inserted) and you retain the other value this tag doesn't have one. So, the quotes still work even though node_import thinks the first 2 values are separate!

I don't understand enough about the actual import process to understand why this happens... and that part of the code is abstract enough that it doesn't help me much... :-) I can't figure out where 'create' methods are set and also can't see a relevant call to implode().

I think this will be really hard to fix "properly" (i.e. without introducing special cases that break other things) and I think the best way of fixing it is to document it:

Tags containing commas need to be quoted just like in the content type edit page. However, the quotes aren't recognised by node_import but are just passed through to Drupal. Therefore, you should not use comma as your multi-value separator for such fields - use a value that is not contained in any terms.

I'm guessing that people who are using node_import have enough of a clue to understand this. I used node_import for the 1st time last night and, after a bit of hacking, I managed to work out what was happening.

By the way, node_import is awesome. I really didn't believe it would work as well as it does because it is solving such a hard problem. Thanks for your work on it.

Oh, and I'm in Australia... but my parents came here from Germany... and I hope things are good in Belgium! :-)

peace & happiness,
martin

Robrecht Jacques’s picture

Aha... that's why you loose the space... because the 3 values are trimmed and then combined again and then taxonomy interpret it as two values (if you have put the quotes) :-)

"A, B",C"A, B", C"A,B",C (because node_imports adds the "," as tags expect it).

A, B||CA, B, C"A, B",C should be fixed and documented.

Hmm.. what with A " B as value? ... need to test that :-)

Martin.Schwenke’s picture

Do you actually know when you're importing into a tag-based vocabulary? I can't see anything in the code that recognises that... but I might not be looking in the right places. So if you add quotes do you potentially break non-tag cases?

"A, B"||C (or even "A, B"||"C") works fine so I think that requiring the quotes in the input is fine. I wouldn't try to make node_import cleverer than the content type edit form.

A " B in the content type edit page causes just a value of A to be entered... so even unexpected things happen in the content type edit form. :-)

Martin.Schwenke’s picture

Just one more comment... :-)

If you're going to quote the values before handing them to Drupal then you need to check that they're not already quoted in the input. So then you're stuck with trying to work out what the input means:

  • Is there one double quote at the beginning and end of the input?
  • Are there 2 double quotes at beginning and end? If 2, then is double quote the escape character and do they mean a literal quote?
  • ...

That's why I now think the current code is fine and the import fields really just want to allow whatever the content type input form would allow... along with a note somewhere explaining this.