Currently, Feeds uses ParserCSV defined in libraries/ParserCSV.inc which parses the CSV file itself, unnecessarily. It is a much better idea to use the native PHP function fgetcsv. It allows for configurable enclosure and escape characters as well as the delimiter (needed for issues like #1356718: The escape and enclosure characters should be configurable). According to the discussion over at #1014678: CSVParser can't parse TSV Files: double quotes still special, it's also faster.

Older versions of PHP may not cut it, but I don't think this is a huge problem. Most OS distributions are already using PHP 5.3.

Changelog:

  • 5.3.0: The escape parameter was added
  • 4.3.5: fgetcsv() is now binary safe
  • 4.3.0: The enclosure parameter was added

Comments

colan’s picture

I'll start by getting what I can from the D6 version over at #1014678: CSVParser can't parse TSV Files: double quotes still special.

mrharolda’s picture

Nice one, I'll be following this!

Feeds is becoming increasingly importent because almost all our clients have existing (non-drupal) websites already. I'll be migrating a lot of data in the future ;)

colan’s picture

Status: Active » Needs review
StatusFileSize
new6.79 KB

This took a lot longer than I thought because the original patch was so divergent from what's now in the development tree. Anyway, here's what I've got so far. I haven't even had a chance to test this myself yet, but I wanted to post something as soon as I finished coding to get some feedback.

We'll see how it runs tomorrow.

colan’s picture

Assigned: colan » Unassigned

Sorry folks. Client priorities are changing again.

colan’s picture

Title: FeedsCSVParser should use fgetcsv() » Don't roll own CSV parser; use PHP's native one

Leaving this with a more descriptive title so hopefully someone else can pick it up.

colan’s picture

Marking #1161628: Better CSV Parser as a duplicate. Yes, that one's older, but we need to go from D7 down anyway.

emackn’s picture

Status: Needs review » Needs work

I like this change. But I think it should happen in libraries/ParserCSV.inc instead of FeedsCSVParser.inc.

This doesn't break the existing tests and also allows for the existing tests to serve as a start point for more tests. YAY!

colan’s picture

Adding tag.

linuzboy’s picture

StatusFileSize
new3.79 KB

Yes, this is much more easier and applicable.

Just replace the line parser codes with the single line

$fields = str_getcsv($line, $this->delimiter, '','\\');
kthull’s picture

Forget where I found this on d.o, but just add this to the end of your settings.php file and you're good to go:

ini_set('auto_detect_line_endings', 1);
gaëlg’s picture

#9 did not work for me. Before, \" gave \ in my field, and after, it gives \" (no change). It should give ".

gaëlg’s picture

And #3 gave \

gaëlg’s picture

My problem comes from this PHP bug. So it works only if the escape character is ".
To get such a CSV file from PHPMyAdmin, you need to select "CSV for MS Excel" as file format. Then no need to patch the module, it just works.

twistor’s picture

Removing tag.

Release blockers are about broken/missing functionality. Trying to replace the most common parser right before a release is not a good idea. The current CSV parser has its issues, but it is relatively well tested. There are numerous known issues with the fgetcsv() in various versions of PHP, mostly stemming from the fact that there's no spec for CSV files.

I love the idea of replacing a large portion of our own code, with C. I'd just like to see it mature a bit.

gaele’s picture

Thank you GaëlG for the workaround (#13). You saved my day.

franz’s picture

Why is it that #9 doesn't set the enclosure when calling str_getcsv()? Why are there 2 backslashes instead of one for the escape argument? I think the whole idea is fine, but providing it can cause instability and some errors, we won't be able to fix. I think it's better to fix current parser.

http://ca.php.net/manual/en/function.str-getcsv.php

twistor’s picture

Issue summary: View changes
Status: Needs work » Closed (won't fix)

str_getcsv() has too many bugs. It's not worth it.

pdcarto’s picture

[Edit - whoops! I conflated this with https://www.drupal.org/node/2443721. Sorry!]
Changing this to "Closed (won't fix)" because the proposed solution is not acceptable ignore the reported issue, which is that there are problems parsing csv with non-enclosing double-quotes. If using str_getcsv is not an acceptable solution, then it seems to me the only other options are to either fix ParserCSV, or to say that the reported issue is not a actually problem.

mrharolda’s picture

My guess is that str_getcsv() contains less bugs than the current custom implementation.

Also I can't find what's wrong with str_getcsv() as I'm using it for years now. Might be coincidence too.

twistor’s picture