Added support of encoding conversions to the CSV Parser [#1428272]

Comment	File	Size	Author
#112	interdiff-1428272-110-112.txt	1.04 KB	MegaChriz
#112	feeds-encoding-support-csv-1428272-112.patch	16.5 KB	MegaChriz
7.x-2.x: PHP 5.3 & MySQL 5.5, D7 302 pass
#110	interdiff-1428272-108-110.txt	2.35 KB	MegaChriz
#110	feeds-encoding-support-csv-1428272-110.patch	16.47 KB	MegaChriz
7.x-2.x: PHP 5.3 & MySQL 5.5, D7 300 pass, 2 fail
#108	feeds-encoding_support_CSV-1428272-108.patch	15.89 KB	MegaChriz
7.x-2.x: PHP 5.3 & MySQL 5.5, D7 302 pass
#107	interdiff-1428272-105-107.txt	2.15 KB	MegaChriz
#107	feeds-encoding_support_CSV-1428272-107.patch	15.94 KB	MegaChriz

#105	interdiff-1428272-101-105.txt	10.48 KB	MegaChriz
#105	feeds-encoding_support_CSV-1428272-105.patch	14.64 KB	MegaChriz

#101	feeds-encoding_support_CSV-1428272-101.patch	9.04 KB	jtsnow

#97	feeds-encoding_support_CSV-1428272-97.patch	5.16 KB	Niremizov

#82	feeds-encoding_support_CSV-1428272-82.patch	5.21 KB	acouch

#52	feeds-encoding_support_CSV-1428272-52.patch	5.07 KB	13rac1

#50	feeds-encoding_support_CSV-1428272-50.patch	6.43 KB	13rac1

#45	feeds-patch.png	18.07 KB	msti
#41	feeds-encoding_support_CSV-1428272-41.patch	6.89 KB	liquidcms

#37	n4.txt	14 bytes	spuki
#35	feeds-encoding_support_CSV-1428272-35.patch	6.89 KB	Jerenus

#34	feeds-encoding_support_CSV-1428272-34.patch	9.42 KB	Jerenus

#30	price_win1251.csv_.txt	143.25 KB	Stan Turyn
#27	feeds-add-encoding-support-1428272-0.patch	6.94 KB	OnkelTem

#13	adding_encoding_conversion_support-1428272-12.patch	10.19 KB	derhasi

#6	adding_encoding_conversion_support_2.patch	9.83 KB	OnkelTem

#3	adding_encoding_conversion_support.patch	5.4 KB	OnkelTem

#1	Edit importer: Товары из бухгалтерии \| ФОР-Дубна.png	86.26 KB	OnkelTem
	adding_encoding_conversion_support.patch	6.26 KB	OnkelTem

Comment #0.0

OnkelTem CreditAttribution: OnkelTem commented 4 February 2012 at 13:50

Issue summary:

View changes

asd

Log in or register to post comments

Comment #0.1

OnkelTem CreditAttribution: OnkelTem commented 4 February 2012 at 13:50

Issue summary:

View changes

asd

Log in or register to post comments

Comment #0.2

OnkelTem CreditAttribution: OnkelTem commented 4 February 2012 at 13:51

Issue summary:

View changes

asd

Log in or register to post comments

Comment #0.3

OnkelTem CreditAttribution: OnkelTem commented 4 February 2012 at 16:32

Issue summary:

View changes

asd

Log in or register to post comments

File	Size
Edit importer: Товары из бухгалтерии \| ФОР-Дубна.png	86.26 KB

Comment #1.0

OnkelTem CreditAttribution: OnkelTem commented 4 February 2012 at 16:36

Issue summary:

View changes

asd

Log in or register to post comments

Comment #2

dmstru CreditAttribution: dmstru commented 5 February 2012 at 07:05

Hi!

Cool features. I need this - I must check it.

Rus:

Пацаны, ваще ребята!
Молодцы, четко, могёте, умеете!

Пошел тестировать.

с ув., Алексей

Log in or register to post comments

Comment #2.0

OnkelTem CreditAttribution: OnkelTem commented 5 February 2012 at 07:58

Issue summary:

View changes

asd

Log in or register to post comments

Comment #2.1

OnkelTem CreditAttribution: OnkelTem commented 5 February 2012 at 08:03

Issue summary:

View changes

Note

Log in or register to post comments

Comment #3

OnkelTem CreditAttribution: OnkelTem commented 5 February 2012 at 12:35

File	Size
adding_encoding_conversion_support.patch	5.4 KB

Supplying patch to 7.x-2.x git version.
Since commerce_feeds dev branch haven't been updated to accomodate severe changes in the feeds 2.x git branch yet (getting error message: Class FeedsCommerceProductProcessor contains 2 abstract methods .... ), I had no chance to test the patch.

Log in or register to post comments

Comment #3.0

OnkelTem CreditAttribution: OnkelTem commented 5 February 2012 at 12:35

Issue summary:

View changes

asd

Log in or register to post comments

Comment #4

dmstru CreditAttribution: dmstru commented 5 February 2012 at 08:56

Version:	7.x-2.0-alpha4	» 7.x-2.x-dev
Status:	Active	» Needs review

Hi your patch not work:

:41: trailing whitespace.

:43: trailing whitespace.

:121: trailing whitespace.
$defaults = $this->configDefaults();
:125: trailing whitespace.
'#collapsible' => TRUE,
error: patch failed: plugins/FeedsCSVParser.inc:17
error: plugins/FeedsCSVParser.inc: patch does not apply

I try to use it in last dev version.

ALEX

Log in or register to post comments

Comment #5

dmstru CreditAttribution: dmstru commented 5 February 2012 at 09:18

Please contact with me via mail or skype awa_77.

Log in or register to post comments

Comment #6

OnkelTem CreditAttribution: OnkelTem commented 5 February 2012 at 23:11

Version:

7.x-2.x-dev

» 7.x-2.0-alpha4

File	Size
adding_encoding_conversion_support_2.patch	9.83 KB

Fixed patch to 7.x-2.0-alpha4

* Fixed errors in previous (#0) patch: added missed encoding conversion when a cell allocates more then one line.

* REWORKED parser. This is serious change. I believe the original CSV parser did things in a wrong way, treating an arbitrary double quote as a delimiter. IMO, this violates CSV rules, which states, that a field should be [fully] double quoted, when something uncommon happens in it. For example:

;A string with a "double quote in it;

should definitely break parsing process. The correct variant would be:

;"A string with a ""double quote in it";

But current implementation will accept the former and will feed the fields' value with subsequent lines until next double quote.
In my patch it will throw an exception with corresponding message about it, stopping parsing.

p.s. Moving the issue back to alpha4. This is odd, I know, but I lost in versions.

Log in or register to post comments

Comment #7

polom CreditAttribution: polom commented 6 February 2012 at 14:24

Hi,

I have import problems related to encodings (my CSV files are is-latin 1 encoded) and as re-encoding them is not my favorite choice I tried to find solutions.
This patch seems a nice deal but I can't apply it. I have Feeds 7.x-2.0-alpha4 but here is what I get :

$ patch < adding_encoding_conversion_support_2.patch
patching file FeedsSource.inc
Hunk #1 FAILED at 343.
1 out of 1 hunk FAILED -- saving rejects to file FeedsSource.inc.rej
patching file ParserCSV.inc
Hunk #1 FAILED at 74.
Hunk #2 FAILED at 92.
Hunk #3 FAILED at 194.
Hunk #4 FAILED at 232.
Hunk #5 FAILED at 258.
Hunk #6 FAILED at 320.
6 out of 6 hunks FAILED -- saving rejects to file ParserCSV.inc.rej
patching file FeedsCSVParser.inc
Hunk #1 FAILED at 17.
Hunk #2 FAILED at 101.
Hunk #3 FAILED at 145.
Hunk #4 FAILED at 154.
Hunk #5 FAILED at 180.
5 out of 5 hunks FAILED -- saving rejects to file FeedsCSVParser.inc.rej

Am I missing a point here ?

Log in or register to post comments

Comment #8

OnkelTem CreditAttribution: OnkelTem commented 6 February 2012 at 14:50

Does patch -p1 < patchfile work for you?

Log in or register to post comments

Comment #9

emackn CreditAttribution: emackn commented 6 February 2012 at 15:25

Status:

Needs review

» Needs work

Log in or register to post comments

Comment #10

polom CreditAttribution: polom commented 7 February 2012 at 12:10

yes it does :

$ patch -p1 < adding_encoding_conversion_support_2.patch
patching file includes/FeedsSource.inc
patching file libraries/ParserCSV.inc
patching file plugins/FeedsCSVParser.inc

Unfortunately my iso-latin encoded CSV file could not be imported and I had to convert it.

Log in or register to post comments

Comment #11

OnkelTem CreditAttribution: OnkelTem commented 7 February 2012 at 12:16

@polom

Would you send me an example file?
aneganov at gmail d0t com

Log in or register to post comments

Comment #12

derhasi CreditAttribution: derhasi commented 8 February 2012 at 11:06

Status:

Needs work

» Needs review

The patch works for me.

Cleaned it up for coding styles and renamed it, so the issue is referenced. (@see).

@polom, did you set the "Source file encoding" in the "Settings for CSV Parser"?

Log in or register to post comments

Comment #13

derhasi CreditAttribution: derhasi commented 8 February 2012 at 11:07

File	Size
adding_encoding_conversion_support-1428272-12.patch	10.19 KB

*arg*, sorry forgot to attach the patch

Log in or register to post comments

Comment #14

emackn CreditAttribution: emackn commented 8 February 2012 at 18:39

Version:	7.x-2.0-alpha4	» 7.x-2.x-dev
Status:	Needs review	» Needs work

do you have a test for this?

Log in or register to post comments

Comment #15

derhasi CreditAttribution: derhasi commented 28 May 2012 at 11:18

OnkelTem,could you provide a test for that?

Log in or register to post comments

Comment #16

OnkelTem CreditAttribution: OnkelTem commented 29 May 2012 at 19:48

@derhasi, I would gladly make one, if I were know how to do that :) Really, I never created unit tests before.

Log in or register to post comments

Comment #17

Yuri CreditAttribution: Yuri commented 6 July 2012 at 17:27

there is a patch in the issue summary, and in #13.
Which one to use? I applied #13 and update.php, nothing changed so far, I still have error
SQLSTATE[HY000]: General error: 1366 Incorrect string value

Log in or register to post comments

Comment #18

OnkelTem CreditAttribution: OnkelTem commented 6 July 2012 at 17:35

Please, provide a copy of example data file which produces the error.

Log in or register to post comments

Comment #19

Yuri CreditAttribution: Yuri commented 6 July 2012 at 18:06

Category:

feature

» bug

In my case, I use feeds, mailhandler and mail comment modules (latest devs in d7)
The following error messages appear onhttp://www.exampledomain.com/import/mailhandler_comments

Mailbox mail@exampledomain.com was checked and contained 1 messages.
Warning message SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xA0
<...' for column 'comment_body_value' at row 1
Error message SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xA0
<...' for column 'message' at row 1

The drupal error log shows:

PDOException: in field_sql_storage_field_storage_write() (line 448 of /home/gezond/public_html/modules/field/modules/field_sql_storage/field_sql_storage.module).

and the feeds log shows:

SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xA0asdf ...' for column 'message' at row 1

Log in or register to post comments

Comment #20

Yuri CreditAttribution: Yuri commented 6 July 2012 at 18:11

The data imported is a general Gmail message like 'hello' (of which I have set utf8 for outgoing messages, according to http://support.google.com/mail/bin/answer.py?hl=en&answer=22841

Log in or register to post comments

Comment #21

OnkelTem CreditAttribution: OnkelTem commented 6 July 2012 at 18:50

Are you mails come in CSV format?

Log in or register to post comments

Comment #22

OnkelTem CreditAttribution: OnkelTem commented 27 July 2012 at 10:07

Status:

Needs work

» Needs review

Log in or register to post comments

Comment #23

xaqrox

he/him

English

Washington, D.C.

CreditAttribution: xaqrox commented 2 August 2012 at 15:41

Seems to work really well for me. Don't know the test framework very well either or I'd be right on it. I found a bunch of other issues which I believe this solves, which I closed as duplicate. Hope that's not too presumptuous.
#1605628: Encoding problems while importing from CSV
#1471950: CSV Parser: Check and Convert the Encoding
#1220606: Add support for encoding conversions for any parser
#1319142: csv in ISO 8859-15/EURO

Log in or register to post comments

Comment #24

twistor CreditAttribution: twistor commented 8 August 2012 at 05:38

Category:

bug

» feature

Log in or register to post comments

Comment #25

OnkelTem CreditAttribution: OnkelTem commented 8 August 2012 at 07:12

@twistor
We probably need to split this issue, since #6 is reporting about serious bug in parser.
What do you think?

Log in or register to post comments

Comment #26

twistor CreditAttribution: twistor commented 8 August 2012 at 19:19

Yes, these are very separate issues.

Log in or register to post comments

Comment #27

OnkelTem CreditAttribution: OnkelTem commented 9 August 2012 at 12:27

File	Size
feeds-add-encoding-support-1428272-0.patch	6.94 KB

Separating encoding support from the rest.

Log in or register to post comments

Comment #28

OnkelTem CreditAttribution: OnkelTem commented 9 August 2012 at 12:41

Moving two more changes into separate issues:
#1720658: Empty parser's results fails import
#1720724: Fix DQUOTES handling according to RFC 4180

Log in or register to post comments

Comment #29

twistor CreditAttribution: twistor commented 6 October 2012 at 19:07

Status:

Needs review

» Needs work

+++ b/libraries/ParserCSV.incundefined
@@ -324,4 +342,37 @@ class ParserCSV {
+      if (function_exists('mb_convert_encoding')) {

This should use extension_loaded(). Move this check to the top of the function, don't check twice.

+++ b/plugins/FeedsCSVParser.incundefined
@@ -185,6 +191,40 @@ class FeedsCSVParser extends FeedsParser {
+    $form['encoding'] = array(

This whole fieldset should be conditional on the mb library.

+++ b/plugins/FeedsCSVParser.incundefined
@@ -185,6 +191,40 @@ class FeedsCSVParser extends FeedsParser {
+    if (function_exists('mb_list_encodings')) {

Should use extension_loaded().

Could you please provide an example CSV file that needs this functionality?

Log in or register to post comments

Comment #30

Stan Turyn CreditAttribution: Stan Turyn commented 27 November 2012 at 12:06

File	Size
price_win1251.csv_.txt	143.25 KB

Hi twistor,

I'm attaching a sample CSV file that requires conversion. Any chance to see the patch commited to dev?

Log in or register to post comments

Comment #31

mtoscano CreditAttribution: mtoscano commented 7 January 2013 at 14:58

Hi,
I need encoding conversions from Spanish characters: which one is the patch to use against the current alpha7 release?
Thanks

Log in or register to post comments

Comment #32

Dubs CreditAttribution: Dubs commented 24 January 2013 at 17:26

Thanks so much for your time on this essential module!

Please can this be committed - this would be so useful as I feel pain at the moment every time I have to import Euro characters!

Log in or register to post comments

Comment #33

imclean CreditAttribution: imclean commented 25 January 2013 at 03:41

To the last 3 posters, see #29. The patch still needs work before it can be committed.

Log in or register to post comments

Comment #34

Jerenus CreditAttribution: Jerenus commented 24 February 2013 at 14:51

File	Size
feeds-encoding_support_CSV-1428272-34.patch	9.42 KB

Here is the patch based on all the previous. And I made some modifications to let it work.

Log in or register to post comments

Comment #35

Jerenus CreditAttribution: Jerenus commented 24 February 2013 at 15:54

File	Size
feeds-encoding_support_CSV-1428272-35.patch	6.89 KB

The final one.

Log in or register to post comments

Comment #36

pcambra

he/him

Spanish

Asturies

CreditAttribution: pcambra commented 25 February 2013 at 09:24

Status:

Needs work

» Needs review

Log in or register to post comments

Comment #37

spuki CreditAttribution: spuki commented 28 February 2013 at 15:56

File	Size
n4.txt	14 bytes

It's not working for me. File encoding is Windows-1251, but it was recognized as CP936. So I just commented lines with encoding checking.

      //$encode_array = array('ASCII', 'UTF-8', 'GBK', 'GB2312', 'BIG5');
      //$this->encoding = mb_detect_encoding($data, $encode_array);
			
      // Convert encoding if needed
      if ($this->from_encoding != $this->to_encoding) {
          $data = mb_convert_encoding($data, $this->to_encoding, $this->from_encoding);
      }

Log in or register to post comments

Comment #38

bjcone CreditAttribution: bjcone commented 15 March 2013 at 15:27

I downloaded the patch and it seems to be working to check the encoding of individual lines. However, I ran into a case where the file I was attempting to import was encoded such that fgets() did not recognize the end of line character and returned the full contents of the file. At this point it is too late for this patch to help.

What I found to resolve this issue and enable feeds to import my ASCII-encoded file was ini_set("auto_detect_line_endings", true). See the bottom of the page at http://www.php.net/manual/en/filesystem.configuration.php#ini.auto-detec... for more information. It seems as though PHP, without this setting, does not separate lines which only end in \r (Mac Carriage Return character).

Hope this is useful to someone else as it took me a while to find.

Log in or register to post comments

Comment #39

liquidcms CreditAttribution: liquidcms commented 20 March 2013 at 19:39

i have issue similar to #37. i saved a file from Excel (2007) as .csv. the file has 5 lines, 3 of which have odd non-ASCII dashes. when hitting the detect encoding line in the patch; as spuki states, those 3 lines are detected as CP936.

NOTE. comment in #37 about commenting out those lines doesn't do anything; mb_detect_encoding doesn't modify the data it simply determines the encoding. with or without the array of encoding types listed it still determines these lines are CP936 and the part of the code which then does the encode still runs (that check is simply to make sure we don't re-encode data that is already UTF-8.

so the only issue (and maybe not fixable) is that to convert those dashes in CP936 to UTF-8 the dash is simply removed - this is not a great solution; but without the patch and the encoding conversion Feeds import fails on these characters.

ALSO - as noted in #1140194: SQLSTATE[HY000]: General error: 1366 Incorrect string value for a field with accents if i open the Excel saved CSV into Notepad and then re-save as UTF-8, the dashes import correctly (with or without this patch; although i agree would be nice if Feeds could handle this).

Log in or register to post comments

Comment #40

liquidcms CreditAttribution: liquidcms commented 20 March 2013 at 20:35

turns out losing characters when using PHP to convert from CP936 to UTF-8 is a bug in PHP 5.3 and has been fixed in PHP 5.4.0 (https://bugs.php.net/bug.php?id=60306)

Log in or register to post comments

Comment #41

liquidcms CreditAttribution: liquidcms commented 20 March 2013 at 20:42

File	Size
feeds-encoding_support_CSV-1428272-41.patch	6.89 KB

btw, patch in #35 does not adhere to drupal coding standards. i think the attached is somewhat closer

Log in or register to post comments

Comment #42

20 March 2013 at 20:45

Status:

Needs review

» Needs work

The last submitted patch, feeds-encoding_support_CSV-1428272-41.patch, failed testing.

Log in or register to post comments

Comment #43

Jerenus CreditAttribution: Jerenus commented 1 April 2013 at 03:47

Status:

Needs work

» Needs review

Log in or register to post comments

Comment #44

msti

Heraklion

CreditAttribution: msti commented 1 April 2013 at 10:17

#35 works for me

Thanks!

Log in or register to post comments

Comment #45

msti

Heraklion

CreditAttribution: msti commented 1 April 2013 at 10:19

File	Size
feeds-patch.png	18.07 KB

screenshot

Log in or register to post comments

Comment #46

Summit CreditAttribution: Summit commented 7 April 2013 at 16:10

Status:

Needs review

» Reviewed & tested by the community

Hi, for me too! Setting this to RTBC ok?
Greetings, Martijn

Log in or register to post comments

Comment #47

meSte

Italian

Rimini

CreditAttribution: meSte commented 10 April 2013 at 13:59

Just to be clear: is #35 the patch that will be committed?

Log in or register to post comments

Comment #48

twistor CreditAttribution: twistor commented 16 April 2013 at 06:42

Status:

Reviewed & tested by the community

» Needs work

As was already pointed out, the patch in #35 breaks Drupal coding standards.

Honestly the to, from, check encoding business seems too complicated. Feeds is complicated enough. Is there a valid reason why we can't just automatically detect the encoding, and use that without any user interaction?

Log in or register to post comments

Comment #49

Honza Pobořil CreditAttribution: Honza Pobořil commented 24 April 2013 at 10:39

twistor: Because sometimes the automatic detection is not reliable. It's nice to have auto detection, but not only this.

Btw, if you think Feeds is too complicated too, see this project.

Log in or register to post comments

Comment #50

13rac1 CreditAttribution: 13rac1 commented 26 April 2013 at 00:51

Status:

Needs work

» Needs review

File	Size
feeds-encoding_support_CSV-1428272-50.patch	6.43 KB

Here is #35 plus the coding standard changes from #41. Should pass tests. Works great for me to import international characters. Thanks everyone.

Log in or register to post comments

Comment #50.0

13rac1 CreditAttribution: 13rac1 commented 26 April 2013 at 00:51

Issue summary:

View changes

asd

Log in or register to post comments

Comment #51

Jerenus CreditAttribution: Jerenus commented 26 April 2013 at 04:11

Are we have the conclusion that the demand of user interaction?

Log in or register to post comments

Comment #52

13rac1 CreditAttribution: 13rac1 commented 27 April 2013 at 01:03

File	Size
feeds-encoding_support_CSV-1428272-52.patch	5.07 KB

I've refactored this a bit:

UTF-8 conversion is forced to stop the "SQLSTATE[HY000]: General error: 1366 Incorrect string value" error.
Removed the optional "Check encoding" checkbox.
Made the encoding convert function fail if the encoding detection fails
Removed the encoding check function (it seems extraneous?)
Removed the exception if fixEncoding() cannot locate mbstring. I left the notice on the import form, maybe should be in install requirements?
Made $detected_encoding a local variable.
Added additional comments.
Reduced the overall amount of code changes.

It needs simpletests and possibly more refactoring to fit better within the current code architecture.

Log in or register to post comments

Comment #53

johan2 CreditAttribution: johan2 commented 5 May 2013 at 16:30

Hi, I installed the latest patch 52. I have a csv with terms in a column with a special character (ë). When I import the csv (utf-8) the term is not recognized. The select term field stays empty. I am struggling for a week with this. The only thing that works is to make a fake term without special character and try to change it by deleting the term and try to merge it with the intended taxonomy term with the special character or changing the term name once it is imported.

Log in or register to post comments

Comment #54

13rac1 CreditAttribution: 13rac1 commented 5 May 2013 at 18:46

@johan2 Have you tried any other encoding options? Windows‑1252 worked for my imports.

Log in or register to post comments

Comment #55

johan2 CreditAttribution: johan2 commented 5 May 2013 at 19:36

I am working on a power mac with excel 2001. My best setting to save from excel is -windows csv-. With Smultron and Textmate I controlled the file, and it shows no problems. A text is imported fine to a Text-field except when it has to go to a Term-field with special characters then it goes wrong even when changing the "ë" to "Ã«".
I suspected also the file... is it really utf-8 ? I found more issues concerning this.

Log in or register to post comments

Comment #56

johan2 CreditAttribution: johan2 commented 5 May 2013 at 22:06

Component:

Feeds Import

» Code

It must have something to do with the taxonomy. When I export a Term to csv from the site it will export the "ë" to "Ã«" . I kow that my excel is not the newest but through text-edit or smultron or textmate the text can be exported to utf-8, even changing the "ë" manually. Only for the Term the issue stays unsolved other text-fields have no problems and import correctly.

Log in or register to post comments

Comment #57

johan2 CreditAttribution: johan2 commented 6 May 2013 at 01:56

What also happens is if you adapt the same Term, changing "ë" to "e" in both taxonomy and csv-file, the import can go wrong because something stays in memory. I don't know what it is but flush cache is not enough or updating aliases. Changing abrupt to another Term that was in the taxonomy before works... So once it goes wrong you are started for a lot of worries since you are not sure what to do.

Log in or register to post comments

Comment #58

johan2 CreditAttribution: johan2 commented 7 May 2013 at 00:22

I have some more information about what happens. It seems that a great deal of the problem can be caused by the uploading process. When creating new terms is allowed (because otherwise the term stays empty) on import then the csv file creates for the same collumn with the same term several times a new term and in that process it ends up with ? ? ? around each character. So many duplicates are created. I installed the merge module 7-1 and here I noticed this behaviour. Before I had merge 7-2dev and the replace module, I desinstalled them. But also good news is that the special characters are passing.

Log in or register to post comments

Comment #59

13rac1 CreditAttribution: 13rac1 commented 7 May 2013 at 04:11

But also good news is that the special characters are passing.

So, do you mean this feature patch is working correctly?

Log in or register to post comments

Comment #60

johan2 CreditAttribution: johan2 commented 7 May 2013 at 20:54

I did some more testing and ended up installing Openoffice. Some of the characters in excel didn't export correctly. This was not obvious to find out. In the Openoffice Calc (spreadsheet) I noticed that some characters had ? around them. So this was also the case in the import with Feeds. Once I deleted these and exported the sheet again to csv the Feeds-import worked. The problem was that these ? were invisible in a texteditor or textmate. To import excel into Openoffice goes through a wizard and the software is free, so for me this is solved ;-)

Log in or register to post comments

Comment #61

johan2 CreditAttribution: johan2 commented 10 May 2013 at 23:18

Thanks for the great module. Just imported over 10.000 records with all kinds of data inline: title, number (mixed 1a, 1.1, 2, etc) ,category, subcategory, dates, description, certificate number ... And in not more then a few minutes everything was in my views without problems. So thanks again!!

Log in or register to post comments

Comment #62

henkit CreditAttribution: henkit commented 21 May 2013 at 18:09

Hi,
Just installed #52, working like a charm now! :) Thanx so much!!!
Henk

Log in or register to post comments

Comment #63

Summit CreditAttribution: Summit commented 22 May 2013 at 13:54

Status:

Needs review

» Reviewed & tested by the community

Yep https://drupal.org/node/1428272#comment-7349918 worked!
Setting this to RTBC ok?
greetings, Martijn

Log in or register to post comments

Comment #64

franz

any

Montréal

CreditAttribution: franz commented 24 May 2013 at 04:35

Issue tags:

+Needs tests

Thanks for all the work!

I'd love to have a test for it though.

Log in or register to post comments

Comment #65

sphankin CreditAttribution: sphankin commented 3 June 2013 at 22:00

Hi,

I've been getting the SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\x92t message in my logs when feeds is trying to process a mail message from mail handler.

I've worked out that in the text the word don't is stopping it from working properly because of the'.

I've tried implementing the patch as suggested above but that still doesn't seem to work.

Also I'm not sure where the GUI as printscreened above is to change the source file encoding?

Thanks in advance.

Log in or register to post comments

Comment #66

msti

Heraklion

CreditAttribution: msti commented 4 June 2013 at 16:59

@sphankin The screenshot above shows the import form located at import/name_of_importer
If you dont see the options in the screenshot, the patch is not applied properly. Here is how to apply patches: https://drupal.org/patch/apply

Log in or register to post comments

Comment #67

sphankin CreditAttribution: sphankin commented 5 June 2013 at 12:41

Hi @msti, thanks for the reply. I've followed the patch instructions and I still can't see anything/it isn't fixing the issue. As far as I can tell (looking through the patch) the patch is being applied properly. Would it make a difference if I'm importing an email instead of a CSV file? Thanks.

Log in or register to post comments

Comment #68

13rac1 CreditAttribution: 13rac1 commented 11 June 2013 at 23:16

@spankin: this is only for the CSV parser.

I'll be able to look into writing a simpletest later this month.

Log in or register to post comments

Comment #69

sphankin CreditAttribution: sphankin commented 22 June 2013 at 12:46

@eosrei - thank you, that would be great!

Log in or register to post comments

Comment #70

sphankin CreditAttribution: sphankin commented 21 August 2013 at 19:14

@eosrei - Hi. Any development on the email parser?

Thanks,

Sam

Log in or register to post comments

Comment #71

13rac1 CreditAttribution: 13rac1 commented 22 August 2013 at 00:07

@sphankin: This patch is only for the CSV parser. I don't have time/need to work on the email parser. Please create a separate feature request issue for email functionality. I still hope to be able to implement the needed simpletests for this patch, but it is feature complete as is.

Log in or register to post comments

Comment #72

franz

any

Montréal

CreditAttribution: franz commented 29 August 2013 at 16:56

I used this patch with success on a project. I'm seriously just waiting for tests to commit it.

Log in or register to post comments

Comment #73

twistor CreditAttribution: twistor commented 2 September 2013 at 20:44

Status:

Reviewed & tested by the community

» Needs work

What franz said, needs tests.

Log in or register to post comments

Comment #74

Rosamunda CreditAttribution: Rosamunda commented 28 September 2013 at 23:54

#52 WORKED FOR ME! THANKS!!!

Log in or register to post comments

Comment #75

acouch CreditAttribution: acouch commented 29 September 2013 at 18:29

@franz or @twistor, could a test just involve adding international characters to one of the csv files in feeds or should it have its own test?

Log in or register to post comments

Comment #76

franz

any

Montréal

CreditAttribution: franz commented 7 October 2013 at 18:32

Ideally, we should have an individual assertion that verifies if the code works well with the encoding. What matters most is to make sure the tests appropriately cover the possibilities and provide an easy output in case it fails. That's my take on it at least.

Log in or register to post comments

Comment #77

HeathN CreditAttribution: HeathN commented 11 October 2013 at 16:49

#52 is the way to go. Thanks for this patch. What is the status on getting this into the next build?

Log in or register to post comments

Comment #78

acouch CreditAttribution: acouch commented 15 October 2013 at 14:55

If someone can provide a file or files that works only with the new conversion that they would be comfortable being added to the project I can write the assertion.

Log in or register to post comments

Comment #79

aimeerae

she/her

English

San Francisco, CA

CreditAttribution: aimeerae commented 17 October 2013 at 08:58

#52 worked for me. Thank you for the patch!

Log in or register to post comments

Comment #80

Summit CreditAttribution: Summit commented 17 October 2013 at 13:56

Hi,
Anyone up for the tests? Than I think this can be committed, right?
Greetings, Martijn

Log in or register to post comments

Comment #81

franz

any

Montréal

CreditAttribution: franz commented 17 October 2013 at 14:19

So all we need is a CSV file with non utf-8 encoding to test the feature. If someone can provide, then acouch writes the assertions and we commit.

Log in or register to post comments

Comment #82

acouch CreditAttribution: acouch commented 30 October 2013 at 16:08

Status:

Needs work

» Needs review

File	Size
feeds-encoding_support_CSV-1428272-82.patch	5.21 KB

I found that the patch in #52 does't allow the encoding settings to be changed on a per-node basis. I added

$form['encoding']['#default_value'] = isset($source_config['encoding']) ? $source_config['encoding'] : $form['encoding']['#default_value'];

to the sourceForm which fixed this. Didn't get a chance to test this, so setting to 'needs review'. Will reset if the test passes and will still wait for a file to write test encodings.

Log in or register to post comments

Comment #83

13rac1 CreditAttribution: 13rac1 commented 30 October 2013 at 17:27

Status:

Needs review

» Needs work

This should be "Needs Work" until tests are written. I've got three higher priority projects in front of the project needing this, so I cannot work on it.

Log in or register to post comments

Comment #83.0

13rac1 CreditAttribution: 13rac1 commented 30 October 2013 at 17:27

Issue summary:

View changes

Fixing screenshot

Log in or register to post comments

Comment #84

BrightBold

she/her

English

Boston, MA

CreditAttribution: BrightBold commented 20 November 2013 at 14:09

Patch in #52 solved the problem for me as well. (Sorry I didn't test #82 — I was short on time and didn't need the additional functionality it offered so I went for the earlier one with proven success). Wish I could write tests to help get this committed — it's great!

Log in or register to post comments

Comment #85

liquidcms CreditAttribution: liquidcms commented 10 December 2013 at 02:21

somewhat related but maybe needs a new issue:

if the column headings have non standard characters (mine have french characters) then that column is skipped on import.

Log in or register to post comments

Comment #86

liquidcms CreditAttribution: liquidcms commented 10 December 2013 at 02:31

hmm..as i mentioned in #39 above:

i was getting the import to crash when it would encounter a non-english character. the path in #52 fixed it from crashing; but it is simply removing those characters. that is certainly not a solution.

Log in or register to post comments

Comment #87

liquidcms CreditAttribution: liquidcms commented 10 December 2013 at 05:12

ughh!! at a bit of a loss with this... as i mentioned in #86 (and #39) i have the patch from #52 and my import no longer crashes on FR characters; but they get dropped.

I have looked in to the code a bit more and this is making less sense the more i look.

taking the code from the patch i do this in a devel/php window:

$data = 'Date entrée';
$data = mb_convert_encoding($data, 'UTF-8', 'UTF-8');
echo $data;

and the result is: Date entrée

but, when i use a debugger and step through the same code in the feeds function: fixEncoding()

after running mb_convert_string() the character are lost; exactly as occurs when doing the import.

my guess is this has something to do with the devel/php having some html encoding possibly in the mix that the import function doesn't have; but also confused why so many people seem to be having success with the patch in #52.

is it possibly due to using PHP 5.3 instead of PHP 5.4?

Log in or register to post comments

Comment #88

liquidcms CreditAttribution: liquidcms commented 10 December 2013 at 20:38

i admit i do not understand all the aspects of dealing with other character sets, but i ran these tests to show if the issue is PHP version dependent.

running a test file via php on command line (Windows)

i set this as my test file:

echo mb_convert_encoding('Bibliothèque', 'UTF-8', 'UTF-8');

results are:

C:\Program Files (x86)\nusphere\phped\php54>php -f test.php
Biblioth?que
C:\Program Files (x86)\nusphere\phped\php54>cd ../php53
C:\Program Files (x86)\nusphere\phped\php53>php -f test.php
Bibliothquec
C:\Program Files (x86)\nusphere\phped\php53>

i am sure the ? is just a display issue with the command shell; but it is clear that PHP 5.3 and PHP 5.4 act differently

i'll set up PHP 5.4 for the web server and try the CSV import to be sure.

Log in or register to post comments

Comment #89

liquidcms CreditAttribution: liquidcms commented 10 December 2013 at 22:14

fixed. in the fixEncoding function i replaced this line:

$data = mb_convert_encoding($data, $this->to_encoding, $detected_encoding);

with this one:

$data = utf8_encode($data);

Log in or register to post comments

Comment #90

13rac1 CreditAttribution: 13rac1 commented 10 December 2013 at 23:12

@liquidcms utf8_encode() only converts a string encoded in ISO-8859-1 to UTF-8. It doesn't cover any other conversions

mb_convert_encoding() converts between any listed encoding and is multi-byte string capable. See: http://www.php.net/manual/en/function.mb-convert-encoding.php

While this code may need some help, utf8_encode is not the answer. Sorry.

Log in or register to post comments

Comment #91

liquidcms CreditAttribution: liquidcms commented 11 December 2013 at 04:31

@eosrei thanks for the info.

i do not doubt what you are saying; but utf8_encode does work. :)

also the code in the patch: mb_detect_encoding($data, $this->from_encoding) returns utf8 for my data (not iso-8859-1)

and mb_convert_encoding certainly does not work (although, as i have said, perhaps it is only busted for < php 5.4).

Log in or register to post comments

Comment #92

祥子 CreditAttribution: 祥子 commented 12 February 2014 at 02:56

Status:

Needs work

» Needs review

82: feeds-encoding_support_CSV-1428272-82.patch queued for re-testing.

Log in or register to post comments

Comment #93

13rac1 CreditAttribution: 13rac1 commented 12 February 2014 at 03:08

Issue summary:	View changes
Status:	Needs review	» Needs work

Setting issue status back. No need to test until tests are written.

Log in or register to post comments

Comment #94

rcodina CreditAttribution: rcodina commented 4 March 2014 at 16:04

The patch in comment #82 works for me with these php versions:

5.3.10-1ubuntu3.10
5.2.17

I had no need to use "utf8_encode" instead of "mb_convert_encoding". I think "mb_convert_encoding" works really well here. However, I had to specify the exact charset of my CSV file to get it to work (in CSV parser configuration form). The "auto" option doesn't work for me.

I really think this patch must be put on recommended release ASAP because this is a key feature for a lot of Drupal sites in spanish, french, german, etc

Thank you so much!!!

Log in or register to post comments

Comment #95

cgdrupalkwk CreditAttribution: cgdrupalkwk commented 29 March 2014 at 11:05

#82 worked great for me.

Log in or register to post comments

Comment #96

Niremizov CreditAttribution: Niremizov commented 25 April 2014 at 22:50

#82 , #52 - "Check encoding" checkbox is not optional. This checkbox is badly needed, because there is not guarantee that string encoding would be determined properly with mb_detect_encoding() function.

The problem is that mb_detect_encoding() returns the first character encoding sheme that it determines. What this leads to?
Example: File contains multi language strings and uses Windows-1251 (CP1251) or any other character encoding sheme, that is compatible with ASCII (same codes for English chars). If first char in the processing string is English, then mb_detect_encoding($data, 'Windows-1251') - returns FALSE, because it detects ASCII first.

But mb_check_encoding() - works better in this case.

Log in or register to post comments

Comment #97

Niremizov CreditAttribution: Niremizov commented 25 April 2014 at 23:14

File	Size
feeds-encoding_support_CSV-1428272-97.patch	5.16 KB

Attached patch in addition to #96 comment.

Log in or register to post comments

Comment #98

14 September 2014 at 15:12

Markoz queued 82: feeds-encoding_support_CSV-1428272-82.patch for re-testing.

Log in or register to post comments

Comment #99

barryvdh CreditAttribution: barryvdh commented 16 April 2015 at 12:03

Patch #82 does seem to fix my import issues with the latest dev version. Any reason this isn't merged yet?

Log in or register to post comments

Comment #100

MegaChriz CreditAttribution: MegaChriz commented 16 April 2015 at 13:26

@Barryvdh
Yes, a test should be written for this issue. After that it is probably ready for commit. See also comment #81.

Log in or register to post comments

Comment #101

jtsnow CreditAttribution: jtsnow commented 21 July 2015 at 20:14

Status:	Needs work	» Needs review
Issue tags:	-Needs tests	+Needs Review

File	Size
feeds-encoding_support_CSV-1428272-101.patch	9.04 KB

Here is a patch that includes some tests.

Log in or register to post comments

Comment #102

acouch CreditAttribution: acouch commented 22 July 2015 at 21:24

Status:

Needs review

» Reviewed & tested by the community

Log in or register to post comments

Comment #103

5 October 2015 at 23:47

dev25 queued 101: feeds-encoding_support_CSV-1428272-101.patch for re-testing.

Log in or register to post comments

Comment #104

MegaChriz CreditAttribution: MegaChriz as a volunteer commented 15 October 2015 at 08:57

Assigned:	Unassigned	» MegaChriz
Status:	Reviewed & tested by the community	» Needs work

I found a few minor issues with this patch: coding standards and php notices on the csv parser settings page when the extension mbstring is not available. I'm working on fixing these and I also write a test to cover the behaviour when the mbstring extension is not available. Other then that, I think everything works OK. Great work everyone!

Log in or register to post comments

Comment #105

MegaChriz CreditAttribution: MegaChriz as a volunteer commented 15 October 2015 at 11:08

Assigned:	MegaChriz	» Unassigned
Status:	Needs work	» Needs review

File	Size
feeds-encoding_support_CSV-1428272-105.patch	14.64 KB

interdiff-1428272-101-105.txt	10.48 KB

In comparison with the previous patch, this patch doesn't change the base functionality, but it fixes minor things (coding standards, php notices) and adds an extra test.

Details:

For all added methods documentation for @param, @return and @throws was added (where appropriate).
The exception thrown in ParserCSV::fixEncoding() has been renamed to ParserCSVEncodingException so modules that extend the CSV parser can easier distinguish which exception they receive.
A test case called FeedsCSVParserTestCase has added that tests the CSV parser in the UI. It has two test methods: one to test behaviour when the mbstring extension is not loaded and one to test that import is halted when a CSV file in the wrong encoding is supplied.
A variable called "feeds_use_mbstring" is added in order to test the behaviour when the mbstring extension is not loaded. As a side effect you can use this variable to turn off the encoding feature.
PHP notices are fixed that occurred when the mbstring extension was not loaded.

I think this is ready.

Log in or register to post comments

Comment #106

15 October 2015 at 11:14

Status:

Needs review

» Needs work

The last submitted patch, 105: feeds-encoding_support_CSV-1428272-105.patch, failed testing.

Log in or register to post comments

Comment #107

MegaChriz CreditAttribution: MegaChriz as a volunteer commented 15 October 2015 at 12:28

Status:

Needs work

» Needs review

File	Size
feeds-encoding_support_CSV-1428272-107.patch	15.94 KB

interdiff-1428272-105-107.txt	2.15 KB

Apparently, items with encoding issues can not be processed during a test without test failures because in < PHP 5.4 that will result into the following error:

htmlspecialchars(): Invalid multibyte sequence in argument

(I work with PHP 5.6 locally, so I did not get the test failure there.)

I've worked around this by emptying the list of items to process after parsing for the test that was failing (which was FeedsCSVParserTestCase::testMbstringExtensionDisabled()). The processing part is not relevant for that test anyway. The list of items will not be emptied in all other tests.

In comparison with the previous patch only test fixes are added.

Log in or register to post comments

Comment #108

MegaChriz CreditAttribution: MegaChriz as a volunteer commented 23 December 2015 at 17:13

File	Size
feeds-encoding_support_CSV-1428272-108.patch	15.89 KB
7.x-2.x: PHP 5.3 & MySQL 5.5, D7 302 pass

Reroll of the patch in #107.

Log in or register to post comments

Comment #109

twistor CreditAttribution: twistor as a volunteer commented 24 December 2015 at 07:15

Status:

Needs review

» Needs work

+++ b/libraries/ParserCSV.inc
@@ -325,4 +339,36 @@ class ParserCSV {
+  private function fixEncoding($data) {
+    if (extension_loaded('mbstring') && variable_get('feeds_use_mbstring', TRUE)) {
+      if (mb_check_encoding($data, $this->from_encoding)) {
+        // Convert encoding. The conversion is to UTF-8 by default to prevent
+        // SQL errors.
+        $data = mb_convert_encoding($data, $this->to_encoding, $this->from_encoding);
+      }
+      else {
+        throw new ParserCSVEncodingException(t('Source file is not in %encoding encoding.', array('%encoding' => $this->from_encoding)));
+      }
+    }

Why is this private?

the extension_loaded() and variable_get() calls should be moved to the constructor, so they are only checked once.

from_encoding and to encoding should be camelCase.

to_encoding and from_encoding should be checked that they aren't the same value.

Log in or register to post comments

Comment #110

MegaChriz CreditAttribution: MegaChriz as a volunteer commented 29 December 2015 at 10:54

Status:

Needs work

» Needs review

File	Size
feeds-encoding-support-csv-1428272-110.patch	16.47 KB
7.x-2.x: PHP 5.3 & MySQL 5.5, D7 300 pass, 2 fail
interdiff-1428272-108-110.txt	2.35 KB

The method ParserCSV::fixEncoding() was made private by OnkelTem in #6, but I can not see a particular reason for it in that comment. I made it public now.

I also fixed the other concerns of #109.

Log in or register to post comments

Comment #111

29 December 2015 at 11:00

Status:

Needs review

» Needs work

The last submitted patch, 110: feeds-encoding-support-csv-1428272-110.patch, failed testing.

Log in or register to post comments

Comment #112

MegaChriz CreditAttribution: MegaChriz as a volunteer commented 29 December 2015 at 11:18

Status:	Needs work	» Needs review
Issue tags:	-Needs Review

File	Size
feeds-encoding-support-csv-1428272-112.patch	16.5 KB
7.x-2.x: PHP 5.3 & MySQL 5.5, D7 302 pass
interdiff-1428272-110-112.txt	1.04 KB

The encoding should be checked even if "fromEncoding" is equal to "toEncoding".

Log in or register to post comments

Comment #113

hargobind

he/him

Austin, Texas

CreditAttribution: hargobind commented 17 January 2016 at 09:57

#112 works great on my custom import. I only tested this to convert from Windows-1252 (ANSI) to UTF-8, but I don't see any reason why it would fail for any other conversions.

Log in or register to post comments

Comment #114

MegaChriz CreditAttribution: MegaChriz as a volunteer commented 17 January 2016 at 11:10

Status:

Needs review

» Fixed

Committed #112. Thanks all!

Log in or register to post comments

Comment #115

17 January 2016 at 11:10

MegaChriz committed ed8fbfd on 7.x-2.x

Issue #1428272 by OnkelTem, eosrei, jtsnow, Jerenus, liquidcms, acouch,...

Log in or register to post comments

Comment #116

31 January 2016 at 11:14

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Log in or register to post comments

Comment #117

mikeytown2 CreditAttribution: mikeytown2 commented 24 March 2016 at 00:47

If you wanted to support auto utf-8 translation of the encoding there is this https://github.com/neitanod/forceutf8

Also came up with this which seems to work for my use case.

  // Build encoding list.
  $encoding_list = array_unique(array_merge(
    mb_detect_order(),
    array(
      'UTF-8',
      'ASCII',
      'ISO-8859-1',
      'ISO-8859-2',
      'ISO-8859-3',
      'ISO-8859-4',
      'ISO-8859-5',
      'ISO-8859-6',
      'ISO-8859-7',
      'ISO-8859-8',
      'ISO-8859-9',
      'ISO-8859-10',
      'ISO-8859-13',
      'ISO-8859-14',
      'ISO-8859-15',
      'ISO-8859-16',
      'Windows-1251',
      'Windows-1252',
      'Windows-1254',
    )
  ));
  // Make sure the list contains only valid options.
  $encoding_list = array_intersect($encoding_list, mb_list_encodings());

  // Convert encoding.
  $utf8_input_string = mb_convert_encoding($input_string, 'UTF-8', mb_detect_encoding($input_string, $encoding_list, TRUE));

Log in or register to post comments

Added support of encoding conversions to the CSV Parser

Comments