Subj + minor fixes to FeedsSource.inc so that processing EMPTY parser results won't break the batch operation (it was before).

Functions:
* CSV parser configuration page and Importer start screen:
- source file encoding selection
- checkbox to check source encoding
* Parser
- converts encoding of source data
- checks source encoding and if it differs - throwing an exception

Screenshot:

Requires: mbstring PHP extension

CommentFileSizeAuthor
#112 interdiff-1428272-110-112.txt1.04 KBMegaChriz
#112 feeds-encoding-support-csv-1428272-112.patch16.5 KBMegaChriz
#110 interdiff-1428272-108-110.txt2.35 KBMegaChriz
#110 feeds-encoding-support-csv-1428272-110.patch16.47 KBMegaChriz
#108 feeds-encoding_support_CSV-1428272-108.patch15.89 KBMegaChriz
#107 interdiff-1428272-105-107.txt2.15 KBMegaChriz
#107 feeds-encoding_support_CSV-1428272-107.patch15.94 KBMegaChriz
#105 interdiff-1428272-101-105.txt10.48 KBMegaChriz
#105 feeds-encoding_support_CSV-1428272-105.patch14.64 KBMegaChriz
#101 feeds-encoding_support_CSV-1428272-101.patch9.04 KBjtsnow
#97 feeds-encoding_support_CSV-1428272-97.patch5.16 KBNiremizov
#82 feeds-encoding_support_CSV-1428272-82.patch5.21 KBacouch
#52 feeds-encoding_support_CSV-1428272-52.patch5.07 KB13rac1
#50 feeds-encoding_support_CSV-1428272-50.patch6.43 KB13rac1
#45 feeds-patch.png18.07 KBmsti
#41 feeds-encoding_support_CSV-1428272-41.patch6.89 KBliquidcms
#37 n4.txt14 bytesspuki
#35 feeds-encoding_support_CSV-1428272-35.patch6.89 KBJerenus
#34 feeds-encoding_support_CSV-1428272-34.patch9.42 KBJerenus
#30 price_win1251.csv_.txt143.25 KBStan Turyn
#27 feeds-add-encoding-support-1428272-0.patch6.94 KBOnkelTem
#13 adding_encoding_conversion_support-1428272-12.patch10.19 KBderhasi
#6 adding_encoding_conversion_support_2.patch9.83 KBOnkelTem
#3 adding_encoding_conversion_support.patch5.4 KBOnkelTem
#1 Edit importer: Товары из бухгалтерии | ФОР-Дубна.png86.26 KBOnkelTem
adding_encoding_conversion_support.patch6.26 KBOnkelTem
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

OnkelTem’s picture

Issue summary: View changes

asd

OnkelTem’s picture

Issue summary: View changes

asd

OnkelTem’s picture

Issue summary: View changes

asd

OnkelTem’s picture

Issue summary: View changes

asd

OnkelTem’s picture

OnkelTem’s picture

Issue summary: View changes

asd

dmstru’s picture

Hi!

Cool features. I need this - I must check it.

Rus:

Пацаны, ваще ребята!
Молодцы, четко, могёте, умеете!

Пошел тестировать.

с ув., Алексей

OnkelTem’s picture

Issue summary: View changes

asd

OnkelTem’s picture

Issue summary: View changes

Note

OnkelTem’s picture

Supplying patch to 7.x-2.x git version.
Since commerce_feeds dev branch haven't been updated to accomodate severe changes in the feeds 2.x git branch yet (getting error message: Class FeedsCommerceProductProcessor contains 2 abstract methods .... ), I had no chance to test the patch.

OnkelTem’s picture

Issue summary: View changes

asd

dmstru’s picture

Version: 7.x-2.0-alpha4 » 7.x-2.x-dev
Status: Active » Needs review

Hi your patch not work:

:41: trailing whitespace.

:43: trailing whitespace.

:121: trailing whitespace.
$defaults = $this->configDefaults();
:125: trailing whitespace.
'#collapsible' => TRUE,
error: patch failed: plugins/FeedsCSVParser.inc:17
error: plugins/FeedsCSVParser.inc: patch does not apply

I try to use it in last dev version.

ALEX

dmstru’s picture

Please contact with me via mail or skype awa_77.

OnkelTem’s picture

Version: 7.x-2.x-dev » 7.x-2.0-alpha4
FileSize
9.83 KB

Fixed patch to 7.x-2.0-alpha4

* Fixed errors in previous (#0) patch: added missed encoding conversion when a cell allocates more then one line.

* REWORKED parser. This is serious change. I believe the original CSV parser did things in a wrong way, treating an arbitrary double quote as a delimiter. IMO, this violates CSV rules, which states, that a field should be [fully] double quoted, when something uncommon happens in it. For example:

;A string with a "double quote in it;

should definitely break parsing process. The correct variant would be:

;"A string with a ""double quote in it";

But current implementation will accept the former and will feed the fields' value with subsequent lines until next double quote.
In my patch it will throw an exception with corresponding message about it, stopping parsing.

p.s. Moving the issue back to alpha4. This is odd, I know, but I lost in versions.

polom’s picture

Hi,

I have import problems related to encodings (my CSV files are is-latin 1 encoded) and as re-encoding them is not my favorite choice I tried to find solutions.
This patch seems a nice deal but I can't apply it. I have Feeds 7.x-2.0-alpha4 but here is what I get :

$ patch < adding_encoding_conversion_support_2.patch
patching file FeedsSource.inc
Hunk #1 FAILED at 343.
1 out of 1 hunk FAILED -- saving rejects to file FeedsSource.inc.rej
patching file ParserCSV.inc
Hunk #1 FAILED at 74.
Hunk #2 FAILED at 92.
Hunk #3 FAILED at 194.
Hunk #4 FAILED at 232.
Hunk #5 FAILED at 258.
Hunk #6 FAILED at 320.
6 out of 6 hunks FAILED -- saving rejects to file ParserCSV.inc.rej
patching file FeedsCSVParser.inc
Hunk #1 FAILED at 17.
Hunk #2 FAILED at 101.
Hunk #3 FAILED at 145.
Hunk #4 FAILED at 154.
Hunk #5 FAILED at 180.
5 out of 5 hunks FAILED -- saving rejects to file FeedsCSVParser.inc.rej

Am I missing a point here ?

OnkelTem’s picture

Does patch -p1 < patchfile work for you?

emackn’s picture

Status: Needs review » Needs work
polom’s picture

yes it does :

$ patch -p1 < adding_encoding_conversion_support_2.patch
patching file includes/FeedsSource.inc
patching file libraries/ParserCSV.inc
patching file plugins/FeedsCSVParser.inc

Unfortunately my iso-latin encoded CSV file could not be imported and I had to convert it.

OnkelTem’s picture

@polom

Would you send me an example file?
aneganov at gmail d0t com

derhasi’s picture

Status: Needs work » Needs review

The patch works for me.

Cleaned it up for coding styles and renamed it, so the issue is referenced. (@see).

@polom, did you set the "Source file encoding" in the "Settings for CSV Parser"?

derhasi’s picture

*arg*, sorry forgot to attach the patch

emackn’s picture

Version: 7.x-2.0-alpha4 » 7.x-2.x-dev
Status: Needs review » Needs work

do you have a test for this?

derhasi’s picture

OnkelTem,could you provide a test for that?

OnkelTem’s picture

@derhasi, I would gladly make one, if I were know how to do that :) Really, I never created unit tests before.

Yuri’s picture

there is a patch in the issue summary, and in #13.
Which one to use? I applied #13 and update.php, nothing changed so far, I still have error
SQLSTATE[HY000]: General error: 1366 Incorrect string value

OnkelTem’s picture

Please, provide a copy of example data file which produces the error.

Yuri’s picture

Category: feature » bug

In my case, I use feeds, mailhandler and mail comment modules (latest devs in d7)
The following error messages appear onhttp://www.exampledomain.com/import/mailhandler_comments

Mailbox mail@exampledomain.com was checked and contained 1 messages.
Warning message SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xA0
<...' for column 'comment_body_value' at row 1
Error message SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xA0
<...' for column 'message' at row 1

The drupal error log shows:

PDOException: in field_sql_storage_field_storage_write() (line 448 of /home/gezond/public_html/modules/field/modules/field_sql_storage/field_sql_storage.module).

and the feeds log shows:

SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xA0asdf ...' for column 'message' at row 1
Yuri’s picture

The data imported is a general Gmail message like 'hello' (of which I have set utf8 for outgoing messages, according to http://support.google.com/mail/bin/answer.py?hl=en&answer=22841

OnkelTem’s picture

Are you mails come in CSV format?

OnkelTem’s picture

Status: Needs work » Needs review
xaqrox’s picture

Seems to work really well for me. Don't know the test framework very well either or I'd be right on it. I found a bunch of other issues which I believe this solves, which I closed as duplicate. Hope that's not too presumptuous.
#1605628: Encoding problems while importing from CSV
#1471950: CSV Parser: Check and Convert the Encoding
#1220606: Add support for encoding conversions for any parser
#1319142: csv in ISO 8859-15/EURO

twistor’s picture

Category: bug » feature
OnkelTem’s picture

@twistor
We probably need to split this issue, since #6 is reporting about serious bug in parser.
What do you think?

twistor’s picture

Yes, these are very separate issues.

OnkelTem’s picture

Separating encoding support from the rest.

OnkelTem’s picture

twistor’s picture

Status: Needs review » Needs work
+++ b/libraries/ParserCSV.incundefined
@@ -324,4 +342,37 @@ class ParserCSV {
+      if (function_exists('mb_convert_encoding')) {

This should use extension_loaded(). Move this check to the top of the function, don't check twice.

+++ b/plugins/FeedsCSVParser.incundefined
@@ -185,6 +191,40 @@ class FeedsCSVParser extends FeedsParser {
+    $form['encoding'] = array(

This whole fieldset should be conditional on the mb library.

+++ b/plugins/FeedsCSVParser.incundefined
@@ -185,6 +191,40 @@ class FeedsCSVParser extends FeedsParser {
+    if (function_exists('mb_list_encodings')) {

Should use extension_loaded().

Could you please provide an example CSV file that needs this functionality?

Stan Turyn’s picture

FileSize
143.25 KB

Hi twistor,

I'm attaching a sample CSV file that requires conversion. Any chance to see the patch commited to dev?

mtoscano’s picture

Hi,
I need encoding conversions from Spanish characters: which one is the patch to use against the current alpha7 release?
Thanks

Dubs’s picture

Thanks so much for your time on this essential module!

Please can this be committed - this would be so useful as I feel pain at the moment every time I have to import Euro characters!

imclean’s picture

To the last 3 posters, see #29. The patch still needs work before it can be committed.

Jerenus’s picture

Here is the patch based on all the previous. And I made some modifications to let it work.

Jerenus’s picture

The final one.

pcambra’s picture

Status: Needs work » Needs review
spuki’s picture

FileSize
14 bytes

It's not working for me. File encoding is Windows-1251, but it was recognized as CP936. So I just commented lines with encoding checking.

      //$encode_array = array('ASCII', 'UTF-8', 'GBK', 'GB2312', 'BIG5');
      //$this->encoding = mb_detect_encoding($data, $encode_array);
			
      // Convert encoding if needed
      if ($this->from_encoding != $this->to_encoding) {
          $data = mb_convert_encoding($data, $this->to_encoding, $this->from_encoding);
      }
bjcone’s picture

I downloaded the patch and it seems to be working to check the encoding of individual lines. However, I ran into a case where the file I was attempting to import was encoded such that fgets() did not recognize the end of line character and returned the full contents of the file. At this point it is too late for this patch to help.

What I found to resolve this issue and enable feeds to import my ASCII-encoded file was ini_set("auto_detect_line_endings", true). See the bottom of the page at http://www.php.net/manual/en/filesystem.configuration.php#ini.auto-detec... for more information. It seems as though PHP, without this setting, does not separate lines which only end in \r (Mac Carriage Return character).

Hope this is useful to someone else as it took me a while to find.

liquidcms’s picture

i have issue similar to #37. i saved a file from Excel (2007) as .csv. the file has 5 lines, 3 of which have odd non-ASCII dashes. when hitting the detect encoding line in the patch; as spuki states, those 3 lines are detected as CP936.

NOTE. comment in #37 about commenting out those lines doesn't do anything; mb_detect_encoding doesn't modify the data it simply determines the encoding. with or without the array of encoding types listed it still determines these lines are CP936 and the part of the code which then does the encode still runs (that check is simply to make sure we don't re-encode data that is already UTF-8.

so the only issue (and maybe not fixable) is that to convert those dashes in CP936 to UTF-8 the dash is simply removed - this is not a great solution; but without the patch and the encoding conversion Feeds import fails on these characters.

ALSO - as noted in #1140194: SQLSTATE[HY000]: General error: 1366 Incorrect string value for a field with accents if i open the Excel saved CSV into Notepad and then re-save as UTF-8, the dashes import correctly (with or without this patch; although i agree would be nice if Feeds could handle this).

liquidcms’s picture

turns out losing characters when using PHP to convert from CP936 to UTF-8 is a bug in PHP 5.3 and has been fixed in PHP 5.4.0 (https://bugs.php.net/bug.php?id=60306)

liquidcms’s picture

btw, patch in #35 does not adhere to drupal coding standards. i think the attached is somewhat closer

Status: Needs review » Needs work

The last submitted patch, feeds-encoding_support_CSV-1428272-41.patch, failed testing.

Jerenus’s picture

Status: Needs work » Needs review
msti’s picture

#35 works for me

Thanks!

msti’s picture

FileSize
18.07 KB

screenshot

Summit’s picture

Status: Needs review » Reviewed & tested by the community

Hi, for me too! Setting this to RTBC ok?
Greetings, Martijn

meSte’s picture

Just to be clear: is #35 the patch that will be committed?

twistor’s picture

Status: Reviewed & tested by the community » Needs work

As was already pointed out, the patch in #35 breaks Drupal coding standards.

Honestly the to, from, check encoding business seems too complicated. Feeds is complicated enough. Is there a valid reason why we can't just automatically detect the encoding, and use that without any user interaction?

Honza Pobořil’s picture

twistor: Because sometimes the automatic detection is not reliable. It's nice to have auto detection, but not only this.

Btw, if you think Feeds is too complicated too, see this project.

13rac1’s picture

Status: Needs work » Needs review
FileSize
6.43 KB

Here is #35 plus the coding standard changes from #41. Should pass tests. Works great for me to import international characters. Thanks everyone.

13rac1’s picture

Issue summary: View changes

asd

Jerenus’s picture

Are we have the conclusion that the demand of user interaction?

13rac1’s picture

I've refactored this a bit:

  • UTF-8 conversion is forced to stop the "SQLSTATE[HY000]: General error: 1366 Incorrect string value" error.
  • Removed the optional "Check encoding" checkbox.
  • Made the encoding convert function fail if the encoding detection fails
  • Removed the encoding check function (it seems extraneous?)
  • Removed the exception if fixEncoding() cannot locate mbstring. I left the notice on the import form, maybe should be in install requirements?
  • Made $detected_encoding a local variable.
  • Added additional comments.
  • Reduced the overall amount of code changes.

It needs simpletests and possibly more refactoring to fit better within the current code architecture.

johan2’s picture

Hi, I installed the latest patch 52. I have a csv with terms in a column with a special character (ë). When I import the csv (utf-8) the term is not recognized. The select term field stays empty. I am struggling for a week with this. The only thing that works is to make a fake term without special character and try to change it by deleting the term and try to merge it with the intended taxonomy term with the special character or changing the term name once it is imported.

13rac1’s picture

@johan2 Have you tried any other encoding options? Windows‑1252 worked for my imports.

johan2’s picture

I am working on a power mac with excel 2001. My best setting to save from excel is -windows csv-. With Smultron and Textmate I controlled the file, and it shows no problems. A text is imported fine to a Text-field except when it has to go to a Term-field with special characters then it goes wrong even when changing the "ë" to "ë".
I suspected also the file... is it really utf-8 ? I found more issues concerning this.

johan2’s picture

Component: Feeds Import » Code

It must have something to do with the taxonomy. When I export a Term to csv from the site it will export the "ë" to "ë" . I kow that my excel is not the newest but through text-edit or smultron or textmate the text can be exported to utf-8, even changing the "ë" manually. Only for the Term the issue stays unsolved other text-fields have no problems and import correctly.

johan2’s picture

What also happens is if you adapt the same Term, changing "ë" to "e" in both taxonomy and csv-file, the import can go wrong because something stays in memory. I don't know what it is but flush cache is not enough or updating aliases. Changing abrupt to another Term that was in the taxonomy before works... So once it goes wrong you are started for a lot of worries since you are not sure what to do.

johan2’s picture

I have some more information about what happens. It seems that a great deal of the problem can be caused by the uploading process. When creating new terms is allowed (because otherwise the term stays empty) on import then the csv file creates for the same collumn with the same term several times a new term and in that process it ends up with ? ? ? around each character. So many duplicates are created. I installed the merge module 7-1 and here I noticed this behaviour. Before I had merge 7-2dev and the replace module, I desinstalled them. But also good news is that the special characters are passing.

13rac1’s picture

But also good news is that the special characters are passing.

So, do you mean this feature patch is working correctly?

johan2’s picture

I did some more testing and ended up installing Openoffice. Some of the characters in excel didn't export correctly. This was not obvious to find out. In the Openoffice Calc (spreadsheet) I noticed that some characters had ? around them. So this was also the case in the import with Feeds. Once I deleted these and exported the sheet again to csv the Feeds-import worked. The problem was that these ? were invisible in a texteditor or textmate. To import excel into Openoffice goes through a wizard and the software is free, so for me this is solved ;-)

johan2’s picture

Thanks for the great module. Just imported over 10.000 records with all kinds of data inline: title, number (mixed 1a, 1.1, 2, etc) ,category, subcategory, dates, description, certificate number ... And in not more then a few minutes everything was in my views without problems. So thanks again!!

henkit’s picture

Hi,
Just installed #52, working like a charm now! :) Thanx so much!!!
Henk

Summit’s picture

Status: Needs review » Reviewed & tested by the community

Yep https://drupal.org/node/1428272#comment-7349918 worked!
Setting this to RTBC ok?
greetings, Martijn

franz’s picture

Issue tags: +Needs tests

Thanks for all the work!

I'd love to have a test for it though.

sphankin’s picture

Hi,

I've been getting the SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\x92t message in my logs when feeds is trying to process a mail message from mail handler.

I've worked out that in the text the word don't is stopping it from working properly because of the'.

I've tried implementing the patch as suggested above but that still doesn't seem to work.

Also I'm not sure where the GUI as printscreened above is to change the source file encoding?

Thanks in advance.

msti’s picture

@sphankin The screenshot above shows the import form located at import/name_of_importer
If you dont see the options in the screenshot, the patch is not applied properly. Here is how to apply patches: https://drupal.org/patch/apply

sphankin’s picture

Hi @msti, thanks for the reply. I've followed the patch instructions and I still can't see anything/it isn't fixing the issue. As far as I can tell (looking through the patch) the patch is being applied properly. Would it make a difference if I'm importing an email instead of a CSV file? Thanks.

13rac1’s picture

@spankin: this is only for the CSV parser.

I'll be able to look into writing a simpletest later this month.

sphankin’s picture

@eosrei - thank you, that would be great!

sphankin’s picture

@eosrei - Hi. Any development on the email parser?

Thanks,

Sam

13rac1’s picture

@sphankin: This patch is only for the CSV parser. I don't have time/need to work on the email parser. Please create a separate feature request issue for email functionality. I still hope to be able to implement the needed simpletests for this patch, but it is feature complete as is.

franz’s picture

I used this patch with success on a project. I'm seriously just waiting for tests to commit it.

twistor’s picture

Status: Reviewed & tested by the community » Needs work

What franz said, needs tests.

Rosamunda’s picture

#52 WORKED FOR ME! THANKS!!!

acouch’s picture

@franz or @twistor, could a test just involve adding international characters to one of the csv files in feeds or should it have its own test?

franz’s picture

Ideally, we should have an individual assertion that verifies if the code works well with the encoding. What matters most is to make sure the tests appropriately cover the possibilities and provide an easy output in case it fails. That's my take on it at least.

HeathN’s picture

#52 is the way to go. Thanks for this patch. What is the status on getting this into the next build?

acouch’s picture

If someone can provide a file or files that works only with the new conversion that they would be comfortable being added to the project I can write the assertion.

aimeerae’s picture

#52 worked for me. Thank you for the patch!

Summit’s picture

Hi,
Anyone up for the tests? Than I think this can be committed, right?
Greetings, Martijn

franz’s picture

So all we need is a CSV file with non utf-8 encoding to test the feature. If someone can provide, then acouch writes the assertions and we commit.

acouch’s picture

Status: Needs work » Needs review
FileSize
5.21 KB

I found that the patch in #52 does't allow the encoding settings to be changed on a per-node basis. I added

$form['encoding']['#default_value'] = isset($source_config['encoding']) ? $source_config['encoding'] : $form['encoding']['#default_value'];

to the sourceForm which fixed this. Didn't get a chance to test this, so setting to 'needs review'. Will reset if the test passes and will still wait for a file to write test encodings.

13rac1’s picture

Status: Needs review » Needs work

This should be "Needs Work" until tests are written. I've got three higher priority projects in front of the project needing this, so I cannot work on it.

13rac1’s picture

Issue summary: View changes

Fixing screenshot

BrightBold’s picture

Patch in #52 solved the problem for me as well. (Sorry I didn't test #82 — I was short on time and didn't need the additional functionality it offered so I went for the earlier one with proven success). Wish I could write tests to help get this committed — it's great!

liquidcms’s picture

somewhat related but maybe needs a new issue:

if the column headings have non standard characters (mine have french characters) then that column is skipped on import.

liquidcms’s picture

hmm..as i mentioned in #39 above:

i was getting the import to crash when it would encounter a non-english character. the path in #52 fixed it from crashing; but it is simply removing those characters. that is certainly not a solution.

liquidcms’s picture

ughh!! at a bit of a loss with this... as i mentioned in #86 (and #39) i have the patch from #52 and my import no longer crashes on FR characters; but they get dropped.

I have looked in to the code a bit more and this is making less sense the more i look.

taking the code from the patch i do this in a devel/php window:

$data = 'Date entrée';
$data = mb_convert_encoding($data, 'UTF-8', 'UTF-8');
echo $data;

and the result is: Date entrée

but, when i use a debugger and step through the same code in the feeds function: fixEncoding()

after running mb_convert_string() the character are lost; exactly as occurs when doing the import.

my guess is this has something to do with the devel/php having some html encoding possibly in the mix that the import function doesn't have; but also confused why so many people seem to be having success with the patch in #52.

is it possibly due to using PHP 5.3 instead of PHP 5.4?

liquidcms’s picture

i admit i do not understand all the aspects of dealing with other character sets, but i ran these tests to show if the issue is PHP version dependent.

running a test file via php on command line (Windows)

i set this as my test file:

echo mb_convert_encoding('Bibliothèque', 'UTF-8', 'UTF-8');

results are:

C:\Program Files (x86)\nusphere\phped\php54>php -f test.php
Biblioth?que
C:\Program Files (x86)\nusphere\phped\php54>cd ../php53
C:\Program Files (x86)\nusphere\phped\php53>php -f test.php
Bibliothquec
C:\Program Files (x86)\nusphere\phped\php53>

i am sure the ? is just a display issue with the command shell; but it is clear that PHP 5.3 and PHP 5.4 act differently

i'll set up PHP 5.4 for the web server and try the CSV import to be sure.

liquidcms’s picture

fixed. in the fixEncoding function i replaced this line:

$data = mb_convert_encoding($data, $this->to_encoding, $detected_encoding);

with this one:

$data = utf8_encode($data);

13rac1’s picture

@liquidcms utf8_encode() only converts a string encoded in ISO-8859-1 to UTF-8. It doesn't cover any other conversions

mb_convert_encoding() converts between any listed encoding and is multi-byte string capable. See: http://www.php.net/manual/en/function.mb-convert-encoding.php

While this code may need some help, utf8_encode is not the answer. Sorry.

liquidcms’s picture

@eosrei thanks for the info.

i do not doubt what you are saying; but utf8_encode does work. :)

also the code in the patch: mb_detect_encoding($data, $this->from_encoding) returns utf8 for my data (not iso-8859-1)

and mb_convert_encoding certainly does not work (although, as i have said, perhaps it is only busted for < php 5.4).

祥子’s picture

Status: Needs work » Needs review
13rac1’s picture

Issue summary: View changes
Status: Needs review » Needs work

Setting issue status back. No need to test until tests are written.

rcodina’s picture

The patch in comment #82 works for me with these php versions:

5.3.10-1ubuntu3.10
5.2.17

I had no need to use "utf8_encode" instead of "mb_convert_encoding". I think "mb_convert_encoding" works really well here. However, I had to specify the exact charset of my CSV file to get it to work (in CSV parser configuration form). The "auto" option doesn't work for me.

I really think this patch must be put on recommended release ASAP because this is a key feature for a lot of Drupal sites in spanish, french, german, etc

Thank you so much!!!

cgdrupalkwk’s picture

#82 worked great for me.

Niremizov’s picture

#82 , #52 - "Check encoding" checkbox is not optional. This checkbox is badly needed, because there is not guarantee that string encoding would be determined properly with mb_detect_encoding() function.

The problem is that mb_detect_encoding() returns the first character encoding sheme that it determines. What this leads to?
Example: File contains multi language strings and uses Windows-1251 (CP1251) or any other character encoding sheme, that is compatible with ASCII (same codes for English chars). If first char in the processing string is English, then mb_detect_encoding($data, 'Windows-1251') - returns FALSE, because it detects ASCII first.

But mb_check_encoding() - works better in this case.

Niremizov’s picture

Attached patch in addition to #96 comment.

barryvdh’s picture

Patch #82 does seem to fix my import issues with the latest dev version. Any reason this isn't merged yet?

MegaChriz’s picture

@Barryvdh
Yes, a test should be written for this issue. After that it is probably ready for commit. See also comment #81.

jtsnow’s picture

Status: Needs work » Needs review
Issue tags: -Needs tests +Needs Review
FileSize
9.04 KB

Here is a patch that includes some tests.

acouch’s picture

Status: Needs review » Reviewed & tested by the community

MegaChriz’s picture

Assigned: Unassigned » MegaChriz
Status: Reviewed & tested by the community » Needs work

I found a few minor issues with this patch: coding standards and php notices on the csv parser settings page when the extension mbstring is not available. I'm working on fixing these and I also write a test to cover the behaviour when the mbstring extension is not available. Other then that, I think everything works OK. Great work everyone!

MegaChriz’s picture

Assigned: MegaChriz » Unassigned
Status: Needs work » Needs review
FileSize
14.64 KB
10.48 KB

In comparison with the previous patch, this patch doesn't change the base functionality, but it fixes minor things (coding standards, php notices) and adds an extra test.

Details:

  • For all added methods documentation for @param, @return and @throws was added (where appropriate).
  • The exception thrown in ParserCSV::fixEncoding() has been renamed to ParserCSVEncodingException so modules that extend the CSV parser can easier distinguish which exception they receive.
  • A test case called FeedsCSVParserTestCase has added that tests the CSV parser in the UI. It has two test methods: one to test behaviour when the mbstring extension is not loaded and one to test that import is halted when a CSV file in the wrong encoding is supplied.
  • A variable called "feeds_use_mbstring" is added in order to test the behaviour when the mbstring extension is not loaded. As a side effect you can use this variable to turn off the encoding feature.
  • PHP notices are fixed that occurred when the mbstring extension was not loaded.

I think this is ready.

Status: Needs review » Needs work

The last submitted patch, 105: feeds-encoding_support_CSV-1428272-105.patch, failed testing.

MegaChriz’s picture

Status: Needs work » Needs review
FileSize
15.94 KB
2.15 KB

Apparently, items with encoding issues can not be processed during a test without test failures because in < PHP 5.4 that will result into the following error:

htmlspecialchars(): Invalid multibyte sequence in argument

(I work with PHP 5.6 locally, so I did not get the test failure there.)

I've worked around this by emptying the list of items to process after parsing for the test that was failing (which was FeedsCSVParserTestCase::testMbstringExtensionDisabled()). The processing part is not relevant for that test anyway. The list of items will not be emptied in all other tests.

In comparison with the previous patch only test fixes are added.

MegaChriz’s picture

Reroll of the patch in #107.

twistor’s picture

Status: Needs review » Needs work
+++ b/libraries/ParserCSV.inc
@@ -325,4 +339,36 @@ class ParserCSV {
+  private function fixEncoding($data) {
+    if (extension_loaded('mbstring') && variable_get('feeds_use_mbstring', TRUE)) {
+      if (mb_check_encoding($data, $this->from_encoding)) {
+        // Convert encoding. The conversion is to UTF-8 by default to prevent
+        // SQL errors.
+        $data = mb_convert_encoding($data, $this->to_encoding, $this->from_encoding);
+      }
+      else {
+        throw new ParserCSVEncodingException(t('Source file is not in %encoding encoding.', array('%encoding' => $this->from_encoding)));
+      }
+    }

Why is this private?

the extension_loaded() and variable_get() calls should be moved to the constructor, so they are only checked once.

from_encoding and to encoding should be camelCase.

to_encoding and from_encoding should be checked that they aren't the same value.

MegaChriz’s picture

The method ParserCSV::fixEncoding() was made private by OnkelTem in #6, but I can not see a particular reason for it in that comment. I made it public now.

I also fixed the other concerns of #109.

Status: Needs review » Needs work

The last submitted patch, 110: feeds-encoding-support-csv-1428272-110.patch, failed testing.

MegaChriz’s picture

Status: Needs work » Needs review
Issue tags: -Needs Review
FileSize
16.5 KB
1.04 KB

The encoding should be checked even if "fromEncoding" is equal to "toEncoding".

hargobind’s picture

#112 works great on my custom import. I only tested this to convert from Windows-1252 (ANSI) to UTF-8, but I don't see any reason why it would fail for any other conversions.

MegaChriz’s picture

Status: Needs review » Fixed

Committed #112. Thanks all!

  • MegaChriz committed ed8fbfd on 7.x-2.x
    Issue #1428272 by OnkelTem, eosrei, jtsnow, Jerenus, liquidcms, acouch,...

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

mikeytown2’s picture

If you wanted to support auto utf-8 translation of the encoding there is this https://github.com/neitanod/forceutf8

Also came up with this which seems to work for my use case.

  // Build encoding list.
  $encoding_list = array_unique(array_merge(
    mb_detect_order(),
    array(
      'UTF-8',
      'ASCII',
      'ISO-8859-1',
      'ISO-8859-2',
      'ISO-8859-3',
      'ISO-8859-4',
      'ISO-8859-5',
      'ISO-8859-6',
      'ISO-8859-7',
      'ISO-8859-8',
      'ISO-8859-9',
      'ISO-8859-10',
      'ISO-8859-13',
      'ISO-8859-14',
      'ISO-8859-15',
      'ISO-8859-16',
      'Windows-1251',
      'Windows-1252',
      'Windows-1254',
    )
  ));
  // Make sure the list contains only valid options.
  $encoding_list = array_intersect($encoding_list, mb_list_encodings());

  // Convert encoding.
  $utf8_input_string = mb_convert_encoding($input_string, 'UTF-8', mb_detect_encoding($input_string, $encoding_list, TRUE));