Some of the most important fields in our webform are essay questions. We ask the users to answer these questions in "text area" elements. When users type directly into the field, everything goes well. Unfortunately, most people seem to prefer to compose their answers separately in a word processor, then copy and paste into the field (allowing them to spell check, save work, revise, etc). This creates a very frustrating problem: their word processor is often based on a different character set than whichever set it is that drupal webform is used to handling (be it unicode, utf, asci, or something else). Thus, commas, quotation marks, and other punctuation comes through in the (comma separated values) results as incomprehensible gibberish. This forces us to spend valuable time in excel cleaning up our data to make it presentable for the people who will be evaluating the webform submissions. How can we avoid this? Is there any kind of control over character encoding that might work around this? I have scoured the forums for other instances of this problem, but have found no discussion. Please help. Thanks!

Comments

quicksketch’s picture

Title: text areas cause punctuation and character set mess » CSV files not UTF-8 aware
Category: support » bug

Marked http://drupal.org/node/210649 as duplicate.

It turns out the downloaded CSV files have the wrong text encoding, which is what causes all the weirdness with Excel. Strangely, I can open the file in TextEdit (on a Mac) and the file looks fine. Opening the same file with BBEdit or Excel ends up with the screwy characters.

quicksketch’s picture

Status: Active » Closed (won't fix)

I did a bit of research on this and it seems that there's nothing that Webform can do on it's end. CSV inherently only supports ASCII text (without special characters). Changing the download headers from UTF8 to ASCII (or other encodings) doesn't have any effect. So as it is, the correcting needs to happen on the side of the desktop application to read in the file correctly.

Marat’s picture

Status: Closed (won't fix) » Active

This could be fixed in different approach. I changed the extension of the file from .csv to .txt in webform_report.inc to be: drupal_set_header("Content-Disposition: attachment; filename=". preg_replace('/\.$/', '', str_replace(' ', '_', $node->title)) .". txt");, and it works since text files is UTF-8 aware. Then I used the text import wizard at Excel to import data to a worksheet. May be the design of submissions need to be rearranged.

Does this change work?

quicksketch’s picture

This sounds like another work-around. If you're going to manually be importing the CSV through the wizard, you could probably change the name of the file from .csv to .txt after downloading anyway. It should still work even after downloading the file. Webform is properly setting the headers on download, it just happens that Excel refuses to read a CSV file directly and use UTF-8 characters.

Marat’s picture

Yes ... you are right. I downloaded .csv file and opened it using Notepad and it works. It is the same as .txt file. ... it just happens that Excel refuses to read a CSV file directly and use UTF-8 characters. that is exactly what happens!

So as you mentioned before ... the correcting needs to happen on the side of the desktop application to read in the file correctly ...

Thanks quicksketch

quicksketch’s picture

Status: Active » Closed (won't fix)

Thanks for the confirmation Marat. I tested it pretty extensively when I looked into this, but all on a Mac. Thanks for backing up the conclusion for Windows also. So... let's summarize: Excel doesn't read UTF-8 CSV files :P

Note that this isn't technically a bug on Excel's part, since the CSV spec itself doesn't allow for UTF-8 characters. It might be very likely then that this bug will never be corrected.

Rafał Ch’s picture

> So... let's summarize: Excel doesn't read UTF-8 CSV files :P

OpenOffice does :) and you may save as .xls

quicksketch’s picture

Title: CSV files not UTF-8 aware » CSV files not Unicode/UTF-8 aware
Status: Closed (won't fix) » Fixed

I was researching this further recently, as I found that Excel can easy import a CSV file with non-latin characters as long as Excel exports it. This revealed that Excel exports in "MacRoman" encoding on Macs, and "Windows-1252" on Windows. Great. So Excel *can* import CSV files with non-latin characters, as long as you stick to the encoding used on each platform.

However! I found some more interesting information, such as Excel can read UTF-16LE (little endian) on both Macs and PCs, but all the results are shifted into a single column. Using the Excel "Text to Columns" function easily corrects this however.

Considering it's much harder to get the encoding correct than it is to use "text to columns" I've changed Webform's downloads to use UTF-16LE. I also found that Excel works *very* well if using tabs instead of commas in the download (also in UTF-16LE), so I've changed the default delimiter to a tab instead of a comma, since it seems to work universally.

As a super-bonus, Excel will read these tab-delimited files just fine when given a .xls extension (though Office 2007 will complain that the file contents don't match the extension). Anyway, all these new changes will be in the 2.2 version for everyone to try out.

Anonymous’s picture

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.