drupal/modules/node.module contains three binary strings. The use of binary strings causes problems in text editors incapable of handling such strings.

For example, I have modified this file for internal use, but when I save the file, the binary strings get converted to question marks. While someone could argue that I should get a better text editor, this issue exists for many users, not just me. Better yet, the solution is simple...

PHP allows representing binary characters via hex encoding. That's what this patch does.

Thanks.

CommentFileSizeAuthor
node.module.nobinary.diff864 bytesdanielc
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

kbahey’s picture

+1 for this.

They never show up correctly for me (Linux. vim).

Steven’s picture

All Drupal code is UTF-8 encoded. Locale.inc contains many more such strings and there's some in search.module too. Keeping them as plain-text is vital to keeping the code readable, using hex escapes reduces editability.

-1 on this.

Steven’s picture

PS: If your editor converts them to question marks, it means it doesn't support UTF-8 properly and only handles your local ANSI codepage. You will run into many more issues. I use Notepad2 on Windows and it works like a charm.

danielc’s picture

Steven wrote:
> Keeping them as plain-text is vital to keeping the code readable,
> using hex escapes reduces editability.

Right now, when viewing these characters using readers don't deal with these characters well,
like Mozilla looking at the web interface of the CVS repository
(go to http://cvs.drupal.org/viewcvs/drupal/drupal/modules/node.module?annotate... then look at line 217), I see an a tilde, a Euro symbol, and a comma. I hardly consider that "readable."

When viewing the file in vi, the binary data already shows up as their hex representatives (for example line 217 has "\xe3\x80\x82". So, changing them to an escaped/encoded string makes the string look exactly the same. The only difference is that my patch represents each character using four ASCII characters. Now everyone can easily read and edit the values.

Steven’s picture

This is the fault of the viewcvs code which assumes ISO-8859-1 encoding, it has nothing to do with using UTF-8. Those mails look fine in my email client (Thunderbird).

The only difference is that my patch represents each character using four ASCII characters. Now everyone can easily read and edit the values.

I disagree. If I want to edit the characters now, I simply type them in using whatever input method is appropriate. The literal bytes don't mean anything.

If I want to hexencode a piece of UTF-8 text, I have to view the text I typed as literal bytes somehow (so I need a hex editor?) and enter the values in the code. If I later want to figure out what that text really says, I have to paste the hex values again in the hex editor, save to a text file and open it as UTF-8. This is a waste of time.

This like saying all code should be hardwrapped at 80 characters, because well, that's what ancient terminals use. Sorry, I don't buy it. My computer has no problems using and displaying Unicode. There are tons of freeware Unicode fonts around and as far as I know, most Unix tools should handle it fine. As far as non-displayable characters goes, there is an excellent fallback font which represents them with a small box with the Unicode codepoint in them. This is much more useful than the literal bytes as it actually means something.

Any modern OS should support fallback fonts for missing characters. It's not my problem if you decide to torture yourself with vi and friends.

chastell’s picture

It looks like drupal-devel is not gated both ways, so I’ll add my 2¢ here as well:

Any modern OS should support fallback fonts for missing characters. It's not my problem if you decide to torture yourself with vi and friends.

Just to clarify – vim works perfectly in an UTF-8 environment. If one sets up a proper locale (pl_PL.UTF-8 in my case, en_US.UTF-8 in the original poster's, perhaps) and properly configures his terminal, everything simply works and the characters show up without a problem.

If an outsider vote counts, -1 on converting the strings to hex values.

Cheers,
-- Shot (Piotr Szotkowski)

danielc’s picture

Steven, I'm confused. I hope you can set me straight, please.

Here's a URI of a graphic showing what I see on line 194 of node.module, version 1.485.2.8.
http://www.analysisandsolutions.com/drupal/node.module.png
Is this what you see? Is that what I'm supposed to be seeing? Can you please show me what you see? What are these characters? What do they represent? In what language?

You mentioned "If I want to edit the characters now, I simply type them in using whatever input method is appropriate." How do you type them in? I don't see keys with those characters on my keyboard.

Because node.module contains characters outside ISO-8859-1 (ord: 227, 128, 130, 216, 159) (hex: e3, 80, 82, d8, f9), people editing or reading this file must have software that can handle handle these bytes. Drupal is an open source project. Shouldn't the files therein be easily accessible to everyone? The characters presently being limits participation to those with the particular editors.

I came across this issue because I'm editing node.module to do unusual things for a client. Then I made a patch file to send him. Upon inspecting the patch file, I noticed my editor, EditPlus, had morphed the characters.

Steven’s picture

Mozilla should handle UTF-8, but that particular page is messed up because the ViewCVS script is hardcoded to assume ISO-8859-1. It replaces the upper bytes by one entity per byte, which prevents the UTF-8 encoding from doing its job.

As far as EditPlus goes, it's simply a buggy editor. I did a big survey for finding a good editor for Windows. EditPlus is one of many which claim to support UTF-8, but really convert the UTF-8 to the local legacy 8-bit code page and back again. This means that any character not in your local code page is permanently destroyed. In other words, your editor is corrupting files you open /and/ without telling you. Not exactly a desirable quality.

See http://www.acko.net/blog/text-editor for a big list of Windows text editors.

As far as entering the characters goes, the world does not end at your keyboard. Most OSes have a character map for example and provide other services for text entry. For example, I can enter Japanese through the Input Method Editor that comes with the OS by typing phonetically and selecting the right characters. And I wrote myself a nice app for entering various Unicode characters:
http://www.acko.net/blog/sprankle
Note that old Windows9x programs don't support Unicode, while some use a weird method of input which does not recognize the keydown/keyup messages I send.

Anyway: the characters you ask about are the ideographic full-stop and a flipped question mark, both common endpoints for a sentence in respectively Chinese/Japanese and Hebrew/Arabic. In a proper editor with a proper font installed they will display fine:

http://acko.net/dumpx/utf.png

The point is that those people who need to edit them can edit them as easily as plain text. Imagine that we had a requirement that every piece of plain text would have to be entered as hex escaped bytes. You wouldn't accept that. And I don't accept your proposal for exactly the same reason. It is unnecessary, cryptic and bloaty.

And even if you cannot type something yourself, you can still copy/paste it from other places. That's how I assembled most of the list in locale.inc. I can't read most of the scripts there, but that doesn't matter. The information stored in the characters is still there, and they will appear correctly to the people who know the language, as they have the right fonts.

Here's a live demo of a piece of locale.inc. On my side it looks just like the screenshot, save for the colors (screenshot is from Notepad2, not Mozilla).

array(
    "ja" => array("Japanese", "日本語"),
    "jv" => array("Javanese"),
    "ka" => array("Georgian"),
    "kg" => array("Kongo"),
    "ki" => array("Kikuyu"),
    "kj" => array("Kwanyama"),
    "kk" => array("Kazakh", "Қазақ"),
    "kl" => array("Greenlandic"),
    "km" => array("Cambodian"),
    "kn" => array("Kannada", "ಕನ್ನಡ"),
    "ko" => array("Korean", "한국어"),
);

Note that Drupal, unlike the buggy ViewCVS script, leaves UTF-8 alone. This chunk of code will display fine for those with the right fonts installed.

Oh and just to convince you, here's me running vi, looking at locale.inc, over an SSH connection. My LANG environment variable is set to en_GB.UTF-8 and Putty is also set to UTF-8:

http://acko.net/dumpx/putty-utf8.png

Unfortunately Putty limits me to what it thinks are fixed-width fonts, preventing me from using a more extensive font. Less characters show up. Note that the terminal and vi are smart enough to see that the Japanse character is a wide one, so the cursor adapts to cover two columns.

The only thing which vi does not do in fact is bidirectionality for e.g. Arabic, but you don't need (or want) that in code anyway.

Unicode is here to stay, and you're going to run into it more in the future. You might as well set up your environment correctly now before you run into more corruption problems later.

danielc’s picture

Considering that node.module contains these unusual characters in one place, changing them seems worth the tradeoff.

Anyway, I'm really writing to say that EditPlus has fixed this problem as of version 2.20. It's in beta format, available to registered users. http://www.editplus.com/betainfo.html