Question about german umlauts in Newsfeeds

By holger on 17 May 2005 at 01:32 UTC

There is a website displaying my rss newsfeed but the german umlauts like ä,ü,ö, will not be displayed correctly on this page.

You can see it here:
http://www.boogiknight.de/index.php?option=com_newsfeeds&task=view&feedi...

at my own website all umlauts are displayed correctly, but how can i realize that this it also on other sites if they use my newsfeed?

greetings from germany, holger

Techno Magazin www.technomusik.net

Comments

UTF-8 encoding

Steven commented 17 May 2005 at 04:46

Drupal uses UTF-8 encoding, and does so correctly. Your XML feed validates. The problem is that the boogiknight.de site seems to assume all feeds are encoded with ISO-8859-1. This means it is not a valid XML parser (the specs require any parser to handle UTF-8) and that it will mess up any non-ASCII character.

You could set up an ISO-8859-1 encoded copy of your feed by grabbing it with a script and converting its encoding (use iconv or something) and changing the <?xml encoding="utf-8"?> prolog.

Changing Drupal's internal encoding is not possible.

--
If you have a problem, please search before posting a question.

Thank you Steven. Would you

holger commented 17 May 2005 at 16:41

Thank you Steven. Would you give me please an example how to make those copy of my feed and how to get that finaly work?
This Problem will be in many german drupal-sites i think because most of german websites are using iso-standards and not utf-8.
At my own site i would not change utf-8 to iso but it would be great if other german sites could display my newsfeed correctly.

greetings from germany, holger

Techno Magazin http://www.technomusik.net

"This means it is not a

Edward C. Zimmermann@drupal.org commented 18 May 2005 at 19:40

"This means it is not a valid XML parser (the specs require any parser to handle UTF-8) and that it will mess up any non-ASCII character."

Nope. The SPEC does not require XML to be encoded in UTF-8 or UTF-16 only that parsers must be able to handle XML encoded in these UTF encodings of UCC. Parsers need not understand other encodings but there is (and this violates the original intent) nothing wrong with specifying ISO 8859-1 (Latin-1) encodings.

<?xml version="1.0" encoding="ISO-8859-1"?>

The default encoding, lacking a specification (or byte order mark) is UTF.

My suggestion is to convert things to ASCII encoding, use the "standard" Latin-1 entity set for diacriticals in the first byte block (to 255) and then use (depending upon metadata and other declarations) &#nnnn; format--- if there is much demand I might publish a tool I've developed (C langauge) for our RSS spider/Drupal integration. This makes sure that things don't get too zaped about in this software imperfect world--- and lets not talk too loudly about RSS perfection as thowing HTML into RSS is not really what one is supposed to do (one can but needs to adopt namespaces and use XHTML and not HTML).

Please read again

killes@www.drop.org commented 18 May 2005 at 20:52

Steven did not say that every feed has to be encoded in utf-8.

We are also very wary to not use any hacks just to work out quirks in other people's software.
--
Drupal services
My Drupal services

--
Drupal services
My Drupal services

Hacks and workarounds

Edward C. Zimmermann@drupal.org commented 19 May 2005 at 04:57

"We are also very wary to not use any hacks just to work out quirks in other people's software."

Well there is hardly any way around that. Most of the software in the Web have quirks, don't adhere correctly to standards, many standards are twisted and illogical (I refer to reader to trace the morphing of XML from its roots as a lightweight normalized SGML into a Golem like monster) and many "standards" are not even standards (relevant to Drupal is RSS as Winer versus O'Riley). In indexing 100s of millions of Web pages over the past years we have some to see some very "creative" intrepretations of standards in Web pages and all kinds of other so-called standardized information encodings.

Lets look closer at Drupal and its RSS.
I see inlined Links, Images and, even worse, presentational markup. Its bad enough that it eecked its way into HTML-- morphing it from an attempt to provide simple descriptive markup to a poor presentational langauge-- but its clearly out of place in RSS (0.9x or 2.0 or 1.x save via namespaces). An image, for example, to a description in a RSS item should not be a bit of HTML inline but should be, via an extension, a cross reference and probably include details about its relationship and not just its linkage. Drupal just takes the whole block of text inclusive of markup and throws in into an XML template. This is wrong but we can live fine with it.

What I suggested was NOT a quirk or even a hack but a way (once upon a time how these things were done) to encode the multilingual content in a manner that one will probably have less trouble with other sites. Exporting your content where ü is encoded as ü makes a lot of sense. If they assume the base character set is Latin-1 there is no problem. If they assume correctly its UTF-8 then that's OK too.

We have no guarantee that we

killes@www.drop.org commented 19 May 2005 at 14:33

We have no guarantee that we can map a site's content to iso-8859-1 or any other subset of utf-8. It is really up to the importing website to get their importer on track.
--
Drupal services
My Drupal services

--
Drupal services
My Drupal services

Encoding UCS

Edward C. Zimmermann@drupal.org commented 19 May 2005 at 18:31

"We have no guarantee that we can map a site's content to iso-8859-1 or any other subset of utf-8. It is really up to the importing website to get their importer on track."

Many people (especially in the Far East) will (and have) argued that its not really possible to map their character sets into Unicode or UCS (which where not always identical)--- the use of overlaid diacriticals and the lacking of language keys given its potential significance to pictorial sematics has found many detractors. In XML encoding UCS into ASCII is, however, simple:
&#nnn" or &#xhhh;.

One can use an -bit ASCII baseset and encode documents that use the whole of the 4-bit Universal Character Set.

Back in the old days we would typically do that and define in our declarations something like:

<!SGML "ISO 8879:1986"
			    CHARSET
BASESET  "ISO 646-1983//CHARSET
	  International Reference Version (IRV)//ESC 2/5 4/0"
.
.
.

ISO 646 is nothing other than the official international standards name for good old 7-bit ASCII. UTF-8 was just an encoding introducted a baker's dozen years ago to allow for a quick implementation of UCS streams in general file systems without need for wide characters. In XML we have markup and don't need it nor wide characters.

If a site or a browser can't deal with the character represented in &#nnnn; then it surely well probably could not deal with it as encoded into UTF-8 either.........

Aside from the technical discussion

laura s commented 19 May 2005 at 18:48

A pragmatic solution could be to generate a more user-friendly feed from Feedburner, and offer a link to it as an alternative for people having problems reading your feed directly and don't know how to reconfigure their aggregators. (The problem with UTF-8 characers seems to be built-in to some common blogging and CMS softwares.)

I've see similar issues with directional quotes, which often are present in text I pull from other webistes to use in a post. When they are displayed on some aggregator websites, they break into escaped characters, symbols or sometimes a ? mark. It's aggravating as all heck, but that's how their sites read the feeds.

.:| Laura • pingV |:.

_____ ____ ___ __ _ _
Laura Scott :: design » blog » tweet