Include bartfeenstra/cldr to parse Unicode CLDR number patterns [#1531058]

Comment	File	Size	Author
#31	currency_1531058_12.patch	50.34 KB	xano
#28	currency_1531058_11.patch	12.28 KB	xano
#24	currency_1531058_10.patch	13.38 KB	xano
#18	currency_1531058_09.patch	10.87 KB	xano
#17	test._php.txt	1.51 KB	alan d.
#11	currency_1531058_08.patch	7.48 KB	xano
#7	currency_1531058_06-do-not-test.patch	3.02 KB	xano
#5	currency_1531058_05.patch	3.01 KB	xano
#4	currency_1531058_04.patch	3.45 KB	xano
#3	currency_1531058_02.patch	3.14 KB	xano
#2	currency_1531058_01.patch	2.99 KB	xano
#1	currency_1531058_00.patch	3.05 KB	xano

Comment #1

xano

he/they

English

Southampton

commented 20 November 2012 at 21:06

Assigned:	Unassigned	» xano
Status:	Active	» Needs review

Status	File	Size
new	currency_1531058_00.patch	3.05 KB

The patch depends on #1531974: Make currencies Ctools exportables. It adds Currency::format(), which accepts an amount and a format string. The format servers as the output value and the amount is rendered into it, after which some replacements are performed for 'tokens' (currency sign, code, and number) and to clean up unused characters from the format. It was deliberately not tied to the proposed CurrencyLocale system, so it serves as an API function for custom formats as well.

Log in or register to post comments

Comment #2

xano

he/they

English

Southampton

commented 20 November 2012 at 21:07

Status	File	Size
new	currency_1531058_01.patch	2.99 KB

The previous patch still contained a line of debugging code.

Log in or register to post comments

Comment #3

xano

he/they

English

Southampton

commented 20 November 2012 at 21:09

Status	File	Size
new	currency_1531058_02.patch	3.14 KB

Improved comments.

Log in or register to post comments

Comment #4

xano

he/they

English

Southampton

commented 20 November 2012 at 21:37

Status	File	Size
new	currency_1531058_04.patch	3.45 KB

Cleanup, code comment love.

Log in or register to post comments

Comment #5

xano

he/they

English

Southampton

commented 20 November 2012 at 21:38

Status	File	Size
new	currency_1531058_05.patch	3.01 KB

FML. 04 had debugging code...

Log in or register to post comments

Comment #6

20 November 2012 at 21:53

Status:

Needs review

» Needs work

The last submitted patch, currency_1531058_05.patch, failed testing.

Log in or register to post comments

Comment #7

xano

he/they

English

Southampton

commented 20 November 2012 at 22:00

Status:

Needs work

» Needs review

Status	File	Size
new	currency_1531058_06-do-not-test.patch	3.02 KB

Lalalaaaa.

Log in or register to post comments

Comment #8

xano

he/they

English

Southampton

commented 21 November 2012 at 10:50

+++ b/includes/Currency.inc
@@ -60,4 +60,68 @@ class Currency extends CurrencyBaseAbstract {
+   *     digits left, it will be suffixed by all of them.

Should be "prefixed".

Log in or register to post comments

Comment #9

21 November 2012 at 11:01

Status:

Needs review

» Needs work

The last submitted patch, currency_1531058_05.patch, failed testing.

Log in or register to post comments

Comment #10

xano

he/they

English

Southampton

commented 22 November 2012 at 21:12

We can do three things to make minor units work better:

Formats are tied to specific currencies they should work with.
Formats are tied to any currency that has a particular minor unit value: formats get a "minor unit" setting and they can only be used for currencies with the same minor unit.
Formats can be used for any currency: for this, we need to know the separation between the parts of formats for major and minor units, either dynamically, or through a format setting (such as a token, or by splitting the format in a format_major and a format_minor). This way we can render each unit away from the separator, starting at the separator (major: from the separator to the left, minor: from the separator to the right). This way any format can be used for any currency.

We might also want to remove every character between redundant hashes and not just periods or commas.

Log in or register to post comments

Comment #11

xano

he/they

English

Southampton

commented 23 November 2012 at 13:45

Status:

Needs work

» Needs review

Status	File	Size
new	currency_1531058_08.patch	7.48 KB

- Commenting issue fixed.
- Adds a SEPARATOR token to formats, which indicates the separation between the major and minor parts. This means formats will now work with every currency, but they should ideally have three replaceable characters (zeroes or hashes) on the right side of the separator to accommodate currencies with three decimals. The most flexible solution would be three hashes, since the format would then work for currencies with zero, one, two, or three decimals.
- Currency::format() now renders the major and minor parts individually, from the separator outward.
- Currency::format() no longer renders digits for which there are no more placeholders.
- Added tests.

See the test's assert messages to get an impression of possible formatting options.

Log in or register to post comments

Comment #12

xano

he/they

English

Southampton

commented 23 November 2012 at 13:47

+++ b/includes/Currency.inc
@@ -60,4 +60,135 @@ class Currency extends CurrencyBaseAbstract {
+      $output[$section] = preg_replace('/(\D*)#/', '', $output[$section]);

This should probably done for the entire rendered amount and not just for per major/minor section, because now a major format that ends in a comma or period will keep that comma or period even if there are no decimals to render.

Log in or register to post comments

Comment #13

alan d. commented 25 November 2012 at 05:13

It looks like you have lost support for bracketed negatives.

    if ($amount >= 0) {
-     $output = str_replace('-', '', $output);
+     $output = str_replace(array(')', '(', '-'), '', $output);
    }

Further review is difficult without a test setup. Hopefully somewhere in the old threads there is a list of all of the different formats parsed from http://www.unicode.org/ or http://cldr.unicode.org/index/downloads for direct download links of the latest cldr. Info on format is here: http://cldr.unicode.org/translation/number-patterns

Just on mute point that may not be applicable, is 1000000000 parsed as 1000000,000 or 1,000,000,000 using the format string #,##0?

When testing, an Arabic (1,00,00,00), English (1,000,000,000) and Indian (1,00,00,000) format should cover all of the bases for testing the major grouping separator, unless you are getting really keen and covering Japaneses & other Asian currencies.

Log in or register to post comments

Comment #14

xano

he/they

English

Southampton

commented 25 November 2012 at 11:27

Alan, thanks a lot for the review and the link to the Unicode number patterns documentation!

About the latest patch

I added support for negatives, but it's limited to a minus sign. Bracketed negatives don't really seem to exist. According to the docs there should be an entirely separate format for negatives, brackets or not.

Just on mute point that may not be applicable, is 1000000000 parsed as 1000000,000 or 1,000,000,000 using the format string #,##0?

It will be parsed as 0,000. The reason for that is that I didn't want to assume how groupings worked; I believe I saw some formats that used alternating groups of 2 and 3 digits. The idea was to force formats to have enough placeholders to accommodate the largest possible amount.

About the docs

¤ can be replaced by either the sign or the code. For site administrators to create formats, this would require an extra configuration option to switch between sign and code. However, I've seen custom formats that used both signs and codes, so we might want to keep replacing ¤ by the sign, and keep a custom token for the code.
If a single apostrophe needs to be inserted as a literal character, should it be escaped by surrounding it by two other single apostrophes? If this is true, what should happen to single apostrophes that do not have a matching second apostrophe? This sounds overly complex compared to escaping by prefixing.
You need to modify the patterns when: The negative pattern doesn't simply add a minus sign.

Does this mean adding a minus sign for negative amounts is the default behavior? If so, how should the sign be added?
Periods and commas are placeholders too. This would remove the need for a special SEPARATOR token, as introduced in the latest patch, but it introduces the requirement for replacement values to be configured.

In short, if we go with this formatting, I suggest we implement the currency general purpose format, and, at least to start with, without support for short numbers. To this, we add a currency code placeholder. However, the standard looks a bit fuzzy and open to implementation.

Log in or register to post comments

Comment #15

alan d. commented 25 November 2012 at 12:25

"Bracketed negatives don't really seem to exist"
This is a common requirement in many systems especially in the commerce sector, and is seen in many of the cldr formats. I have lost the scripts I used last year to extract unique currency formats, but from memory I pulled out about 2 or 3 formats from the XML files that used brackets instead of a minus character. I found no other minus markers when parsing the files and no formats that used brackets for anything else, but if you want to handle two format strings as per the docs, no objections from me!

"The idea was to force formats to have enough placeholders to accommodate the largest possible amount"
The original code did do this (though maybe not i18n safe) with the minimal format strings that boiled down to one of "#,##0", "#,#0" or "#,##,##0", although for fun you could add any possible combo like "#,##,###,#,#0" and it would produce "12,34,56,78,90,12,34,567,8,90". Basically just parsed the major component from the right until there were no more patterns given and the last pattern size is used for the rest. More friendly for site admins.

Re #1 - This should only be the symbol (which may be the code), so I like the idea of either manual tokens (XXX,999) or if you want to support tokens proper, token strings like [currency:iso3] or [currency:num3] could be better.

Re #2 - This is for number abbreviations from memory so can be ignored afaick

Re #3 - Like preferred the single format with - ( or ) and just str_replacing them out, but your call if you want to handle this, see first comment above :)

Re #4 - According to the docs, the full stop is the special placeholder that is replaced by the minor separator and the comma is the major separator.

Log in or register to post comments

Comment #16

xano

he/they

English

Southampton

commented 25 November 2012 at 13:42

This is a common requirement in many systems especially in the commerce sector, and is seen in many of the cldr formats. I have lost the scripts I used last year to extract unique currency formats, but from memory I pulled out about 2 or 3 formats from the XML files that used brackets instead of a minus character. I found no other minus markers when parsing the files and no formats that used brackets for anything else, but if you want to handle two format strings as per the docs, no objections from me!

I was just saying in terms of formatting, there is no such concept as bracketed negatives. They're just negatives, and the system couldn't care less how exactly they're formatted.

2) Not sure what you mean with that. Escaping is used in any context that uses characters with a special meaning, so it applies to currency formatting as well, because what if you want to display a literal comma and not use it as a decimal separator placeholder?

3) I'm not sure what you mean here either. Is there a default behavior when there is no format for negative amounts, or isn't there?

4) Right. I understood that and meant that my solution seems more flexible to me. I haven't been able to think of a situation in which the system wants to know what characters are used as separators. The unicode standard needs to know them for replacement, by my approach worked fine without replacements, so I don't see the added value.

//Edit: are you on IRC? It might be easier discussing a few things there than here in the queue :)

Log in or register to post comments

Comment #17

alan d. commented 25 November 2012 at 15:54

Status	File	Size
new	test._php.txt	1.51 KB

Format? Maybe you are talking about something completely different. In regards to formatting, open MS Excel, click a cell, select currency type, and two of the four options are for bracketed negatives. Screenshot.

#2 was in relation to:

If a single apostrophe needs to be inserted as a literal character, should it be escaped by surrounding it by two other single apostrophes? If this is true, what should happen to single apostrophes that do not have a matching second apostrophe? This sounds overly complex compared to escaping by prefixing.

Do whatever works for you!

After parsing every XML file last year, all formats were limited to just 9 characters: the currency symbol, hash, zero, comma, dot, dash, space and brackets and personally I would only support these simple strings myself, and maybe handle this via a theme function to allow users to markup elements. But your way does eliminate the need to know about separators.

#3 The "default" is to prefix the format with a dash.

#4 If it works, all good.

For fun I reparsed the XML files and here are the results:

¤#,##0.00
(¤#,##0.00)
¤ #,##0.00
#,##0.00¤
¤ #0.00
¤ #0.00-
¤ #,##0.00-
¤#0.00
¤ #,##,##0.00
#,##0.00 ¤
#,##,##0.00¤
(#,##,##0.00¤)
-#,##0.00¤
¤-#,##0.00
¤#,##,##0.00
¤ -#,##0.00
(¤ #,##0.00)
#0.00¤
(#0.00¤)
(#,##0.00 ¤)
‎¤#,##0.00
‎(¤#,##0.00)
#0.00 ¤
¤- #,##0.00

And non-specified negative formats

-¤#,##0.00
-¤ #,##0.00
-¤#0.00
-¤ #,##,##0.00
-#,##0.00 ¤
-¤#,##,##0.00
-¤ #0.00
-#0.00 ¤
-#,##,##0.00¤

Just 33 all up :)

Download the latest cldr zip and run the test.php (after renaming it) to set a full run down of these.

Per primary locale, en, es and ar were the most diverse:

ar: ¤ #0.00;¤ #0.00-;¤ #,##0.00;¤ #,##0.00-;¤#0.00
en: ¤#,##0.00;(¤#,##0.00);#,##0.00 ¤;¤ #,##,##0.00;¤ #0.00
es: #,##0.00 ¤;¤#,##0.00;¤-#,##0.00;¤ #,##0.00;¤ -#,##0.00;(¤ #,##0.00)

Log in or register to post comments

Comment #18

xano

he/they

English

Southampton

commented 26 November 2012 at 12:14

Status	File	Size
new	currency_1531058_09.patch	10.87 KB

Format? Maybe you are talking about something completely different. In regards to formatting, open MS Excel, click a cell, select currency type, and two of the four options are for bracketed negatives. Screenshot.

I know bracketing is a form of negative amount notation, but from the system's point of view, it doesn't matter how a negative is rendered. It should take care of supporting different locales, but not care about how those locales work. Just a bit of miscommunication :)

2) I opted to just go with the CLDR guidelines for consistency's sake.

3) Good, thanks! Would you have a link to a document that says that, so we can put it in our own documentation?

4) Like 2, I went with the CLDR guidelines here as well. This allows us to parse the CLDR XML files and generate format configurations based on the locale data they contains.

Per primary locale, en, es and ar were the most diverse:

ar: ¤ #0.00;¤ #0.00-;¤ #,##0.00;¤ #,##0.00-;¤#0.00
en: ¤#,##0.00;(¤#,##0.00);#,##0.00 ¤;¤ #,##,##0.00;¤ #0.00
es: #,##0.00 ¤;¤#,##0.00;¤-#,##0.00;¤ #,##0.00;¤ -#,##0.00;(¤ #,##0.00)

Why do those locales have more than two patterns? http://cldr.unicode.org/translation/number-patterns doesn't mention this.

The attached patch tries to conform to CLDR as much as possible. (Possible) differences:
- Two custom [XXX] and [999] tokens for the currency code and number.
- Every pattern is required to consist of a positive and negative fragment, separated by a semicolon. There is no default negative pattern yet.
- Every pattern fragment is required to have a period for a decimal separator.
- Every pattern is required to have a 00 decimal placeholder anywhere on the right side of the decimal separator.

Log in or register to post comments

Comment #19

alan d. commented 26 November 2012 at 12:32

3) Last paragraph in the first table:

If your language uses different formats for negative numbers than just adding "-" at the front, you can put in two patterns, separated by a semicolon. The first will be used for zero and positive values, while the second will be used for negative values.

4) Yep, this is a collective for multiple locales, en_US.xml = English USA, etc

en.xml
¤#,##0.00;(¤#,##0.00)
en_150.xml
#,##0.00 ¤
en_BE.xml
#,##0.00 ¤
en_BW.xml
¤#,##0.00
en_BZ.xml
¤#,##0.00
en_GB.xml
¤#,##0.00
en_HK.xml
¤#,##0.00;(¤#,##0.00)
en_IN.xml
¤ #,##,##0.00
en_JM.xml
¤#,##0.00
en_NA.xml
¤#,##0.00
en_PK.xml
¤ #,##,##0.00
en_SG.xml
¤#,##0.00;(¤#,##0.00)
en_TT.xml
¤#,##0.00
en_US_POSIX.xml
¤ #0.00
en_ZW.xml
¤#,##0.00

Sorry, haven't looked at the code but seems strange to have a minor placeholder "00" for currencies like the Laos Kips where 1,000,000 LAK is only 125 USD.

I wonder if the formatter shouldn't be part of the code, something like this maybe:

$formatter = new CurrencyFormatter(Currency $currency);
$formatter->format();

Which outputs the stock standard UNICODE formats, and if someone wants to fancy formats, you could extend the class to supply whatever you want. The above two lines could be wrapped into a factory method that initiates a globally defined formatter in case someone wants something special, falling back to the default CurrencyFormatter if the class isn't found.

I'm thinking about integration with external systems / third party modules, and supplying anything other than a raw formatted code would probably break the integration as these wouldn't expect HTML.

Not on IRC but skype or gtalk are options if you want to chat

Log in or register to post comments

Comment #20

alan d. commented 26 November 2012 at 12:41

D'uh thats sort of what you are doing.

regexp are expensive, string functions are much faster, again boils down to do you want to support complex strings.

-$fragments = preg_split("#(?<!');(?!')#", $pattern);
+$fragments = explode(';', $pattern);

The two pattern strings:

switch (count($fragments)) {
  case 0:
    throw new Exception(t('Missing currency format for @currency', array('@currency' => ?)));
  case 1:
    $positive = reset($fragments);
    $negative = '-' . $positive;
    break;
  default:
    list($positive, $negative) = $fragments;
}

Log in or register to post comments

Comment #21

alan d. commented 26 November 2012 at 12:53

I don't think you can use this.

$sign = (int) $amount < 0;

For example:

$tests = array(
  '1.01', 1.01, 0.1, 0.01, 0, 0.0, -0.01, -0.1, -1, -1.01
);
foreach ($tests as $amount) {
  $sign = (int) $amount < 0;
  print "$amount is " . ($sign ? 'negative' : 'positive') . "\n";
}

outputs

1.01 is positive
1.01 is positive
0.1 is positive
0.01 is positive
0 is positive
0 is positive
-0.01 is positive
-0.1 is positive
-1 is negative
-1.01 is negative

So cast the comparison result if you really need an int value:

$sign = (int) ($amount < 0);

Log in or register to post comments

Comment #22

xano

he/they

English

Southampton

commented 26 November 2012 at 13:09

Title:

Parse amount formats

» Parse Unicode CLDR number patterns

Sorry, haven't looked at the code but seems strange to have a minor placeholder "00" for currencies like the Laos Kips where 1,000,000 LAK is only 125 USD.

From http://cldr.unicode.org/translation/number-patterns (first row of the second table):

Note: the number of decimals will be set by the program to whatever is appropriate for the currency, so do not change them; keep exactly 2 decimals.

I interpreted this as keeping two zeroes, which will be replaced by however many decimals the used currency has.

I wonder if the formatter shouldn't be part of the code, something like this maybe:
$formatter = new CurrencyFormatter(Currency $currency);
$formatter->format();
Which outputs the stock standard UNICODE formats, and if someone wants to fancy formats, you could extend the class to supply whatever you want. The above two lines could be wrapped into a factory method that initiates a globally defined formatter in case someone wants something special, falling back to the default CurrencyFormatter if the class isn't found.

I'm thinking about integration with external systems / third party modules, and supplying anything other than a raw formatted code would probably break the integration as these wouldn't expect HTML.

The idea is to let every piece of code do its own thing. This issue is literally just about parsing CLDR number patterns. Locale delegation, which will try to find the correct pattern (either stock Unicode CLDR or custom/overridden) based on available locale parameters, will be added in another issue.
Also, parsing a pattern requires a currency, but the pattern itself does not require a specific currency (e.g. should work with any currency), so passing on a Currency to CurrencyFormat for other reasons than parsing a single pattern only limits CurrencyFormat IMHO.

I just sent you an email through your contact form with my Skype handle.

Log in or register to post comments

Comment #23

xano

he/they

English

Southampton

commented 26 November 2012 at 13:14

-$fragments = preg_split("#(?<!');(?!')#", $pattern);
+$fragments = explode(';', $pattern);

This won't work, because it will split on escaped semicolons as well.

I don't think you can use this.
$sign = (int) $amount < 0;

Good catch. Your example isn't quite right, because amounts will always be integers anyway, but the casting is indeed wrong.

Log in or register to post comments

Comment #24

xano

he/they

English

Southampton

commented 26 November 2012 at 17:21

Status	File	Size
new	currency_1531058_10.patch	13.38 KB

This patch adds support for default patterns for negative amounts, and fixes the type casting issue raised in #21. It also updates tests pattern formatting using every currency available in the system that has a minor unit set (and for this, it adds minor units to some currencies).

Log in or register to post comments

Comment #25

xano

he/they

English

Southampton

commented 29 November 2012 at 07:48

For some reason I didn't find this during my first search for existing code, but apparently PHP *can* parse CLDR number patterns using its NumberFormatter class. I haven't checked closely yet, but we can at least make it use custom patterns, and I believe we can even set custom replacement characters (meaning PHP will let us override the defaults it finds in the locale data it has).

We will also need to find out how common this extension is. Based on a quick search it seems most operating systems don't ship with ICU (a requirement for the Intl extension), so using the extension might limit Currency to people managing their server OS themselves.

Log in or register to post comments

Comment #26

xano

he/they

English

Southampton

commented 30 November 2012 at 11:46

A few recent findings/realizations:

Hindi uses one 3-digit group. All other groups are indeed 2 digits.
We probably want to support other digit character sets than Arabic (there are Hindi, Tamil, etc), as long as they have ten characters for 0-9.
Unlike some other parsers (of which I could only find a few after hours of searching), I'd like to continue support 'unusual' characters for theming purposes, as opposed to

Note: I think we already test patterns, to make sure that they are sane, eg no "###-hi mom-##0".

(quote from http://site.icu-project.org/design/number-parsing)
I'm thinking we might want to develop the parser code on Github for increased exposure. There are probably too few people here within the Drupal community with enough knowledge to pull this off, and a good PHP CLDR (currency) number pattern parser is usable for tons of other projects. Using Composer we can easily embed the code in Currency.module, which would deal with all the other Drupal-specific stuff we're planning to add.

Log in or register to post comments

Comment #27

xano

he/they

English

Southampton

commented 3 December 2012 at 18:50

I took the liberty of putting this on Github already: https://github.com/bartfeenstra/cldr. It's released under the MIT, so we can easily commit the code to the d.o repo.

The code is a massive rework of the patches posted here. It does not only support currencies, but every number format described by CLDR, except scientific notation (currency amounts, decimals, percentages), and it even does integers. Supporting integers actually makes testing and maintenance easier. It has PHPUnit tests and Composer integration.

Log in or register to post comments

Comment #28

xano

he/they

English

Southampton

commented 10 December 2012 at 16:20

Status	File	Size
new	currency_1531058_11.patch	12.28 KB

The attached patch imports bartfeenstra/cldr using Composer. This ensures that people using Currency don't have to install Composer themselves. To update the code (whenever bartfeenstra/cldr is updated), simply run composer update and commit the changes.