From srl at icu-project.org  Thu Aug  7 15:19:48 2014
From: srl at icu-project.org (Steven R. Loomis)
Date: Thu, 07 Aug 2014 13:19:48 -0700
Subject: CLDR bug tracker back open
Message-ID: <53E3DF64.60906@icu-project.org>

the CLDR bug tracker is back open for submissions, with an improved spam
filter.
My apologies for the inconvenience.

-- 

IBMer but all opinions are mine.
https://www.ohloh.net/accounts/srl295 // fingerprint @ https://ssl.icu-project.org/trac/wiki/Srl


From rxaviers at gmail.com  Thu Aug 21 08:20:13 2014
From: rxaviers at gmail.com (Rafael Xavier)
Date: Thu, 21 Aug 2014 10:20:13 -0300
Subject: Loose Matching questions
Message-ID: <CADdLYsrdTL1cVp013oYaV2RvJXwyf64juNE6MRQwhKCGp-JAeQ@mail.gmail.com>

Dear all,

Reading the Loose Matching TR35 documentation
<http://unicode.org/reports/tr35/#Loose_Matching> lead me to the questions
below. I have quoted the documentation and inlined the questions (probably
newbie).

Thanks in advance for your help!

7.2 Loose Matching
> Loose matching ignores attributes of the strings being compared that are
> not important to matching. It involves the following steps:
>
>    - Remove "." from currency symbols and other fields used for matching,
>    and also from the input string unless:
>       - "." is in the decimal set, and
>       - its position in the input string is immediately before a decimal
>       digit
>    - Ignore all format characters: in particular, ignore the RLM and LRM
>    used to control BIDI formatting.
>
> Where do I find a list of all format characters?


>    - Ignore all characters in [:Zs:] unless they occur between letters.
>    (In the heuristics below, even those between letters are ignored except to
>    delimit fields)
>
>  Where do I find a list of all [:Zs:] characters?


>    - Map all characters in [:Dash:] to U+002D HYPHEN-MINUS
>
>  Where do I find a list of all [:Dash:] characters?


>    - Use the data in the element to map equivalent characters (for
>    example, curly to straight apostrophes). Other apostrophe-like characters
>    should also be treated as equivalent, especially if the character actually
>    used in a format may be unavailable on some keyboards. For example:
>       - U+02BB MODIFIER LETTER TURNED COMMA (?) might be typed instead as
>       U+2018 LEFT SINGLE QUOTATION MARK (?).
>       - U+02BC MODIFIER LETTER APOSTROPHE (?) might be typed instead as
>       U+2019 RIGHT SINGLE QUOTATION MARK (?), U+0027 APOSTROPHE, etc.
>       - U+05F3 HEBREW PUNCTUATION GERESH (??) might be typed instead as
>       U+0027 APOSTROPHE.
>
>  Except for the U+05F3 example, the other two cannot be found in
http://www.unicode.org/repos/cldr-aux/json/25/supplemental/characterFallbacks.json.
Are both the "other apostrophe-like characters". Where do I find a complete
list of the apostrophe-like characters? Do mappings follow an algorithm,
algebric formula or lookup table?

On
http://unicode.org/reports/tr35/tr35-info.html#Supplemental_Character_Fallback_Data,
there's:

There is more than one possible fallback: the recommended usage is that
> when a character value is not in the desired repertoire the following
> process is used, whereby the first value that is wholly in the desired
> repertoire is used.
>
>    - toNFC(value)
>    - other canonically equivalent sequences, if there are any
>    - the explicit substitutes value (in order)
>    - toNFKC(value)
>
>  Does it mean that when the character being looked up is not found, the
above process should be followed? Where do I find the definition of toNFC(),
toNFC(), canonically equivalence and explicit substitutes?


>    - Apply mappings particular to the domain (i.e., for dates or for
>    numbers, discussed in more detail below)
>
>  Where?


>    - Apply case folding (possibly including language-specific mappings
>    such as Turkish i)
>
>  Where do I find more information about it?


>    - Normalize to NFKC; thus no-break space will map to space; half-width
>    katakana will map to full-width.
>
>  Are both mappings (no-break space and half-width katakana) all it's
about, or are there any other NFKC normalizations that should be done?
Where do I find a complete list of what should be done? Do mappings follow
an algorithm, algebric formula or lookup table?

Loose matching involves (logically) applying the above transform to both
> the input text and to each of the field elements used in matching, before
> applying the specific heuristics below. For example, if the input number
> text is " - NA f. 1,000.00", then it is mapped to "-naf1,000.00" before
> processing. The currency signs are also transformed, so "NA f." is
> converted to "naf" for purposes of matching. As with other Unicode
> algorithms, this is a logical statement of the process; actual
> implementations can optimize, such as by applying the transform
> incrementally during matching.
>
 "NA f." is the currency symbol for ANG (Netherlands Antillean guilder, aka
Netherlands Antilles Florin according to wikipedia
<http://en.wikipedia.org/wiki/Netherlands_Antillean_guilder>). nl-CW and
nl-SX defines ANG symbol as NAf.. All other locales define it as ANG.

Following the above recommendation (to map NA f. into naf), how is
implementation supposed to know naf is ANG? Where do I find a mapping
between naf and ANG?


-- 
+55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers
http://rafael.xavier.blog.br
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140821/b313be0e/attachment.html>

From verdy_p at wanadoo.fr  Thu Aug 21 11:11:12 2014
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 21 Aug 2014 18:11:12 +0200
Subject: Loose Matching questions
In-Reply-To: <CADdLYsrdTL1cVp013oYaV2RvJXwyf64juNE6MRQwhKCGp-JAeQ@mail.gmail.com>
References: <CADdLYsrdTL1cVp013oYaV2RvJXwyf64juNE6MRQwhKCGp-JAeQ@mail.gmail.com>
Message-ID: <CAGa7JC1KbirrcXHTgDy+LBEQZ-8GqEnCLK5VJXtkjkqAAw-ASQ@mail.gmail.com>

2014-08-21 15:20 GMT+02:00 Rafael Xavier <rxaviers at gmail.com>:

> Dear all,
>
> Reading the Loose Matching TR35 documentation
> <http://unicode.org/reports/tr35/#Loose_Matching> lead me to the
> questions below. I have quoted the documentation and inlined the questions
> (probably newbie).
>
> Thanks in advance for your help!
>
> 7.2 Loose Matching
>> Loose matching ignores attributes of the strings being compared that are
>> not important to matching. It involves the following steps:
>>
>>    - Remove "." from currency symbols and other fields used for
>>    matching, and also from the input string unless:
>>       - "." is in the decimal set, and
>>       - its position in the input string is immediately before a decimal
>>       digit
>>    - Ignore all format characters: in particular, ignore the RLM and LRM
>>    used to control BIDI formatting.
>>
>> Where do I find a list of all format characters?
>

Look at the general property "Cf' (format controls) in the UCD.

>
>>    - Ignore all characters in [:Zs:] unless they occur between letters.
>>    (In the heuristics below, even those between letters are ignored except to
>>    delimit fields)
>>
>>  Where do I find a list of all [:Zs:] characters?
>
Like Cf.

>
>>    - Map all characters in [:Dash:] to U+002D HYPHEN-MINUS
>>
>>  Where do I find a list of all [:Dash:] characters?
>
Look at "Derived properties."

>
>>    - Use the data in the element to map equivalent characters (for
>>    example, curly to straight apostrophes). Other apostrophe-like characters
>>    should also be treated as equivalent, especially if the character actually
>>    used in a format may be unavailable on some keyboards. For example:
>>       - U+02BB MODIFIER LETTER TURNED COMMA (?) might be typed instead
>>       as U+2018 LEFT SINGLE QUOTATION MARK (?).
>>       - U+02BC MODIFIER LETTER APOSTROPHE (?) might be typed instead as
>>       U+2019 RIGHT SINGLE QUOTATION MARK (?), U+0027 APOSTROPHE, etc.
>>       - U+05F3 HEBREW PUNCTUATION GERESH (??) might be typed instead as
>>       U+0027 APOSTROPHE.
>>
>>  Except for the U+05F3 example, the other two cannot be found in
> http://www.unicode.org/repos/cldr-aux/json/25/supplemental/characterFallbacks.json.
> Are both the "other apostrophe-like characters". Where do I find a complete
> list of the apostrophe-like characters? Do mappings follow an algorithm,
> algebric formula or lookup table?
>
This rule is language dependant. Some languages (notalby those that were
converted to the Latin script from another script, when the Latin alphabet
was not enough to represent letters similar to the glotal stop, when H was
already used for a breathing consonnant or for digrams) use apostrophe-like
characters as plain letters and sometimes even make distnctions between a
left and right apostrophe (in that case the straight ASCII apostrophe is a
bad fallback.

However loose matching is frequently ignoring other differences such as
vowel points and cantillation marks in Arabic and Hebrew. You need to know
for which context you need "loose matching" and what users are expecting
about these matches.

> On
> http://unicode.org/reports/tr35/tr35-info.html#Supplemental_Character_Fallback_Data,
> there's:
>
> There is more than one possible fallback: the recommended usage is that
>> when a character value is not in the desired repertoire the following
>> process is used, whereby the first value that is wholly in the desired
>> repertoire is used.
>>
>>    - toNFC(value)
>>    - other canonically equivalent sequences, if there are any
>>    - the explicit substitutes value (in order)
>>    - toNFKC(value)
>>
>>  Does it mean that when the character being looked up is not found, the
> above process should be followed? Where do I find the definition of
> toNFC(), toNFC(), canonically equivalence and explicit substitutes?
>
>
>>    - Apply mappings particular to the domain (i.e., for dates or for
>>    numbers, discussed in more detail below)
>>
>>  Where?
>
>
>>    - Apply case folding (possibly including language-specific mappings
>>    such as Turkish i)
>>
>>  Where do I find more information about it?
>

>>    - Normalize to NFKC; thus no-break space will map to space;
>>    half-width katakana will map to full-width.
>>
>>  All is documented in the standard. You should first read the initial
chapters to learn the basic concepts, notably chapter 3 about conformance,
and then look at the referenced chapters.


>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140821/3e8f415a/attachment.html>

From markus.icu at gmail.com  Thu Aug 21 11:43:10 2014
From: markus.icu at gmail.com (Markus Scherer)
Date: Thu, 21 Aug 2014 09:43:10 -0700
Subject: Loose Matching questions
In-Reply-To: <CAGa7JC1KbirrcXHTgDy+LBEQZ-8GqEnCLK5VJXtkjkqAAw-ASQ@mail.gmail.com>
References: <CADdLYsrdTL1cVp013oYaV2RvJXwyf64juNE6MRQwhKCGp-JAeQ@mail.gmail.com>
 <CAGa7JC1KbirrcXHTgDy+LBEQZ-8GqEnCLK5VJXtkjkqAAw-ASQ@mail.gmail.com>
Message-ID: <CAN49p6o+YG3vdYA5ACj016Z4aj_HJH1zPOPCJin2ivDj=quTcw@mail.gmail.com>

What Philippe said :-)

In addition, you can get a snapshot of such properties via a web demo as
well. For example,
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3ACf%3A%5D&g=
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3ADash%3A%5D&g=

But please also read some of http://www.unicode.org/reports/tr44/ and
http://www.unicode.org/reports/tr18/#Categories

This might also be useful:
http://userguide.icu-project.org/strings/unicodeset and
http://userguide.icu-project.org/strings/properties

Best regards,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140821/2127d036/attachment.html>

From rxaviers at gmail.com  Fri Aug 22 13:18:00 2014
From: rxaviers at gmail.com (Rafael Xavier)
Date: Fri, 22 Aug 2014 15:18:00 -0300
Subject: Loose Matching questions
In-Reply-To: <CAN49p6o+YG3vdYA5ACj016Z4aj_HJH1zPOPCJin2ivDj=quTcw@mail.gmail.com>
References: <CADdLYsrdTL1cVp013oYaV2RvJXwyf64juNE6MRQwhKCGp-JAeQ@mail.gmail.com>
 <CAGa7JC1KbirrcXHTgDy+LBEQZ-8GqEnCLK5VJXtkjkqAAw-ASQ@mail.gmail.com>
 <CAN49p6o+YG3vdYA5ACj016Z4aj_HJH1zPOPCJin2ivDj=quTcw@mail.gmail.com>
Message-ID: <CADdLYsobW4vYCmouBTUe75v9MrBMc_LPhua=Q6TZkP1Nh-wuxw@mail.gmail.com>

Philippe and Markus, thanks a lot for the prompt answer. Will read all
that...


On Thu, Aug 21, 2014 at 1:43 PM, Markus Scherer <markus.icu at gmail.com>
wrote:

> What Philippe said :-)
>
> In addition, you can get a snapshot of such properties via a web demo as
> well. For example,
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3ACf%3A%5D&g=
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3ADash%3A%5D&g=
>
> But please also read some of http://www.unicode.org/reports/tr44/ and
> http://www.unicode.org/reports/tr18/#Categories
>
> This might also be useful:
> http://userguide.icu-project.org/strings/unicodeset and
> http://userguide.icu-project.org/strings/properties
>
> Best regards,
> markus
>


-- 
+55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers
http://rafael.xavier.blog.br
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140822/a6de9930/attachment-0001.html>

From zeppieri at gmail.com  Fri Aug 29 00:09:17 2014
From: zeppieri at gmail.com (Jon Zeppieri)
Date: Fri, 29 Aug 2014 01:09:17 -0400
Subject: Terminology: "modified Julian day"
Message-ID: <CAKfDxxxJHzyv7YAv2y5639yLT4YFOiVHgXzteEgBQ9O-xu_mrA@mail.gmail.com>

At [http://www.unicode.org/reports/tr35/tr35-35/tr35-dates.html#Date_Field_Symbol_Table],
the documentation for the 'g' date pattern reads:

"Modified Julian day. This is different from the conventional Julian
day number in two regards. First, it demarcates days at local zone
midnight, rather than noon GMT. Second, it is a local number; that is,
it depends on the local time zone. It can be thought of as a single
number that encompasses all the date-related fields."

Based on this (and the accompanying example "2451334"), it looks like
this is different from the usual definition of "modified Julian day,"
which is (JD - 2400000.5). What's described in the documentation
sounds more like a Julian day number. (Though some people use "Julian
day number" to denote the day-of-year.)

I'd like to verify that the 'g' pattern is not supposed to produce (JD
- 2400000.5). If that's the case, I suggest that the documentation
change the terminology to avoid confusion with standard usage of
"modified Julian day".

-Jon