From srl at icu-project.org Thu Aug 7 15:19:48 2014 From: srl at icu-project.org (Steven R. Loomis) Date: Thu, 07 Aug 2014 13:19:48 -0700 Subject: CLDR bug tracker back open Message-ID: <53E3DF64.60906@icu-project.org> the CLDR bug tracker is back open for submissions, with an improved spam filter. My apologies for the inconvenience. -- IBMer but all opinions are mine. https://www.ohloh.net/accounts/srl295 // fingerprint @ https://ssl.icu-project.org/trac/wiki/Srl From rxaviers at gmail.com Thu Aug 21 08:20:13 2014 From: rxaviers at gmail.com (Rafael Xavier) Date: Thu, 21 Aug 2014 10:20:13 -0300 Subject: Loose Matching questions Message-ID: Dear all, Reading the Loose Matching TR35 documentation lead me to the questions below. I have quoted the documentation and inlined the questions (probably newbie). Thanks in advance for your help! 7.2 Loose Matching > Loose matching ignores attributes of the strings being compared that are > not important to matching. It involves the following steps: > > - Remove "." from currency symbols and other fields used for matching, > and also from the input string unless: > - "." is in the decimal set, and > - its position in the input string is immediately before a decimal > digit > - Ignore all format characters: in particular, ignore the RLM and LRM > used to control BIDI formatting. > > Where do I find a list of all format characters? > - Ignore all characters in [:Zs:] unless they occur between letters. > (In the heuristics below, even those between letters are ignored except to > delimit fields) > > Where do I find a list of all [:Zs:] characters? > - Map all characters in [:Dash:] to U+002D HYPHEN-MINUS > > Where do I find a list of all [:Dash:] characters? > - Use the data in the element to map equivalent characters (for > example, curly to straight apostrophes). Other apostrophe-like characters > should also be treated as equivalent, especially if the character actually > used in a format may be unavailable on some keyboards. For example: > - U+02BB MODIFIER LETTER TURNED COMMA (?) might be typed instead as > U+2018 LEFT SINGLE QUOTATION MARK (?). > - U+02BC MODIFIER LETTER APOSTROPHE (?) might be typed instead as > U+2019 RIGHT SINGLE QUOTATION MARK (?), U+0027 APOSTROPHE, etc. > - U+05F3 HEBREW PUNCTUATION GERESH (??) might be typed instead as > U+0027 APOSTROPHE. > > Except for the U+05F3 example, the other two cannot be found in http://www.unicode.org/repos/cldr-aux/json/25/supplemental/characterFallbacks.json. Are both the "other apostrophe-like characters". Where do I find a complete list of the apostrophe-like characters? Do mappings follow an algorithm, algebric formula or lookup table? On http://unicode.org/reports/tr35/tr35-info.html#Supplemental_Character_Fallback_Data, there's: There is more than one possible fallback: the recommended usage is that > when a character value is not in the desired repertoire the following > process is used, whereby the first value that is wholly in the desired > repertoire is used. > > - toNFC(value) > - other canonically equivalent sequences, if there are any > - the explicit substitutes value (in order) > - toNFKC(value) > > Does it mean that when the character being looked up is not found, the above process should be followed? Where do I find the definition of toNFC(), toNFC(), canonically equivalence and explicit substitutes? > - Apply mappings particular to the domain (i.e., for dates or for > numbers, discussed in more detail below) > > Where? > - Apply case folding (possibly including language-specific mappings > such as Turkish i) > > Where do I find more information about it? > - Normalize to NFKC; thus no-break space will map to space; half-width > katakana will map to full-width. > > Are both mappings (no-break space and half-width katakana) all it's about, or are there any other NFKC normalizations that should be done? Where do I find a complete list of what should be done? Do mappings follow an algorithm, algebric formula or lookup table? Loose matching involves (logically) applying the above transform to both > the input text and to each of the field elements used in matching, before > applying the specific heuristics below. For example, if the input number > text is " - NA f. 1,000.00", then it is mapped to "-naf1,000.00" before > processing. The currency signs are also transformed, so "NA f." is > converted to "naf" for purposes of matching. As with other Unicode > algorithms, this is a logical statement of the process; actual > implementations can optimize, such as by applying the transform > incrementally during matching. > "NA f." is the currency symbol for ANG (Netherlands Antillean guilder, aka Netherlands Antilles Florin according to wikipedia ). nl-CW and nl-SX defines ANG symbol as NAf.. All other locales define it as ANG. Following the above recommendation (to map NA f. into naf), how is implementation supposed to know naf is ANG? Where do I find a mapping between naf and ANG? -- +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers http://rafael.xavier.blog.br -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Aug 21 11:11:12 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 21 Aug 2014 18:11:12 +0200 Subject: Loose Matching questions In-Reply-To: References: Message-ID: 2014-08-21 15:20 GMT+02:00 Rafael Xavier : > Dear all, > > Reading the Loose Matching TR35 documentation > lead me to the > questions below. I have quoted the documentation and inlined the questions > (probably newbie). > > Thanks in advance for your help! > > 7.2 Loose Matching >> Loose matching ignores attributes of the strings being compared that are >> not important to matching. It involves the following steps: >> >> - Remove "." from currency symbols and other fields used for >> matching, and also from the input string unless: >> - "." is in the decimal set, and >> - its position in the input string is immediately before a decimal >> digit >> - Ignore all format characters: in particular, ignore the RLM and LRM >> used to control BIDI formatting. >> >> Where do I find a list of all format characters? > Look at the general property "Cf' (format controls) in the UCD. > >> - Ignore all characters in [:Zs:] unless they occur between letters. >> (In the heuristics below, even those between letters are ignored except to >> delimit fields) >> >> Where do I find a list of all [:Zs:] characters? > Like Cf. > >> - Map all characters in [:Dash:] to U+002D HYPHEN-MINUS >> >> Where do I find a list of all [:Dash:] characters? > Look at "Derived properties." > >> - Use the data in the element to map equivalent characters (for >> example, curly to straight apostrophes). Other apostrophe-like characters >> should also be treated as equivalent, especially if the character actually >> used in a format may be unavailable on some keyboards. For example: >> - U+02BB MODIFIER LETTER TURNED COMMA (?) might be typed instead >> as U+2018 LEFT SINGLE QUOTATION MARK (?). >> - U+02BC MODIFIER LETTER APOSTROPHE (?) might be typed instead as >> U+2019 RIGHT SINGLE QUOTATION MARK (?), U+0027 APOSTROPHE, etc. >> - U+05F3 HEBREW PUNCTUATION GERESH (??) might be typed instead as >> U+0027 APOSTROPHE. >> >> Except for the U+05F3 example, the other two cannot be found in > http://www.unicode.org/repos/cldr-aux/json/25/supplemental/characterFallbacks.json. > Are both the "other apostrophe-like characters". Where do I find a complete > list of the apostrophe-like characters? Do mappings follow an algorithm, > algebric formula or lookup table? > This rule is language dependant. Some languages (notalby those that were converted to the Latin script from another script, when the Latin alphabet was not enough to represent letters similar to the glotal stop, when H was already used for a breathing consonnant or for digrams) use apostrophe-like characters as plain letters and sometimes even make distnctions between a left and right apostrophe (in that case the straight ASCII apostrophe is a bad fallback. However loose matching is frequently ignoring other differences such as vowel points and cantillation marks in Arabic and Hebrew. You need to know for which context you need "loose matching" and what users are expecting about these matches. > On > http://unicode.org/reports/tr35/tr35-info.html#Supplemental_Character_Fallback_Data, > there's: > > There is more than one possible fallback: the recommended usage is that >> when a character value is not in the desired repertoire the following >> process is used, whereby the first value that is wholly in the desired >> repertoire is used. >> >> - toNFC(value) >> - other canonically equivalent sequences, if there are any >> - the explicit substitutes value (in order) >> - toNFKC(value) >> >> Does it mean that when the character being looked up is not found, the > above process should be followed? Where do I find the definition of > toNFC(), toNFC(), canonically equivalence and explicit substitutes? > > >> - Apply mappings particular to the domain (i.e., for dates or for >> numbers, discussed in more detail below) >> >> Where? > > >> - Apply case folding (possibly including language-specific mappings >> such as Turkish i) >> >> Where do I find more information about it? > >> - Normalize to NFKC; thus no-break space will map to space; >> half-width katakana will map to full-width. >> >> All is documented in the standard. You should first read the initial chapters to learn the basic concepts, notably chapter 3 about conformance, and then look at the referenced chapters. >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Thu Aug 21 11:43:10 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Thu, 21 Aug 2014 09:43:10 -0700 Subject: Loose Matching questions In-Reply-To: References: Message-ID: What Philippe said :-) In addition, you can get a snapshot of such properties via a web demo as well. For example, http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3ACf%3A%5D&g= http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3ADash%3A%5D&g= But please also read some of http://www.unicode.org/reports/tr44/ and http://www.unicode.org/reports/tr18/#Categories This might also be useful: http://userguide.icu-project.org/strings/unicodeset and http://userguide.icu-project.org/strings/properties Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From rxaviers at gmail.com Fri Aug 22 13:18:00 2014 From: rxaviers at gmail.com (Rafael Xavier) Date: Fri, 22 Aug 2014 15:18:00 -0300 Subject: Loose Matching questions In-Reply-To: References: Message-ID: Philippe and Markus, thanks a lot for the prompt answer. Will read all that... On Thu, Aug 21, 2014 at 1:43 PM, Markus Scherer wrote: > What Philippe said :-) > > In addition, you can get a snapshot of such properties via a web demo as > well. For example, > http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3ACf%3A%5D&g= > http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3ADash%3A%5D&g= > > But please also read some of http://www.unicode.org/reports/tr44/ and > http://www.unicode.org/reports/tr18/#Categories > > This might also be useful: > http://userguide.icu-project.org/strings/unicodeset and > http://userguide.icu-project.org/strings/properties > > Best regards, > markus > -- +55 (16) 98138-1582, +1 (415) 568-5854, skype: rxaviers http://rafael.xavier.blog.br -------------- next part -------------- An HTML attachment was scrubbed... URL: From zeppieri at gmail.com Fri Aug 29 00:09:17 2014 From: zeppieri at gmail.com (Jon Zeppieri) Date: Fri, 29 Aug 2014 01:09:17 -0400 Subject: Terminology: "modified Julian day" Message-ID: At [http://www.unicode.org/reports/tr35/tr35-35/tr35-dates.html#Date_Field_Symbol_Table], the documentation for the 'g' date pattern reads: "Modified Julian day. This is different from the conventional Julian day number in two regards. First, it demarcates days at local zone midnight, rather than noon GMT. Second, it is a local number; that is, it depends on the local time zone. It can be thought of as a single number that encompasses all the date-related fields." Based on this (and the accompanying example "2451334"), it looks like this is different from the usual definition of "modified Julian day," which is (JD - 2400000.5). What's described in the documentation sounds more like a Julian day number. (Though some people use "Julian day number" to denote the day-of-year.) I'd like to verify that the 'g' pattern is not supposed to produce (JD - 2400000.5). If that's the case, I suggest that the documentation change the terminology to avoid confusion with standard usage of "modified Julian day". -Jon