Menu

Tree [b959bd] default / joe / util /
 History

Read Only access


File Date Author Commit
 Blocks-8-0-0.txt 2015-09-06 Joe Allen Joe Allen [477f0b] Move utilities into joe/util. Update builtins.c
 CaseFolding-8-0-0.txt 2015-09-06 Joe Allen Joe Allen [477f0b] Move utilities into joe/util. Update builtins.c
 EastAsianWidth-8-0-0.txt 2015-09-06 Joe Allen Joe Allen [477f0b] Move utilities into joe/util. Update builtins.c
 Makefile.am 2015-09-06 Joe Allen Joe Allen [477f0b] Move utilities into joe/util. Update builtins.c
 README 2015-09-06 Joe Allen Joe Allen [477f0b] Move utilities into joe/util. Update builtins.c
 UnicodeData-8-0-0.txt 2015-09-06 Joe Allen Joe Allen [477f0b] Move utilities into joe/util. Update builtins.c
 stringify.c 2015-09-06 Joe Allen Joe Allen [477f0b] Move utilities into joe/util. Update builtins.c
 termidx.c 2015-09-06 Joe Allen Joe Allen [477f0b] Move utilities into joe/util. Update builtins.c
 uniproc.c 2015-11-16 Joseph Allen Joseph Allen [7af398] Fix warns

Read Me

# Utility programs:

## Stringify

This converts rc and .jsf files into a C file.  Use it to build the builtins.c file with:

	./stringify ../../rc/joerc ../../rc/ftyperc ../../syntax/c.jsf >../builtins.c

## Uniproc

This generates JOE's unicode database file unicat.c from Blocks.txt CaseFolding.txt EastAsianWidth.txt UnicodeData.txt:

	./uniproc Blocks-8-0-0.txt UnicodeData-8-0-0.txt CaseFolding-8-0-0.txt EastAsianWidth-8-0-0.txt >../unicat.c

## Termidx

This creates an index for a termcap file.

	./termidx </etc/termcap >/etc/termcap.idx


## Unicode notes...

from DerivedCoreProperties.txt:
see: http://unicode.org/reports/tr31/
and: http://unicode.org/reports/tr31/tr31-1.html#Pattern_Syntax

 ID_Continue: \c
    ID_Start + Mn + Mc + Nd + Pc + Other_ID_Continue
    -Pattern_Syntax?
    -Pattern_White_Space?
    Allow 200c and 200d

 ID_Start: \i
    Lu + Ll + Lt + Lm + Lo + Nl
    + Other_ID_Start
    - Pattern_Syntax
    - Pattern_White_Space
      (some languages allow \p{Sk})

 Other_ID_Start: (because they were recategorized, but kept for backwards compatibility)
    2118, 212e, 309b, 309c

 Other_ID_Continue:
    1369, 00b7, 0387, 19da

Pattern_white_space: [9-d] [20] [85] [a0] [2000-200a] [200e-200f] [2028]
[2029] [202f] [205f] [3000]

Pattern_syntax: [21-2f] [3a-40] [5b-60] [7b-7e] [a1-a7] [a9] [ab-ac] [ae]
[b0-b1] [b6-b7] [bb] [bf] [d7] [f7] [2010-2027] [2030-205e] [2190-2bff]
[3001-3003] [3008-3020] [3030] [fd3e-fd3f] [fe45-fe46]



Fields in UnicodeData.txt, see:

ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html

 1 code value
 2 character name
 3 general category
 4 canonical combining classes
 5 bidirectional category
 6 character decomposition mapping: indicates wide
 7 decimal digit value
 8 digit value
 9 numeric value
 10 mirrored
 11 unicode 1.0 name
 12 10646 comment field
 13 uppercase mapping
 14 lowercase mapping
 15 titlecase mapping


iswcntrl hierarchy in unicode:

cntrl: \p{C} (but maybe should include \p{Zl}, \p{Zp}

print: "any printable character including space" \p{L} \p{M} \p{S} \p{N} \p{P} \p{Z}

  graph: "any printable character except space" \p{L} \p{M} \p{S} \p{N} \p{P}

    alpha \p{L} \p{M}
      lower \p{Ll}
      upper \p{Lu}

    digit: 0 - 9 \p{Nd} (but then we need digit_value())

    xdigit: 0-9, a-f, A-F (maybe \p{Nd}- but we should have hex_value())

    alnum: \p{L} \p{M} \p{N}

    punct: printable, but not space or alnum

  space: \f \n \r \t \v (how about \p{Z}?)

    blank: space or tab

- note that tab is classified as a control character
- U+2028 line separator (like HTML <br>)
- U+2029 paragraph separator (like HTML <p>)
- U+0085 next line (used in EBCDIC as 0x15?)
- Maybe \p{Pc} where we use _


Unicode categories:

'-' means I see it in the UnicodeData.txt file

    \p{L} or \p{Letter}: any kind of letter from any language.
-        \p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
-        \p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
-        \p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
        \p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
-        \p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
-        \p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants. 
    \p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
-        \p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
-        \p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
-        \p{Me} or \p{Enclosing_Mark}: a character that encloses the character is is combined with (circle, square, keycap, etc.). 
    \p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
-        \p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.
-        \p{Zl} or \p{Line_Separator}: line separator character U+2028.
-        \p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029. 
    \p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.
-        \p{Sm} or \p{Math_Symbol}: any mathematical symbol.
-        \p{Sc} or \p{Currency_Symbol}: any currency sign.
-        \p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
-        \p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters. 
    \p{N} or \p{Number}: any kind of numeric character in any script.
-       \p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
-        \p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.
-        \p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts). 
    \p{P} or \p{Punctuation}: any kind of punctuation character.
-        \p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.
-        \p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.
-        \p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.
-        \p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.
-        \p{Pf} or \p{Final_Punctuation}: any kind of closing quote.
-        \p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.
-        \p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector. 
    \p{C} or \p{Other}: invisible control characters and unused code points.
-        \p{Cc} or \p{Control}: an ASCII 0x00–0x1F or Latin-1 0x80–0x9F control character.
-        \p{Cf} or \p{Format}: invisible formatting indicator.
-        \p{Co} or \p{Private_Use}: any code point reserved for private use.
-        \p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.
        \p{Cn} or \p{Unassigned}: any code point to which no character has been assigned. 

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.