mirror of git://sourceware.org/git/glibc.git
= `Default_Ignorable_Code_Point`s should have width 0 = Unicode specifies (https://www.unicode.org/faq/unsup_char.html#3) that characters with the `Default_Ignorable_Code_Point` property > should be rendered as completely invisible (and non advancing, i.e. “zero width”), if not explicitly supported in rendering. Hence, `wcwidth()` should give them all a width of 0, with two exceptions: - the soft hyphen (U+00AD SOFT HYPHEN) is assigned width 1 by longstanding precedent - U+115F HANGUL CHOSEONG FILLER needs a carveout due to the unique behavior of the conjoining Korean jamo characters. One composed Hangul "syllable block" like 퓛 is made up of two to three individual component characters, or "jamo". These are all assigned an `East_Asian_Width` of `Wide` by Unicode, which would normally mean they would all be assigned width 2 by glibc; a combination of (leading choseong jamo) + (medial jungseong jamo) + (trailing jongseong jamo) would then have width 2 + 2 + 2 = 6. However, glibc (and other wcwidth implementations) special-cases jungseong and jongseong, assigning them all width 0, to ensure that the complete block has width 2 + 0 + 0 = 2 as it should. U+115F is meant for use in syllable blocks that are intentionally missing a leading jamo; it must be assigned a width of 2 even though it has no visible display to ensure that the complete block has width 2. However, `wcwidth()` currently (before this patch) incorrectly assigns non-zero width to U+3164 HANGUL FILLER and U+FFA0 HALFWIDTH HANGUL FILLER; this commit fixes that. Unicode spec references: - Hangul: §3.12 https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G24646 and §18.6 https://www.unicode.org/versions/Unicode15.0.0/ch18.pdf#G31028 - `Default_Ignorable_Code_Point`: §5.21 https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G40095. = Non-`Default_Ignorable_Code_Point` format controls should be visible = The Unicode Standard, §5.21 - Characters Ignored for Display (https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G40095) says the following: > A small number of format characters (General_Category = Cf ) > are also not given the Default_Ignorable_Code_Point property. > This may surprise implementers, who often assume > that all format characters are generally ignored in fallback display. > The exact list of these exceptional format characters > can be found in the Unicode Character Database. > There are, however, three important sets of such format characters to note: > > - prepended concatenation marks > - interlinear annotation characters > - Egyptian hieroglyph format controls > > The prepended concatenation marks always have a visible display. > See “Prepended Concatenation Marks” in [*Section 23.2, Layout Controls*](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M9.35858.HeadingBreak.132.Layout.Controls) > for more discussion of the use and display of these signs. > > The other two notable sets of format characters that exceptionally are not ignored > in fallback display consist of the interlinear annotation characters, > U+FFF9 INTERLINEAR ANNOTATION ANCHOR through > U+FFFB INTERLINEAR ANNOTATION TERMINATOR, > and the Egyptian hieroglyph format controls, > U+13430 EGYPTIAN HIEROGLYPH VERTICAL JOINER through > U+1343F EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE. > These characters should have a visible glyph display for fallback rendering, > because if they are not displayed, > it is too easy to misread the resulting displayed text. > See “Annotation Characters” in [*Section 23.8, Specials*](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M9.21335.Heading.133.Specials), > as well as [*Section 11.4, Egyptian Hieroglyphs*](https://www.unicode.org/versions/Unicode15.1.0/ch11.pdf#M9.73291.Heading.1418.Egyptian.Hieroglyphs) > for more discussion of the use and display of these characters. glibc currently correctly assigns non-zero width to the prepended concatenation marks, but it incorrectly gives zero width to the interlinear annotation characters (which a generic terminal cannot interpret) and the Egyptian hieroglyph format controls (which are not widely supported in rendering implementations at present). This commit fixes both these issues as well. = Derive Hangul syllable type from Unicode data = Previosuly, the jungseong and jongseong jamo ranges were hard-coded into the script. With this commit, they are instead parsed from the HangulSyllableType.txt data file published by Unicode. This does not affect the end result. Signed-off-by: Jules Bertholet <julesbertholet@quoi.xyz> |
||
|---|---|---|
| .. | ||
| ANSI_X3.4-1968 | ||
| ANSI_X3.110-1983 | ||
| ARMSCII-8 | ||
| ASMO_449 | ||
| BIG5 | ||
| BIG5-HKSCS | ||
| BRF | ||
| BS_4730 | ||
| BS_VIEWDATA | ||
| CP737 | ||
| CP770 | ||
| CP771 | ||
| CP772 | ||
| CP773 | ||
| CP774 | ||
| CP775 | ||
| CP949 | ||
| CP1125 | ||
| CP1250 | ||
| CP1251 | ||
| CP1252 | ||
| CP1253 | ||
| CP1254 | ||
| CP1255 | ||
| CP1256 | ||
| CP1257 | ||
| CP1258 | ||
| CP10007 | ||
| CSA_Z243.4-1985-1 | ||
| CSA_Z243.4-1985-2 | ||
| CSA_Z243.4-1985-GR | ||
| CSN_369103 | ||
| CWI | ||
| DEC-MCS | ||
| DIN_66003 | ||
| DS_2089 | ||
| EBCDIC-AT-DE | ||
| EBCDIC-AT-DE-A | ||
| EBCDIC-CA-FR | ||
| EBCDIC-DK-NO | ||
| EBCDIC-DK-NO-A | ||
| EBCDIC-ES | ||
| EBCDIC-ES-A | ||
| EBCDIC-ES-S | ||
| EBCDIC-FI-SE | ||
| EBCDIC-FI-SE-A | ||
| EBCDIC-FR | ||
| EBCDIC-IS-FRISS | ||
| EBCDIC-IT | ||
| EBCDIC-PT | ||
| EBCDIC-UK | ||
| EBCDIC-US | ||
| ECMA-CYRILLIC | ||
| ES | ||
| ES2 | ||
| EUC-JISX0213 | ||
| EUC-JP | ||
| EUC-JP-MS | ||
| EUC-KR | ||
| EUC-TW | ||
| GB2312 | ||
| GB18030 | ||
| GBK | ||
| GB_1988-80 | ||
| GEORGIAN-ACADEMY | ||
| GEORGIAN-PS | ||
| GOST_19768-74 | ||
| GREEK-CCITT | ||
| GREEK7 | ||
| GREEK7-OLD | ||
| HP-GREEK8 | ||
| HP-ROMAN8 | ||
| HP-ROMAN9 | ||
| HP-THAI8 | ||
| HP-TURKISH8 | ||
| IBM037 | ||
| IBM038 | ||
| IBM256 | ||
| IBM273 | ||
| IBM274 | ||
| IBM275 | ||
| IBM277 | ||
| IBM278 | ||
| IBM280 | ||
| IBM281 | ||
| IBM284 | ||
| IBM285 | ||
| IBM290 | ||
| IBM297 | ||
| IBM420 | ||
| IBM423 | ||
| IBM424 | ||
| IBM437 | ||
| IBM500 | ||
| IBM850 | ||
| IBM851 | ||
| IBM852 | ||
| IBM855 | ||
| IBM856 | ||
| IBM857 | ||
| IBM858 | ||
| IBM860 | ||
| IBM861 | ||
| IBM862 | ||
| IBM863 | ||
| IBM864 | ||
| IBM865 | ||
| IBM866 | ||
| IBM866NAV | ||
| IBM868 | ||
| IBM869 | ||
| IBM870 | ||
| IBM871 | ||
| IBM874 | ||
| IBM875 | ||
| IBM880 | ||
| IBM891 | ||
| IBM903 | ||
| IBM904 | ||
| IBM905 | ||
| IBM918 | ||
| IBM922 | ||
| IBM1004 | ||
| IBM1026 | ||
| IBM1047 | ||
| IBM1124 | ||
| IBM1129 | ||
| IBM1132 | ||
| IBM1133 | ||
| IBM1160 | ||
| IBM1161 | ||
| IBM1162 | ||
| IBM1163 | ||
| IBM1164 | ||
| IEC_P27-1 | ||
| INIS | ||
| INIS-8 | ||
| INIS-CYRILLIC | ||
| INVARIANT | ||
| ISIRI-3342 | ||
| ISO-8859-1 | ||
| ISO-8859-2 | ||
| ISO-8859-3 | ||
| ISO-8859-4 | ||
| ISO-8859-5 | ||
| ISO-8859-6 | ||
| ISO-8859-7 | ||
| ISO-8859-8 | ||
| ISO-8859-9 | ||
| ISO-8859-9E | ||
| ISO-8859-10 | ||
| ISO-8859-11 | ||
| ISO-8859-13 | ||
| ISO-8859-14 | ||
| ISO-8859-15 | ||
| ISO-8859-16 | ||
| ISO-IR-90 | ||
| ISO-IR-197 | ||
| ISO-IR-209 | ||
| ISO_646.BASIC | ||
| ISO_646.IRV | ||
| ISO_2033-1983 | ||
| ISO_5427 | ||
| ISO_5427-EXT | ||
| ISO_5428 | ||
| ISO_6937 | ||
| ISO_6937-2-25 | ||
| ISO_6937-2-ADD | ||
| ISO_8859-1,GL | ||
| ISO_8859-SUPP | ||
| ISO_10367-BOX | ||
| ISO_10646 | ||
| ISO_11548-1 | ||
| IT | ||
| JIS_C6220-1969-JP | ||
| JIS_C6220-1969-RO | ||
| JIS_C6229-1984-A | ||
| JIS_C6229-1984-B | ||
| JIS_C6229-1984-B-ADD | ||
| JIS_C6229-1984-HAND | ||
| JIS_C6229-1984-HAND-ADD | ||
| JIS_C6229-1984-KANA | ||
| JIS_X0201 | ||
| JOHAB | ||
| JUS_I.B1.002 | ||
| JUS_I.B1.003-MAC | ||
| JUS_I.B1.003-SERB | ||
| KOI-8 | ||
| KOI8-R | ||
| KOI8-RU | ||
| KOI8-T | ||
| KOI8-U | ||
| KSC5636 | ||
| LATIN-GREEK | ||
| LATIN-GREEK-1 | ||
| MAC-CENTRALEUROPE | ||
| MAC-CYRILLIC | ||
| MAC-IS | ||
| MAC-SAMI | ||
| MAC-UK | ||
| MACINTOSH | ||
| MIK | ||
| MSZ_7795.3 | ||
| NATS-DANO | ||
| NATS-DANO-ADD | ||
| NATS-SEFI | ||
| NATS-SEFI-ADD | ||
| NC_NC00-10 | ||
| NEXTSTEP | ||
| NF_Z_62-010 | ||
| NF_Z_62-010_1973 | ||
| NS_4551-1 | ||
| NS_4551-2 | ||
| PT | ||
| PT2 | ||
| PT154 | ||
| RK1048 | ||
| SAMI | ||
| SAMI-WS2 | ||
| SEN_850200_B | ||
| SEN_850200_C | ||
| SHIFT_JIS | ||
| SHIFT_JISX0213 | ||
| T.61-7BIT | ||
| T.61-8BIT | ||
| T.101-G2 | ||
| TCVN5712-1 | ||
| TIS-620 | ||
| TSCII | ||
| UTF-8 | ||
| VIDEOTEX-SUPPL | ||
| VISCII | ||
| WINDOWS-31J | ||