glibc/localedata/charmaps
Jules Bertholet 25c9c3789e localedata: Fix several issues with the set of characters considered 0-width [BZ #31370]
= `Default_Ignorable_Code_Point`s should have width 0 =

Unicode specifies (https://www.unicode.org/faq/unsup_char.html#3) that characters
with the `Default_Ignorable_Code_Point` property

> should be rendered as completely invisible (and non advancing, i.e. “zero width”),
if not explicitly supported in rendering.

Hence, `wcwidth()` should give them all a width of 0, with two exceptions:

- the soft hyphen (U+00AD SOFT HYPHEN) is assigned width 1 by longstanding precedent
- U+115F HANGUL CHOSEONG FILLER needs a carveout
  due to the unique behavior of the conjoining Korean jamo characters.
  One composed Hangul "syllable block" like 퓛
  is made up of two to three individual component characters, or "jamo".
  These are all assigned an `East_Asian_Width` of `Wide`
  by Unicode, which would normally mean they would all be assigned
  width 2 by glibc; a combination of (leading choseong jamo) +
  (medial jungseong jamo) + (trailing jongseong jamo) would then have width 2 + 2 + 2 = 6.
  However, glibc (and other wcwidth implementations) special-cases jungseong and jongseong,
  assigning them all width 0,
  to ensure that the complete block has width 2 + 0 + 0 = 2 as it should.
  U+115F is meant for use in syllable blocks
  that are intentionally missing a leading jamo;
  it must be assigned a width of 2 even though it has no visible display
  to ensure that the complete block has width 2.

However, `wcwidth()` currently (before this patch)
incorrectly assigns non-zero width to
U+3164 HANGUL FILLER and U+FFA0 HALFWIDTH HANGUL FILLER;
this commit fixes that.

Unicode spec references:
- Hangul:  §3.12 https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G24646 and
  §18.6 https://www.unicode.org/versions/Unicode15.0.0/ch18.pdf#G31028
- `Default_Ignorable_Code_Point`: §5.21 https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G40095.

= Non-`Default_Ignorable_Code_Point` format controls should be visible =

The Unicode Standard, §5.21 - Characters Ignored for Display
(https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G40095)
says the following:

> A small number of format characters (General_Category = Cf )
> are also not given the Default_Ignorable_Code_Point property.
> This may surprise implementers, who often assume
> that all format characters are generally ignored in fallback display.
> The exact list of these exceptional format characters
> can be found in the Unicode Character Database.
> There are, however, three important sets of such format characters to note:
>
> - prepended concatenation marks
> - interlinear annotation characters
> - Egyptian hieroglyph format controls
>
> The prepended concatenation marks always have a visible display.
> See “Prepended Concatenation Marks” in [*Section 23.2, Layout Controls*](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M9.35858.HeadingBreak.132.Layout.Controls)
> for more discussion of the use and display of these signs.
>
> The other two notable sets of format characters that exceptionally are not ignored
> in fallback display consist of the interlinear annotation characters,
> U+FFF9 INTERLINEAR ANNOTATION ANCHOR through
> U+FFFB INTERLINEAR ANNOTATION TERMINATOR,
> and the Egyptian hieroglyph format controls,
> U+13430 EGYPTIAN HIEROGLYPH VERTICAL JOINER through
> U+1343F EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE.
> These characters should have a visible glyph display for fallback rendering,
> because if they are not displayed,
> it is too easy to misread the resulting displayed text.
> See “Annotation Characters” in [*Section 23.8, Specials*](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M9.21335.Heading.133.Specials),
> as well as [*Section 11.4, Egyptian Hieroglyphs*](https://www.unicode.org/versions/Unicode15.1.0/ch11.pdf#M9.73291.Heading.1418.Egyptian.Hieroglyphs)
> for more discussion of the use and display of these characters.

glibc currently correctly assigns non-zero width to the prepended concatenation marks,
but it incorrectly gives zero width to the interlinear annotation characters
(which a generic terminal cannot interpret)
and the Egyptian hieroglyph format controls
(which are not widely supported in rendering implementations at present).
This commit fixes both these issues as well.

= Derive Hangul syllable type from Unicode data =

Previosuly, the jungseong and jongseong jamo ranges
were hard-coded into the script. With this commit, they are instead parsed
from the HangulSyllableType.txt data file published by Unicode.
This does not affect the end result.

Signed-off-by: Jules Bertholet <julesbertholet@quoi.xyz>
2024-05-15 14:31:06 +02:00
..
ANSI_X3.4-1968
ANSI_X3.110-1983
ARMSCII-8
ASMO_449
BIG5
BIG5-HKSCS
BRF
BS_4730
BS_VIEWDATA
CP737
CP770
CP771
CP772
CP773
CP774
CP775
CP949
CP1125
CP1250
CP1251
CP1252
CP1253
CP1254 Bug 21399: Fix CP1254 comment for U+00EC 2017-04-19 08:10:35 -04:00
CP1255
CP1256
CP1257
CP1258
CP10007 localedata: change M$ to Microsoft 2016-08-10 00:49:14 +08:00
CSA_Z243.4-1985-1
CSA_Z243.4-1985-2
CSA_Z243.4-1985-GR
CSN_369103
CWI
DEC-MCS
DIN_66003
DS_2089
EBCDIC-AT-DE
EBCDIC-AT-DE-A
EBCDIC-CA-FR
EBCDIC-DK-NO
EBCDIC-DK-NO-A
EBCDIC-ES
EBCDIC-ES-A
EBCDIC-ES-S
EBCDIC-FI-SE
EBCDIC-FI-SE-A
EBCDIC-FR
EBCDIC-IS-FRISS
EBCDIC-IT
EBCDIC-PT
EBCDIC-UK
EBCDIC-US
ECMA-CYRILLIC
ES
ES2
EUC-JISX0213
EUC-JP
EUC-JP-MS
EUC-KR
EUC-TW
GB2312
GB18030 add GB18030-2022 charmap and test the entire GB18030 charmap [BZ #30243] 2023-08-29 19:02:30 +02:00
GBK localedata: GBK: add mapping for 0x80->Euro sign [BZ #20864] 2016-11-26 17:20:22 -05:00
GB_1988-80
GEORGIAN-ACADEMY
GEORGIAN-PS
GOST_19768-74
GREEK-CCITT
GREEK7
GREEK7-OLD
HP-GREEK8
HP-ROMAN8
HP-ROMAN9
HP-THAI8
HP-TURKISH8
IBM037
IBM038
IBM256 localedata: Use U+00AF MACRON in more EBCDIC charsets [BZ #27882] 2021-05-18 07:21:45 +02:00
IBM273 localedata: Make IBM273 compatible with ISO-8859-1 [BZ #23290] 2018-06-14 22:34:10 +02:00
IBM274
IBM275
IBM277 localedata: Use U+00AF MACRON in more EBCDIC charsets [BZ #27882] 2021-05-18 07:21:45 +02:00
IBM278 localedata: Use U+00AF MACRON in more EBCDIC charsets [BZ #27882] 2021-05-18 07:21:45 +02:00
IBM280 localedata: Use U+00AF MACRON in more EBCDIC charsets [BZ #27882] 2021-05-18 07:21:45 +02:00
IBM281
IBM284 localedata: Use U+00AF MACRON in more EBCDIC charsets [BZ #27882] 2021-05-18 07:21:45 +02:00
IBM285
IBM290
IBM297 localedata: Use U+00AF MACRON in more EBCDIC charsets [BZ #27882] 2021-05-18 07:21:45 +02:00
IBM420
IBM423
IBM424 localedata: Use U+00AF MACRON in more EBCDIC charsets [BZ #27882] 2021-05-18 07:21:45 +02:00
IBM437
IBM500
IBM850
IBM851
IBM852
IBM855
IBM856
IBM857
IBM858 Add new codepage charmaps/IBM858 [BZ #21084] 2017-09-14 15:50:57 +02:00
IBM860
IBM861
IBM862
IBM863
IBM864
IBM865
IBM866
IBM866NAV
IBM868
IBM869
IBM870
IBM871
IBM874
IBM875 charmaps: IBM875: fix mapping of iota/upsilon variants [BZ #18453] 2016-05-07 19:55:55 -04:00
IBM880
IBM891
IBM903
IBM904
IBM905
IBM918
IBM922
IBM1004
IBM1026
IBM1047
IBM1124
IBM1129
IBM1132
IBM1133
IBM1160
IBM1161
IBM1162
IBM1163
IBM1164
IEC_P27-1
INIS
INIS-8
INIS-CYRILLIC
INVARIANT
ISIRI-3342
ISO-8859-1
ISO-8859-2
ISO-8859-3
ISO-8859-4
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
ISO-8859-9E
ISO-8859-10
ISO-8859-11
ISO-8859-13
ISO-8859-14
ISO-8859-15
ISO-8859-16
ISO-IR-90
ISO-IR-197
ISO-IR-209
ISO_646.BASIC
ISO_646.IRV
ISO_2033-1983
ISO_5427
ISO_5427-EXT
ISO_5428
ISO_6937
ISO_6937-2-25
ISO_6937-2-ADD
ISO_8859-1,GL
ISO_8859-SUPP
ISO_10367-BOX
ISO_10646
ISO_11548-1
IT
JIS_C6220-1969-JP
JIS_C6220-1969-RO
JIS_C6229-1984-A
JIS_C6229-1984-B
JIS_C6229-1984-B-ADD
JIS_C6229-1984-HAND
JIS_C6229-1984-HAND-ADD
JIS_C6229-1984-KANA
JIS_X0201
JOHAB
JUS_I.B1.002
JUS_I.B1.003-MAC
JUS_I.B1.003-SERB
KOI-8
KOI8-R
KOI8-RU
KOI8-T
KOI8-U
KSC5636
LATIN-GREEK
LATIN-GREEK-1
MAC-CENTRALEUROPE
MAC-CYRILLIC
MAC-IS
MAC-SAMI
MAC-UK
MACINTOSH
MIK
MSZ_7795.3
NATS-DANO
NATS-DANO-ADD
NATS-SEFI
NATS-SEFI-ADD
NC_NC00-10
NEXTSTEP
NF_Z_62-010
NF_Z_62-010_1973
NS_4551-1
NS_4551-2
PT
PT2
PT154
RK1048
SAMI
SAMI-WS2
SEN_850200_B
SEN_850200_C
SHIFT_JIS
SHIFT_JISX0213
T.61-7BIT
T.61-8BIT
T.101-G2
TCVN5712-1
TIS-620
TSCII
UTF-8 localedata: Fix several issues with the set of characters considered 0-width [BZ #31370] 2024-05-15 14:31:06 +02:00
VIDEOTEX-SUPPL
VISCII
WINDOWS-31J