Commit Graph

109 Commits

Author SHA1 Message Date
Marc Mutz 2c3c479c20 util/unicode: readEastAsianWidth(): remove simplified() call
All users of the split()ed value handle intervening whitespace
already:

- fields[0] is piped through parseHexRange(), which does

- fields[1] has trimmed() called on it before lookup and all
  idnaStatusMap values are space-free (cf. initIdnaStatusMap())

As a consequence, we can accept the line by reference to const
QByteArray now.

Amends 838a7a01f3.

Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I53247332c624a192fcaca6009a3f20cb8c65786a
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
2025-09-05 08:42:20 +02:00
Marc Mutz 079b97b7ac util/unicode: readArabicShaping(): remove replace() call
All users of the split()ed value handle intervening whitespace
already:

- fields[0] is piped through parseHexRange(), which does

- fields[1] has trimmed() called on it before lookup and all
  age_map values are space-free (cf. initAgeMap())

As a consequence, we can accept the line by reference to const
QByteArray now.

Amends the start of the public history.

Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I1d371af33bc0b1c1b2bf28bbd3cbaf6820f8b4e8
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
2025-09-05 08:42:14 +02:00
Marc Mutz 0738a0dd5e util/unicode: remove replace('_', "") from readScripts()
For some reason, the code stored the official Unicode script tags
without their intervening underscores, removing underscores from the
input before attempting to match, which works, as long as Unicode
stays consistent in spelling properties "Like_This".

Relying on that is brittle, though, seeing as a tag without intervening
underscore (SignWriting) already slipped into the database, potentially
matching a sought Sign_Writing. It's highly unlikely that Unicode will
start to use property names that differ only by their use of underscore,
but why risk it, and why confuse readers of code by using a different
sought string, compared to what's in the files?

Fix by storing the tags unaltered and leaving the underscores in the
input alone, too.

Amends the start of the public history.

Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I5870a35812cb3fc0b28888cb09e9f42661684a26
Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>
2025-09-05 06:42:07 +00:00
Marc Mutz 587adc2217 util/unicode: use new parseHexList() in readIdnaMappingTable()
This one is straight-forward.

As a drive-by, use QString::append(QStringView) instead of iterating
over the result of QChar::fromUcs4(). It may not be faster, in fact, I
expect it to be slower, but it's much nicer to read, and this tool
doesn't need to be optimized.

Since every field is now clearly handled by functions that can handle
extra whitespace (the values looked up unsurprisingly are space-free,
too), we can drop the simplified() call and take the QByteArray 'line'
by cref.

Amends 2afe1a3c19.

Pick-to: 6.10 6.9 6.8 6.5
Change-Id: Id8900367d774ec4a6dccb89f6be73984caac2701
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
2025-09-03 12:31:50 +02:00
Marc Mutz 5d357dfea9 util/unicode: assert no mirrored pairs exist outside BMP
QTextEngine implicitly assumes this (it's looking up mirrored
characters in UTF-16 space, without first decoding surrogates).

Add a comment there, too.

Amends 7f504283ef.

Fixes: QTBUG-139456
Pick-to: 6.10 6.9 6.8 6.5
Change-Id: Ie79b33907e71cc455434127c1752898c40b128f9
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
2025-09-02 21:31:16 +00:00
Marc Mutz c4d8e58f1d util/unicode: error out with a meaningful message if Qt is too old
The Qt version against which this tool is build need not be the same
as the Qt version for which this tool generates code. The advantage is
that we can use the latest Qt features in this tool without having to
worry about compat with older Qt versions. We might also use C++20
here in the future.

Instead of greeting prospective users of the tool with random compile
errors, check the Qt version and #error out with a descritive message
instead.

Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I2a153ee4eb6ca1a1ea7ece39c9872f3f6d746fcd
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
2025-09-02 23:31:16 +02:00
Marc Mutz 6e526bf92c util/unicode: Extract Method parseHexRange()
Wrapping parseHexList(), which gets extended to support
QLatin1StringView separators, add parseHexRange() and use it around
the code to parse HHHHH[..HHHHH] hex ranges.

Amends the start of the public history.

Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I0372e5c239642988f0e920d95108657e276b19dd
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
2025-09-02 23:31:15 +02:00
Marc Mutz 714969e8a1 util/unicode: Extract Method parseHexList() from readSpecialCasing()
This function, too, is duplicated all over the place, so centralize
it. Also use modern parsing techniques, like QStringTokenizer instead
of split() and QVarLengthArray instead of QList, that make the
function much more efficient.

Use it in readCaseFolding(), too, which is straight-forward.

There are many more potential users of this function; I'll port
them one by one in follow-up patches, though most are reading hex
ranges, so I'll add a function for that next.

Amends the start of the public history.

Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I545a22d65a3baeaa850a7d658dcf466d2284b0fa
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
2025-09-02 23:31:15 +02:00
Marc Mutz 7602054ac2 util/unicode: Extract Method parseHex()
This code is used all over the place, so put it into a convenience
function where we can arguably put in a bit of effort to optimize
things, to wit: use QByteArrayView, and provide a better error message
than inline code could afford. E.g. the LastValidCodePoint check was
previously only in readUnicodeData().

For use in readUnicodeData(), add lineNo tracking.

In readBidiLine(), we can now drop the replace(' ', ""), because
parseHex() already trims. As a consequence, the lambda can take the
QByteArray by cref now.

There are many more places where this could be used, but they
represent higher-level constructs for which I'll add higher-level
helper functions.

Amends the start of the public history.

Pick-to: 6.10 6.9 6.8 6.5
Change-Id: Ic8f59f6509da1a0deb47a46cfaf160abb20c067e
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
2025-09-02 21:31:15 +00:00
Marc Mutz 98878bd873 util/unicode: port appendToSpecialCaseMap() from QList to QSpan
This makes the function independent of the actual input container
being passed, allowing us to port from QList to QVarLengthArray
step-by-step.

In readUnicodeData(), this allows to pass the single-element case
using initializer_list instead of a temporary QList.

Amends the start of the public history (but, to be fair, QSpan wasn't
available then).

As with readLineInto() in a previous patch, QSpan is not available in
Qt 6.5, but that's not a problem because this tool doesn't need to
compile with the Qt version it is generating code for, so we can use
Qt-latest.

Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I10039af9d5b82a3d23fec451bf051a868db4c343
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
2025-09-02 23:31:15 +02:00
Marc Mutz 44da6f996a util/unicode: Extract Method readUnicodeFile()
There's about a dozen files this program reads, and in each of these
cases, the code to read the file line-by-line, remove comments (or
just LF) and trim the line before further handling is duplicated. It's
also very inefficient, we have better APIs these days (readLineInto(),
rvalue *this overloads, truncate() instead of = left(), ...). Besides,
as Mårten pointed out in review, trimmed() already removes the LF, so
we don't need to do it manually.

So Extract Method readUnicodeFile() that does that, coroutine-style
(but with function object for now), from all the readX() functions
(except readUnicodeData() itself, which is using nested readLine()s.

Also maintain a line number for later improving the error messages.

Remove some isEmpty() checks in the lambdas that, after the
refactoring, can never be true (because removing whitespace from a
trimmed() string cannot make the string empty, ditto with
simplified()).

The extracted function could even pre-split the line along `;`, but
for that, I would port each lambda to use QByteArrayView / qTokenizer
first.

Picking to all active branches, because a) this is a tool and b) we
continue to update the Unicode tables in all active branches, so the
tool to do so should not differ, unless the target branch requires it
(changed data structures, e.g.). Note that readLineInto() is not in
6.5, but the tool is not required to be built against the Qt version
it is building tables for, so we can use the latest Qt features here.

Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I3b699f213c98baa45bc8bbdb7ae2ac985d893798
Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
2025-08-29 02:03:40 +02:00
Marc Mutz 3bf21f41a2 util/unicode: fix format error
Says GCC:

  main.cpp:3232:33: warning: format ‘%zu’ expects argument of type ‘size_t’, but argument 3 has type ‘long long unsigned int’ [-Wformat=]
   3232 |     qDebug("    memory usage: %zu bytes", specialCaseMap.size() * sizeof(unsigned short));
        |                               ~~^         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        |                                 |                               |
        |                                 long unsigned int               long long unsigned int
        |                               %llu

Fix by using %llu and an explicit cast to qulonglong (for the case of
32-bit platforms), as usual.

Pick-to: 6.10 6.9 6.8 6.5
Change-Id: Idd8c83f05880ad5e12311829d8375baaec376ac6
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
2025-08-29 00:03:40 +00:00
Marc Mutz 4dd7f5bc68 util/unicode: fix pointless qPrintable(QByteArray)
QByteArray doesn't need qPrintable(), we can just pass it's
data()/constData() to %s.

Also, don't output the "destroyed" field values (replace("..", '.')),
but the original ones.

Amends 2afe1a3c19 and
838a7a01f3.

Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I5eb819f74075c6d6aa8989b30615c7955a60155c
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>
2025-08-28 11:58:34 +02:00
Marc Mutz 1b7418e494 QUnicodeTools: fix attibute location on properties() functions
The Q_DECL_CONST_FUNCTION needs to be on the declaration to have any
effect on callers, but it was only on the (out-of-line) definition.

Amends 2fe90a61bd.

As a drive-by, also remove the export macros from the definitions;
they, too, are only needed on the declaration.

Pick-to: 6.10 6.9 6.8 6.5
Change-Id: Id69b58c50440b8b835f7be7ba873927d07b11219
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
2025-08-27 02:39:27 +02:00
Marc Mutz 756bb4bffa util/unicode: make buildSuperstring() take input by rvalue ref
The function's docs state that the function may destroy the input, so
let the signature reflect that.

Amends ca1eeb23fa.

Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I3668887c97893e7114827819d8aaef7a0b3528ce
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
2025-08-27 02:39:27 +02:00
Thiago Macieira 038d127fe5 QChar::isSpace: optimize by lowering the upper limit check
Of all the Category categories, separators are the only to currently
have assigned codepoints exclusively in the BMP. This allows us to lower
the maximum check from the LastValidCodepoint to category-specific
one. This will also cause the compiler to dead-code eliminate the check
inside of qGetProperty and emit only the BMP check of the property
tables:

    if (ucs4 < 0x11000)
        return uc_properties + uc_property_trie[uc_property_trie[ucs4 >> 5] + (ucs4 & 0x1f)];

Pick-to: 6.10
Change-Id: I31eda5d79cc2c3560d90fffd74a546d1e7cda7bb
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
2025-08-19 18:04:55 -07:00
Mårten Nordheim 85899ff181 Update UCD to Unicode 16.0.0
They added some new scripts.

There were a few changes to the line break algorithm,
most notably there is more rules that require more context than before.
While not major, there was some shuffling and additions to our
implementation to match the new rules.

IDNA test data now disallows the trailing dot/empty root label,
technically to be toggled off by an option that controls a few things,
but we don't have options. For test-data they changed the format a
little - "" is used to mean empty string, while a blank segment is
null/no string, update the parser to read this.

[ChangeLog][Third-Party Code] Updated the Unicode Character Database to
UCD revision 34/Unicode 16.

Fixes: QTBUG-132902
Task-number: QTBUG-132851
Pick-to: 6.9 6.8 6.5
Change-Id: I4569703659f6fd0f20943110a03301c1cf8cc1ed
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
2025-02-10 18:36:55 +01:00
Mårten Nordheim 62685375a2 Unicode tool: use unsigned values for the bitfields
On MSVC the values stored end up as negative.

Task-number: QTBUG-132902
Pick-to: 6.9 6.8 6.5
Change-Id: I963c57c34479041911c1364a1100d04998bdfaed
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
2025-02-04 15:08:09 +01:00
Mårten Nordheim f7d0366207 Unicode tool: print values using portable types
ssize_t is not universal; fails to compile in Windows.

Task-number: QTBUG-132902
Pick-to: 6.9
Change-Id: I4b8f45cba32202329ac085c7caa0a8c19a11c621
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
2025-02-04 15:08:09 +01:00
Mårten Nordheim e99d5c6268 Unicode tool: handle required QFile::open return
Fixing warnings/errors about QFile::open() return value not being
checked, and print the name of the file and the error message that
occurred.

Task-number: QTBUG-132902
Pick-to: 6.9
Change-Id: I099b300b5fd4563334fa547ffa365ec3f68e08cf
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
2025-02-04 15:08:09 +01:00
Eskil Abrahamsen Blomfeldt b614aa10b9 Extract emoji data from Unicode files
Expand unicode data to include information needed to
parse emoji sequences. This is a pre-requisite for
automatically preferring color fonts for emojis.

As a drive-by, this also fixes a double space in the
output of the uc_properties array.

Task-number: QTBUG-111801
Change-Id: Icd993803c87c69ed278c7724377028f3706d0272
Reviewed-by: Eirik Aavitsland <eirik.aavitsland@qt.io>
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
2024-08-06 10:00:08 +02:00
Edward Welbourne d5e40b5e58 Revise UCD-generated data files' SPDX headers
The existing data comes under Unicode-DFS-2016 but future updates
shall come under Unicode-3.0, so update the existing headers with the
former and the generator script with the latter. Leave a note in the
attribution file about this transitional state and how to resolve it.

Replaced UNICODE_LICENSE.txt from src/corelib/text/ with
LICENSES/Unicode-DFS-2016.txt, as fetched using reuse download.
This doesn't look like a rename but only actually adds some irrelevant
lines about where on the Unicode website the upstream files (to which
we do not apply this license) come from and changes some spacing.

Pick-to: 6.7 6.5
Fixes: QTBUG-121653
Change-Id: I50c9f4badc77a9aa402af946561aff58ae9e3e7a
Reviewed-by: Volker Hilsheimer <volker.hilsheimer@qt.io>
Reviewed-by: Kai Köhne <kai.koehne@qt.io>
2024-04-22 15:22:12 +00:00
Ievgenii Meshcheriakov 1f73d4b87c Unicode line breaking: Implement rules LB15a and LB15b
The new rules were added in Unicode 15.1 (TR #14, revision 51).

The rules read:

    LB15a: (sot | BK | CR | LF | NL | OP | QU | GL | SP | ZW)
           [\p{Pi}&QU] SP* ×
    LB15b: × [\p{Pf}&QU] (SP | GL | WJ | CL | QU | CP | EX
           | IS | SY | BK | CR | LF | NL | ZW | eot)

Add two new line breaking classes LineBreak_QU_Pi and _QU_Pf to
represent quotation characters with context that matches left
side of LB15a and right side of LB15b respectively. This way
it is still possible to use the line breaking classes table.

Also add a coment about the original source of the line
break table.

Task-number: QTBUG-121529
Change-Id: Ib35f400e39e76819cd1c3299691f7b040ea37178
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>
2024-02-08 17:43:58 +01:00
Ievgenii Meshcheriakov bfd09ec38c unicode: Import version 15.1 (UCD version 32)
Add enumerator for the new Unicode version to QChar::UnicodeVersion.

Remap new line breaking classes to their Unicode 15.0 values:
* AK, AP and AS to AL,
* VI and VF to CM.
These are classes for new line breaking support for Indic scripts
that require more work.

Blacklist failing tests for now:
* tst_QUrlUts46::idnaTestV2
* tst_QTextBoundaryFinder::lineBoundariesDefault
* tst_QTextBoundaryFinder::graphemeBoundariesDefault

Regenerate the source files.

Task-number: QTBUG-121529
Change-Id: I869cc9fbaa53765d8ae6265c22cdbef9f19d05bf
Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
2024-02-08 16:43:58 +00:00
Ievgenii Meshcheriakov 1e7f1e5b73 Update Unicode data version string
This amends c4e550703c. The data version
update was just forgotten when updating to Unicode 15.0.

Pick-to: 6.5 6.6 6.7
Change-Id: Ibb3e9cb81e9bbcb5d4aaf4e4df6231485531c128
Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
2024-01-25 17:37:48 +00:00
Ievgenii Meshcheriakov c4e550703c Update UCD to Revision 30
This corresponds to Unicode version 15.0.0.

Added the following scripts:

    * Kawi
    * Nag Mundari

Full support of these scripts requires harfbuzz version 5.2.0,
this version adds support for Unicode 15.0:

    https://github.com/harfbuzz/harfbuzz/releases/tag/5.2.0

Fixes: QTBUG-106810
Change-Id: Ib06c526e49b0f01ef9f21123bcf875c6b19f2601
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
2022-10-11 14:10:59 +00:00
Yuhang Zhao 34c21d0407 Core: make Unicode Database constexpr
Task-number: QTBUG-100485
Pick-to: 6.3 6.2
Change-Id: I41480a34b14fd86a68a5c10b7e0f3d250e785d0f
Reviewed-by: Marc Mutz <marc.mutz@qt.io>
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
2022-05-26 12:42:36 +08:00
Ievgenii Meshcheriakov 838a7a01f3 Unicode: Extract EastAsianWidth property
This property is needed to properly implement the line breaking
algorithm from UAX #14.

Task-number: QTBUG-97537
Pick-to: 6.3
Change-Id: Ia83cc553c9ef19fae33560721630849d2a95af84
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
2022-05-24 23:07:43 +02:00
Ievgenii Meshcheriakov 1a26719c54 Unicode: Remove obsolete word break classes
Remove E_Base, Glue_After_Zwj, E_Base_GAZ, and E_Modifier obsoleted by
UTS #29, version 33 (Unicode 11.0.0).

Task-number: QTBUG-97537
Pick-to: 6.2 6.3
Change-Id: If5dc36ae17cd8746bbe81b73bbcc0863181e4a7a
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
2022-05-24 23:07:42 +02:00
Lucie Gérard 05fc3aef53 Use SPDX license identifiers
Replace the current license disclaimer in files by
a SPDX-License-Identifier.
Files that have to be modified by hand are modified.
License files are organized under LICENSES directory.

Task-number: QTBUG-67283
Change-Id: Id880c92784c40f3bbde861c0d93f58151c18b9f1
Reviewed-by: Qt CI Bot <qt_ci_bot@qt-project.org>
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
Reviewed-by: Jörg Bornemann <joerg.bornemann@qt.io>
2022-05-16 16:37:38 +02:00
Ievgenii Meshcheriakov 826fc8c9bd Update UCD to Revision 28
This corresponds to Unicode version 14.0.0.

Added the following scripts:

    * CyproMinoan
    * OldUyghur
    * Tangsa
    * Toto
    * Vithkuqi

Full support of these scripts requires harfbuzz version 3.0.0,
this version adds support for Unicode 14.0:

    https://github.com/harfbuzz/harfbuzz/releases/tag/3.0.0

With this release 10 test cases in tst_qurluts46 were fixed, one
additional test case is failing in tst_qtextboundaryfinder and
is commented out. In total 62 line break test cases and 44 word
break test cases are failing.

A comment in src/corelib/text/qt_attribution.json was updated to
include the URL of the page containing UCD version number.

Fixes: QTBUG-94359
Change-Id: Iefc9ff13f3df279f91cbdb1246d56f75b20ecb35
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
2021-10-18 16:45:10 +00:00
Ievgenii Meshcheriakov 8be862687d unicode: Fix typo s/supersting/superstring/
Thanks to Konstantin Ritt for spotting it.

Change-Id: Ia3b5c4103b315cdb690fcd8b42239f000acdbef0
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
2021-09-15 21:58:12 +02:00
Ievgenii Meshcheriakov 5f3f073126 unicode: Build IDNA map superstring greedily
This slightly reduces memory required for mapping tables:

    uncompressed size: 1146 characters
    consolidated size: 904 characters
    memory usage: 47856 bytes

Task-number: QTBUG-85323
Change-Id: Ic960789e433e80acf1a4e36791533a1c55a735c8
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
2021-09-03 14:43:16 +02:00
Ievgenii Meshcheriakov 2d78b71fc6 unicode: Pack 2 QChar's into one entry for IDNA mapping
Store up to 2 QChar's for mapping values inside the mapping
table itself. This reduces the size of the superstring for
other mapping values.

results:
    uncompressed size: 1146 characters
    consolidated size: 1001 characters
    memory usage: 48050 bytes

Task-number: QTBUG-85323
Change-Id: I922a6d2037551d0532ddae1a032ec1a9890f40a7
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
2021-09-03 14:43:16 +02:00
Ievgenii Meshcheriakov ca1eeb23fa unicode: More compact IDNA mapping tables implementation
This implementation stores mapping that are 1 QChar long
inside the mapping tables, the longer mappings are stored as
an offset and length pairs pointing into the common superstring
of all such mapping values.

Size comparison with the existing implementation follows.

old:
    max mapping length: 6
    memory usage: 103608 bytes

new:
    uncompressed size: 3250 characters
    consolidated size: 2367 characters
    memory usage: 50782 bytes

Task-number: QTBUG-85323
Change-Id: I9f2e32438dd463457e0fcd783136bb17145e27a8
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
2021-09-03 14:43:16 +02:00
Ievgenii Meshcheriakov 2afe1a3c19 unicode: Generate tables for IDNA/UTS #46
Update the Unicode data processing tool to generate properties
and mapping tables needed to implement UTS #46
(https://unicode.org/reports/tr46/). The implementation extends
the standard to allow usage of underscores in URLs. This is done
for compatibility with DNS-SD and SMB protocols.

The data file needed to generate the new properties was taken from
https://www.unicode.org/Public/idna/13.0.0/IdnaMappingTable.txt

Task-number: QTBUG-85323
Change-Id: I2c303bf8a08aefb18a7491fb9b55385563bfa219
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
2021-08-26 16:55:05 +02:00
Giuseppe D'Angelo a794c5e287 Unicode: fix the extended grapheme cluster algorithm
UAX #29 in Unicode 11 changed the EGC algorithm to its current form.
Although Qt has upgraded the Unicode tables all the way up to
Unicode 13, the algorithm has never been adapted; in other words,
it has been working by chance for years. Luckily, MOST
of the cases were dealt with correctly, but emoji handling
actually manages to break it.

This commit:

* Adds parsing of emoji-data.txt into the unicode table generator.
  That is necessary to extract the Extended_Pictographic property,
  which is used by the EGC algorithm.

* Regenerates the tables.

* Removes some obsoleted grapheme cluster break properties, and
  adds the ones added in the meanwhile.

* Rewrites the EGC algorithm according to Unicode 13. This is
  done by simplifying a lot the lookup table. Some rules (GB11,
  GB12, GB13) can't be done by the table alone so some hand-rolled
  code is necessary in that case.

* Thanks to these fixes, the complete upstream GraphemeBreakTest
  now passes. Remove the "edited" version that ignored some rows
  (because they were failing).

Change-Id: Iaa07cb2e6d0ab9deac28397f46d9af189d2edf8b
Pick-to: 6.1 6.0 5.15
Fixes: QTBUG-92822
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
2021-04-16 20:31:39 +02:00
Edward Welbourne 78cf89c07d Use checked string iteration in case conversions
The Unicode table code can only be safely called on valid code-points.
So code that calls it must only pass it valid Unicode data. The string
iterator's Unchecked Unchecked methods only provide this guarantee
when the string being iterated is guaranteed to be valid UTF-16; while
client code should only use QString, QStringView and friends on valid
UTF-16 data, we have no way to be sure they have respected that.

So take the few extra cycles to actually check validity in the course
of iterating strings, when the resulting code-points are to be passed
to the Unicode table look-ups. Add tests that case mapping doesn't
access Unicode tables out of range (it'll trigger the new assertion).
Added some comments to qchar.h that helped me understand surrogates.

Change-Id: Iec2c3106bf1a875bdaa1d622f6cf94d7007e281e
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
2020-08-29 18:15:27 +02:00
Edward Welbourne 1fb35832df Simplify initialization of UnicodeData and PropertyFlags structs
Initialize values where they're declared, where possible.

Change-Id: Ib6bf33b27b19c76f406f78bc8a1bd9729bd8f2cd
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
2020-08-28 21:27:51 +02:00
Edward Welbourne a111dd26b1 Document the indexing used in the Unicode tables
Make clear why we don't need to assert against out-of-bounda accesses
in the generated code, provided the code point is within its bound,
(Using one table's early entries as indices into later in the same
table at which to look up indices into another table made it a little
hard to work out what was going on, especially as nothing told me
about the early / late distinction. Record what I discovered, to save
the next person to stumble into this some confusion.)

Change-Id: I8e5771a7f3d70c1911aeae1b0cabe5c47bc7e9c7
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
2020-08-20 09:02:00 +02:00
Edward Welbourne ca034e4e50 Inline two macros in the unicode tables
They were only used by one function each, in unicodetables.cpp, so
don't need to be macros.

Change-Id: I3e7f9f661568862d0a0d265bb8f657a8e0782b13
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
2020-08-12 13:47:56 +02:00
Edward Welbourne f12ddfaecc Tidy up unicode table generation
Eliminate some needless parentheses, tidy up some spacing and
indentation and split some long lines.  Change first += after
declaration to initializer.

Change-Id: I05ff2a6337b7ed14e0a2dc9c03fc784c92b63515
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
2020-08-05 10:02:11 +02:00
Edward Welbourne e3e6d58cad Use %zd for size-type formatting in unicode table generator
Qt6 makes sizes qsizetype; and one of these was already sizeof()-sized.
While qsizetype might not be ssize_t, it's at least no bigger, so we
can safely use its format specifier, with a suitable cast.

Change-Id: I433f654f6b139d74b4d5358b804b44ab1f0ada15
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
2020-08-04 13:28:34 +02:00
Edward Welbourne e536fc7975 Fix deprecation warnings (s/hex/Qt::hex/gw) in unicode table generator
Removed three warnings, rather than fixing them, as Konstantin Ritt
tells me they've been redundant since Unicode 6 or so.

Change-Id: I4507e852bceb08a0252c77a8b383aceac212aad9
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
2020-08-04 13:28:34 +02:00
Edward Welbourne b69092b13e Fix compilation error in unicode table generator
Don't include a QString::number() in a sum of QByteArray and C strings.

Change-Id: I7544e835fcf5625b1fe1ee2055a48600200daafd
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>
2020-08-04 13:28:34 +02:00
Jarek Kobus a7f9d5a7fa Use QList instead of QVector in util
Task-number: QTBUG-84469
Change-Id: I077fb5c32456d438a457c1f73852313ea2ea9ae5
Reviewed-by: Friedemann Kleint <Friedemann.Kleint@qt.io>
2020-07-07 20:34:48 +02:00
Giuseppe D'Angelo 3e1d03b1ea Port Q_STATIC_ASSERT(_X) to static_assert
There is no reason for keep using our macro now that we have C++17.
The macro itself is left in for the moment being, as well as its
detection logic, because it's needed for C code (not everything
supports C11 yet).  A few more cleanups will arrive in the next few
patches.

Note that this is a mere search/replace; some places were using
double braces to work around the presence of commas in a macro, no
attempt has been done to fix those.

tst_qglobal had just some minor changes to keep testing the macro.

Change-Id: I1c1c397d9f3e63db3338842bf350c9069ea57639
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
2020-06-19 19:38:23 +02:00
Marc Mutz 19e7c0d2b5 QChar/QString: centralize case folding in qchar.cpp
There are (at least) two implementations of the low-level case-folding
algorithm, one of which (for QChar::toLower()) seems to be wrong (it
doesn't deal with special cases which expand to more than one code
point).

The algoithm hidden in QString and entangled with the QString
detaching code makes reusing the code much harder.

At the same time, the dependency of the algorithm on the unicode
tables makes exposing a non-allocating result type in the public API
hard. std::u16string would be an alternative if we can assure that all
implementations use SSO with at least four characters.

So, for the time being, leave this as internal API for use in an
upcoming QStringView::toLower() as well as case-insensitive hashing.

Change-Id: Iabb2611846f6176776aa20e634f44d8464f3305c
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
2020-05-09 06:25:05 +00:00
Marc Mutz 7b04e0012b QUnicodeTables: port to charNN_t
This makes existing calls passing uint or ushort ambiguous, so
fix all the callers. There do not appear to be callers outside
QtBase. In fact, the ...BreakClass() functions appear to be
utterly unused.

Change-Id: I1c2251920beba48d4909650bc1d501375c6a3ecf
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
2020-04-27 13:08:41 +02:00
Marc Mutz 20cdf807b1 QChar: port low-level functions from uint/ushort to char32/16_t
Now that the standard gives us proper types for UTF-16 and UTF-32
characters, use them. Will eventually make the code much easier to
read than today, where uint could be an index as well as a char32_t.

It also ensures that the result of e.g. QChar::highSurrogate() can
still be implicitly converted to a QChar now that the
QChar(non-characater-integral-types) ctors are being made explicit.

[ChangeLog][QtCore][QChar] All low-level functions
(e.g. highSurrogate()) now take and return char16_t instead of ushort
and char32_t instead of uint.

Change-Id: I9cd8ebf6fb998fe1075dae96c7c4484a057f0b91
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
2020-04-24 12:45:53 +02:00