Amends 772b62c91e.
The amended commit removed the definition from both the generator and
the generated code, but only removed the declaration from the
generated code, not the generator.
Pick-to: 6.10.0 6.10
Change-Id: I2f41aad9777a8c27f80edb9b7ef7a97e1871ffbb
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
... in all the lambdas, which now receive a span<QByteArrayView>
instead.
Of course, this isn't a 1:1 move, but ports to QStringTokenizer and
QVarLengthArray as a drive-by.
Since everything is now QByteArrayView, we're now implicitly depending
on QHash heterogeneous lookup, a Qt 6.8 feature, so mention that in
the QT_VERSION check comment (just FYI; we've already required 6.9,
anyway).
Pick-to: 6.10 6.9 6.8 6.5
Change-Id: Ide6bcce5e1cd28c42f0091b5bcefb89d6278b6a9
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
... instead of QByteArrays.
Define a new qPrinableView(x) macro that expands to x.size(),
x.data(), suitably cast so "%.*s" will accept it, then apply it to all
strings printed in qFatal()s that might become views soon.
As a drive-by, improve the (touched) messages by mentioning the line
number and using "" to delimit variable output.
Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I82434f6c8522a84daf18367a8ab5cafb74453f1c
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
Clang was already doing it, but GCC (at least in LTO mode) wasn't and
was repeatedly calling qGetProp(). This has the benefit that, in most
cases, the input character whose property we seek is UTF-16, so dead
code-elimination removes the extra branch - this can happen when QString
functions go through the QChar front-end, like QChar::isSpace() or
isSymbol(), which route through the char32_t overload.
This forced inling allows us to remove the UCS2 overloads of qGetProp()
and properties(), because the same const-propagation will apply to all
but one of the places where UTF-16 code units were being compared. The
16-bit qGetProp() was only used in qstring.cpp's convertCase_helper(),
whose 16-bit overload was only used in foldCase(). The one exception to
this is qtextengine.cpp's QBidiAlgorithm::resolveN0():
const QUnicodeTables::Properties *p = QUnicodeTables::properties(char16_t{text[pos].unicode()});
This will now call the full UTF-32 overload.
Pick-to: 6.10
Change-Id: Ifa4f2d77475877f26be2fffd9a987ff994dc8ef1
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
Two functions indexed into the split() fields without first checking
whether there are sufficiently many fields to index into.
Add the missing Q_ASSERT()s. Yes, a qFatal() with line-number would be
nicer, but all other functions do it like this, so I'm asking for
forgiveness that I do it here, too.
Amends the start of the public history.
Pick-to: 6.10 6.9 6.8 6.5
Change-Id: Ief054796f2e058a331037540bc2a633ec7f64f2c
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
Some users of the split()ed value handled intervening whitespace
already:
- the first field is piped through parseHexRange(), which does
- the second field was missing the trimmed() call before lookup.
Added. All looked-up values are space-free (cf. resp. init*()
functions), so that's enough, too.
As a consequence, we can accept the lines by reference to const
QByteArray now and, now that all lambdas have the same signature,
change readUnicodeFile() from a template to a regular function
taking qxp::function_ref callbacks.
Amends a794c5e287 (readEmojiData())
and the start of the public history (rest).
Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I442855a183552aa90d24810023793e6464b18162
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
Some users of the split()ed value handled intervening whitespace
already:
- fields[0..1] are piped through parseHex(), which does
- fields[2] is unused
- fields[3] needs to be trimmed, so do that.
As a consequence, we can accept the line by reference to const
QByteArray now.
Amends the start of the public history.
Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I60371820cd143b980c81a1077d9c3e34528f1830
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
All users of the split()ed value handle intervening whitespace
already:
- fields[0] is piped through parseHexRange(), which does
- fields[1] has trimmed() called on it before lookup and all
idnaStatusMap values are space-free (cf. initIdnaStatusMap())
As a consequence, we can accept the line by reference to const
QByteArray now.
Amends 838a7a01f3.
Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I53247332c624a192fcaca6009a3f20cb8c65786a
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
All users of the split()ed value handle intervening whitespace
already:
- fields[0] is piped through parseHexRange(), which does
- fields[1] has trimmed() called on it before lookup and all
age_map values are space-free (cf. initAgeMap())
As a consequence, we can accept the line by reference to const
QByteArray now.
Amends the start of the public history.
Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I1d371af33bc0b1c1b2bf28bbd3cbaf6820f8b4e8
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
For some reason, the code stored the official Unicode script tags
without their intervening underscores, removing underscores from the
input before attempting to match, which works, as long as Unicode
stays consistent in spelling properties "Like_This".
Relying on that is brittle, though, seeing as a tag without intervening
underscore (SignWriting) already slipped into the database, potentially
matching a sought Sign_Writing. It's highly unlikely that Unicode will
start to use property names that differ only by their use of underscore,
but why risk it, and why confuse readers of code by using a different
sought string, compared to what's in the files?
Fix by storing the tags unaltered and leaving the underscores in the
input alone, too.
Amends the start of the public history.
Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I5870a35812cb3fc0b28888cb09e9f42661684a26
Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>
This one is straight-forward.
As a drive-by, use QString::append(QStringView) instead of iterating
over the result of QChar::fromUcs4(). It may not be faster, in fact, I
expect it to be slower, but it's much nicer to read, and this tool
doesn't need to be optimized.
Since every field is now clearly handled by functions that can handle
extra whitespace (the values looked up unsurprisingly are space-free,
too), we can drop the simplified() call and take the QByteArray 'line'
by cref.
Amends 2afe1a3c19.
Pick-to: 6.10 6.9 6.8 6.5
Change-Id: Id8900367d774ec4a6dccb89f6be73984caac2701
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
The Qt version against which this tool is build need not be the same
as the Qt version for which this tool generates code. The advantage is
that we can use the latest Qt features in this tool without having to
worry about compat with older Qt versions. We might also use C++20
here in the future.
Instead of greeting prospective users of the tool with random compile
errors, check the Qt version and #error out with a descritive message
instead.
Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I2a153ee4eb6ca1a1ea7ece39c9872f3f6d746fcd
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
Wrapping parseHexList(), which gets extended to support
QLatin1StringView separators, add parseHexRange() and use it around
the code to parse HHHHH[..HHHHH] hex ranges.
Amends the start of the public history.
Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I0372e5c239642988f0e920d95108657e276b19dd
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
This function, too, is duplicated all over the place, so centralize
it. Also use modern parsing techniques, like QStringTokenizer instead
of split() and QVarLengthArray instead of QList, that make the
function much more efficient.
Use it in readCaseFolding(), too, which is straight-forward.
There are many more potential users of this function; I'll port
them one by one in follow-up patches, though most are reading hex
ranges, so I'll add a function for that next.
Amends the start of the public history.
Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I545a22d65a3baeaa850a7d658dcf466d2284b0fa
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
This code is used all over the place, so put it into a convenience
function where we can arguably put in a bit of effort to optimize
things, to wit: use QByteArrayView, and provide a better error message
than inline code could afford. E.g. the LastValidCodePoint check was
previously only in readUnicodeData().
For use in readUnicodeData(), add lineNo tracking.
In readBidiLine(), we can now drop the replace(' ', ""), because
parseHex() already trims. As a consequence, the lambda can take the
QByteArray by cref now.
There are many more places where this could be used, but they
represent higher-level constructs for which I'll add higher-level
helper functions.
Amends the start of the public history.
Pick-to: 6.10 6.9 6.8 6.5
Change-Id: Ic8f59f6509da1a0deb47a46cfaf160abb20c067e
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
This makes the function independent of the actual input container
being passed, allowing us to port from QList to QVarLengthArray
step-by-step.
In readUnicodeData(), this allows to pass the single-element case
using initializer_list instead of a temporary QList.
Amends the start of the public history (but, to be fair, QSpan wasn't
available then).
As with readLineInto() in a previous patch, QSpan is not available in
Qt 6.5, but that's not a problem because this tool doesn't need to
compile with the Qt version it is generating code for, so we can use
Qt-latest.
Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I10039af9d5b82a3d23fec451bf051a868db4c343
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
There's about a dozen files this program reads, and in each of these
cases, the code to read the file line-by-line, remove comments (or
just LF) and trim the line before further handling is duplicated. It's
also very inefficient, we have better APIs these days (readLineInto(),
rvalue *this overloads, truncate() instead of = left(), ...). Besides,
as Mårten pointed out in review, trimmed() already removes the LF, so
we don't need to do it manually.
So Extract Method readUnicodeFile() that does that, coroutine-style
(but with function object for now), from all the readX() functions
(except readUnicodeData() itself, which is using nested readLine()s.
Also maintain a line number for later improving the error messages.
Remove some isEmpty() checks in the lambdas that, after the
refactoring, can never be true (because removing whitespace from a
trimmed() string cannot make the string empty, ditto with
simplified()).
The extracted function could even pre-split the line along `;`, but
for that, I would port each lambda to use QByteArrayView / qTokenizer
first.
Picking to all active branches, because a) this is a tool and b) we
continue to update the Unicode tables in all active branches, so the
tool to do so should not differ, unless the target branch requires it
(changed data structures, e.g.). Note that readLineInto() is not in
6.5, but the tool is not required to be built against the Qt version
it is building tables for, so we can use the latest Qt features here.
Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I3b699f213c98baa45bc8bbdb7ae2ac985d893798
Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
Says GCC:
main.cpp:3232:33: warning: format ‘%zu’ expects argument of type ‘size_t’, but argument 3 has type ‘long long unsigned int’ [-Wformat=]
3232 | qDebug(" memory usage: %zu bytes", specialCaseMap.size() * sizeof(unsigned short));
| ~~^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| | |
| long unsigned int long long unsigned int
| %llu
Fix by using %llu and an explicit cast to qulonglong (for the case of
32-bit platforms), as usual.
Pick-to: 6.10 6.9 6.8 6.5
Change-Id: Idd8c83f05880ad5e12311829d8375baaec376ac6
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
QByteArray doesn't need qPrintable(), we can just pass it's
data()/constData() to %s.
Also, don't output the "destroyed" field values (replace("..", '.')),
but the original ones.
Amends 2afe1a3c19 and
838a7a01f3.
Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I5eb819f74075c6d6aa8989b30615c7955a60155c
Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>
Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>
The Q_DECL_CONST_FUNCTION needs to be on the declaration to have any
effect on callers, but it was only on the (out-of-line) definition.
Amends 2fe90a61bd.
As a drive-by, also remove the export macros from the definitions;
they, too, are only needed on the declaration.
Pick-to: 6.10 6.9 6.8 6.5
Change-Id: Id69b58c50440b8b835f7be7ba873927d07b11219
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
The function's docs state that the function may destroy the input, so
let the signature reflect that.
Amends ca1eeb23fa.
Pick-to: 6.10 6.9 6.8 6.5
Change-Id: I3668887c97893e7114827819d8aaef7a0b3528ce
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
Of all the Category categories, separators are the only to currently
have assigned codepoints exclusively in the BMP. This allows us to lower
the maximum check from the LastValidCodepoint to category-specific
one. This will also cause the compiler to dead-code eliminate the check
inside of qGetProperty and emit only the BMP check of the property
tables:
if (ucs4 < 0x11000)
return uc_properties + uc_property_trie[uc_property_trie[ucs4 >> 5] + (ucs4 & 0x1f)];
Pick-to: 6.10
Change-Id: I31eda5d79cc2c3560d90fffd74a546d1e7cda7bb
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
They added some new scripts.
There were a few changes to the line break algorithm,
most notably there is more rules that require more context than before.
While not major, there was some shuffling and additions to our
implementation to match the new rules.
IDNA test data now disallows the trailing dot/empty root label,
technically to be toggled off by an option that controls a few things,
but we don't have options. For test-data they changed the format a
little - "" is used to mean empty string, while a blank segment is
null/no string, update the parser to read this.
[ChangeLog][Third-Party Code] Updated the Unicode Character Database to
UCD revision 34/Unicode 16.
Fixes: QTBUG-132902
Task-number: QTBUG-132851
Pick-to: 6.9 6.8 6.5
Change-Id: I4569703659f6fd0f20943110a03301c1cf8cc1ed
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
On MSVC the values stored end up as negative.
Task-number: QTBUG-132902
Pick-to: 6.9 6.8 6.5
Change-Id: I963c57c34479041911c1364a1100d04998bdfaed
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
ssize_t is not universal; fails to compile in Windows.
Task-number: QTBUG-132902
Pick-to: 6.9
Change-Id: I4b8f45cba32202329ac085c7caa0a8c19a11c621
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Fixing warnings/errors about QFile::open() return value not being
checked, and print the name of the file and the error message that
occurred.
Task-number: QTBUG-132902
Pick-to: 6.9
Change-Id: I099b300b5fd4563334fa547ffa365ec3f68e08cf
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
Those files are read by reuse to complement or override the copyright
and licensing information found in file.
The use of REUSE.toml files was introduced in REUSE version 3.1.0a1.
This reuse version is compatible with reuse specification
version 3.2 [1].
With this commit's files,
* The SPDX document generated by reuse spdx conforms to SPDX 2.3,
* The reuse lint command reports that the Qt project is reuse compliant.
[1]: https://reuse.software/spec-3.2/
Task-number: QTBUG-124453
Task-number: QTBUG-125211
Pick-to: 6.8
Change-Id: I01023e862607777a5e710669ccd28bbf56091097
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
Reviewed-by: Joerg Bornemann <joerg.bornemann@qt.io>
Expand unicode data to include information needed to
parse emoji sequences. This is a pre-requisite for
automatically preferring color fonts for emojis.
As a drive-by, this also fixes a double space in the
output of the uc_properties array.
Task-number: QTBUG-111801
Change-Id: Icd993803c87c69ed278c7724377028f3706d0272
Reviewed-by: Eirik Aavitsland <eirik.aavitsland@qt.io>
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
The existing data comes under Unicode-DFS-2016 but future updates
shall come under Unicode-3.0, so update the existing headers with the
former and the generator script with the latter. Leave a note in the
attribution file about this transitional state and how to resolve it.
Replaced UNICODE_LICENSE.txt from src/corelib/text/ with
LICENSES/Unicode-DFS-2016.txt, as fetched using reuse download.
This doesn't look like a rename but only actually adds some irrelevant
lines about where on the Unicode website the upstream files (to which
we do not apply this license) come from and changes some spacing.
Pick-to: 6.7 6.5
Fixes: QTBUG-121653
Change-Id: I50c9f4badc77a9aa402af946561aff58ae9e3e7a
Reviewed-by: Volker Hilsheimer <volker.hilsheimer@qt.io>
Reviewed-by: Kai Köhne <kai.koehne@qt.io>
The new rules were added in Unicode 15.1 (TR #14, revision 51).
The rules read:
LB15a: (sot | BK | CR | LF | NL | OP | QU | GL | SP | ZW)
[\p{Pi}&QU] SP* ×
LB15b: × [\p{Pf}&QU] (SP | GL | WJ | CL | QU | CP | EX
| IS | SY | BK | CR | LF | NL | ZW | eot)
Add two new line breaking classes LineBreak_QU_Pi and _QU_Pf to
represent quotation characters with context that matches left
side of LB15a and right side of LB15b respectively. This way
it is still possible to use the line breaking classes table.
Also add a coment about the original source of the line
break table.
Task-number: QTBUG-121529
Change-Id: Ib35f400e39e76819cd1c3299691f7b040ea37178
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>
Add enumerator for the new Unicode version to QChar::UnicodeVersion.
Remap new line breaking classes to their Unicode 15.0 values:
* AK, AP and AS to AL,
* VI and VF to CM.
These are classes for new line breaking support for Indic scripts
that require more work.
Blacklist failing tests for now:
* tst_QUrlUts46::idnaTestV2
* tst_QTextBoundaryFinder::lineBoundariesDefault
* tst_QTextBoundaryFinder::graphemeBoundariesDefault
Regenerate the source files.
Task-number: QTBUG-121529
Change-Id: I869cc9fbaa53765d8ae6265c22cdbef9f19d05bf
Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
This amends c4e550703c. The data version
update was just forgotten when updating to Unicode 15.0.
Pick-to: 6.5 6.6 6.7
Change-Id: Ibb3e9cb81e9bbcb5d4aaf4e4df6231485531c128
Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
This corresponds to Unicode version 15.0.0.
Added the following scripts:
* Kawi
* Nag Mundari
Full support of these scripts requires harfbuzz version 5.2.0,
this version adds support for Unicode 15.0:
https://github.com/harfbuzz/harfbuzz/releases/tag/5.2.0
Fixes: QTBUG-106810
Change-Id: Ib06c526e49b0f01ef9f21123bcf875c6b19f2601
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
This property is needed to properly implement the line breaking
algorithm from UAX #14.
Task-number: QTBUG-97537
Pick-to: 6.3
Change-Id: Ia83cc553c9ef19fae33560721630849d2a95af84
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
Replace the current license disclaimer in files by
a SPDX-License-Identifier.
Files that have to be modified by hand are modified.
License files are organized under LICENSES directory.
Task-number: QTBUG-67283
Change-Id: Id880c92784c40f3bbde861c0d93f58151c18b9f1
Reviewed-by: Qt CI Bot <qt_ci_bot@qt-project.org>
Reviewed-by: Lars Knoll <lars.knoll@qt.io>
Reviewed-by: Jörg Bornemann <joerg.bornemann@qt.io>
This corresponds to Unicode version 14.0.0.
Added the following scripts:
* CyproMinoan
* OldUyghur
* Tangsa
* Toto
* Vithkuqi
Full support of these scripts requires harfbuzz version 3.0.0,
this version adds support for Unicode 14.0:
https://github.com/harfbuzz/harfbuzz/releases/tag/3.0.0
With this release 10 test cases in tst_qurluts46 were fixed, one
additional test case is failing in tst_qtextboundaryfinder and
is commented out. In total 62 line break test cases and 44 word
break test cases are failing.
A comment in src/corelib/text/qt_attribution.json was updated to
include the URL of the page containing UCD version number.
Fixes: QTBUG-94359
Change-Id: Iefc9ff13f3df279f91cbdb1246d56f75b20ecb35
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
Add a script that downloads UCD data for a given Unicode version,
unpacks it, and copies the used files to appropriate locations inside
the Qt source code.
Also update the README and use an HTTPS link for the UCD data file.
FTP links are no longer supported by some browsers.
Task-number: QTBUG-94359
Change-Id: I2aa70a588f675e411fa6b3ce5b4444a7c07ed707
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
Store up to 2 QChar's for mapping values inside the mapping
table itself. This reduces the size of the superstring for
other mapping values.
results:
uncompressed size: 1146 characters
consolidated size: 1001 characters
memory usage: 48050 bytes
Task-number: QTBUG-85323
Change-Id: I922a6d2037551d0532ddae1a032ec1a9890f40a7
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
This implementation stores mapping that are 1 QChar long
inside the mapping tables, the longer mappings are stored as
an offset and length pairs pointing into the common superstring
of all such mapping values.
Size comparison with the existing implementation follows.
old:
max mapping length: 6
memory usage: 103608 bytes
new:
uncompressed size: 3250 characters
consolidated size: 2367 characters
memory usage: 50782 bytes
Task-number: QTBUG-85323
Change-Id: I9f2e32438dd463457e0fcd783136bb17145e27a8
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
The new test is called tst_qurluts46. It verifies QUrl::{to,from}Ace()
functionality using the data from IdnaTestV2.txt supplied by Unicode.
The file was downloaded from
https://www.unicode.org/Public/idna/13.0.0/IdnaTestV2.txt
Task-Id: QTBUG-85371
Change-Id: I4c6a4942ef6018dafc90cb84ef73f6b2614566d7
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
Update the Unicode data processing tool to generate properties
and mapping tables needed to implement UTS #46
(https://unicode.org/reports/tr46/). The implementation extends
the standard to allow usage of underscores in URLs. This is done
for compatibility with DNS-SD and SMB protocols.
The data file needed to generate the new properties was taken from
https://www.unicode.org/Public/idna/13.0.0/IdnaMappingTable.txt
Task-number: QTBUG-85323
Change-Id: I2c303bf8a08aefb18a7491fb9b55385563bfa219
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
UAX #29 in Unicode 11 changed the EGC algorithm to its current form.
Although Qt has upgraded the Unicode tables all the way up to
Unicode 13, the algorithm has never been adapted; in other words,
it has been working by chance for years. Luckily, MOST
of the cases were dealt with correctly, but emoji handling
actually manages to break it.
This commit:
* Adds parsing of emoji-data.txt into the unicode table generator.
That is necessary to extract the Extended_Pictographic property,
which is used by the EGC algorithm.
* Regenerates the tables.
* Removes some obsoleted grapheme cluster break properties, and
adds the ones added in the meanwhile.
* Rewrites the EGC algorithm according to Unicode 13. This is
done by simplifying a lot the lookup table. Some rules (GB11,
GB12, GB13) can't be done by the table alone so some hand-rolled
code is necessary in that case.
* Thanks to these fixes, the complete upstream GraphemeBreakTest
now passes. Remove the "edited" version that ignored some rows
(because they were failing).
Change-Id: Iaa07cb2e6d0ab9deac28397f46d9af189d2edf8b
Pick-to: 6.1 6.0 5.15
Fixes: QTBUG-92822
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>
Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>
If one "accidentally" uses a release build of the unicode tool,
the asserts within it won't fire. Enable them in all cases.
Change-Id: I9d63641dc6d6d2e5805b61b36f8c28e624b25e12
Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>
The Unicode table code can only be safely called on valid code-points.
So code that calls it must only pass it valid Unicode data. The string
iterator's Unchecked Unchecked methods only provide this guarantee
when the string being iterated is guaranteed to be valid UTF-16; while
client code should only use QString, QStringView and friends on valid
UTF-16 data, we have no way to be sure they have respected that.
So take the few extra cycles to actually check validity in the course
of iterating strings, when the resulting code-points are to be passed
to the Unicode table look-ups. Add tests that case mapping doesn't
access Unicode tables out of range (it'll trigger the new assertion).
Added some comments to qchar.h that helped me understand surrogates.
Change-Id: Iec2c3106bf1a875bdaa1d622f6cf94d7007e281e
Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>