qtbase

Commit Graph

Author	SHA1	Message	Date
Marc Mutz	2c3c479c20	util/unicode: readEastAsianWidth(): remove simplified() call All users of the split()ed value handle intervening whitespace already: - fields[0] is piped through parseHexRange(), which does - fields[1] has trimmed() called on it before lookup and all idnaStatusMap values are space-free (cf. initIdnaStatusMap()) As a consequence, we can accept the line by reference to const QByteArray now. Amends `838a7a01f3`. Pick-to: 6.10 6.9 6.8 6.5 Change-Id: I53247332c624a192fcaca6009a3f20cb8c65786a Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>	2025-09-05 08:42:20 +02:00
Marc Mutz	079b97b7ac	util/unicode: readArabicShaping(): remove replace() call All users of the split()ed value handle intervening whitespace already: - fields[0] is piped through parseHexRange(), which does - fields[1] has trimmed() called on it before lookup and all age_map values are space-free (cf. initAgeMap()) As a consequence, we can accept the line by reference to const QByteArray now. Amends the start of the public history. Pick-to: 6.10 6.9 6.8 6.5 Change-Id: I1d371af33bc0b1c1b2bf28bbd3cbaf6820f8b4e8 Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>	2025-09-05 08:42:14 +02:00
Marc Mutz	0738a0dd5e	util/unicode: remove replace('_', "") from readScripts() For some reason, the code stored the official Unicode script tags without their intervening underscores, removing underscores from the input before attempting to match, which works, as long as Unicode stays consistent in spelling properties "Like_This". Relying on that is brittle, though, seeing as a tag without intervening underscore (SignWriting) already slipped into the database, potentially matching a sought Sign_Writing. It's highly unlikely that Unicode will start to use property names that differ only by their use of underscore, but why risk it, and why confuse readers of code by using a different sought string, compared to what's in the files? Fix by storing the tags unaltered and leaving the underscores in the input alone, too. Amends the start of the public history. Pick-to: 6.10 6.9 6.8 6.5 Change-Id: I5870a35812cb3fc0b28888cb09e9f42661684a26 Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>	2025-09-05 06:42:07 +00:00
Marc Mutz	587adc2217	util/unicode: use new parseHexList() in readIdnaMappingTable() This one is straight-forward. As a drive-by, use QString::append(QStringView) instead of iterating over the result of QChar::fromUcs4(). It may not be faster, in fact, I expect it to be slower, but it's much nicer to read, and this tool doesn't need to be optimized. Since every field is now clearly handled by functions that can handle extra whitespace (the values looked up unsurprisingly are space-free, too), we can drop the simplified() call and take the QByteArray 'line' by cref. Amends `2afe1a3c19`. Pick-to: 6.10 6.9 6.8 6.5 Change-Id: Id8900367d774ec4a6dccb89f6be73984caac2701 Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>	2025-09-03 12:31:50 +02:00
Marc Mutz	5d357dfea9	util/unicode: assert no mirrored pairs exist outside BMP QTextEngine implicitly assumes this (it's looking up mirrored characters in UTF-16 space, without first decoding surrogates). Add a comment there, too. Amends `7f504283ef`. Fixes: QTBUG-139456 Pick-to: 6.10 6.9 6.8 6.5 Change-Id: Ie79b33907e71cc455434127c1752898c40b128f9 Reviewed-by: Edward Welbourne <edward.welbourne@qt.io> Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>	2025-09-02 21:31:16 +00:00
Marc Mutz	c4d8e58f1d	util/unicode: error out with a meaningful message if Qt is too old The Qt version against which this tool is build need not be the same as the Qt version for which this tool generates code. The advantage is that we can use the latest Qt features in this tool without having to worry about compat with older Qt versions. We might also use C++20 here in the future. Instead of greeting prospective users of the tool with random compile errors, check the Qt version and #error out with a descritive message instead. Pick-to: 6.10 6.9 6.8 6.5 Change-Id: I2a153ee4eb6ca1a1ea7ece39c9872f3f6d746fcd Reviewed-by: Thiago Macieira <thiago.macieira@intel.com> Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>	2025-09-02 23:31:16 +02:00
Marc Mutz	6e526bf92c	util/unicode: Extract Method parseHexRange() Wrapping parseHexList(), which gets extended to support QLatin1StringView separators, add parseHexRange() and use it around the code to parse HHHHH[..HHHHH] hex ranges. Amends the start of the public history. Pick-to: 6.10 6.9 6.8 6.5 Change-Id: I0372e5c239642988f0e920d95108657e276b19dd Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>	2025-09-02 23:31:15 +02:00
Marc Mutz	714969e8a1	util/unicode: Extract Method parseHexList() from readSpecialCasing() This function, too, is duplicated all over the place, so centralize it. Also use modern parsing techniques, like QStringTokenizer instead of split() and QVarLengthArray instead of QList, that make the function much more efficient. Use it in readCaseFolding(), too, which is straight-forward. There are many more potential users of this function; I'll port them one by one in follow-up patches, though most are reading hex ranges, so I'll add a function for that next. Amends the start of the public history. Pick-to: 6.10 6.9 6.8 6.5 Change-Id: I545a22d65a3baeaa850a7d658dcf466d2284b0fa Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>	2025-09-02 23:31:15 +02:00
Marc Mutz	7602054ac2	util/unicode: Extract Method parseHex() This code is used all over the place, so put it into a convenience function where we can arguably put in a bit of effort to optimize things, to wit: use QByteArrayView, and provide a better error message than inline code could afford. E.g. the LastValidCodePoint check was previously only in readUnicodeData(). For use in readUnicodeData(), add lineNo tracking. In readBidiLine(), we can now drop the replace(' ', ""), because parseHex() already trims. As a consequence, the lambda can take the QByteArray by cref now. There are many more places where this could be used, but they represent higher-level constructs for which I'll add higher-level helper functions. Amends the start of the public history. Pick-to: 6.10 6.9 6.8 6.5 Change-Id: Ic8f59f6509da1a0deb47a46cfaf160abb20c067e Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>	2025-09-02 21:31:15 +00:00
Marc Mutz	98878bd873	util/unicode: port appendToSpecialCaseMap() from QList to QSpan This makes the function independent of the actual input container being passed, allowing us to port from QList to QVarLengthArray step-by-step. In readUnicodeData(), this allows to pass the single-element case using initializer_list instead of a temporary QList. Amends the start of the public history (but, to be fair, QSpan wasn't available then). As with readLineInto() in a previous patch, QSpan is not available in Qt 6.5, but that's not a problem because this tool doesn't need to compile with the Qt version it is generating code for, so we can use Qt-latest. Pick-to: 6.10 6.9 6.8 6.5 Change-Id: I10039af9d5b82a3d23fec451bf051a868db4c343 Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>	2025-09-02 23:31:15 +02:00
Marc Mutz	44da6f996a	util/unicode: Extract Method readUnicodeFile() There's about a dozen files this program reads, and in each of these cases, the code to read the file line-by-line, remove comments (or just LF) and trim the line before further handling is duplicated. It's also very inefficient, we have better APIs these days (readLineInto(), rvalue *this overloads, truncate() instead of = left(), ...). Besides, as Mårten pointed out in review, trimmed() already removes the LF, so we don't need to do it manually. So Extract Method readUnicodeFile() that does that, coroutine-style (but with function object for now), from all the readX() functions (except readUnicodeData() itself, which is using nested readLine()s. Also maintain a line number for later improving the error messages. Remove some isEmpty() checks in the lambdas that, after the refactoring, can never be true (because removing whitespace from a trimmed() string cannot make the string empty, ditto with simplified()). The extracted function could even pre-split the line along `;`, but for that, I would port each lambda to use QByteArrayView / qTokenizer first. Picking to all active branches, because a) this is a tool and b) we continue to update the Unicode tables in all active branches, so the tool to do so should not differ, unless the target branch requires it (changed data structures, e.g.). Note that readLineInto() is not in 6.5, but the tool is not required to be built against the Qt version it is building tables for, so we can use the latest Qt features here. Pick-to: 6.10 6.9 6.8 6.5 Change-Id: I3b699f213c98baa45bc8bbdb7ae2ac985d893798 Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io> Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>	2025-08-29 02:03:40 +02:00
Marc Mutz	3bf21f41a2	util/unicode: fix format error Says GCC: main.cpp:3232:33: warning: format ‘%zu’ expects argument of type ‘size_t’, but argument 3 has type ‘long long unsigned int’ [-Wformat=] 3232 \| qDebug(" memory usage: %zu bytes", specialCaseMap.size() * sizeof(unsigned short)); \| ~~^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \| \| \| \| long unsigned int long long unsigned int \| %llu Fix by using %llu and an explicit cast to qulonglong (for the case of 32-bit platforms), as usual. Pick-to: 6.10 6.9 6.8 6.5 Change-Id: Idd8c83f05880ad5e12311829d8375baaec376ac6 Reviewed-by: Ahmad Samir <a.samirh78@gmail.com>	2025-08-29 00:03:40 +00:00
Marc Mutz	4dd7f5bc68	util/unicode: fix pointless qPrintable(QByteArray) QByteArray doesn't need qPrintable(), we can just pass it's data()/constData() to %s. Also, don't output the "destroyed" field values (replace("..", '.')), but the original ones. Amends `2afe1a3c19` and `838a7a01f3`. Pick-to: 6.10 6.9 6.8 6.5 Change-Id: I5eb819f74075c6d6aa8989b30615c7955a60155c Reviewed-by: Ahmad Samir <a.samirh78@gmail.com> Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>	2025-08-28 11:58:34 +02:00
Marc Mutz	1b7418e494	QUnicodeTools: fix attibute location on properties() functions The Q_DECL_CONST_FUNCTION needs to be on the declaration to have any effect on callers, but it was only on the (out-of-line) definition. Amends `2fe90a61bd`. As a drive-by, also remove the export macros from the definitions; they, too, are only needed on the declaration. Pick-to: 6.10 6.9 6.8 6.5 Change-Id: Id69b58c50440b8b835f7be7ba873927d07b11219 Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>	2025-08-27 02:39:27 +02:00
Marc Mutz	756bb4bffa	util/unicode: make buildSuperstring() take input by rvalue ref The function's docs state that the function may destroy the input, so let the signature reflect that. Amends `ca1eeb23fa`. Pick-to: 6.10 6.9 6.8 6.5 Change-Id: I3668887c97893e7114827819d8aaef7a0b3528ce Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>	2025-08-27 02:39:27 +02:00
Thiago Macieira	038d127fe5	QChar::isSpace: optimize by lowering the upper limit check Of all the Category categories, separators are the only to currently have assigned codepoints exclusively in the BMP. This allows us to lower the maximum check from the LastValidCodepoint to category-specific one. This will also cause the compiler to dead-code eliminate the check inside of qGetProperty and emit only the BMP check of the property tables: if (ucs4 < 0x11000) return uc_properties + uc_property_trie[uc_property_trie[ucs4 >> 5] + (ucs4 & 0x1f)]; Pick-to: 6.10 Change-Id: I31eda5d79cc2c3560d90fffd74a546d1e7cda7bb Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>	2025-08-19 18:04:55 -07:00
Mårten Nordheim	85899ff181	Update UCD to Unicode 16.0.0 They added some new scripts. There were a few changes to the line break algorithm, most notably there is more rules that require more context than before. While not major, there was some shuffling and additions to our implementation to match the new rules. IDNA test data now disallows the trailing dot/empty root label, technically to be toggled off by an option that controls a few things, but we don't have options. For test-data they changed the format a little - "" is used to mean empty string, while a blank segment is null/no string, update the parser to read this. [ChangeLog][Third-Party Code] Updated the Unicode Character Database to UCD revision 34/Unicode 16. Fixes: QTBUG-132902 Task-number: QTBUG-132851 Pick-to: 6.9 6.8 6.5 Change-Id: I4569703659f6fd0f20943110a03301c1cf8cc1ed Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>	2025-02-10 18:36:55 +01:00
Mårten Nordheim	62685375a2	Unicode tool: use unsigned values for the bitfields On MSVC the values stored end up as negative. Task-number: QTBUG-132902 Pick-to: 6.9 6.8 6.5 Change-Id: I963c57c34479041911c1364a1100d04998bdfaed Reviewed-by: Edward Welbourne <edward.welbourne@qt.io> Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>	2025-02-04 15:08:09 +01:00
Mårten Nordheim	f7d0366207	Unicode tool: print values using portable types ssize_t is not universal; fails to compile in Windows. Task-number: QTBUG-132902 Pick-to: 6.9 Change-Id: I4b8f45cba32202329ac085c7caa0a8c19a11c621 Reviewed-by: Edward Welbourne <edward.welbourne@qt.io> Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>	2025-02-04 15:08:09 +01:00
Mårten Nordheim	e99d5c6268	Unicode tool: handle required QFile::open return Fixing warnings/errors about QFile::open() return value not being checked, and print the name of the file and the error message that occurred. Task-number: QTBUG-132902 Pick-to: 6.9 Change-Id: I099b300b5fd4563334fa547ffa365ec3f68e08cf Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>	2025-02-04 15:08:09 +01:00
Eskil Abrahamsen Blomfeldt	b614aa10b9	Extract emoji data from Unicode files Expand unicode data to include information needed to parse emoji sequences. This is a pre-requisite for automatically preferring color fonts for emojis. As a drive-by, this also fixes a double space in the output of the uc_properties array. Task-number: QTBUG-111801 Change-Id: Icd993803c87c69ed278c7724377028f3706d0272 Reviewed-by: Eirik Aavitsland <eirik.aavitsland@qt.io> Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>	2024-08-06 10:00:08 +02:00
Edward Welbourne	d5e40b5e58	Revise UCD-generated data files' SPDX headers The existing data comes under Unicode-DFS-2016 but future updates shall come under Unicode-3.0, so update the existing headers with the former and the generator script with the latter. Leave a note in the attribution file about this transitional state and how to resolve it. Replaced UNICODE_LICENSE.txt from src/corelib/text/ with LICENSES/Unicode-DFS-2016.txt, as fetched using reuse download. This doesn't look like a rename but only actually adds some irrelevant lines about where on the Unicode website the upstream files (to which we do not apply this license) come from and changes some spacing. Pick-to: 6.7 6.5 Fixes: QTBUG-121653 Change-Id: I50c9f4badc77a9aa402af946561aff58ae9e3e7a Reviewed-by: Volker Hilsheimer <volker.hilsheimer@qt.io> Reviewed-by: Kai Köhne <kai.koehne@qt.io>	2024-04-22 15:22:12 +00:00
Ievgenii Meshcheriakov	1f73d4b87c	Unicode line breaking: Implement rules LB15a and LB15b The new rules were added in Unicode 15.1 (TR #14, revision 51). The rules read: LB15a: (sot \| BK \| CR \| LF \| NL \| OP \| QU \| GL \| SP \| ZW) [\p{Pi}&QU] SP* × LB15b: × [\p{Pf}&QU] (SP \| GL \| WJ \| CL \| QU \| CP \| EX \| IS \| SY \| BK \| CR \| LF \| NL \| ZW \| eot) Add two new line breaking classes LineBreak_QU_Pi and _QU_Pf to represent quotation characters with context that matches left side of LB15a and right side of LB15b respectively. This way it is still possible to use the line breaking classes table. Also add a coment about the original source of the line break table. Task-number: QTBUG-121529 Change-Id: Ib35f400e39e76819cd1c3299691f7b040ea37178 Reviewed-by: Edward Welbourne <edward.welbourne@qt.io> Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>	2024-02-08 17:43:58 +01:00
Ievgenii Meshcheriakov	bfd09ec38c	unicode: Import version 15.1 (UCD version 32) Add enumerator for the new Unicode version to QChar::UnicodeVersion. Remap new line breaking classes to their Unicode 15.0 values: * AK, AP and AS to AL, * VI and VF to CM. These are classes for new line breaking support for Indic scripts that require more work. Blacklist failing tests for now: * tst_QUrlUts46::idnaTestV2 * tst_QTextBoundaryFinder::lineBoundariesDefault * tst_QTextBoundaryFinder::graphemeBoundariesDefault Regenerate the source files. Task-number: QTBUG-121529 Change-Id: I869cc9fbaa53765d8ae6265c22cdbef9f19d05bf Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io> Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>	2024-02-08 16:43:58 +00:00
Ievgenii Meshcheriakov	1e7f1e5b73	Update Unicode data version string This amends `c4e550703c`. The data version update was just forgotten when updating to Unicode 15.0. Pick-to: 6.5 6.6 6.7 Change-Id: Ibb3e9cb81e9bbcb5d4aaf4e4df6231485531c128 Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io> Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>	2024-01-25 17:37:48 +00:00
Ievgenii Meshcheriakov	c4e550703c	Update UCD to Revision 30 This corresponds to Unicode version 15.0.0. Added the following scripts: * Kawi * Nag Mundari Full support of these scripts requires harfbuzz version 5.2.0, this version adds support for Unicode 15.0: https://github.com/harfbuzz/harfbuzz/releases/tag/5.2.0 Fixes: QTBUG-106810 Change-Id: Ib06c526e49b0f01ef9f21123bcf875c6b19f2601 Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>	2022-10-11 14:10:59 +00:00
Yuhang Zhao	34c21d0407	Core: make Unicode Database constexpr Task-number: QTBUG-100485 Pick-to: 6.3 6.2 Change-Id: I41480a34b14fd86a68a5c10b7e0f3d250e785d0f Reviewed-by: Marc Mutz <marc.mutz@qt.io> Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>	2022-05-26 12:42:36 +08:00
Ievgenii Meshcheriakov	838a7a01f3	Unicode: Extract EastAsianWidth property This property is needed to properly implement the line breaking algorithm from UAX #14. Task-number: QTBUG-97537 Pick-to: 6.3 Change-Id: Ia83cc553c9ef19fae33560721630849d2a95af84 Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>	2022-05-24 23:07:43 +02:00
Ievgenii Meshcheriakov	1a26719c54	Unicode: Remove obsolete word break classes Remove E_Base, Glue_After_Zwj, E_Base_GAZ, and E_Modifier obsoleted by UTS #29, version 33 (Unicode 11.0.0). Task-number: QTBUG-97537 Pick-to: 6.2 6.3 Change-Id: If5dc36ae17cd8746bbe81b73bbcc0863181e4a7a Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>	2022-05-24 23:07:42 +02:00
Lucie Gérard	05fc3aef53	Use SPDX license identifiers Replace the current license disclaimer in files by a SPDX-License-Identifier. Files that have to be modified by hand are modified. License files are organized under LICENSES directory. Task-number: QTBUG-67283 Change-Id: Id880c92784c40f3bbde861c0d93f58151c18b9f1 Reviewed-by: Qt CI Bot <qt_ci_bot@qt-project.org> Reviewed-by: Lars Knoll <lars.knoll@qt.io> Reviewed-by: Jörg Bornemann <joerg.bornemann@qt.io>	2022-05-16 16:37:38 +02:00
Ievgenii Meshcheriakov	826fc8c9bd	Update UCD to Revision 28 This corresponds to Unicode version 14.0.0. Added the following scripts: * CyproMinoan * OldUyghur * Tangsa * Toto * Vithkuqi Full support of these scripts requires harfbuzz version 3.0.0, this version adds support for Unicode 14.0: https://github.com/harfbuzz/harfbuzz/releases/tag/3.0.0 With this release 10 test cases in tst_qurluts46 were fixed, one additional test case is failing in tst_qtextboundaryfinder and is commented out. In total 62 line break test cases and 44 word break test cases are failing. A comment in src/corelib/text/qt_attribution.json was updated to include the URL of the page containing UCD version number. Fixes: QTBUG-94359 Change-Id: Iefc9ff13f3df279f91cbdb1246d56f75b20ecb35 Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>	2021-10-18 16:45:10 +00:00
Ievgenii Meshcheriakov	8be862687d	unicode: Fix typo s/supersting/superstring/ Thanks to Konstantin Ritt for spotting it. Change-Id: Ia3b5c4103b315cdb690fcd8b42239f000acdbef0 Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>	2021-09-15 21:58:12 +02:00
Ievgenii Meshcheriakov	5f3f073126	unicode: Build IDNA map superstring greedily This slightly reduces memory required for mapping tables: uncompressed size: 1146 characters consolidated size: 904 characters memory usage: 47856 bytes Task-number: QTBUG-85323 Change-Id: Ic960789e433e80acf1a4e36791533a1c55a735c8 Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>	2021-09-03 14:43:16 +02:00
Ievgenii Meshcheriakov	2d78b71fc6	unicode: Pack 2 QChar's into one entry for IDNA mapping Store up to 2 QChar's for mapping values inside the mapping table itself. This reduces the size of the superstring for other mapping values. results: uncompressed size: 1146 characters consolidated size: 1001 characters memory usage: 48050 bytes Task-number: QTBUG-85323 Change-Id: I922a6d2037551d0532ddae1a032ec1a9890f40a7 Reviewed-by: Thiago Macieira <thiago.macieira@intel.com> Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>	2021-09-03 14:43:16 +02:00
Ievgenii Meshcheriakov	ca1eeb23fa	unicode: More compact IDNA mapping tables implementation This implementation stores mapping that are 1 QChar long inside the mapping tables, the longer mappings are stored as an offset and length pairs pointing into the common superstring of all such mapping values. Size comparison with the existing implementation follows. old: max mapping length: 6 memory usage: 103608 bytes new: uncompressed size: 3250 characters consolidated size: 2367 characters memory usage: 50782 bytes Task-number: QTBUG-85323 Change-Id: I9f2e32438dd463457e0fcd783136bb17145e27a8 Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>	2021-09-03 14:43:16 +02:00
Ievgenii Meshcheriakov	2afe1a3c19	unicode: Generate tables for IDNA/UTS #46 Update the Unicode data processing tool to generate properties and mapping tables needed to implement UTS #46 (https://unicode.org/reports/tr46/). The implementation extends the standard to allow usage of underscores in URLs. This is done for compatibility with DNS-SD and SMB protocols. The data file needed to generate the new properties was taken from https://www.unicode.org/Public/idna/13.0.0/IdnaMappingTable.txt Task-number: QTBUG-85323 Change-Id: I2c303bf8a08aefb18a7491fb9b55385563bfa219 Reviewed-by: Edward Welbourne <edward.welbourne@qt.io>	2021-08-26 16:55:05 +02:00
Giuseppe D'Angelo	a794c5e287	Unicode: fix the extended grapheme cluster algorithm UAX #29 in Unicode 11 changed the EGC algorithm to its current form. Although Qt has upgraded the Unicode tables all the way up to Unicode 13, the algorithm has never been adapted; in other words, it has been working by chance for years. Luckily, MOST of the cases were dealt with correctly, but emoji handling actually manages to break it. This commit: * Adds parsing of emoji-data.txt into the unicode table generator. That is necessary to extract the Extended_Pictographic property, which is used by the EGC algorithm. * Regenerates the tables. * Removes some obsoleted grapheme cluster break properties, and adds the ones added in the meanwhile. * Rewrites the EGC algorithm according to Unicode 13. This is done by simplifying a lot the lookup table. Some rules (GB11, GB12, GB13) can't be done by the table alone so some hand-rolled code is necessary in that case. * Thanks to these fixes, the complete upstream GraphemeBreakTest now passes. Remove the "edited" version that ignored some rows (because they were failing). Change-Id: Iaa07cb2e6d0ab9deac28397f46d9af189d2edf8b Pick-to: 6.1 6.0 5.15 Fixes: QTBUG-92822 Reviewed-by: Thiago Macieira <thiago.macieira@intel.com> Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>	2021-04-16 20:31:39 +02:00
Edward Welbourne	78cf89c07d	Use checked string iteration in case conversions The Unicode table code can only be safely called on valid code-points. So code that calls it must only pass it valid Unicode data. The string iterator's Unchecked Unchecked methods only provide this guarantee when the string being iterated is guaranteed to be valid UTF-16; while client code should only use QString, QStringView and friends on valid UTF-16 data, we have no way to be sure they have respected that. So take the few extra cycles to actually check validity in the course of iterating strings, when the resulting code-points are to be passed to the Unicode table look-ups. Add tests that case mapping doesn't access Unicode tables out of range (it'll trigger the new assertion). Added some comments to qchar.h that helped me understand surrogates. Change-Id: Iec2c3106bf1a875bdaa1d622f6cf94d7007e281e Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>	2020-08-29 18:15:27 +02:00
Edward Welbourne	1fb35832df	Simplify initialization of UnicodeData and PropertyFlags structs Initialize values where they're declared, where possible. Change-Id: Ib6bf33b27b19c76f406f78bc8a1bd9729bd8f2cd Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>	2020-08-28 21:27:51 +02:00
Edward Welbourne	a111dd26b1	Document the indexing used in the Unicode tables Make clear why we don't need to assert against out-of-bounda accesses in the generated code, provided the code point is within its bound, (Using one table's early entries as indices into later in the same table at which to look up indices into another table made it a little hard to work out what was going on, especially as nothing told me about the early / late distinction. Record what I discovered, to save the next person to stumble into this some confusion.) Change-Id: I8e5771a7f3d70c1911aeae1b0cabe5c47bc7e9c7 Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com> Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>	2020-08-20 09:02:00 +02:00
Edward Welbourne	ca034e4e50	Inline two macros in the unicode tables They were only used by one function each, in unicodetables.cpp, so don't need to be macros. Change-Id: I3e7f9f661568862d0a0d265bb8f657a8e0782b13 Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>	2020-08-12 13:47:56 +02:00
Edward Welbourne	f12ddfaecc	Tidy up unicode table generation Eliminate some needless parentheses, tidy up some spacing and indentation and split some long lines. Change first += after declaration to initializer. Change-Id: I05ff2a6337b7ed14e0a2dc9c03fc784c92b63515 Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com> Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>	2020-08-05 10:02:11 +02:00
Edward Welbourne	e3e6d58cad	Use %zd for size-type formatting in unicode table generator Qt6 makes sizes qsizetype; and one of these was already sizeof()-sized. While qsizetype might not be ssize_t, it's at least no bigger, so we can safely use its format specifier, with a suitable cast. Change-Id: I433f654f6b139d74b4d5358b804b44ab1f0ada15 Reviewed-by: Lars Knoll <lars.knoll@qt.io> Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>	2020-08-04 13:28:34 +02:00
Edward Welbourne	e536fc7975	Fix deprecation warnings (s/hex/Qt::hex/gw) in unicode table generator Removed three warnings, rather than fixing them, as Konstantin Ritt tells me they've been redundant since Unicode 6 or so. Change-Id: I4507e852bceb08a0252c77a8b383aceac212aad9 Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>	2020-08-04 13:28:34 +02:00
Edward Welbourne	b69092b13e	Fix compilation error in unicode table generator Don't include a QString::number() in a sum of QByteArray and C strings. Change-Id: I7544e835fcf5625b1fe1ee2055a48600200daafd Reviewed-by: Lars Knoll <lars.knoll@qt.io> Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com> Reviewed-by: Mårten Nordheim <marten.nordheim@qt.io>	2020-08-04 13:28:34 +02:00
Jarek Kobus	a7f9d5a7fa	Use QList instead of QVector in util Task-number: QTBUG-84469 Change-Id: I077fb5c32456d438a457c1f73852313ea2ea9ae5 Reviewed-by: Friedemann Kleint <Friedemann.Kleint@qt.io>	2020-07-07 20:34:48 +02:00
Giuseppe D'Angelo	3e1d03b1ea	Port Q_STATIC_ASSERT(_X) to static_assert There is no reason for keep using our macro now that we have C++17. The macro itself is left in for the moment being, as well as its detection logic, because it's needed for C code (not everything supports C11 yet). A few more cleanups will arrive in the next few patches. Note that this is a mere search/replace; some places were using double braces to work around the presence of commas in a macro, no attempt has been done to fix those. tst_qglobal had just some minor changes to keep testing the macro. Change-Id: I1c1c397d9f3e63db3338842bf350c9069ea57639 Reviewed-by: Lars Knoll <lars.knoll@qt.io>	2020-06-19 19:38:23 +02:00
Marc Mutz	19e7c0d2b5	QChar/QString: centralize case folding in qchar.cpp There are (at least) two implementations of the low-level case-folding algorithm, one of which (for QChar::toLower()) seems to be wrong (it doesn't deal with special cases which expand to more than one code point). The algoithm hidden in QString and entangled with the QString detaching code makes reusing the code much harder. At the same time, the dependency of the algorithm on the unicode tables makes exposing a non-allocating result type in the public API hard. std::u16string would be an alternative if we can assure that all implementations use SSO with at least four characters. So, for the time being, leave this as internal API for use in an upcoming QStringView::toLower() as well as case-insensitive hashing. Change-Id: Iabb2611846f6176776aa20e634f44d8464f3305c Reviewed-by: Thiago Macieira <thiago.macieira@intel.com>	2020-05-09 06:25:05 +00:00
Marc Mutz	7b04e0012b	QUnicodeTables: port to charNN_t This makes existing calls passing uint or ushort ambiguous, so fix all the callers. There do not appear to be callers outside QtBase. In fact, the ...BreakClass() functions appear to be utterly unused. Change-Id: I1c2251920beba48d4909650bc1d501375c6a3ecf Reviewed-by: Lars Knoll <lars.knoll@qt.io> Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>	2020-04-27 13:08:41 +02:00
Marc Mutz	20cdf807b1	QChar: port low-level functions from uint/ushort to char32/16_t Now that the standard gives us proper types for UTF-16 and UTF-32 characters, use them. Will eventually make the code much easier to read than today, where uint could be an index as well as a char32_t. It also ensures that the result of e.g. QChar::highSurrogate() can still be implicitly converted to a QChar now that the QChar(non-characater-integral-types) ctors are being made explicit. [ChangeLog][QtCore][QChar] All low-level functions (e.g. highSurrogate()) now take and return char16_t instead of ushort and char32_t instead of uint. Change-Id: I9cd8ebf6fb998fe1075dae96c7c4484a057f0b91 Reviewed-by: Lars Knoll <lars.knoll@qt.io> Reviewed-by: Konstantin Ritt <ritt.ks@gmail.com>	2020-04-24 12:45:53 +02:00

1 2 3

109 Commits