qtbase/util/locale_database/cldr2qlocalexml.py

103 lines
4.1 KiB
Python
Raw Normal View History

#!/usr/bin/env python3
# Copyright (C) 2021 The Qt Company Ltd.
# SPDX-License-Identifier: LicenseRef-Qt-Commercial OR GPL-3.0-only WITH Qt-GPL-exception-1.0
"""Convert CLDR data to QLocaleXML
The CLDR data can be downloaded as a zip-file from CLDR_, which has a
sub-directory for each version; you need the ``core.zip`` file for
your version of choice (typically the latest), which you should then
unpack. Alternatively, you can clone the git repo from github_, which
has a tag for each release and a maint/maint-$ver branch for each
major version. Either way, the CLDR top-level directory should have a
subdirectory called common/ which contains (among other things)
subdirectories main/ and supplemental/.
This script has had updates to cope up to v44.1; for later versions,
we may need adaptations. Pass the path of the CLDR top-level directory
to this script as its first command-line argument. Pass the name of
the file in which to write output as the second argument; either omit
it or use '-' to select the standard output. This file is the input
needed by ``./qlocalexml2cpp.py``
When you update the CLDR data, be sure to also update
Rework cldr2qlocalexml.py's reading of CLDR data Move the code out to a CldrReader class in cldr.py, expand CldrAccess with facilities that needs, expand ldml.py to include support for more features, finally making xpathlite.py redundant. This initial commit aims, though, to be bug-for-bug compatible with xpathlite in its reading of the CLDR data. It turns out we've been using draftier data than we were aware of (which might not be a bad thing). The xpathlite code appeared to check for draft attributes, but these only appear on leaf nodes and most data were fetched by finding a parent and then scanning its children without the draft check; only am/pm data was actually being excluded based on draft values. (We allowed contributed, for am/pm, in addition to approved, which is all the xpathlite code allows otherwise.) There are also some less equivocal bugs; I'll deal with these in later commits. Simplified number-system data look-ups; the old get_number_in_system() was taking care of old LDML versions' placement of the number system attribute; this is no longer needed. (It was also being used for a currency value to which it was not appropriate, which is now handled separately; this is one of the bugs mentioned above.) Ditched a fall-back to nativeZeroDigit, which no longer exists in CLDR. Change the command-line to take the root of the CLDR data tree, rather than its common/main/ sub-directory. Support naming the file to which to write output, as a second command-line argument, instead of always writing to stdout (which remains the default) and leaving whoever runs the script to redirect stdout. Support (internally for now, while adding TODOs to give main() more command-line options) separating the stderr output into its more and less interesting parts; for now, continue producing both, but suppress the least interesting entirely. Task-number: QTBUG-81344 Change-Id: Ie611b47403a9452b51feaeeaaa0fbc8f7e84dc71 Reviewed-by: Cristian Maureira-Fredes <cristian.maureira-fredes@qt.io>
2020-02-27 12:58:58 +00:00
src/corelib/text/qt_attribution.json's entry for unicode-cldr. Check
this script's output for unknown language, territory or script messages;
if any can be resolved, use their entry in common/main/en.xml to
append new entries to enumdata.py's lists and update documentation in
src/corelib/text/qlocale.qdoc, adding the new entries in alphabetic
order.
Both of the scripts mentioned support --help to tell you how to use
them.
.. _CLDR: https://unicode.org/Public/cldr/
.. _github: https://github.com/unicode-org/cldr
"""
from pathlib import Path
import argparse
Rework cldr2qlocalexml.py's reading of CLDR data Move the code out to a CldrReader class in cldr.py, expand CldrAccess with facilities that needs, expand ldml.py to include support for more features, finally making xpathlite.py redundant. This initial commit aims, though, to be bug-for-bug compatible with xpathlite in its reading of the CLDR data. It turns out we've been using draftier data than we were aware of (which might not be a bad thing). The xpathlite code appeared to check for draft attributes, but these only appear on leaf nodes and most data were fetched by finding a parent and then scanning its children without the draft check; only am/pm data was actually being excluded based on draft values. (We allowed contributed, for am/pm, in addition to approved, which is all the xpathlite code allows otherwise.) There are also some less equivocal bugs; I'll deal with these in later commits. Simplified number-system data look-ups; the old get_number_in_system() was taking care of old LDML versions' placement of the number system attribute; this is no longer needed. (It was also being used for a currency value to which it was not appropriate, which is now handled separately; this is one of the bugs mentioned above.) Ditched a fall-back to nativeZeroDigit, which no longer exists in CLDR. Change the command-line to take the root of the CLDR data tree, rather than its common/main/ sub-directory. Support naming the file to which to write output, as a second command-line argument, instead of always writing to stdout (which remains the default) and leaving whoever runs the script to redirect stdout. Support (internally for now, while adding TODOs to give main() more command-line options) separating the stderr output into its more and less interesting parts; for now, continue producing both, but suppress the least interesting entirely. Task-number: QTBUG-81344 Change-Id: Ie611b47403a9452b51feaeeaaa0fbc8f7e84dc71 Reviewed-by: Cristian Maureira-Fredes <cristian.maureira-fredes@qt.io>
2020-02-27 12:58:58 +00:00
from cldr import CldrReader
from qlocalexml import QLocaleXmlWriter
def main(argv, out, err):
"""Generate a QLocaleXML file from CLDR data.
Takes sys.argv, sys.stdout, sys.stderr (or equivalents) as
arguments. In argv[1:], it expects the root of the CLDR data
directory as first parameter and the name of the file in which to
save QLocaleXML data as second parameter. It accepts a --calendars
option to select which calendars to support (all available by
default)."""
all_calendars = ['gregorian', 'persian', 'islamic']
parser = argparse.ArgumentParser(
prog=Path(argv[0]).name,
description='Generate QLocaleXML from CLDR data.',
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('cldr_path', help='path to the root of the CLDR tree')
parser.add_argument('out_file', help='output XML file name',
nargs='?', metavar='out-file.xml')
parser.add_argument('--calendars', help='select calendars to emit data for',
nargs='+', metavar='CALENDAR',
choices=all_calendars, default=all_calendars)
args = parser.parse_args(argv[1:])
root = Path(args.cldr_path)
root_xml_path = 'common/main/root.xml'
if not root.joinpath(root_xml_path).exists():
parser.error('First argument is the root of the CLDR tree: '
f'found no {root_xml_path} under {root}')
xml = args.out_file
Rework cldr2qlocalexml.py's reading of CLDR data Move the code out to a CldrReader class in cldr.py, expand CldrAccess with facilities that needs, expand ldml.py to include support for more features, finally making xpathlite.py redundant. This initial commit aims, though, to be bug-for-bug compatible with xpathlite in its reading of the CLDR data. It turns out we've been using draftier data than we were aware of (which might not be a bad thing). The xpathlite code appeared to check for draft attributes, but these only appear on leaf nodes and most data were fetched by finding a parent and then scanning its children without the draft check; only am/pm data was actually being excluded based on draft values. (We allowed contributed, for am/pm, in addition to approved, which is all the xpathlite code allows otherwise.) There are also some less equivocal bugs; I'll deal with these in later commits. Simplified number-system data look-ups; the old get_number_in_system() was taking care of old LDML versions' placement of the number system attribute; this is no longer needed. (It was also being used for a currency value to which it was not appropriate, which is now handled separately; this is one of the bugs mentioned above.) Ditched a fall-back to nativeZeroDigit, which no longer exists in CLDR. Change the command-line to take the root of the CLDR data tree, rather than its common/main/ sub-directory. Support naming the file to which to write output, as a second command-line argument, instead of always writing to stdout (which remains the default) and leaving whoever runs the script to redirect stdout. Support (internally for now, while adding TODOs to give main() more command-line options) separating the stderr output into its more and less interesting parts; for now, continue producing both, but suppress the least interesting entirely. Task-number: QTBUG-81344 Change-Id: Ie611b47403a9452b51feaeeaaa0fbc8f7e84dc71 Reviewed-by: Cristian Maureira-Fredes <cristian.maureira-fredes@qt.io>
2020-02-27 12:58:58 +00:00
if not xml or xml == '-':
emit = out
elif not xml.endswith('.xml'):
parser.error(f'Please use a .xml extension on your output file name, not {xml}')
else:
try:
Rework cldr2qlocalexml.py's reading of CLDR data Move the code out to a CldrReader class in cldr.py, expand CldrAccess with facilities that needs, expand ldml.py to include support for more features, finally making xpathlite.py redundant. This initial commit aims, though, to be bug-for-bug compatible with xpathlite in its reading of the CLDR data. It turns out we've been using draftier data than we were aware of (which might not be a bad thing). The xpathlite code appeared to check for draft attributes, but these only appear on leaf nodes and most data were fetched by finding a parent and then scanning its children without the draft check; only am/pm data was actually being excluded based on draft values. (We allowed contributed, for am/pm, in addition to approved, which is all the xpathlite code allows otherwise.) There are also some less equivocal bugs; I'll deal with these in later commits. Simplified number-system data look-ups; the old get_number_in_system() was taking care of old LDML versions' placement of the number system attribute; this is no longer needed. (It was also being used for a currency value to which it was not appropriate, which is now handled separately; this is one of the bugs mentioned above.) Ditched a fall-back to nativeZeroDigit, which no longer exists in CLDR. Change the command-line to take the root of the CLDR data tree, rather than its common/main/ sub-directory. Support naming the file to which to write output, as a second command-line argument, instead of always writing to stdout (which remains the default) and leaving whoever runs the script to redirect stdout. Support (internally for now, while adding TODOs to give main() more command-line options) separating the stderr output into its more and less interesting parts; for now, continue producing both, but suppress the least interesting entirely. Task-number: QTBUG-81344 Change-Id: Ie611b47403a9452b51feaeeaaa0fbc8f7e84dc71 Reviewed-by: Cristian Maureira-Fredes <cristian.maureira-fredes@qt.io>
2020-02-27 12:58:58 +00:00
emit = open(xml, 'w')
except IOError as e:
parser.error(f'Failed to open "{xml}" to write output to it')
Rework cldr2qlocalexml.py's reading of CLDR data Move the code out to a CldrReader class in cldr.py, expand CldrAccess with facilities that needs, expand ldml.py to include support for more features, finally making xpathlite.py redundant. This initial commit aims, though, to be bug-for-bug compatible with xpathlite in its reading of the CLDR data. It turns out we've been using draftier data than we were aware of (which might not be a bad thing). The xpathlite code appeared to check for draft attributes, but these only appear on leaf nodes and most data were fetched by finding a parent and then scanning its children without the draft check; only am/pm data was actually being excluded based on draft values. (We allowed contributed, for am/pm, in addition to approved, which is all the xpathlite code allows otherwise.) There are also some less equivocal bugs; I'll deal with these in later commits. Simplified number-system data look-ups; the old get_number_in_system() was taking care of old LDML versions' placement of the number system attribute; this is no longer needed. (It was also being used for a currency value to which it was not appropriate, which is now handled separately; this is one of the bugs mentioned above.) Ditched a fall-back to nativeZeroDigit, which no longer exists in CLDR. Change the command-line to take the root of the CLDR data tree, rather than its common/main/ sub-directory. Support naming the file to which to write output, as a second command-line argument, instead of always writing to stdout (which remains the default) and leaving whoever runs the script to redirect stdout. Support (internally for now, while adding TODOs to give main() more command-line options) separating the stderr output into its more and less interesting parts; for now, continue producing both, but suppress the least interesting entirely. Task-number: QTBUG-81344 Change-Id: Ie611b47403a9452b51feaeeaaa0fbc8f7e84dc71 Reviewed-by: Cristian Maureira-Fredes <cristian.maureira-fredes@qt.io>
2020-02-27 12:58:58 +00:00
# TODO - command line options to tune choice of grumble and whitter:
reader = CldrReader(root, err.write, err.write)
writer = QLocaleXmlWriter(emit.write)
Rework cldr2qlocalexml.py's reading of CLDR data Move the code out to a CldrReader class in cldr.py, expand CldrAccess with facilities that needs, expand ldml.py to include support for more features, finally making xpathlite.py redundant. This initial commit aims, though, to be bug-for-bug compatible with xpathlite in its reading of the CLDR data. It turns out we've been using draftier data than we were aware of (which might not be a bad thing). The xpathlite code appeared to check for draft attributes, but these only appear on leaf nodes and most data were fetched by finding a parent and then scanning its children without the draft check; only am/pm data was actually being excluded based on draft values. (We allowed contributed, for am/pm, in addition to approved, which is all the xpathlite code allows otherwise.) There are also some less equivocal bugs; I'll deal with these in later commits. Simplified number-system data look-ups; the old get_number_in_system() was taking care of old LDML versions' placement of the number system attribute; this is no longer needed. (It was also being used for a currency value to which it was not appropriate, which is now handled separately; this is one of the bugs mentioned above.) Ditched a fall-back to nativeZeroDigit, which no longer exists in CLDR. Change the command-line to take the root of the CLDR data tree, rather than its common/main/ sub-directory. Support naming the file to which to write output, as a second command-line argument, instead of always writing to stdout (which remains the default) and leaving whoever runs the script to redirect stdout. Support (internally for now, while adding TODOs to give main() more command-line options) separating the stderr output into its more and less interesting parts; for now, continue producing both, but suppress the least interesting entirely. Task-number: QTBUG-81344 Change-Id: Ie611b47403a9452b51feaeeaaa0fbc8f7e84dc71 Reviewed-by: Cristian Maureira-Fredes <cristian.maureira-fredes@qt.io>
2020-02-27 12:58:58 +00:00
writer.version(reader.root.cldrVersion)
writer.enumData(reader.root.englishNaming)
Rework cldr2qlocalexml.py's reading of CLDR data Move the code out to a CldrReader class in cldr.py, expand CldrAccess with facilities that needs, expand ldml.py to include support for more features, finally making xpathlite.py redundant. This initial commit aims, though, to be bug-for-bug compatible with xpathlite in its reading of the CLDR data. It turns out we've been using draftier data than we were aware of (which might not be a bad thing). The xpathlite code appeared to check for draft attributes, but these only appear on leaf nodes and most data were fetched by finding a parent and then scanning its children without the draft check; only am/pm data was actually being excluded based on draft values. (We allowed contributed, for am/pm, in addition to approved, which is all the xpathlite code allows otherwise.) There are also some less equivocal bugs; I'll deal with these in later commits. Simplified number-system data look-ups; the old get_number_in_system() was taking care of old LDML versions' placement of the number system attribute; this is no longer needed. (It was also being used for a currency value to which it was not appropriate, which is now handled separately; this is one of the bugs mentioned above.) Ditched a fall-back to nativeZeroDigit, which no longer exists in CLDR. Change the command-line to take the root of the CLDR data tree, rather than its common/main/ sub-directory. Support naming the file to which to write output, as a second command-line argument, instead of always writing to stdout (which remains the default) and leaving whoever runs the script to redirect stdout. Support (internally for now, while adding TODOs to give main() more command-line options) separating the stderr output into its more and less interesting parts; for now, continue producing both, but suppress the least interesting entirely. Task-number: QTBUG-81344 Change-Id: Ie611b47403a9452b51feaeeaaa0fbc8f7e84dc71 Reviewed-by: Cristian Maureira-Fredes <cristian.maureira-fredes@qt.io>
2020-02-27 12:58:58 +00:00
writer.likelySubTags(reader.likelySubTags())
writer.zoneData(*reader.zoneData()) # Locale-independent zone data.
en_US = tuple(id for id, name in reader.root.codesToIdName('en', '', 'US'))
writer.locales(reader.readLocales(args.calendars), args.calendars, en_US)
writer.close(err.write)
return 0
if __name__ == '__main__':
import sys
sys.exit(main(sys.argv, sys.stdout, sys.stderr))