LTRU to XML

The LTRU directory contains AWK tools to convert a IANA language subtag registry as specified in RFC 4646 to XML. For Windows see gawk-win.htm how to install gawk. The gawk-win.zip (1 MB) archive contains a gawk.exe (GNU AWK 3.1.6), credits to Volker Kiefel.

The 4645bis files are based on the May 2008 version 05 of Internet Draft 4645bis. The 4645bisA.awk script simply extracted a proto-registry as is (US-ASCII with NCRs) from the draft resulting in 4645bisA.txt. Usage:

gawk -f 4645bisA.awk draft-0N > 4645bisA.txt

Similar 4645bis.awk extracted an UTF-8 proto-registry 4645bis5.txt as specified in Internet Draft 4646bis. This AWK script can be also used to convert RFC 4646 US-ASCII registries to UTF-8, its main part is an NCR decoder UNCR. Usage:

gawk -f 4645bis.awk draft-0N > 4645bisN.txt

ltru2xml.awk is the script used for the transformation of IANA or proto-registries to XML, at the moment it expects US-ASCII input resulting in standalone text/xml documents, an example is 4645bisA.xml (1 MB). Usage:

gawk -f ltru2xml.awk 4645bisA.txt > 4645bisA.xml

AWK happily eats any octets not limited to UTF-8, therefore ltru2xml.awk can also handle UTF-8 input without any plausibility checks. The encoding has then to be adjusted manually, s/US-ASCII/UTF-8/ in the first XML output line, and tell your Web server or the W3C validator what it is supposed to be, UTF-8 text/xml or application/xml. An example is 4645bis5.xml (1 MB).

A 200y-mm copy of the IANA registry is lstrymm.txt, converted to lstrymm.xml with ltru2xml.awk. The alternative ltru2xmlalt.awk creates a slightly different XML version, multiple descriptions or comments are squeezed into a single <description> or <comment> element separated by <alt />, roughly the same (ugly) idea as <br /> in XHTML.

Maybe XML import in Excel is simpler using the alternative format. Some experiments with old Excel versions (2002 and 2003) didn't confirm this theory, YMMV. For other formats check out langtag.net.

The main purpose of ltru2xml.awk is to check subtag references as ID and IDREF with the W3C validator based on a DTD. The AWK script creates standalone XML files without an external DTD.

See also langtag.net, the LTRU directory here, and the IETF LTRU pages.


W3 validator Last update: 13 Nov 2008 03:00 by F.Ellermann