Last Updated: 5 Oct 2004
Copyright (©) 2004, Innodata Isogen
Document Language: en
Last Updated: 5 Oct 2004
Copyright (©) 2004, Innodata Isogen
The Innodata Isogen Internationalization (I18N) Support Library is a collection of Java classes that provide fundamental services to document processors for localizing and internationalizing the rendered form of XML documents.
The services provided include:
Language-specific comparators for doing language and locale-appropriate lexical sorting of strings (for example, with the xsl:sort command through Saxon). The generic "getComparator" functions can be bound to any implementation of the Java Comparator interface. The default Comparator implementation is that provided by the ICU4J package ( http://oss.software.ibm.com/icu4j/).
The core functions (I18nService) are processor independent and can be bound to any specific processor through a relatively thin binding layer, as demonstrated by the provided Saxoni18nService class. For example, the I18nService can be bound to Epic Editor through it's Java API, other Java-based XSLT processors, or Java-based user interfaces, or DOM-based XML processors.
The I18N Support Library uses two configuration files, one for static text and one for index configuration. Both are XML documents. As far as the core library is concerned these files can be anywhere. However, the Saxon extension class requires that the files be in specific locations relative to the root of the "i18n home" directory (which is set using the "com.innodata.i18n.home" Java system variable.
For the Saxon extensions, the configuration files must be in the following directories:
This restriction is a side effect of the fact that there's no direct way to pass parameters to the Saxon extension library (except through Java system properties set on the Java command line). If more flexibility is needed, it would be possible to define additional system properties for specifying the exact locations of these configuration files.
The static text database document consists of two main parts: the "contexts" and " attribute maps". The contexts are primarily intended to map element types to their text before and, if needed, text after. However, the contexts can include entries with arbitrary string keys, for example, for strings that have no associated element type. The attribute maps map values of enumerated attributes to specific strings.
The static text database configuration vocabulary is bound to the XML name space URI "http://www.innodata-isogen.com/vocabularies/i18n_support/static_text_database".
The <contexts_common>
element
contains the context entries, consisting of one more <context>
elements.
Each <context>
element has a <lookup_key>
,
which contains the string by which they context is looked up. This can be
anything, but values that are the same as element type names can be accessed
using the getGeneratedText functions that take an element as one of their
arguments. By convention, non-element-type name keys are prefixed with "#"
to ensure that they do not conflict with any element type names (XML names
cannot start with "#").
Following <lookup_key>
is one <text_before>
and one <text_after>
element. Each of these is either empty or has a <default_item>
element and zero or more <item
> elements.
The
<default_item>
element defines the default value to be used when
there is no item for a specific language. This can either be an useful value,
or a string like "{toc not translated}" which will provide a clear
visual indicator of a missing translation.
Each <item>
element
provides the translation for a single language, specified using the
xml:lang=
attribute.
A typical context element is:
<context> <lookup_key>#full_stop</lookup_key> <text_before> <default_item>.</default_item> <item xml:lang="zh-CN">。</item> <item xml:lang="zh-HK">。</item> <item xml:lang="zh-TW">。</item> </text_before> <text_after/> </context>
This example defines the character to use for full stop (period) in various languages. This might be used in constructing cross reference strings, for example.
The back-of-the-book (botb) index rules configuration file lets you define the alphabetic groups for each language, as well as defining the collation (sorting) rules for the language, if necessary. Grouping rules can be defined by enumerating each character or character sequence for each group or, for languages with lots of characters, such as ideographic languages, you can define groups by specifying the first member of each group (and the last member of the last group).
The back-of-the-book index configuration vocabulary is bound to the XML name space URI "http://www.innodata-isogen.com/vocabularies/i18n_support/botb_index_config".
The element types involved are:
botb_index_rules
metadata
index_config
national_language
description
collation_spec
sort_method
group_definitions
term_group
). Each group must have at least a group key. If the sort method is
"group by members", it must also contain an explicit list of group
member characters (group_members)
. Groups can also have a group
label that is different from the group key, and, if necessary, a group sort
key that is different from either the group label or group key. If only the
group key is specified, it is also used as the group label and group sort
key. If a group label is specified, it is used as the group sort key if no
explicit sort key is defined. Note that any character that does not sort into
one of the defined groups will be grouped into the "Symbol/Numeric"
group (group key "#NUMERIC").
term_group
group_key
group_label
group_sort_key
group_members
char_or_seq
elements to enumerate the
characters within the group. The group_members
element should
not be used if the sort method is "sort between keys", except
for the last group, which must specify the last_member
element
to indicate the last member of the last group.
char_or_seq
char_or_seq
element would contain
one character, one for each each lowercase and uppercase letter. For languages
like Spanish, where two or more characters are treated as a single character
for sorting and grouping, you would specify multiple characters within a single
group, e.g. <char_or_seq>ch</char_or_seq>
.
last_member
group_members
, identifies the last member of the last
group for indexes that use the "sort between keys" sort method
(e.g., the ideographic languages).
The sample index configuration document provides examples of index configurations for alphabetic, sylabic (Korean), and ideographic languages, showing how to configure each type of language. The configurations for these languages are discussed in more detail below.
NOTE: The index configuration mechanism has been implemented to use a single XML document instance to hold the configurations for all the languages needed. If you find it convenient to put each language's configuration is a separate file, you can use normal XML external parsed entities to do this. While it hasn't been done, it would not be difficult to implement an XInclude-style inclusion mechanism if there is a strong requirement for it.
The English index configuration is the simplest configuration, as
it requires nothing more than a set of groups, each consisting of two single-character
char_or_seq
elements, one for the lowercase form of a letter, one for
the uppercase form. There is no special collation specification or sorting
method. The English index configuration must always be present and is used
as the fallback configuration for any language for which no explicit configuration
is found and for grouping and sorting English words (the current code base
assumes that words not in the document's base national language will be in
English--that is, the current code base does not provide for a Chinese
document that contains Spanish words that need to sort according to the Spanish
index rules).
The English index configuration can be used as the base
for any other latin-based language--just copy the index_config
element, change the national language value, and adjust the groups
as necessary.
The
Spanish index configuration demonstrates using char_or_seq
to
define a group as having a multi-character sequence as a member. In Spanish,
"ch" is treated as a single character for the purposes of grouping
and sorting, so the Spanish configuration differs from the English in having
this additional entry:
<term_group> <group_key>CH</group_key> <group_members> <char_or_seq>ch</char_or_seq> <char_or_seq>CH</char_or_seq> </group_members> </term_group>
Note that it is not necessary to define all the possible case combinations of the character group (e.g., "Ch", "cH"), just the all lowercase and all uppercase versions.
For grouping and sorting, this definition causes all words starting with "ch" to be grouped and sorted all words starting with "c" and followed by any character other than "h".
Note also that this treatment of "ch" must be defined in the Java collation rules for the language. In the case of Spanish (and all or most other European and East European languages), the appropriate collation rules are provided by the standard Java distribution.
The Simplified Chinese index configuration demonstrates several features. Simplified Chinese, as an ideographic language, uses at least 40,000 characters, grouped and sorted alphabetically according to their Pin-Yin transliteration. For example, the character for "horse" is transliterated as "ma" (ignoring tone indicators) in Pin-Yin. Thus, words starting with this character will be grouped under "M" and sorted before any character that transliterates as "mi".
Because of the large number of characters
it would be impractical (but not impossible) and inefficient to enumerate
the members of each group. Instead, Chinese (and all the other ideographic
languages) use the "sort between keys" sort strategy, as indicated
by the <sort_between_keys>
element within the <sort_method>
element.
In addition, the editorial style for Simplified Chinese
is that English words sort before Chinese words, so that the English word
"math" would sort before all Chinese characters within the "M"
group. This is indicated by the <sort_english_before>
within
<sort_method
>. In most non-latin languages English words are sorted
after the words in the main language, so that is the default.
Each group has a group key, which is the first Chinese character within that group, and a group label, which is the latin character label for that group ("A", "B", "C", etc.). Because the group key is used as the group sort key by default, there is no need to specify a separate group sort key.
Each group has a <group_members>
element but
it is empty for all but the last member. For the last member, the <group_members>
element contains a <last_member>
element that contains
the last Chinese character member of the last group. Without this specification,
any characters that are defined as sorting after the ideographs would also
be sorted into the last group.
Finally, the built-in Java collation
rules for Simplified Chinese in Java 1.3 and 1.4 are not correct. Therefore,
custom collation rules are used, as specified with the <java_collation_spec>
element within the <collation_spec>
element. The
<java_collation_spec>
element contains an <include_collation_spec>
element, whose content is a path to a file containing a Java RuleBasedCollator
collation rule specification. If this path is relative, it is relative to
the location of the index configuration document.
The Simplified Chinese collation rules provided with the I18N Support package were created using the Unicode database, which provides the Pin-Yin transliteration for most characters (the "unihan.txt" file, available from the Unicode consortium Web site). However, there is no single authority for transliterations, so different readers or authorities may result in different collation rules. The most precise collation rules require the use of an agreed upon and authoritative Simplified Chinese dictionary and would require significant human effort to develop and verify.
Traditional Chinese indexes are sorted and group by character stroke count and then by radical (the base graphical element within a character). The group labels are the Characters for "one-stroke character", "two-stroke characters", and so on. Thus, where for Simplified Chinese the group label and sort key are the same, here the group label and sort key are different. The sort key is the same as the group key so there is no need to specify a separate group sort key.
To install the I18N Support library, simply unpack the package, creating the subdirectories. The `i18n_support.jar file includes a manifest that automatically adds the 3rd-party libraries in the lib/ directory to the Java class path. As long as the relative relationship is maintained you do not need to set or extend the Java CLASSPATH environment variable or command-line parameter to include the 3rd-party jars, only the i18n_support.jar itself..
The configuration files can be in any location, although the Saxon extension class (Saxoni18nService) expects them to be in config/ below the root of the distribution (the com.innodata.xml.i18nhome Java system property). If you change the organization of the configuration files you must update the Java source to reflect those changes.
To use the Saxon extensions you must declare an extension to use for the extension functions and bind them to the com.isogen.i18n.I18nService class, e.g.:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:isoi18n="java:com.isogen.saxoni18n.Saxoni18nService" >
You can then use the static methods defined in the Saxoni18nService class as XSLT extensions functions, e.g.:
<xsl:value-of select="isoi18n:getGeneratedTextForKeyBefore('#toc', $currentLang)"/>
See the Java API docs for the details of the extension functions provided.
Generated index goes here