Computer Hyphenation Logo

Description of Hyphenologist

Created 19/4/97. Last modified 14/11/97. URL http://www.hyphenologist.co.uk/descript.html

Contents


Introduction

Hyphenologist is a program to perform hyphenation in many human languages: Afrikaans, American, Anglo Saxon, Malay, Basque, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch (Flemish), English (phonetic), English (Traditional), Esperanto, Estonian, Finnish, French (French Canadian, Walloon), German (Swiss German), New German, Greek (Classical), Greek (demotic), Hungarian, Icelandic, Indonesian, Irish (Erse, Gaelic, Scottish Gaelic), Italian, Latin, Latvian, Lithuanian, Maltese, Norwegian (bokmal, landsmal, nynorsk, samnorsk), Polish, Portuguese (Brazilian), Romanian (Moldavian), Russian, Serbian, Slovak, Slovenian, Spanish (Papiamento etc.), Swahili, Swedish, Turkish, Ukrainian and Welsh. Several other languages are under development. Should you require any other language we will be pleased to discuss any requirements.

Hyphenologist is a new adaptation of a long-established family of word splitting algorithms. It is now implemented in ANSI "C". It includes some “intelligence”, to give the implementor a sensible choice between hyphen positions.

Hyphenologist handles algorithmically the linguistic problem of where to place hyphens in a word, by using prefixes, suffixes and infixes in the language involved. Where a language uses "words within words" as in German, Hyphenologist searches for common internal words.

Hyphenologist is "data driven" from rule bases that are specially developed for each language taking into account "custom and practice". We find however that "custom and practice", dictionaries and published methods are substantially different from each other and often internally inconsistent. Hyphenologist does not therefore reproduce any specific system of hyphenation. A facility to construct a personal exception file that overrides the existing rules is included. Where a language has unique problems, special code is included.

Hyphenologist is normally supplied as source code for ease of incorporation into your programs. Any combination of languages will compile immediately. Dummy main()s and files of hyphenated words are supplied for testing purposes.

Why does Hyphenologist use an algorithm, without an exception list on disk?

When word splitting by computer began, some 25 years ago, the first language split was naturally American English, one of the most difficult languages. At that time the main memory allowed for an algorithm was tiny, and therefore the algorithms were all terrible. A long exception list was required for reasonable performance. Lists of correctly hyphenated American words were developed. These lists are reasonably well developed and are now often used without the algorithm. Such a system works well until it meets words like ‘ayatollahlike, ‘midtermitis’, ‘Reaganometrics’ and many more, which often produce loose lines. The situation has now changed, memory is cheap and available.

Hyphenologist is a very powerful algorithm, and for American English is based on some 10,000 parts of words. When tested against 140,000 words it correctly splits over 99% of them, and 100% of words which are in common use. An internal, amendable, exception list is included, but contains at worst 1000 words.

For the romance languages, Italian, Spanish etc., hyphenation methods are simple, and can be described on a single sheet of paper. Algorithms are small, fast, and very accurate. The dictionary method is a waste of resources.

The Germans as a matter of routine invent perfectly correct German words by adding one word to another, so no dictionary can ever be created. If attempted, it rapidly becomes impossibly large, out of date, or does not cover the words required by the user. Algorithms are the only possible solution. Dutch, Norwegian, Swedish etc. have similar problems.

General

Hyphenologist is a collection of some 100 tools and rule bases that are used on a mix and match basis for the various languages. These are selected via the “C” preprocessor. The rule bases (prefixes, suffixes etc.) are contained in a separate file for each language with any special tools for that language.

Hyphenologist normally suggests one hyphen position for every three or four letters in a word, thus for long words the calling routine is given a choice of hyphenation positions. This is facilitated by a “goodness factor” or weight given to each hyphen. A file of functions choice.c is supplied, these may be used on a “mix and match” basis to facilitate the choice.

Multi language work

Hyphenologist will compile to hyphenate from 1 to 43 languages or more than 1,000,000,000,000 different ways, actually 2 to the power 46. After compilation languages are selected by changing a global variable langnum. To ensure that the code is testable, in reasonable time, the code for performing this is isolated into four files that consist only of a switch statement for each tool. For multi language compilations, the data for each language, may swapped in from disk as required. Alternatively for virtual systems, where the swapping is done by the operating system, the data for all languages may be compiled into a single executable file.

Speed

Hyphenologist takes about 0.5ms per word hyphenated on a 300Mhz Pentium II CPU. We limit the maximum time to 1 msec by adjusting data structures. Timings on modern cached CPUs are difficult as they are are dependent on configuration.

Size

Hyphenologist is shipped with some 6000 lines of common code, some 11000 lines of switch statements for multi language work, plus some Extra code for English/American. However after stripping out comments, debugging code, preprocessor statements, unused code and duplicate versions of code a single language uses some 650 to 1500 lines of common code. Each language also uses some 300 to 3500 lines of data statements from the language files. Hyphen.h contains some 50 lines of #defines per language.

The code of Hyphenologist is totally held in main memory. Language data may be swapped into main memory as required. When incorporated into an existing program a single language module of Hyphenologist requires approximately 10 to 100 kilobytes of main memory depending on the model, language and compiler

Portability

Hyphenologist is portable over all character sets by using a special internal character set. Where a language uses accented characters, or a non roman character set all characters are transliterated into ‘a’ thru ‘z’ and ‘A’ thru ‘Z’ by a look up table. Where required by the language other characters such as the apostrophe are allowed in words. Accented and upper case characters are normally transliterated into the lower case version of the character. Other characters such as the German Esszett are transliterated into appropriate upper case characters. Cyrillic and Greek require special transliteration. This is described at length in comments at the beginning of each language.c file. With the advent of Unicode, several ligatures are becoming more common, notably ‘ij’ in Dutch and ‘dz’, ‘lj’ and ‘nj’ in Croatian. Hyphenologist correctly hyphenates these languages, either with the ligature, or if the text uses two characters. How long will it take to incorporate Hyphenologist into our program? The record is only 1 1/2 hours for a 2 language implementation, several customers have reported one day. It would be better to allow two to three man days, in order that you may read the Manual and complete the testing.

C Standards

The code is written in a highly portable style of C, strictly conforming to the ANSI standard. For those without an ANSI Compiler a #define KR allows the use of a compiler to the K&R standard.

Unusual Hyphenations

Old German, Dutch, Swedish and Hungarian have highly individual hyphenation conventions. These grew up in the days of hand typesetting and hot metal typesetting when the compositors handled the vast majority of type only once and adding, deleting, or changing a character was not difficult. With a computer these conventions are very difficult to implement as in all cases the original form of the word must be reinstated when the hyphen is removed on rejustification.

Hyphenologist handles these with with two special functions. The first function uses a ‘_’ in the language.c file and the exception files and a ‘d’ in hyarr. The second function uses a ‘=’ in the language.c file and the exception files and a ‘c’ in hyarr. The description how to implement these is in comments at the front of each language.c file.

The interface

The interface of Hyphenologist with your program is simple.

    short hyphen (inword, hyarr, lcword)
    char inword[];
    char hyarr[];
    char lcword[];
    {
    

On entry inword must contain the NULL terminated word to be hyphenated. lcword is used as workspace by Hyphenologist, to create a lower case version of inword. hyarr will contain a NULL terminated string containing the hyphens suggested, represented as a character between ‘1’ and ‘9’ indicating the “goodness” of that hyphen, (positioned BEFORE the suggested position). Weight ‘9’ is a word end, ‘6’ is a morpheme end, ‘3’ is a syllable end, ‘1’ & ‘2’ are “Do not use unless you have to”, ‘c’ and ‘d’ are used for unusual Hyphenations. Unused positions in hyarr will be packed with ‘.’.

Hyphenologist returns the number of hyphens planted or HYFAIL (-1) if an internal failure occurred, CHAR_FAIL (-2) if an invalid character was passed to Hyphenologist, or FILE_FAIL (-3) if a problem occurred with a file containing the language data.

As an example:

    inword autosuggestion
    hyarr  .3.9..3..6....
    lcword autosuggestion
                             3  9      3    6
    hyphenated au-to-sug-ges-tion