Technical Notes

Formatting HTML

AH Formatter V6.4 can format HTML designed for the Web (except for HTML that uses a frame). However, there may be few HTML documents that achieve a good result without needing adjustment after formatting. The reasons are as follows:

For example, if the HTML can be printed from a Web browser without overflowing the right-hand side of the page, then formatting with AH Formatter V6.4 will produce a reasonable result. However, in order to achieve a better result, the HTML must be designed both for the browser and for printing. The CSS for printing may be precisely defined using rules such as:

@media print { ... }
@page { ... }

Moreover, there are big differences in the CSS implementations of current Web browsers. If the HTML contains grammar mistakes by being designed for a particular browser, or the HTML uses incorrect CSS, it is unlikely that a good result could be obtained.

Many (X)HTML documents on the Web use only generic fonts. (This is desirable considering the characteristics of the Web). Since the font settings for every script in the Option Setting File always apply in AH Formatter V6.4 GUI on Windows, suitable fonts will be used. However, this applies only to AH Formatter V6.4 GUI and only on Windows. When using the Command-line Interface, please set appropriate <script-font> values in the Option Setting File and specify the Option Setting File.

CAUTION: Since AH Formatter V6.4 formats the document for print media, @media screen is ignored even when viewing documents on-screen with the GUI.
CAUTION: HTML saved from the web browser
Many web browsers have a feature to save their current (X)HTML document. However, XHTML saved this way may not produce correct XHTML. When this XHTML is formatted by AH Formatter V6.4, it will produce an error and formatting will fail. If this happens, please specify HTML as the formatter type. In addition, there may be cases where unexpected white space is added within Japanese text in the saved (X)HTML. The added white space will reduce the quality of the formatted text.

Cascading Order of CSS

The cascading order of the CSS is defined in the CSS2 Specification as follows.

  1. user agent declarations
  2. user normal declarations
  3. author normal declarations
  4. author important declarations
  5. user important declarations

AH Formatter V6.4 corresponds to the followings.

Default CSS for HTML

Default CSS for HTML is used as the first stylesheet (user agent declarations) when formatting (X)HTML. This is html.css which is placed in the directory indicated by the environment variable, AHF64_DEFAULT_HTML_CSS. (When html.css does not exist, it is formatted as all the elements are inline).

This stylesheet is created based on the display of a web browser, the style specified by CSS, etc. However, there may be specification which cannot be well displayed depending on the environment. Probably, there is also a difference of taste. Users are required to optimize the default CSS according to their own environment etc. Some examples are shown below.

Detection of Formatting Type

When the formatting starts by setting the detection of formatting type automatically, the formatting type will be determined in the following procedures.

  1. When MIME is specified, AH Formatter V6.4 will follow its settings. That is, if text/html is specified, it will be detected as HTML. When application/xhtml+xml is specified, it will be detected as XHTML.
  2. When auto-formatter-type="html" is specified in the Option Setting File and the extension of the input document is known, AH Formatter V6.4 will follow its setting. That is, when the extension is for HTML such as .htm or .html, it will be detected as HTML. If the extension is for XHTML, such as .xht or .xhtml, it will be detected as XHTML.
  3. When there is no XML declaration and DOCTYPE is for HTML, it will be detected as HTML.
  4. When auto-formatter-type="xhtml" is specified in the Option Setting File and the name space is for XHTML, it will be detected as XHTML.
  5. When there is no XML declaration and name space does not exist and the root element is <HTML> with case insensitive, it will be detected as HTML.
  6. When CSS which is not XSLT is specified (to the internal or external document), it will be detected as XML+CSS.
  7. When the name space is for XSL-FO, it will be detected as XSL-FO.
  8. Other than these will be detected as XML+CSS.

Although the document does not need to be XML if it's HTML formatting, it is required except HTML that the document should be well formed XML.

Difference in Formatting with AH Formatter V6.3

There are some differences in formatting between AH Formatter V6.4 and AH Formatter V6.3 as listed below.

Difference in Formatting with AH Formatter V6.2

There are some differences in formatting between AH Formatter V6.3 and AH Formatter V6.2 as listed below.

Difference in Formatting with AH Formatter V6.1

There are some differences in formatting between AH Formatter V6.2 and AH Formatter V6.1 as listed below.

Difference in Formatting with AH Formatter V6.0

There are some differences in formatting between AH Formatter V6.1 and AH Formatter V6.0 as listed below.

Difference in Formatting with AH Formatter V5

There are some differences in formatting between AH Formatter V6.0 and AH Formatter V5 as listed below.

Difference in Formatting with XSL Formatter V4

There are some differences in formatting between AH Formatter V5 and XSL Formatter V4 as listed below.

Incompatibility of XSL1.0 and XSL1.1

Some incompatible changes from XSL1.0 are made to XSL1.1.

Shorthand

Since the shorthand in the property of XSL has succeeded the definition of CSS, the value is evaluated like CSS. That is,

margin="0pt -10pt"

is evaluated as two values instead of one formula. However, when it's not a shorthand, this is evaluated as one formula. For example, the following is one formula.

margin-left="0pt -10pt"

AH Formatter V6.4 processes such an ambiguous expression by the shorthand as follows.

In FO, when using a formula in the shorthand, it can be enclosed with parentheses, etc.

With CSS, when a function of calc() is written as calc(10pt-5pt), - is evaluated as a operator. It is because there is no description of whether to separate - from <length-unit> in calc() of the CSS3 specification. Syntactically, It is allowed to use <length-unit> with - in succession.

Property Value Syntax

We briefly explain a part of property value syntax in the XSL/CSS Extensions. This notation conforms to that in CSS. See also Value Definition Syntax for more details.

Table Auto Layout

The table (fo:table) has the attribute, table-layout="fixed" and table-layout="auto". The former specifies the fixed layout which has the fixed column width, and the latter is a specification of the automatic layout which calculates the column width automatically. When the value is omitted, the default value is table-layout="auto". In the XSL specification, the automatic layout serves as implementation-independent. We will explain the implementation of AH Formatter V6.4 in this document.

An automatic layout can take a lot of time for calculating the width of columns. Please specify table-layout="fixed" if high-speed formatting is desired.

In AH Formatter V6.4, the processing method of the table differs between the specification of table-layout and the specification of the width to fo:table. When the width of all columns is specified, even if table-layout="auto" is specified, it is treated as table-layout="fixed". Moreover, proportional-column-width() is supposed to be available to specify only in the case of table-layout="fixed" according to the XSL specification. In AH Formatter V6.4, when a column with proportional-column-width() and a column without the width specification are intermingled, it is considered that column-width="proportional-column-width(1)" is specified to the column without the width specification. In addition, it is considered and processed that table-layout="fixed" is specified. That is, in such case, all columns will have the width specification.

table-layoutWidth of fo:tableProcessing Method
fixedYes The width is divided equally and assigned to the column as which width is not specified. When the content exceeds the width, it will overflow.
No The table width becomes 100%. The width is divided equally and assigned to the column where the width is not specified. When the content exceeds the width, it will overflow.
autoYes The content of the column are calculated and the width is assigned to the column where the width is not specified. When the table width exceeds its specified width even if the minimum width of a column is adopted, the table width expands to the exceeded width.
No The content of the column are calculated and the width is assigned to the column where the width is not specified. When the table width does not fill to 100% even if the maximum width of a column is adopted, it will become the table width. When the table width exceeds 100% even if the width of a column is adopted, it will become the table width. Otherwise, the width of a table becomes 100%.

When table-layout="auto" is specified, the content of the column where the width is not specified are investigated. More desirable column width can be determined if all rows are investigated, but it takes too much time for a big table. AH Formatter V6.4 usually investigates the contents only to the column for 100 rows at the maximum and determines the width of a column. This number of rows can be changed by table-auto-layout-limit of Option Setting File.

When table-layout="fixed" is specified, since the contents of the column are not investigated, the processing speed is always high.

URI

<uri-specification> in XSL specification is supposed to specify the character string which fulfills IRI (RFC3987) specification in url(). IRI is called URI for convenience in this document. Schemes which can actually be specified in AH Formatter V6.4 are as follows:

When a bare string is specified without using url() and it doesn't match either of other values, it is considered that URI is specified. For example, the following two are the same.

<fo:external-graphic src="url('http://localhost/image.png')"/>
<fo:external-graphic src="http://localhost/image.png"/>

Moreover, it's possible to specify the relative URI without specifying the scheme name.

<fo:external-graphic src="url('image.png')"/>
<fo:external-graphic src="image.png"/>

AH Formatter V6.4 allows specifying the file name on a local file system instead of URI for user's convenience. However, generally there is no compatibility between URI and a local file name. For example, while a white space is not allowed for URI, a white space may be available for a local file name. Moreover, since the direct use of the % may be available to use, a character string called foo%20bar.png will point out a different resource between the two cases, evaluating as URI and evaluating as a local file name.

AH Formatter V6.4 solves this problem as follows:

The relative URI is combined with base-uri and transformed into the absolute URI. All local file names are transformed into a file scheme at this time. For example, in the Windows environment, when base-uri is C:\dir\, it is transformed as follows:

foobar.pngfile:///C:/dir/foobar.png
url('foobar.png')file:///C:/dir/foobar.png
url('url(foobar.png)')file:///C:/dir/url(foobar.png)
subdir\foobar.pngfile:///C:/dir/subdir/foobar.png
url('subdir\foobar.png')file:///C:/dir/subdir%5Cfoobar.png
url('subdir/foobar.png')file:///C:/dir/subdir/foobar.png
foo bar.pngfile:///C:/dir/foo%20bar.png
url('foo bar.png')file:///C:/dir/foo%20bar.png
foo%20bar.pngfile:///C:/dir/foo%2520bar.png
url('foo%20bar.png')file:///C:/dir/foo%20bar.png
foo%%20bar.pngfile:///C:/dir/foo%25%2520bar.png
url('foo%%20bar.png')file:///C:/dir/foo%25%2520bar.png
foo#bar.pngfile:///C:/dir/foo#bar.png
url('foo#bar.png')file:///C:/dir/foo#bar.png
foo%23bar.pngfile:///C:/dir/foo%2523bar.png
url('foo%23bar.png')file:///C:/dir/foo%23bar.png

A local file name cannot be written directly into url(). For example:

url('C:\My Document\foobar.png')

The string above will not operate as expected. Please specify a local file name without surrounding by url().

# is a separator of fragmentation. In file:///C:/dir/foo#bar.png, the resource actually accessed is file:///C:/dir/foo. Please specify url('foo%23bar.png') to access a resource called foo#bar.png.

UNC (Universal Naming Convention) in Windows, for example, \\host\My Document\foobar.png is transformed into file://host/My%20Document/foobar.png. Also, //host/My Document/foobar.png will be transformed into http://host/My%20Document/foobar.png when base-uri is http:. (The same applies to https:). In non-Windows, file://host/... is not supported.

Please refer to Graphics for the data scheme and the jar scheme.

When accessing HTTP or HTTPS via a proxy in non-Windows environment, it's necessary to specify the proxy address by the environment variable.

When the root certificate is necessary in non-Windows environment, it's necessary to specify the directory of the root certificate by the environment variable. V6.3MR2

Unicode

AH Formatter V6.4 supports Unicode 7.0. The characters added after that may not be treated correctly. In addition, it's impossible to treat the character of unsupported script correctly (☞ Scripts and Languages.) Also the following characters are not supported:

U+2066 is considered as U+202D, U+2067 is considered as U+202E and U+2069 is considered as U+202C for each.

Line Breaking

AH Formatter V6.4 processes the line breaking according to UAX#14: Line Breaking Properties. There are some cases that the processing differs from UAX#14.

Hyphenation

This section explains the behavior of the page (or column) break when hyphenation-keep="page" (or "column") is specified. Suppose there is the following sentence with hyphenation-keep="page" specified.

xxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx
xxxxxxxxxxx abc-
def xxxxxxx ghi-
jkl mnopqr.

When the page break occurs at the last line, ghi will pushed to the next page and results in the following:

xxxxxxxxxxxxxxxx
xxxxxxxxxxx abc-
def xxxxxxx
---------------- page break
ghijkl mnopqr.

When widows="2" is specified, another 1 line will be pushed to the next page and results in the following:

xxxxxxxxxxxxxxxx
xxxxxxxxxxx abc-
---------------- page break
def xxxxxxx ghi-
jkl mnopqr.

But it acts against the behavior of hyphenation-keep="page". At that time, AH Formatter V6.4 cannot push only abc and accordingly 1 line will be pushed to the next page.

xxxxxxxxxxxxxxxx
---------------- page break
xxxxxxxxxxx abc-
def xxxxxxx ghi-
jkl mnopqr.

When the previous line ends with the hyphenation, lines will be pushed line after line. It's better to use together with hyphenation-ladder-count.

In another some different case, Lines may increasee when ghi is pushed as follows:

xxxxxxxxxxxxxxxx
xxxxxxxxxxx xxxx
xxx xxxxxxx
---------------- page break
ghijkl xxxx mno-
pqr.

When widows="3" is specified, one more line will be pushed. At this time, lines may decrease as follows:

xxxxxxxxxxxxxxxx
xxxxxxxxxxx xxxx
---------------- page break
xxx xxxxxxx ghi-
jkl xxxx mnopqr.

AH Formatter V6.4 cannot dissolve the widows="3" caused by the side effect. This is the limitation of AH Formatter V6.4. widows="2" never cause such scenario.

Variation Sequence

AH Formatter V6.4 supports the Unicode Character 'Variation Sequence'. When the OpenType font has the capability of Variation Sequence (cmap Format14), it is processed appropriately. For example, Variant Sequences can be expressed as follows.

葛&#xE0100;城市葛城市
葛&#xE0101;飾区葛飾区

Even when it is applied to a CID font which does not have the capability of Variation Sequence, CID is selected according to the following IVD (UTS#37: Ideographic Variation Database).

&#xE0100;, etc. will be disregarded when it is a font which does not have the capability of Variation Sequence or there is no corresponded variation characters, or the specified Variation Sequence is beyond the range. This indicates that even if the setting is the same, the displayed font face may differ depending on which Variation Sequence the font corresponds to.

CAUTION: Variation Sequences other than Ideographic are not supported.

Font Selection

Fonts in FO or CSS are specified by the font-family property. There are various cases in settings when the candidates of the font are enumerated like font-family="'Courier New', serif", or when there is no specification of font-family, AH Formatter V6.4 determines which font should be applied to a character string as follows.

  1. The character strings in the region are divide into the character strings with the same character by the script information corresponding to the character defined by Unicode, the language specified in FO or CSS, or the script information, etc. and the script of the divided character string is determined. This method of determination is complicated because of the reason that there contains the ambiguous characters to determine if it's a full width character or not in Unicode. Or the language is being unable to determine by kanji only as a character string.

  2. When font-selection-mode="6" is specified in the Option Setting File, each character of this character string is investigated in order whether the font-family specified by FO or CSS has its glyph. Then the font with the first found glyph will be adopted. If these are not specified, each character of this character string is investigated whether the font-family specified by FO or CSS has its glyph, and the font-family supports the Unicode range or script in order. Then the first found supported font will be adopted. When no font-family is specified, it is considered that the generic font family as the standard font family is specified.

In XSL or CSS, the following five can be used as the generic font family.

AH Formatter V6.4 has the information of which font is actually made to correspond to these for every script. Moreover, the default generic font which does not belong to any script can also be defined now. These can be specified in the Font Setting page of the Option Setting dialog in GUI, and also can be specified with <script-font> in the Option Setting File.

  1. When the generic font classified by the script corresponding to the script of the target character string is specified, whether it supports the character string is investigated.

  2. When the corresponding generic font classified by the script is not specified, the default generic font is investigated.

  3. When auto-fallback-font="true" is specified in the Option Setting File and any fonts specified in the font-family don't support the target character string, the following fallback processing will be performed.

    1. The font specified to the fallback associated with the corresponding script is investigated.
    2. The font specified to the fallback of the standard generic font is investigated.
    3. Even then any fonts don't support the target character string, the following fonts are investigated in order.
      • Windows version
        1. Lucida Sans Unicode
        2. Microsoft Sans Serif
        3. IPAGothic
        4. Code2000
        5. MS PGothic
        6. Arial Unicode MS
      • Non-Windows version
        1. Helvetica
        2. IPAGothic
        3. Code2000

  4. It is an error even then the font which supports the target character string is not found.

The settings in the Option Setting dialog is reflected on the Option Setting File. For example, it is written like

<script-font script="Hans" serif="SimSun" sans-serif="SimHei" monospace="SimSun"/>

Since there is no specification of cursive here, cursive in the default generic font is adopted to Hans. Like immediately after the installation, when <script-font script="Hans"/> itself is not specified, it is considered that the default group is specified. The following default group is set up with the Windows version. No scripts which are not specified here are set up. Moreover, it is not set up when the font does not actually exist.

Scriptserifsans-serifcursivefantasymonospace
Standard Times New Roman Arial Segeo Script or
Comic Sans MS or
Monotype Corsiva
Impact Courier New
Jpan MS Mincho MS Gothic MS Mincho or
MS Gothic
MS Mincho or
MS Gothic
MS Gothic or
MS Mincho
Hans SimSun or
MS Song
SimHei or
MS Hei or
MS Song
SimSun or
MS Song
SimSun or
MS Song
SimHei or
MS Hei or
MS Song
Hant MingLiU
Hang Batang or
BatangChe
Gulim or
BatangChe
Batang or
BatangChe
Batang or
BatangChe
BatangChe
Arab Arabic Typesetting
Hebr FrankRuehl
Deva Mangal
Beng Vrinda
Guru Raavi
Gujr Shruti
Taml Latha
Telu Gautami
Knda Tunga
Mlym Kartika
Sinh Iskoola Pota
Thai Angsana New
Khmr DaunPenh
Laoo DokChampa
Mymr Myanmar Text

The following default group is set up with the Macintosh version.

Scriptserifsans-serifcursivefantasymonospace
Standard Times or
Times New Roman
Helvetica or
Arial
Monaco or
Chalkboard
Monaco or
Chalkboard
Courier
Jpan HiraMinPro W3 HiraKakuPro W3 HiraMaruPro W3 or
HiraKakuPro W3
HiraMaruPro W3 or
HiraKakuPro W3
HiraKakuPro W3
Hans STXihei STSong STXihei STXihei STSong
Hant LiHeiPro LiSongPro LiHeiPro LiHeiPro LiSongPro
Hang AppleMyungjo AppleGothic AppleMyungjo AppleMyungjo AppleGothic
Arab Geeza Pro
Hebr NewPeninimMT
Deva DevanagariMT
Thai Thonburi

The following default group is set up with the other UNIX version.

Scriptserifsans-serifcursivefantasymonospace
StandardTimesHelveticaTimesTimesCourier

Upright Rendering of Text in Vertical Writing Mode

There are basically three types of the orientation of text in Japanese or Chinese documents as follows:

In horizontal writing In vertical writing
SVO MVO

Expresses the orientation of text in vertical writing mode with U or R. U is a character displayed upright on the paper. R is a character rotated 90 degrees clockwise on the paper. Then the text orientation in vertical writing mode is as follows:

There is an argument of which characters should be upright or which characters should be rotated 90 degrees at UTR#50: Unicode Vertical Text Layout. Right now only the description of MVO (Mixed Vertical Orientation) is here in tr50-11.html. However, the description of SVO (Stacked Vertical Orientation) was also included in the past (tr50-6.html). AH Formatter V6.4 implements axf:text-orientation="mixed" complying with MVO, axf:text-orientation="upright" complying with SVO. However, AH Formatter V6.4 uses the one with some modifications. (☞ tr50-x.Orientation.txt). This data can be modified arbitrarily in the Option Setting File. See also UTR50.

Usually, the font supporting the vertical writing mode has the glyph for vertical writing for some characters. It is because some are inapplicable to vertical writing simply by rotating the glyph for horizontal writing mode. They are small kana, punctuations, long vowel, etc. In vertical writing mode, if the character has the glyph for vertical writing, it will be used.

The orientation of text (U or R) is decided and expressed as compared to the orientation of the glyph for horizontal writing mode. However some glyphs for vertical writing mode differ from that for horizontal writing mode. The example below shows the glyph of U+3083, U+FF08, and U+2190. U+FF08 and U+2190 have the different orientation between vertical and horizontal writing mode.

Glyph for horizontal writing Glyph for vertical writing

Although "brackets are R" as mentioned above, actually you have to display them as U using the glyph for vertical writing mode. That is, here is a tacit assumption that the glyph for vertical writing mode is designed to have the orientation differently from that for horizontal writing mode. Whether the font has the glyph for vertical writing mode or whether the orientation is the same as that for horizontal writing mode depends on the font. In particular, the difference by a font is remarkable in the orientation of symbols, such as arrows. Since it is impossible to get to know which orientation the glyph is designed, this problem is generally impossible to solve. Therefore, AH Formatter V6.4 controls the orientation of the character according to the major implementations.

Formatting Large Document

For example, when formatting the simple FO without <fo:page-number-citation> and outputting PDF, since AH Formatter V6.4 outputs PDF by throwing away pages which has already been formatted, no matter how huge the document is, AH Formatter V6.4 can process without consuming the memory of greater than 1 page (except for the formatting from GUI). However, if the page refers to the back page by <fo:page-number-citation> we cannot know what page number the currently referenced page will be until the page is actually being formatted. For that reason, if the page containing the unsolved <fo:page-number-citation> appears, AH Formatter V6.4 will suspend the output, storing the result on the memory in the middle of formatting. When the document has a table of contents at the start, the output will not be performed until all the page number that appears in a table of contents is solved. A limit arises in the number of formatting pages and this means that the formatting of a large-scale document is impossible because of the memory consumption in large quantities.

In order to solve this problem, AH Formatter V6.4 makes it possible to process the document with 2-pass format. With the first pass, the formatting is processed only for the purpose of the solution of <fo:page-number-citation>, and all the required page number information is collected. With the second pass, the formatting starts again from the start of the page. Since all <fo:page-number-citation> is solved at this time, AH Formatter V6.4 can output the document by throwing away the already formatted pages. Although the formatting processing time will increase, most memories used for the formatting are not consumed and it is available to format the large-scale document. But it has no effect on the memory consumption needed for the output.

The following shows how to perform 2-pass formatting.

CAUTION: 2-pass formatting is not available with CSS formatting.
CAUTION: It's not available to process the 2-pass formatting from GUI.
CAUTION: It's not available to process the 2-pass formatting with AH Formatter V6.4 Lite.

Temporary File

AH Formatter V6.4 does not make the temporary file for work except for the case of being inescapable. Followings are the cases that AH Formatter V6.4 makes the temporary file for work.