Convert docx file to HTML


Our proprietary analyzing program analayzes docx files created in Word and converts them into HTML5 or XHTML 1.0 compliant HTML, which is much simpler and free of extra tags than the standard Word HTML format.

 

func01-01-top

 

 

func01-01-01

 

HTML source code output from Word's standard features

For standard Word functions

Word has a standard function to convert a document to HTML when it is saved, but in order to ensure that the appearance of layout and style can be reproduced and re-editable in Word, a large number of layout and style specifications are given as "style" directly to tags for text, images, etc. This generally makes them unsuitable as HTML to be published on the Web, or makes it difficult to customize or modify the HTML.

In some cases, the output lacks the right output for the HTML structure, so although the appearance in the Web browser reproduces the layout in Word to some extent, it uses tags that do not match the HTML structure. 

 

 

func01-01-02

HTML source code output from HTML on Word

Convert with HTML on Word

"HTML on Word" analyzes the contents of the docx file, minimizes information related to layout and style, and converts the structure of the text added in the Word document so that it is appropriate for the HTML structure. Since there are no extra layout or style specifications, HTML is generated and is simple and easy to customize or modify.
Layout and style can be specified separately using CSS, making it easy to structure a web page separately from HTML structure and design.

 

 

Convert Word's style to HTML tags


Styles, paragraphs, etc. specified in Word are analyzed and converted into equivalent HTML tags for output.
The table below lists some of the tags to be converted. For detailed conversion specifications, please refer to "Conversion Specifications" in the online manual.

 

Main converted styles and tags

Word's style

html tag

Body text

<body>-</body>

Heading 1 to 6 (Outline level 1 to 6)

<h1>-<h6>

Note: For HTML5, output <section> for each heading.

Heading 7 to 9 (Outline level 7 to 9)

<p class=”l7”>-<p class=”l9”>

Paragraph (normal)

<p>-</p>

Bullets

<ul><li>-</li></ul>

Paragraphs with numbering

<ol><li>-</li></ol>

Image

<img src="Path of output image">

Table

<table><tbody><tr><td>-</td></tr></tbody></table>

Table style option: Title row

<thead><tr><td>-</td></tr></thead>

Table style option: First column

<tr><th>-</th><td>-</td>-</tr>

Table cells

<td>-<td>

Hyperlink

<a href="URL">-</a>

 

Other HTML elements and tags to be converted/outputted

  • Output HTML version: HTML5 or XHTML 1.0

  • Header/Meta information: head, title, meta, link, style, script

  • Paragraph numbering and ordered lists: Normal paragraph or ol class=Numbering type", li

  • Paragraph style name (optional): class="Style name"

  • Image and shape formats: JPEG or PNG for images, SVG for line art

  • Layout options: Specify layout options by class

  • Position to output the figure with “With Text Wrapping” specified: Behind the anchored block

  • Formula: SVG by default, optional MathML or OMath output

  • Inline elements: strong, sub, sup, ruby, rp, rt, with optional italic, underline, and strikethrough output

  • Text color: Optionally, style color

  • Links and cross-references: External links, cross-references, links from auto-generated ToC to main text headings

  • Paragraph text alignment: class attribute

  • Endnote: Anchor to endnote symbol and link to footnote output at the end of the document

 

Convert Word's ToC to Web page ToC


The "Table of Contents" that can be automatically created in Word is converted into text links that can be used like a table of contents on a Web page.
Text links generated for each heading (outline level) make it easy to navigate to the desired heading.

 

func01-03

 

 

Easy to create with add-in


If you have Microsoft Word installed on your computer, add the Word add-in when you install "HTML on Word"; the Word add-in allows you to immediately output an HTML file from the Word document you are editing.

This can be done at any time while editing, so you can easily preview your creation.

 

func01-04

 

Options

The add-in allows the following options to be specified during conversion.

button-convert2

 

Use specified CSS

You can specify a CSS file to be linked to HTML. The specified CSS file will be saved in the same folder as the HTML, and you can check the web page with the style reflected.

Line break with block tag

For the output HTML source code (strings to be written), you can specify whether or not to break lines after the closing tag of a block tag (h, p, table, etc.). The default setting is no line breaks. This can reduce HTML size by not outputting line feeds, but it also reduces readability because all code is generated on a single line.
By specifying the option, you can add line feed codes and output HTML with highly readable source code.

 

image29

Output HTML source code [left: without line feed code, right: with line feed code]

 

In addition, by specifying the destination folder, an HTML file is produced simply by clicking the "Convert to HTML" button and is displayed in the associated program.
If a file with the extension "html" is associated with a Web browser, the conversion results can be viewed immediately in that Web browser.

 

Supports command line


"Word2HTML", the program that analyzes docx and convert into HTML, the core of "HTML on Word", can be run directly from the command line (Windows command prompt or any program that can execute commands).

When executed from the command line, various options can be specified to output HTML with more detailed conversion settings.

 

func01-05

In addition, by saving the conversion settings in a file, you can easily convert with the same settings by specifying the settings file at runtime from the command line.

Note: The setting file can also be used as the default settings for conversions from the add-in by saving it in a predetermined directory.

How to use command

To perform the conversion from the command line, start the Windows command prompt, specify the command line program "Word2HTML.exe", enter the location/file name of the input file (docx) and output file (html), the optional command you want to use and execute (press Enter).

 

Example (The following three items are required):

Command program location and name

The default folder is: C:\Program Files\Antenna House\xhw12\Word2HTML.exe

Location and name of the original docx file to be converted

e.g., c:\document\manual.docx

Location and name of the converted html file

e.g., c:\out\index.html

 

Actual command

"C:\Program Files\Antenna House\xhw12\Word2HTML.exe" c:\document\manual.docx c:\out\index.html

 

Example with XHTML 1.0, with line feed code, and CSS file (c:\document\sample.css) options in addition to the above:

Actual command

"C:\Program Files\Antenna House\xhw12\Word2HTML.exe" c:\document\manual.docx c:\out\index.html -xhtml -endl -css c:\document\sample.css

 

Command and option list

Parameter

Description

<input-file>

(Required) Specify the input file name.

<output-file>

(Required) Specify the output file name.

-clrsettings

When this option is specified, option settings already specified in the default setting file, etc. will be cleared.

-settings <settings-file>

Reads the conversion option setting file specified in <settings-file>.

-xhtml

By default, output HTML grammar tags. If -xhtml is specified, XML grammar tags are output.

-viewport <content>

Outputs a meta tag of the following format to <head>.

<meta name=”viewport” content=” Content specified in ‘content’”>

-endl

Outputs a line break at the end of the block tag.

-emptyP

By default, blank lines (lines with line breaks only) in Word are ignored when outputting HTML. When this option is specified, empty <p></p> tags are output as many as the number of blank lines.

-nonrefiid

While editing in Word, a lot of IDs that are not internally referenced may be created. By default, this converter scans IDs that are not internally referenced and deletes them when outputting HTML. Unreferenced IDs will not be deleted when this option is specified.

-imgwidth

Outputs the width of the image.

-hstrong

Ignores the emphasis specified in the heading style.

-embedimg

When this option is not specified (default), images are output to the image folder.

When this option is specified, the images are embedded in the body HTML with a data URL.

-(x|o)math

Specifies the output format for formulas edited in the Word formula editor. The following four output formats can be specified:

Unspecified: Output formulas to <img> tags as files in svg file format.

-math: Output formulas to <img> tags as files in MathML format.

-xmath: Output formulas in MathML format markup.

-omath: Output formulas in Word's own Office Math format.

-throughimg

Outputs the image in its original format inserted into Word.

-pstyle

Outputs the style name of the paragraph by setting it as the value of the class attribute.

-citation

Outputs the value of tag in the Citation field by setting it as the value of the href attribute of the <a> tag.

-textcolor

Outputs the color specified for the text as <span style="color:color value">.

-italic n|t|s

Specifies the output method when italics are specified for text:

-italic n: Do not output. (default)

-italic t: Output as <i>tag.

-italic s: Output as <span style="font-style:italic

-underline n|t|s

Specifies the output method when underline is specified for text:

-underline n: Do not output. (default)

-underline t: Output as <u> tag.

-underline s: Output as style="text-decoration-line:underline;">.

-linethrough n|t|s

Specifies the output method when strikethrough is specified for text:

-linethrough n: Do not output. (default)

-linethrough t: Outputs as <del> tag.

-linethrough s: Outputs as <span style="text-decoration-line: line-through;">.

-encoding <encoding>

When you want to specify a character code (encoding method) other than Unicode's UTF-8 for HTML files, specify the encoding method with this parameter.

-encoding Shift_JIS: Output in Shift-JIS (see Note 1)

-encoding UTF-16: Unicode's UTF-16 encoding

Note 1: Because fewer character types are specified in Shift-JIS than in Unicode, Unicode characters that cannot be handled by Shift-JIS are output as &#x character_number; (character_number is a hexadecimal number). Note that the old model-dependent characters added by Microsoft to JIS X0208 (e.g., ①, ②) are treated as Shift-JIS characters.

-defstyle

When this option is specified, the <style> element (element specifying the default CSS style) in <head> is not output.

-spaceindent

When this option is specified, the indentation is converted to a single full-width space when one or more indentations are specified at the beginning of a paragraph.

-outputbr

Instead of enclosing a paragraph in a <p> tag, a <br> tag is output at the end of the paragraph. This is invalid when -xhtml parameter is specified.

-fileimages

Name the folder that stores image files as "destination_file_name.images".

-css cssfile

Links the CSS file. Place the CSS file in a folder on Windows and specify its path. An error will occur if the specified CSS file does not exist. You can optionally specify “media”.

Outputs a link tag of the following format in <head>.

<link href="xxx.css" rel="stylesheet" type="text/css" media="print"> The specified CSS file is copied to the HTML output destination folder.

You can specify multiple pairs of -css and CSS files.

-js javascript-path

Place the script tag in <head> and specify the path (URL) of the JavaScript file in its src attribute. No error will occur even if the specified JavaScript path does not exist.

-savesettings <settings-file>

Saves the specified values of the conversion option parameters at command line execution with the file name specified in <settings-file>.

-savedefault

Outputs the specified values of conversion option parameters at command line execution to the default settings file (def-settings.xml).

For detailed specifications, please refer to "Command-line version" in the online manual.