How can program developers convert Microsoft Office files to PDF or images?

There is a great need to view Word, Excel, and PowerPoint files created with Microsoft Office in a browser. Also, if you are a developer, it is only natural to want to implement this functionality on your own. For example, the following question was asked on StackOverflow on March 6, 2017.

 

"I have a request to create a (Office) document viewer in my web application (C#.Net) and I don't want to use any third party tools for this. Easy visualization of files as images or PDFs or web pages. Can you convert it to a common format that can be used?" ( Taken from StackOverflow question [1] )

 

Similar questions are frequently posted. This can be easily achieved using tools such as Office Server Document Converter (formerly Server Based Converter) ( [2] ), but here we will try to sort out how to develop it on our own without using tools.

How to view Office files in a browser


To view an Office file in a browser, you must first convert the Office file to a format that can be viewed in the browser. There are three possible ways to do this:

  1. Convert the Office file to HTML to be viewed as a webpage
  2. Convert the Office file to PDF to be viewed within a browser
  3. Convert the Office file to an image such as a raster image or SVG

The first is to convert the Office file to HTML and display it as HTML. The second is to convert the Office file to PDF and view the PDF in your browser. Today, more and more browsers, including Chrome and Microsoft Edge, can directly display PDFs. The third method is to convert the Office file to an image such as a raster image or SVG and display the image in a browser.

 

Flowchart of how to display content from MS Office files by converting Office files to PDF, HTML, or images

Easy if you can use Microsoft Office


All three options are relatively easy to implement if you can use Microsoft Office for the conversion process.

Microsoft Word and Microsoft Excel have has a "Save As" function in the "File" menu, where you can select the file type. File types include "Web format (HTML)" and "PDF". Images can't be created directly from Word or Excel, but it's easy to convert them via PDF (using another tool).

In Microsoft PowerPoint, you can choose between "PDF" and "Image (PNG, JPEG, TIFF, etc.)" formats. PowerPoint cannot be saved as HTML. A tool is also provided to automate Microsoft Office's "Save As".

 

 OSDC_banner

 

What to do when you don't use Microsoft Office


Imagine a developer doing the same thing on their own without using Microsoft Office or third-party tools. The question quoted above was answered by a Microsoft engineer outlining how to do it. Let's take a closer look here.

Even if you don't use third-party tools, whether you can use open source or not, and how accurately you want to reproduce the layout when printing Office files with Microsoft Office, the degree of difficulty varies greatly. Here's what you can do to make it easier:

  1. Familiarize yourself with Office File Format

    To convert an Office file, you have to read the Office file first. If you're reading for yourself, you should be familiar with Microsoft Office document formats.

    Until Office 2003, the document format of Microsoft Office was a binary format. Starting with Office 2007, it is specified using XML. The Office file format defined by XML is called Office Open XML (OOXML) ( [3] ). The OOXML format is an international standard and open to the public, so anyone can understand the file format if they study it. 

    However, if you read using open source libraries such as Apache's POI (
    [4] ), you may not need to master the Office file format.


  2. Create a program to read Office files

    Reading files in OOXML format is relatively easy to do on your own. You just need to change the file extension to ".zip" and use a tool to extract the file by unzipping it. Once you have unzipped the file, you will see that the internal contents are in XML format. Therefore, you can use a program that can parse XML in order to extract the contents of the document file.

    On the other hand, reading binary files from Office 2003 and older versions is not as easy. If you use Microsoft Office, you can read binary files in newer versions of Office without worrying about it.

    However, if you do it yourself, you will need to develop a dedicated program to read the binary format. Writing a program to decrypt and read binaries is hard work. In the past, you had to create your own from scratch, but now it's more realistic to use Apache's POI.

    Another method is to use Open Office to read binary files created in Microsoft Office. There is a limit to the part that can be read by the tool, so if you try to exceed the limit, you will need to create your own program.

  3. Familiarize yourself with the destination file format

    Next, choose the destination file format: PDF, HTML or Image. You need to be thoroughly familiar with the format of the destination file in order to export the file yourself.

    PDF is an international standard ( [5] ) called ISO 32000-1:2008. If you don't want to be familiar with PDF, you can use a library that has an API for generating PDFs. If you don't want to use advanced PDF features (PDF/A, PDF/X, Accessible PDF (PDF/UA)), there are also open-source PDF export libraries.

    Image file formats such as PNG and JPEG are publicly available and anyone willing to put in the time and effort can become familiar with them.

  4. Convert the Office file to the destination file format


    a) Change the Office file to format to PDF

    When converting the Office file format to PDF, it is necessary to create a program (called a renderer or formatter) that processes the print layout in the same way that the read Office file is printed on paper.

    Renderers run the gamut. For example, if all you need to do is extract text from an Office file and arrange the text on a page of a certain size, creating a renderer is relatively easy. Going further, if you want your PDF to look the same layout as printed using Microsoft Office, you'll have a hard time developing a renderer.

    For example, you can set the headline text to the same size as the original Office file, and the text  to the same size. You'll need to find out how Microsoft Office creates print layouts and program a compatible emulation of it.

    There are some office software that have Microsoft Office compatible functions, such as Apache's Open Office ( [6] ) and LibreOffice ( [7] ). LibreOffice is said to be more compatible with Microsoft Office than OpenOffice, but the reproduction of print layouts is not as accurate ( [8] ).

     

    Antenna House's Office Server Document Converter (formerly Server Based Converter) ( [2] ) has much higher fidelity than Open Office and LibreOffice, but it is not 100% compatible with Microsoft Office print layouts.



    b) Convert Office files to HTML

    For example, even if you convert a Word file to HTML, it's easy to simply enclose the text in paragraph tags (p tags). Next, if you want to add heading tags (h1~h6), if the person who created the Word document used heading styles, it is relatively easy to map paragraphs with heading styles to HTML heading tags, so it can be done with relative ease.

    However, if the Word document does not have heading styles set, it becomes difficult to identify headings. For example, you may need to write a program that identifies heading ranks (h1~h6) based on the size of the paragraph text. Matching Word documents to HTML tags presents different challenges from PDF output.

     

     

     


    In HTML, Microsoft Office layout is mainly expressed with CSS. Therefore, how much to reproduce the layout, just like in PDF, greatly affects the difficulty of development.


    c) Convert Office file to image

    If you want to create an image file directly from an Office file, you can create a renderer similar to creating a PDF page and then generate an image file from the renderer. Another option is to convert a PDF into an image. Alternatively, it may be possible to create HTML and then render it as an image in a headless browser."

 

 

 

Summary


It would be easy to develop on your own if it's just a matter of extracting text and internal images from Office files and displaying them in a browser. On the other hand, it takes a lot of effort to create a program that reproduces the print layout of Microsoft Office.

 

Reference material