Antenna House PDFXML Conversion Library

Convert PDF to XML!

AHPDFXML_Diagram

 

Unlock the Power of XML

Antenna House PDFXML Conversion Library allows you to convert PDF to XML and unlock the content from your legacy PDFs. If you want to reuse content from old PDFs, you no longer need to retype or go through the trouble of reconstructing your documents’ content from the PDF binary format.

Antenna House PDFXML is designed for those organizations that need to convert large volumes of PDFs into XML, HTML5, XSL-FO, DocBook, or any other file formats. The Antenna House PDFXML Conversion Library extracts text, tables and images from PDFs and convert them to an XML format which we call “AHPDFXML”. The data can then be transformed to any desired output by applying XSLT stylesheets. Benefits and uses for XML include:

  • Content Re-usability
  • Improved Search-ability
  • Good for Accessibility
  • Promotes Interoperability and Data Integration
  • Platform Independent
  • Vendor Independent

New Enhancements

  • EMF has been added as an output option for image files
  • The process for judging vertical pages from vertical writing symbols has been added when CID fonts with a mixture of vertical and horizontal writing are used.
  • The process of judging the number of horizontal writing existing on the vertical writing page as the page number has been improved.
  • The analysis of merged cells both vertically and horizontally has been improved.

To ensure the product meets user requirements, we encourage all potential customers to first try a free trial of the software by contacting us.

The Antenna House PDFXML Conversion Library is a C/C++ library (also includes a Command-line program) that generates a richly structured XML document from the PDFs by using Antenna House’s PDF Analyzer Technology.

How it works:

  1. Loads the information for each page from PDF
  2. Extracts vertical and horizontal lines from line drawings
  3. Analyzes the tables
  4. Creates text in the table cell
  5. Creates text lines of the body
  6. Creates paragraphs from lines
  7. Creates the area information from paragraphs
  8. Creates sections (columns)
  9. Outputs the information for each page to AHPDFXML

The XML format outputted by this conversion library is called Antenna House PDFXML format. It is a verbose format defined by Antenna House representing the content of a PDF in an intermediate XML structure. It is created by converting the contents in a PDF into XML expressions for text, tables, and images.

Antenna House PDFXML consists of multiple files:

  • Catalog File (input file for stylesheets) – manages the AHPDFXML files
  • Document File – stores the main body of a PDF document configuration
  • Style File – defines the style applied to the respective elements of a document
  • External Files – outputs JPEG, PNG, BMP, SVG, etc.

The resulting XML can then be transformed with XSLT to any format that displays the document structure such as XSL-FO, DocBook, HTML5, or simply text. With Antenna House PDFXML, you now have the means to take advantage of PDF content for a wide range of environments. Transforming PDF content to XML makes it much easier to reuse, transform, manipulate, and search for data. By applying an XSLT stylesheet, there is more flexibility to processing data depending on how it’s being used.

  • PDF 1.3—1.7
  • PDFs compliant with ISO 32000-1: 2008
  • PDFs created with Antenna House software
  • The conversion of annotation data is not supported.
  • The conversion of data in form fields is not supported.
  • The gradation and the pattern of line drawings are not supported.
  • When the character code cannot be changed to UNICODE, the characters may be outputted incorrectly.

Windows

  • Windows Server 2016
  • Windows Server 2012 R2
  • Windows Server 2008 R2
  • Windows Server 2008 (32bit/64bit)
  • Windows 10 (32bit/64bit)
  • Windows 8.1 (32bit/64bit)
  • Windows 7 (32bit/64bit)

Linux 64bit (built with GCC4.8)

  • Linux Red Hat Enterprise series
  • Linux CentOS series
  • Linux Fedora series
  • Needs Run Time Library libc.so.6 (glibc-2.17), libstdc++.so.6(libstdc++.so.6.0.19)
ANTENNA HOUSE PDFXML CONVERSION LIBRARY V2.0 PRICE
Production Server + Development License $10,000
Annual Maintenance $2000

Contact Us