AHPDFXML Conversion Library V2.0

Convert PDFs to XML!

AHPDFXML_Diagram

 

Unlock the Power of XML

AHPDFXML Conversion Library V2.0 allows you to unlock the content from your legacy PDFs. If you want to reuse content from old PDFs, you no longer need to retype or go through the trouble of reconstructing your documents’ content from the PDF binary format.

AHPDFXML is designed for those organizations that need to convert large volumes of PDFs into XML, HTML5, XSL-FO, DocBook, or any other file formats. The AHPDFXML Conversion Library extracts text, tables and images from PDFs and convert them to an XML format which we call “AHPDFXML”. The data can then be transformed to any desired output by applying XSLT stylesheets. Benefits and uses for XML include:

  • Content Re-usability
  • Improved Search-ability
  • Good for Accessibility
  • Promotes Interoperability and Data Integration
  • Platform Independent
  • Vendor Independent

To ensure the product meets user requirements, we encourage all potential customers to first try a free trial of the software by contacting us.



The AHPDFXML Conversion Library is a C/C++ library (also includes a Command-line program) that generates a richly structured XML document from the PDFs by using Antenna House’s PDF Analyzer Technology.

How it works:

  1. Loads the information for each page from PDF
  2. Extracts vertical and horizontal lines from line drawings
  3. Analyzes the tables
  4. Creates text in the table cell
  5. Creates text lines of the body
  6. Creates paragraphs from lines
  7. Creates the area information from paragraphs
  8. Creates sections (columns)
  9. Outputs the information for each page to AHPDFXML


The XML format outputted by this conversion library is called AHPDFXML format. It is a verbose format defined by Antenna House representing the content of a PDF in an intermediate XML structure. It is created by converting the contents in a PDF into XML expressions for text, tables, and images.

AHPDFXML consists of multiple files:

  • Catalog File (input file for stylesheets) – manages the AHPDFXML files
  • Document File – stores the main body of a PDF document configuration
  • Style File – defines the style applied to the respective elements of a document
  • External Files – outputs JPEG, PNG, BMP, SVG, etc.

See AHPDFXML Schema Documentation for more detail.

The resulting XML can then be transformed with XSLT to any format that displays the document structure such as XSL-FO, DocBook, HTML5, or simply text. With AHPDFXML, you now have the means to take advantage of PDF content for a wide range of environments. Transforming PDF content to XML makes it much easier to reuse, transform, manipulate, and search for data. By applying an XSLT stylesheet, there is more flexibility to processing data depending on how it’s being used.

  • PDF 1.3—1.7
  • PDFs compliant with ISO 32000-1: 2008
  • PDFs created with Antenna House software

  • The conversion of annotation data is not supported.
  • The conversion of data in form fields is not supported.
  • The gradation and the pattern of line drawings are not supported.
  • When the character code cannot be changed to UNICODE, the characters may be outputted incorrectly.


Windows

  • Windows Server 2016
  • Windows Server 2012 R2
  • Windows Server 2008 R2
  • Windows Server 2008 (32bit/64bit)
  • Windows 10 (32bit/64bit)
  • Windows 8.1 (32bit/64bit)
  • Windows 7 (32bit/64bit)

Linux 64bit (built with GCC4.8)

  • Linux Red Hat Enterprise series
  • Linux CentOS series
  • Linux Fedora series
  • Needs Run Time Library libc.so.6 (glibc-2.17), libstdc++.so.6(libstdc++.so.6.0.19)


AHPDFXML Conversion Library V2.0 Price
Production Server + Development License $10,000
Annual Maintenance $2000