AHPDFXML Conversion Library

Convert PDFs to XML!

AHPDFXML_Diagram


Unlock the Power of XML

If you need to reuse content from old PDFs, there’s no need to retype or reconstruct your documents anymore. With AHPDFXML Conversion Library, it is now possible to extract text, figures or images from a PDF and convert it to an XML format. Being able to convert to XML means your data is more organized and can be analyzed easily compared to unstructured data like a PDF document. Benefits and uses for XML include:

  • Content Re-usability
  • Improved Search-ability
  • Good for Accessibility
  • Promotes Interoperability and Data Integration
  • Basis for Interconversion Among Document Formats

If you are interested; contact info@antennahouse.com.


Various Uses

There are several reasons why you'd want to convert PDF to XML:

  • Modifying legacy PDF content
  • Reusing legacy content
  • To increase content accessibility
  • Outputting old content into new formats (eBook, Docbook, MS Word, etc.)

Reuse PDF content anywhere

This library generates the document structure by using Antenna House’s PDF analysis technology and outputs XML data that is suitable for reuse. This allows you to convert PDF content to Office document formats on any operating system or non-PC device. It provides a means to take advantage of PDF content in a wide range of environments.

Easily handles PDF content

By transforming PDF content to XML, it's easy to reuse, transform, manipulate, and search for data. By applying an XSLT stylesheet, there is more flexibility to processing data depending on how it’s being used. You can take advantage of the PDF data in maximizing the benefits of XML.

Extract text from any part of a PDF

You can easily extract the text in an arbitrary range in any PDF document with the text element output to AHPDFXML

Quickly generates structured content from a PDF

The XML format outputted by this conversion library is called AHPDFXML format. It is created by converting the contents in a PDF into XML expressions for text, tables, and images. By utilizing the text frame element (line, paragraph and column), table frame element and image frame element, you can convert them to any format that displays the document structure such as DocBook, HTML5, or XSL-FO.

Sample of structure extraction

Example of structure extraction

PDF Support

  • PDF 1.3—1.7
  • PDFs that conforms to ISO 32000-1
  • PDFs created with Antenna House products

Limitations

To ensure the product meets user requirements, we encourage all potential customers to first try a free trial of the software by contacting us at info@antennahouse.com.

  • The conversion of annotation data is not supported
  • Conversion of data in form fields not supported
  • Only the cover page of Adobe Acrobat 9 portfolio format is converted
  • Color gradient data not supported
  • Embedded fonts with no mapping information cannot be converted into text
  • Only main PDF file of PDF 1.7 package format is converted

System Support

  • Windows Server 2012/2012 R2 (64bit)
  • Windows Server 2008 R2 (64bit)
  • Windows Server 2008 (32bit/64bit)
  • Windows 8.1 (32bit/64bit)
  • Windows 8 (32bit/64bit)
  • Windows 7 (32bit/64bit)
  • Linux (32bit/64bit) and has been built with GCC 4.1

Print Brochure:
PDF