githubEdit

Accurately Ingesting PDFs into the Aisera Platform

Overview

The Aisera Platform provides the capability of converting PDFs into HTML during parsing. This conversion allows for more accurate data extraction from PDFs, especially if tables or images appear in the ingested content.

Enabling the Document Converter

To convert the documents during parsing, you must enable the Document Converter in the Data Source Configurations.

  1. Navigate to Settings > Data Sources in the Aisera Admin UI.

  2. Select an existing Data Source or configure a new Data Source.

  3. Select the Pencil icon in the top right of the Data Source Details screen to begin editing the Data Source.

  4. In the Edit Data Source window, select Ingestion Configuration.

  5. Enable the Document Converter configuration.

  6. Select OK to save the changes.

After enabling this configuration, documents will be converted to HTML for better parsing and data extraction. The following settings will also become available when the Document Converter is enabled.

Accurate Table Parse

By default, when parsing tables, fast mode is enabled. Enabling Accurate mode provides more accurate table parsing and cell content extraction. This is useful for when documents contain complex tables that need precise extraction.

If you only need accurate parsing for specific PDFs, enter the PDF names as a comma separated list without file extensions in the PDF Names input box.

Bypass Cache

Activating this setting ensures PDFs undergo new conversions for each request, bypassing cached versions. This benefits frequently updated documents but may slow access due to the conversion process for every retrieval.

Force OCR

By default, Optical Character Recognition (OCR) is enabled for PDFs containing images. Enabling this option to apply OCR to all documents during conversion. This feature benefits documents without embedded text, like images or scanned files.

Last updated

Was this helpful?