githubEdit

Parser

The Settings > Configuration > Parser window allows you to set parameters for your tenant. These are settings that apply to any bot you create in your Aisera tenant.

Tenant Configuration for Parser

Ignore Unsupported Documents

This feature is Enabled by default.

This feature determines the Aisera Platform's behavior when ingesting unsupported file types. If this setting is enabled, documents which are not supported will be skipped. When this feature is disabled, the platform will attempt to parse all document types.

Supported formats:

  • PDF

  • HTML

  • DOC/DOCX

  • PPTX

  • MD

  • TXT

  • EML

  • MSG

  • CSV

Filter out non-ASCII characters

This feature is Off by default.

When enabled, any characters that are non-ASCII characters will be replaces with spaces. This feature is only applicable to documents using western character sets.

is Private?

This feature is Off by default.

Use Templates?

Detect sections using font properties

This feature is Off by default.

When enabled, the parser uses font size, weight, and boldness to automatically detect section boundaries and assign hierarchy levels during HTML parsing. Content with font size at or above the 75th percentile of all fonts in the document will be promoted to a section title, with bold text receiving additional weight. Higher hierarchy levels are given to larger fonts: level 1 for the largest, level 2 for the next, and so forth.

This is useful for documents that lack explicit structural markup and instead rely on visual font styling to indicate sections. This applies to standard HTML content and simple text-based formats such as TXT, Markdown, EML, MSG processed through HTMLParser. Converted documents that contain font cluster metadata, such as DOCX, use header-based section detection instead.

Add contents section

This feature is Off by default.

By default, when a document has two or more sections, the first section includes an inline list of other top-level section titles. This gives readers a brief overview of the document's structure within the first section itself.

When this feature is enabled, that overview is instead presented as its own standalone section at the beginning of the document. The section is titled with the document's title, or "Document Contents" if no title is available, and contains a bulleted list of the document's top-level section titles.

This applies to all supported document formats.

Merge Sections with Similar Subjects

This feature is Enabled by default.

When enabled, the parser combines consecutive sections whose headings share a common prefix before a colon or hyphen delimiter and that prefix is nearly identical. For example, if a document contains two consecutive sections both titled "Troubleshooting:", they will be merged into one.

Only consecutive sections are compared. If two sections with similar headings are separated by an unrelated section, they will not be merged. Sections without headings are not eligible for merging.

When sections are merged, the first section's heading is preserved. The subsequent section's heading and content are added to the first section. No content is removed.

circle-exclamation

This applies to all supported document formats.

Use Predefined section titles

This field allows you to specify a list of words, sentences, or regular expressions that should be recognized as section titles. When a line of text matches a provided pattern it will become a section header. This is useful for documents that do not have defined header elements.

You can define the recognized phrases using a comma-separated list:

For more explicit definitions you may use a JSON for a more explicit definition of recognized phrases:

There is also a companion configuration, Ignore section titles, which accepts a comma-separated list of headings that should not be treated as section titles during parsing.

Show sections as images

This feature is Off by default.

When enabled, each page of a PDF document will become its own section, ignoring any section boundaries withn the page. These sections are ren dered and displayed as page images rather than as parsed text.

This feature is not compatible with RAG. Sections created in this mode are not processed through the standard content pipeline and will not be returned by AI search.

The primary use of this feature is to serve specific pages through Workflows or Hyperflows.

Ignore section titles

This field accepts a comma-separated list of words or phrases. When a line of text in a document exactly matches one of the provided values, that line will be excluded from the parsed output entirely. The matching is case-sensitive.

For example, setting this to Disclaimer, Copyright Notice will remove any standalone lines that read exactly "Disclaimer" or "Copyright Notice" from the parsed document. Lines that only partially match, such as "Disclaimer: This document is provided as-is", will not be affected.

This is the companion to Use predefined section titles. While that configuration promotes matching text to section headings, this configuration removes matching text from the output. If the same text appears in both configurations, it will be ignored rather than promoted.

Pdf Names

This field allows you to specify which PDF files should be parsed using Microsoft Form Recognizerarrow-up-right. This is useful for scanned documents, documents containing complex layouts, or documents containing forms. It is also provides more accurate extraction of information from tables.

This requires that Microsoft Form Recognizer is enabled.

triangle-exclamation

You can define the documents to be parsed using Microsoft Form Recognizer using a comma separated list of file names without the file extension. You may also use the following values:

  • "all" - All PDFs will use Form Recognizer. This can be expensive!

  • "ocr" - PDFs containing only images will use Form Recognizer.

The following is an example of an input for this field:

This field is also used to define what documents will be affected by enabling the Document Converter configuration.

Copy Images During Parsing

This feature is Enabled by default.

When enabled, the parser will copy images from the source document to the Aisera server storage. This enables the Aisera platform to load and serve images faster, and reduces dependency on external images sources.

Renamed HTML Tags

This field allows you to define HTML tags and their respective transformations before parsing to change how content is interpreted. This is useful for mapping custom HTML tags to standard semantic tags.

This field accepts input in the form of tag=<orignial>,replacetag=<replacement>.

For multiple replacements, separate the replacements with semicolons.

You may also apply custom classes to tags using the class=class-name

The following is an example of a valid input for this field:

Microsoft Form Recognizer

This feature is Off by default.

This enables the use of Microsoft Form Recognizerarrow-up-right. This provides advanced Optical Character Recognition (OCR), table extraction, form field detection, and layout analysis. This is useful for complex PDFs, scanned documents, forms, documents containing complicated layouts, or documents requiring high-accuracy parsing.

Documents analyzed using Microsoft Form Recognizer are sent the Microsoft Cloud services for processing.

triangle-exclamation

Use the PDF Names field to specify which PDFs to process.

Enable images in Form Recognizer

Document Converter

This feature is Off by default.

This feature enables higher quailty PDF conversions using a more advanced PDF conversion service. This will automatically split large PDFs into smaller parts, supports caching, higher accuracy table parsing, and supports OCR.

Related features are:

Accurate Table Parse

This feature is Off by default.

This feature requires Document Converter to be enabled.

This enables more accurate table parsing and cell content extraction. This is useful for when documents contain complex tables that need precise extraction.

If this is enabled and PDF Names is empty, this will be applied to all PDFs

If this is enabled and PDF Names is specified, only the specified PDFs will be processed using the accurate table parse.

Bypass Cache

This feature is Off by default.

This feature requires Document Converter to be enabled.

Activating this setting ensures PDFs undergo new conversions for each request, bypassing cached versions. This benefits frequently updated documents but may slow access due to the conversion process for every retrieval.

Force OCR

This feature is Off by default.

This feature requires Document Converter to be enabled.

Enable this option to apply Optical Character Recognition (OCR) to all documents during conversion. This feature benefits documents without embedded text, like images or scanned files. Alternatively you can specify a list of files you would like to force the use of OCR on in the PDF Names field.

When this is disabled the system will instead check if a PDF is image only. If the PDF contains only images, OCR will be applied, otherwise OCR will be skipped.

HTML Parameters

PDF Parameters

Docx Parameters

This option lets you control how Microsoft Word styles are converted into HTML elements during document parsing. This is useful for ensuring that your document's structure remains consistent and that data is parsed properly.

The field accepts input in the following format:

  • Word Style Name: This is the name of the style applied to the text in the Word document.

  • HTML Tag: The HTML tag you would like the text to be converted to.

  • Mode: If this is fresh, a new HTML element will be created for each paragraph of text with this style. Otherwise, if two consecutive paragraphs have the same style applied, they will be appended to each other in the same HTML element.

For example, to convert every Title to a new H1, every Subtitle to a new H2, and append each consecutive paragraph styled as Normal, you would input the following:

Signify FAQ

This feature is Enabled by default.

When enabled, content ending with "?" will be detected as questions for the FAQ format. These questions will be separated into sections and the answers will be the content of each section.

circle-info

The Parser will compare detected questions with existing sections to avoid the creation of duplicate sections.

Skip KB Similarity matching

This feature is Off by default.

During parsing, the Aisera Platform will detect documents and sections containing similar content. When this feature is enabled, the similarity detection will be bypassed. This feature is useful if you want to parse all content regardless of similarity. It can also be used to speed up the process of parsing at the expense of duplicate detection.

See KB Similarity Score Threshold for more information on configuring the KB Similarity Mattching service.

KB Similarity Score Threshold

This requires Skip KB Similarity Matching to be Off.

This field sets the similarity score for the KB Similarity matching service. Documents with a score above this are considered duplicates. Input percentages as decimals; for example, 50% is 0.5. Lowering this value will make duplicate matching more aggressive, and may result in more false positive matches.

Image Removal Contour Threshold

This field controls the removal of background and low content images. This is done using Canny Edge detectionarrow-up-right, and the value correlates to the number of contours detected in the image. Images with fewer contours than the specified threshold will be deleted. A lower threshold results in fewer images being removed, whereas a higher threshold results in more images being removed.

Set this field to 0 to disable the removal of background and low content images.

While there is no upper bound a practical range for this field is 0 to 1000.

Use origin URL for files download

This feature is Off by default.

When this feature is enabled, the original source URL will be used for file downloads instead of the S3 storage. Files will be displayed and downloaded from their original location. Use this feature when users need to access files directly from their original source, especially if the URLs have specific access controls or provide ease of convenience.

Append parent title to filename

This feature is Off by default.

This feature appends the name of the source knowledge base to the document filename. This results in filenames with the form of <Parent Knowledge Base>-<Filename>. This is useful for organizing filenames by their source. For instance, if a document is titled "User Guide" under the "Product Documentation" knowledge base, the filename will be "Product Documentation-User Guide".

Enable Translation Service

Enable PowerPoint Photo Mode

This feature is Off by default.

When enabled, the parser will parse PowerPoint presentations as images. Each slide will be converted to an image and embedded in HTML. This will preserve the exact visual appearance of slides as images. This is useful for presentations using many visuals, or presentations where visual fidelity is more important than searchable text.

Last updated

Was this helpful?