Parser
The Settings > Configuration > Parser window allows you to set parameters for your tenant. These are settings that apply to any bot you create in your Aisera tenant.

Ignore Unsupported Documents
This feature is Enabled by default.
This feature determines the Aisera Platform's behavior when ingesting unsupported file types. If this setting is enabled, documents which are not supported will be skipped. When this feature is disabled, the platform will attempt to parse all document types.
Supported formats:
PDF
HTML
DOC/DOCX
PPTX
MD
TXT
EML
MSG
Filter out non-ASCII characters
This feature is Off by default.
When enabled, any characters that are non-ASCII characters will be replaces with spaces. This feature is only applicable to documents using western character sets.
is Private?
This feature is Off by default.
Use Templates?
Detect sections using font properties
This feature is Off by default.
When enabled, the parser will use font size and weigh to automatically detect and create document sections. This is useful for documents that don't have explicit section markers, such as HTML headers, rather using font style to indicate sections, such as PDFs.
Add contents section
This feature is Off by default.
If this feature is enabled, a document containing multiple sections will have a table of contents added to the beginning of the document.
Merge Sections with Similar Subjects
This feature is Enabled by default.
When enabled, the parser will automatically detect sections with similar headings and combine them. This helps to reduce redundancy in the knowledge base.
Use Predefined section titles
This field allows you to specify a list of words, sentences, or regular expressions that should be recognized as section titles. When a line of text matches a provided pattern it will become a section header. This is useful for documents that do not have defined header elements.
You can define the recognized phrases using a comma-separated list:
For more explicit definitions you may use a JSON for a more explicit definition of recognized phrases:
Show sections as images
This feature is Off by default.
When this feature is enabled each page of the parsed document will become a separate section. This will ignore any section boundaries within the pages. When the sections are displayed, they will be displayed as images rather than as text. This feature is not compatible with RAG. The primary use of this feature is to serve specific pages through Workflows or Hyperflows.
Ignore section titles
This field allows you to specify a list of words and sentences that should not be treated as section titles. The phrases input here will be ignored even if they match the section title patterns.
You can define the set of ignored phrases using a comma-separated list:
Pdf Names
This field allows you to specify which PDF files should be parsed using Microsoft Form Recognizer. This is useful for scanned documents, documents containing complex layouts, or documents containing forms. It is also provides more accurate extraction of information from tables.
This requires that Microsoft Form Recognizer is enabled.
Using Microsoft Form Recognizer may incur extra charges.
You can define the documents to be parsed using Microsoft Form Recognizer using a comma separated list of file names without the file extension. You may also use the following values:
"all"- All PDFs will use Form Recognizer. This can be expensive!"ocr"- PDFs containing only images will use Form Recognizer.
The following is an example of an input for this field:
This field is also used to define what documents will be affected by enabling the Document Converter configuration.
Copy Images During Parsing
This feature is Enabled by default.
When enabled, the parser will copy images from the source document to the Aisera server storage. This enables the Aisera platform to load and serve images faster, and reduces dependency on external images sources.
Renamed HTML Tags
This field allows you to define HTML tags and their respective transformations before parsing to change how content is interpreted. This is useful for mapping custom HTML tags to standard semantic tags.
This field accepts input in the form of tag=<orignial>,replacetag=<replacement>.
For multiple replacements, separate the replacements with semicolons.
You may also apply custom classes to tags using the class=class-name
The following is an example of a valid input for this field:
Microsoft Form Recognizer
This feature is Off by default.
This enables the use of Microsoft Form Recognizer. This provides advanced Optical Character Recognition (OCR), table extraction, form field detection, and layout analysis. This is useful for complex PDFs, scanned documents, forms, documents containing complicated layouts, or documents requiring high-accuracy parsing.
Documents analyzed using Microsoft Form Recognizer are sent the Microsoft Cloud services for processing.
Using Microsoft Form Recognizer will incur extra costs.
Use the PDF Names field to specify which PDFs to process.
Enable images in Form Recognizer
Document Converter
This feature is Off by default.
This feature enables higher quailty PDF conversions using a more advanced PDF conversion service. This will automatically split large PDFs into smaller parts, supports caching, higher accuracy table parsing, and supports OCR.
Related features are:
Accurate Table Parse
This feature is Off by default.
This feature requires Document Converter to be enabled.
This enables more accurate table parsing and cell content extraction. This is useful for when documents contain complex tables that need precise extraction.
If this is enabled and PDF Names is empty, this will be applied to all PDFs
If this is enabled and PDF Names is specified, only the specified PDFs will be processed using the accurate table parse.
Bypass Cache
This feature is Off by default.
This feature requires Document Converter to be enabled.
Activating this setting ensures PDFs undergo new conversions for each request, bypassing cached versions. This benefits frequently updated documents but may slow access due to the conversion process for every retrieval.
Force OCR
This feature is Off by default.
This feature requires Document Converter to be enabled.
Enable this option to apply Optical Character Recognition (OCR) to all documents during conversion. This feature benefits documents without embedded text, like images or scanned files. Alternatively you can specify a list of files you would like to force the use of OCR on in the PDF Names field.
When this is disabled the system will instead check if a PDF is image only. If the PDF contains only images, OCR will be applied, otherwise OCR will be skipped.
HTML Parameters
PDF Parameters
Docx Parameters
This option lets you control how Microsoft Word styles are converted into HTML elements during document parsing. This is useful for ensuring that your document's structure remains consistent and that data is parsed properly.
The field accepts input in the following format:
Word Style Name: This is the name of the style applied to the text in the Word document.
HTML Tag: The HTML tag you would like the text to be converted to.
Mode: If this is
fresh, a new HTML element will be created for each paragraph of text with this style. Otherwise, if two consecutive paragraphs have the same style applied, they will be appended to each other in the same HTML element.
For example, to convert every Title to a new H1, every Subtitle to a new H2, and append each consecutive paragraph styled as Normal, you would input the following:
Signify FAQ
This feature is Enabled by default.
When enabled, content ending with "?" will be detected as questions for the FAQ format. These questions will be separated into sections and the answers will be the content of each section.
Skip KB Similarity matching
This feature is Off by default.
During parsing, the Aisera Platform will detect documents and sections containing similar content. When this feature is enabled, the similarity detection will be bypassed. This feature is useful if you want to parse all content regardless of similarity. It can also be used to speed up the process of parsing at the expense of duplicate detection.
See KB Similarity Score Threshold for more information on configuring the KB Similarity Mattching service.
KB Similarity Score Threshold
This requires Skip KB Similarity Matching to be Off.
This field sets the similarity score for the KB Similarity matching service. Documents with a score above this are considered duplicates. Input percentages as decimals; for example, 50% is 0.5. Lowering this value will make duplicate matching more aggressive, and may result in more false positive matches.
Image Removal Contour Threshold
This field controls the removal of background and low content images. This is done using Canny Edge detection, and the value correlates to the number of contours detected in the image. Images with fewer contours than the specified threshold will be deleted. A lower threshold results in fewer images being removed, whereas a higher threshold results in more images being removed.
Set this field to 0 to disable the removal of background and low content images.
While there is no upper bound a practical range for this field is 0 to 1000.
Use origin URL for files download
This feature is Off by default.
When this feature is enabled, the original source URL will be used for file downloads instead of the S3 storage. Files will be displayed and downloaded from their original location. Use this feature when users need to access files directly from their original source, especially if the URLs have specific access controls or provide ease of convenience.
Append parent title to filename
This feature is Off by default.
This feature appends the name of the source knowledge base to the document filename. This results in filenames with the form of <Parent Knowledge Base>-<Filename>. This is useful for organizing filenames by their source. For instance, if a document is titled "User Guide" under the "Product Documentation" knowledge base, the filename will be "Product Documentation-User Guide".
Enable Translation Service
Enable PowerPoint Photo Mode
This feature is Off by default.
When enabled, the parser will parse PowerPoint presentations as images. Each slide will be converted to an image and embedded in HTML. This will preserve the exact visual appearance of slides as images. This is useful for presentations using many visuals, or presentations where visual fidelity is more important than searchable text.
Last updated
Was this helpful?
