Parser
Controls how the Aisera platform ingests, parses, and processes documents from connected data sources.
The Settings > Configuration > Parser section controls how the Aisera platform processes documents from connected data sources. These are tenant-level defaults that apply across all bots and data sources. You can override these settings at the data source level. If a data source requires specific parsing behavior, its data source configuration takes priority over the values set here.
General behavior
Controls how the parser handles documents during ingestion, including unsupported format handling, character filtering, privacy tagging, and knowledge base similarity detection.
Ignore Unsupported Documents
Type
Checkbox
Default
Enabled
When enabled, the parser skips documents in unsupported formats during ingestion without extracting or indexing any content. When disabled, the parser extracts content from unsupported documents as unformatted plain text for indexing. Disable this setting when you need maximum coverage and can tolerate lower-quality plain-text extraction for unsupported formats.
Natively supported formats are unaffected by this setting:
PDF
HTML
Word
PPTX
TXT
MD
EML
MSG
Filter out non-ASCII characters
Type
Checkbox
Default
Disabled
When enabled, the parser replaces any character outside the standard ASCII range of 0–127, including accented letters, typographic quotes, emoji, and non-Latin script characters, with a space before parsing and indexing the text. When disabled, the parser preserves all characters as they appear in the source content. Enable this when knowledge sources are exclusively in western character sets and stray non-Latin characters are causing indexing or search quality issues.
Do not enable this setting if any knowledge content includes non-Latin languages such as Japanese, Arabic, Chinese, or Korean. The parser removes those characters entirely. This setting applies only to inline text content in knowledge source messages, not to file attachments such as PDFs, Word files, or crawled HTML.
is Private?
Type
Checkbox
Default
Disabled
When enabled, Aisera tags every article surfaced during agent search as private. In the Agent Assist widget, private articles display a lock icon with a Private label next to the article title, and the widget separates results into distinct lists: private articles appear in the internal content section while public articles appear in the external content section. When disabled, Aisera treats articles as public and displays them in standard results without any lock indicator. Enable this when your knowledge base contains internal content intended only for agent reference, such as runbooks, escalation guides, or agent-only procedures.
This setting is a visual and organizational marker only; it does not restrict access to content or exclude it from search results. You can configure connectors to apply different actions depending on whether an agent applies a private or public article, enabling workflows that distinguish between internal-use and external-use content.
Use Templates?
Type
Checkbox
Default
Disabled
When enabled, the parser first attempts to match each incoming document against templates you have configured. If the parser finds a matching template, it parses the document according to that template's defined sections. If no template matches or the template produces no content, parsing falls back automatically to standard HTML parsing. Enable this when your data source contains structured documents with consistent layouts, such as policy documents, HR articles, or product specifications, and you have templates configured to match them.
Configure your templates before enabling this setting. Enabling Use Templates? without any configured templates has no effect; the parser falls back to standard HTML parsing for all documents.
Signify FAQ
Type
Checkbox
Default
Enabled
When enabled, the parser detects question-formatted text within parsed documents and promotes each qualifying question into a standalone section with the question as the section heading. Content qualifies as a question when the full text ends with ? and every sentence begins with a question word. In practice, the setting is designed for single-sentence questions. The parser also compares detected questions with existing sections to avoid creating duplicates. Disable this when documents contain interrogative language that should not be treated as FAQ structure, such as troubleshooting guides or policy documents where questions are rhetorical or part of flowing prose.
Multi-sentence content rarely qualifies, as any sentence that does not open with a question word disqualifies the entire block. If an answer follows the question, it becomes that section's content
Skip KB Similarity Matching
Type
Checkbox
Default
Disabled
When enabled, the parser skips the knowledge base similarity service during document ingestion and the parser uses only exact hash matches to track section changes. When disabled, the parser sends sections that do not match exactly to the KB similarity service, which uses a configurable score threshold to identify near-duplicate or lightly edited content, such as reformatted paragraphs or minor edits, across re-ingestions. Enable this to reduce ingestion processing time when near-duplicate detection is not needed, such as when documents are fully replaced on each ingestion rather than incrementally updated.
See also: KB Similarity Score Threshold
KB Similarity Score Threshold
Type
Decimal
Default
0.95
Sets the minimum similarity score required for the KB similarity service to treat two document sections as a match during re-ingestion. Accepted values range from 0.9 to 1.0. Higher values require closer textual similarity before the parser declares a match; lower values match sections with more significant differences. Increase the threshold when the parser incorrectly identifies distinct sections as updates to existing content. Decrease it when the parser does not recognize lightly edited sections as updates and instead creates duplicate entries.
See also: Skip KB Similarity Matching
Image Removal Contour Threshold
Type
Text field (integers)
Default
400
Controls how aggressively the parser removes background images from parsed documents. The parser evaluates images using Canny edge detection, counting contour segments in each image; it strips images with a count at or below the threshold and preserves images above it. The parser keeps high-contour images such as diagrams and charts and removes low-contour images such as decorative banners and background gradients. Set to 0 to disable background image removal entirely. While there is no upper bound, a practical range is 0 to 1000. Increase the threshold if background images are appearing in parsed content; decrease it if the parser is incorrectly stripping useful diagrams.
Section detection
Controls how the parser identifies and creates sections within documents, including font-based heading detection, table of contents generation, and section title management.
Detect sections using font properties
Type
Checkbox
Default
Disabled
When enabled, the parser analyzes font size and boldness across a document to identify section headings. The parser promotes text with larger or bolder fonts than the document average to section subjects, with relative font sizes determining heading hierarchy. This applies to natively parsed HTML and simple text-based formats processed through the HTML parser, including TXT, Markdown, EML, and MSG. It does not apply to converted formats such as Word or PowerPoint, which use header-based section detection instead, or to PDF documents. Enable this for documents that rely on visual formatting to communicate structure rather than proper heading markup.
Enabling this on documents with inconsistent font usage may produce unexpected section splits. Documents that already use semantic heading tags are unaffected; the parser detects their hierarchy regardless of this setting.
Add contents section
Type
Checkbox
Default
Disabled
When enabled, the parser creates a table of contents as a standalone section at the beginning of the document. The contents section contains a bulleted list of all top-level section titles, including the first. By default, a similar overview is appended inline to the first section instead, listing all sections after the first.
The contents section title uses the document title with " (Contents)" appended if the document title matches the first section's title, or "Document Contents" if no document title is available.
Enable this when your knowledge base contains long, multi-section documents and you want users to see a structural overview when a document appears in search results.
Merge Sections with Similar Subjects
Type
Checkbox
Default
Enabled
When enabled, the parser merges consecutive sections whose headings share a nearly identical prefix before a colon or hyphen delimiter. For example, two consecutive sections titled "Step 1:" and "Step 2:" are merged into one, as are two consecutive sections both titled "Troubleshooting:". The parser preserves the first section's heading and appends the subsequent section's heading and content to it. No content is removed.
Only consecutive sections are compared, sections with similar headings separated by an unrelated section are not merged. Sections without headings are not eligible. This setting applies to all supported document formats.
Disable this when parallel headings cover different topics and should remain separate.
Sections that share a prefix but cover different subtopics may be unintentionally merged. Disable this setting if that occurs.
Use Predefined section titles
Type
Text field
Default
Empty
Defines a list of text patterns that the parser promotes to section headings, regardless of how the source document formats them. When configured, the parser scans document text for lines matching the specified patterns and wraps each match with an HTML heading tag, creating a section boundary in the parsed output. Matching is case-insensitive and each entry supports regular expressions. Use this when source documents contain consistent section labels that lack heading markup, such as plain-text knowledge base exports or documents authored without heading styles.
Two input formats are accepted:
All matches become H1 headings.
Matches are assigned to the specified heading level.
Entering {} activates a built-in default set of titles including Summary, Introduction, Description, Issues, Causes, Solutions, Instructions, Environment, and Modification History.
See also: Ignore section titles
Ignore section titles
Type
Text field
Default
Empty
Accepts a comma-separated list of exact text strings. When processing PDFs through Form Recognizer, the parser omits any paragraph whose full text exactly matches an entry in this list; it does not appear as a heading, body text, or section. For example, setting this to Disclaimer, Copyright Notice removes any standalone paragraph that reads exactly "Disclaimer" or "Copyright Notice"; a paragraph reading "Disclaimer: This document is provided as-is" is unaffected. Use this to remove recurring boilerplate from Form Recognizer-processed PDFs, such as repeated headers, footers, copyright notices, or document metadata lines.
Matching is case-sensitive and exact; this field does not support partial matches or regular expressions. If the same text appears in both Use predefined section titles and Ignore section titles, Aisera ignores the text rather than promoting it.
See also: Use Predefined section titles
PDF processing
Controls PDF-specific parsing behavior, including page rendering, Form Recognizer routing, and image extraction.
Show sections as images
Type
Checkbox
Default
Disabled
For PDF documents, the parser renders each page as an image and creates one section per page in the parsed output. Each page becomes its own section. The parser ignores text-based headings and section boundaries within a page. The primary use of this setting is to serve specific PDF pages through Workflows or Hyperflows where visual fidelity matters more than searchable text.
This setting is not compatible with retrieval-augmented generation. Aisera does not process sections created in this mode through the standard content pipeline, and AI search does not return them. This setting has no effect on non-PDF formats.
Pdf Names
Type
Text field
Default
Empty
Accepts a comma-separated list of PDF filenames, without the .pdf extension, to route through Microsoft Form Recognizer for advanced parsing. Aisera sends only PDFs whose filenames match an entry in this list to Form Recognizer. Two special values are accepted:
all: Routes every PDF in the data source through Form Recognizer.ocr: Routes only image-only PDFs through Form Recognizer.
When Accurate Table Parse is also enabled, only the files listed here receive accurate table parsing. Leaving this field empty disables Form Recognizer routing regardless of other settings.
Example:
Form Recognizer is a paid Azure service. Routing additional documents through it increases processing costs. Review this field carefully before saving.
See also: Accurate Table Parse, Force OCR, Microsoft Form Recognizer
Copy Images During Parsing
Type
Checkbox
Default
Enabled
When enabled, the parser extracts images from documents, including embedded and externally linked images, and stores a copy on the Aisera platform. Parsed content references Aisera's copy instead of the original source. When disabled, images in parsed content reference their original source. If that source requires authentication or uses expiring URLs, images may fail to load when Aisera presents content to users. Disable this setting only if your organization has requirements against retaining document images on Aisera's infrastructure, or if the data source contains documents with a very high volume of images where storage overhead is a concern.
File handling
Controls how the parser manages file references and metadata after parsing, including download link behavior, title formatting, translation, and PowerPoint rendering.
Use origin URL for files download
Type
Checkbox
Default
Disabled
Controls which URL the platform uses as the download link for parsed documents. When disabled, document download links in search results and knowledge base articles point to a cached copy of the file on the Aisera platform. When enabled, those links instead point to the file's original source address, such as the URL it was crawled from in Confluence, SharePoint, or a web crawler. Enable this when you want end users to open the live, authoritative version of a document at its source rather than a cached copy, or when users should land on the source system for access control reasons.
The original source URL must be recorded at crawl time for this setting to take effect. If the connector did not record an origin URL for a document, this setting has no effect for that document.
Append parent title to filename
Type
Checkbox
Default
Disabled
When enabled, the parser prepends the parent knowledge base title to each document's title during parsing, producing a combined title in the format {parentTitle}-{documentTitle}. For example, a document titled "User Guide" under the "Product Documentation" knowledge base becomes "Product Documentation-User Guide". This applies to both PDF and HTML documents, including section-level titles within PDFs. If the parser finds no parent title for a document, the title remains unchanged. Enable this when documents from different knowledge bases or folders share similar names and the title needs to reflect source context, such as when crawling multi-section portals where individual article names are ambiguous without their parent category.
Enable Translation Service
Type
Checkbox
Default
Disabled
When enabled, Aisera translates document titles, section subjects, and body content to English before indexing. Enable this when your knowledge base contains documents in non-English languages and your end users primarily search in English, such as when source systems contain multilingual content from Confluence spaces, file shares, or web crawls.
Translation errors do not block the commit pipeline. If a translation fails, Aisera indexes the document without translated content. If the detected language is already English, Aisera marks the document as successfully translated without making a translation call.
Enable PowerPoint Photo Mode
Type
Checkbox
Default
Disabled
Controls how the parser converts PowerPoint presentations. When disabled, Aisera processes each slide individually, extracting text, tables, and images into structured, searchable HTML. When enabled, the parser renders each slide as an image. The parser does not extract slide text in this mode; users cannot search slides by content. Enable this when slides contain complex visual layouts or design-heavy content where visual fidelity matters more than text search.
When enabled, Aisera does not index slide text, and users cannot find slides through search. Keep this disabled when slides contain instructional or reference text that users need to search.
Last updated
Was this helpful?
