# Parser

The **Settings > Configuration > Parser** window allows you to set parameters for your tenant. These are settings that apply to any bot you create in your Aisera tenant.

## General behavior

### Ignore Unsupported Documents

| **Type**    | Checkbox |
| ----------- | -------- |
| **Default** | Enabled  |

Determines how the platform handles unsupported file types during ingestion. When enabled, the parser skips unsupported documents. When disabled, the platform attempts to parse all document types.

**Supported formats**:

* PDF
* HTML
* DOC/DOCX
* PPTX
* MD
* TXT
* EML
* MSG
* CSV

### Filter out non-ASCII characters

| **Type**    | Checkbox |
| ----------- | -------- |
| **Default** | Disabled |

Replaces non-ASCII characters with spaces during parsing. Only applicable to documents using western character sets.

### is Private?

| **Type**    | Checkbox |
| ----------- | -------- |
| **Default** | Disabled |

No description available.

### Use Templates?

| **Type**    | Checkbox |
| ----------- | -------- |
| **Default** | Disabled |

No description available.

### Signify FAQ

| **Type**    | Checkbox |
| ----------- | -------- |
| **Default** | Enabled  |

Detects content ending with "?" as questions and formats them as FAQ sections, with the following content used as the answer for each question. The parser compares detected questions with existing sections to avoid creating duplicates.

### Image Removal Contour Threshold

| **Type**    | Text field (integers) |
| ----------- | --------------------- |
| **Default** | `400`                 |

Controls the removal of background and low-content images using [Canny Edge detection](https://en.wikipedia.org/wiki/Canny_edge_detector). The value correlates to the number of contours detected in the image. Images with fewer contours than the specified threshold will be deleted. A lower threshold results in fewer images being removed; a higher threshold results in more images being removed.

Set this field to `0` to disable the removal of background and low-content images. While there is no upper bound, a practical range for this field is 0 to 1000.

***

## Section detection

### Detect sections using font properties

| **Type**    | Checkbox |
| ----------- | -------- |
| **Default** | Disabled |

Uses font size, weight, and boldness to automatically detect section boundaries and assign hierarchy levels during HTML parsing. Content with font size at or above the 75th percentile of all fonts in the document will be promoted to a section title, with bold text receiving additional weight. Higher hierarchy levels are given to larger fonts: level 1 for the largest, level 2 for the next, and so forth.

This is useful for documents that lack explicit structural markup and instead rely on visual font styling to indicate sections. This applies to standard HTML content and simple text-based formats such as TXT, Markdown, EML, and MSG processed through HTMLParser. Converted documents that contain font cluster metadata, such as DOCX, use header-based section detection instead.

### Add contents section

| **Type**    | Checkbox |
| ----------- | -------- |
| **Default** | Disabled |

By default, when a document has two or more sections, the first section includes an inline list of other top-level section titles. This gives readers a brief overview of the document's structure within the first section itself.

When enabled, that overview is instead presented as its own standalone section at the beginning of the document. The section is titled with the document's title, or "Document Contents" if no title is available, and contains a bulleted list of the document's top-level section titles.

This applies to all supported document formats.

### Merge Sections with Similar Subjects

| **Type**    | Checkbox |
| ----------- | -------- |
| **Default** | Enabled  |

Combines consecutive sections whose headings share a common prefix before a colon or hyphen delimiter, where that prefix is nearly identical. For example, if a document contains two consecutive sections both titled **"Troubleshooting:"**, they will be merged into one.

Only consecutive sections are compared. If two sections with similar headings are separated by an unrelated section, they will not be merged. Sections without headings are not eligible for merging.

When sections are merged, the first section's heading is preserved. The subsequent section's heading and content are added to the first section. No content is removed.

{% hint style="warning" %}
Sections that share a prefix but cover different subtopics may be unintentionally merged. Disable this setting if that occurs.
{% endhint %}

This applies to all supported document formats.

### Use Predefined section titles

| **Type**    | Text field |
| ----------- | ---------- |
| **Default** | Empty      |

Specifies a list of words, sentences, or regular expressions that should be recognized as section titles. When a line of text matches a provided pattern it will become a section header. This is useful for documents that do not have defined header elements.

You can define the recognized phrases using a comma-separated list:

```
"Summary, Issue, Cause, Resolution"
```

For more explicit definitions you may use JSON:

```json
{
  "h1": [
    "Summary",
    "Issue"
  ],
  "h2": [
    "Details"
  ]
}
```

Companion to: [Ignore section titles](#ignore-section-titles)

### Ignore section titles

| **Type**    | Text field |
| ----------- | ---------- |
| **Default** | Empty      |

Accepts a comma-separated list of words or phrases. When a line of text in a document exactly matches one of the provided values, that line will be excluded from the parsed output entirely. **The matching is case-sensitive.**

For example, setting this to `Disclaimer, Copyright Notice` will remove any standalone lines that read exactly "Disclaimer" or "Copyright Notice" from the parsed document. Lines that only partially match, such as "Disclaimer: This document is provided as-is", will not be affected.

While Use Predefined section titles promotes matching text to section headings, this configuration removes matching text from the output. If the same text appears in both configurations, it will be ignored rather than promoted.

Companion to: [Use Predefined section titles](#use-predefined-section-titles)

***

## PDF processing

### Show sections as images

| **Type**    | Checkbox |
| ----------- | -------- |
| **Default** | Disabled |

Converts each page of a PDF document into its own section, rendered as a page image rather than parsed text, ignoring any section boundaries within the page.

This feature is not compatible with RAG. Sections created in this mode are not processed through the standard content pipeline and will not be returned by AI search.

The primary use of this feature is to serve specific pages through Workflows or Hyperflows.

### Pdf Names

| **Type**    | Text field |
| ----------- | ---------- |
| **Default** | Empty      |

Specifies which PDF files should be parsed using [Microsoft Form Recognizer](https://azure.microsoft.com/en-us/products/ai-foundry/tools/document-intelligence). This is useful for scanned documents, documents containing complex layouts, or documents containing forms. It also provides more accurate extraction of information from tables.

Enter a comma-separated list of file names *without the file extension*. You may also use the following values:

* `"all"` — All PDFs will use Form Recognizer.
* `"ocr"` — PDFs containing only images will use Form Recognizer.

The following is an example of a valid input for this field:

```
"invoice, receipt, ocr"
```

{% hint style="danger" %}
Using Microsoft Form Recognizer may incur extra charges. Using `"all"` will apply Form Recognizer to every PDF and can be expensive.
{% endhint %}

{% hint style="info" %}
This field also controls which documents are processed by the Document Converter and Microsoft Form Recognizer features.  See [Document Converter](https://docs.aisera.com/aisera-platform/tenant-setup/aisera-platform-configuration/tenant-configuration-settings/parser/document-converter) and [Microsoft Form Recognizer](https://docs.aisera.com/aisera-platform/tenant-setup/aisera-platform-configuration/tenant-configuration-settings/parser/microsoft-form-recognizer) for details.
{% endhint %}

### Copy Images During Parsing

| **Type**    | Checkbox |
| ----------- | -------- |
| **Default** | Enabled  |

Copies images from the source document to Aisera server storage during parsing, enabling faster image loading and reducing dependency on external image sources.

***

### File handling

### Use origin URL for files download

| **Type**    | Checkbox |
| ----------- | -------- |
| **Default** | Disabled |

Uses the original source URL for file downloads instead of S3 storage. Files will be displayed and downloaded from their original location. Use this when users need to access files directly from their original source, especially if the URLs have specific access controls or provide ease of convenience.

### Append parent title to filename

| **Type**    | Checkbox |
| ----------- | -------- |
| **Default** | Disabled |

Appends the name of the source knowledge base to the document filename, resulting in filenames with the form `<Parent Knowledge Base>-<Filename>`. This is useful for organizing filenames by their source. For instance, if a document is titled "User Guide" under the "Product Documentation" knowledge base, the filename will be "Product Documentation-User Guide".

### Enable Translation Service

| **Type**    | Checkbox |
| ----------- | -------- |
| **Default** | Disabled |

No description available.

### Enable PowerPoint Photo Mode

| **Type**    | Checkbox |
| ----------- | -------- |
| **Default** | Disabled |

Parses PowerPoint presentations as images, converting each slide to an image embedded in HTML. This preserves the exact visual appearance of slides. This is useful for presentations using many visuals, or presentations where visual fidelity is more important than searchable text.
