> For the complete documentation index, see [llms.txt](https://docs.aisera.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.aisera.com/aisera-platform/tenant-setup/aisera-platform-configuration/tenant-configuration-settings/parser/format-specific-parameters.md).

# Format-specific parameters

### Renamed HTML Tags

| **Type**    | Text field |
| ----------- | ---------- |
| **Default** | Empty      |

Accepts semicolon-separated instructions that remap HTML tags to different element types or exclude elements by CSS class from parsed output. Enter instructions in one of two forms:

**Tag renaming:** `tag=X,replacetag=Y` Instructs the parser to treat tag `X` as if it were tag `Y`. For example, `tag=span,replacetag=h2` causes the parser to interpret every `<span>` element as a heading.

**Class exclusion:** a plain CSS class name Causes the parser to exclude every HTML element whose `class` attribute exactly matches the entry from parsing entirely, including all its children.

Both forms can be combined in a single value:

```
tag=span,replacetag=p;tag=button,replacetag=h3;navpage-header;page-footer
```

All matching is case-insensitive.

### HTML Parameters

| **Type**    | Text field (JSON) |
| ----------- | ----------------- |
| **Default** | Empty             |

Accepts a JSON object that fine-tunes HTML document parsing behavior. Configure one or more of the following keys:

<table><thead><tr><th width="169">Key</th><th width="101">Type</th><th width="99">Default</th><th>Description</th></tr></thead><tbody><tr><td><code>excludedHeaders</code></td><td>Array</td><td>-</td><td>Header tags to exclude from heading detection when parsing PDFs and Office documents converted to HTML. Has no effect on native HTML documents. Example: <code>["h4","h5","h6"]</code></td></tr><tr><td><code>excludedPages</code></td><td>Array</td><td>-</td><td>1-based page numbers to skip entirely during parsing. Example: <code>[1,3]</code></td></tr><tr><td><code>headerFrequency</code></td><td>Integer</td><td><code>10</code></td><td>Minimum number of times a heading subject must appear in PDF-converted HTML before the parser promotes it as a section header. Minimum value: <code>1</code></td></tr><tr><td><code>faqDeduplication</code></td><td>Boolean</td><td>-</td><td>When <code>true</code>, screens early sections before applying FAQ detection. If a section fails the FAQ content check, it is preserved as regular content and screening ends, preventing content such as a table of contents from being misclassified as FAQ material.</td></tr><tr><td><code>version</code></td><td>Integer</td><td>-</td><td>Set to <code>3</code> or higher to enable experimental HTML flattening that pre-processes document structure before section detection</td></tr></tbody></table>

Example:

```json
{"excludedHeaders": ["h4","h5","h6"], "headerFrequency": 5, "faqDeduplication": true}
```

Use `excludedHeaders` or `excludedPages` when minor heading levels or specific pages such as cover or index pages create unwanted section splits. Use `headerFrequency` when PDFs converted to HTML produce noisy section detection due to frequently repeated minor headings.

{% hint style="info" %}
Invalid or blank JSON falls back to defaults for all sub-parameters.
{% endhint %}

### PDF Parameters

| **Type**    | Text field (JSON) |
| ----------- | ----------------- |
| **Default** | Empty             |

Accepts a JSON object that customizes PDF parsing behavior. Several keys require `enhancedPhotoMode: true` to take effect.

Top-level keys:

<table><thead><tr><th width="180">Key</th><th width="86">Type</th><th width="98">Default</th><th>Description</th></tr></thead><tbody><tr><td><code>imageResolution</code></td><td>Integer</td><td><code>144</code></td><td>Resolution in dots per inch at which the parser renders PDF pages as images. Range: <code>72</code>–<code>360</code>. Higher values produce sharper images at the cost of larger file sizes.</td></tr><tr><td><code>enhancedPhotoMode</code></td><td>Boolean</td><td>-</td><td>When <code>true</code>, uses font analysis to identify document sections. Also acts as a fallback parsing mode when the PDF converter is unavailable.</td></tr><tr><td><code>subjects</code></td><td>Array</td><td>-</td><td>Replaces the default subject keywords used to identify section headings (<code>chapter</code>, <code>section</code>, <code>article</code>, <code>part</code>, <code>paragraph</code>, <code>articolo</code>). Requires <code>enhancedPhotoMode: true</code>.</td></tr><tr><td><code>annexes</code></td><td>Array</td><td>-</td><td>Replaces the default appendix keywords used to detect appendix sections (<code>appendix</code>, <code>annex</code>, <code>annexe</code>, <code>appex</code>, <code>załącznik</code>, <code>attachment</code>, <code>allegato</code>). Requires <code>enhancedPhotoMode: true</code>.</td></tr><tr><td><code>pdfConfiguration</code></td><td>Object</td><td>-</td><td>Nest object for advanced document segmentation. Accepts <code>links</code>, <code>pdfSegments</code>, and <code>strategy</code> keys. Requires <code>enhancedPhotoMode: true</code> for <code>pdfSegments</code> and <code>strategy</code>.</td></tr></tbody></table>

The `pdfConfiguration` key accepts a nested object with the following keys:

<table><thead><tr><th width="142">Key</th><th>Description</th></tr></thead><tbody><tr><td><code>links</code></td><td>Represents a PDF as a single linked section instead of parsed text. Each entry requires a <code>pdfName</code> value: a filename or <code>"all"</code>, and optionally <code>title</code>, <code>linkText</code>, and <code>text</code>.</td></tr><tr><td><code>pdfSegments</code></td><td>Manual section definitions for specific PDFs. Each entry requires a <code>pdfName</code> (exact filename) and a <code>sections</code> array. Each section has <code>subject</code>, <code>sentence</code>, and <code>pageNo</code> (1-based, as a string). If <code>pdfSegments</code> is non-empty, <code>strategy</code> is ignored. Requires <code>enhancedPhotoMode: true</code>.</td></tr><tr><td><code>strategy</code></td><td>Page range strategies for specific PDFs. Each entry requires a <code>pdfName</code> (supports <code>"all"</code>) and a <code>strategies</code> array. Each strategy has <code>strategyName</code> (<code>"PRESENTATION"</code> or <code>"BOLD_FONTS"</code>), <code>startPage</code>, and <code>endPage</code> (both strings). Ignored if <code>pdfSegments</code> is non-empty. Requires <code>enhancedPhotoMode: true</code>.</td></tr></tbody></table>

Example:

```json
{
  "imageResolution": 200,
  "enhancedPhotoMode": true,
  "pdfConfiguration": {
    "pdfSegments": [
      {
        "pdfName": "policy-document",
        "sections": [
          {"subject": "Introduction", "sentence": "This policy applies", "pageNo": "1"},
          {"subject": "Scope", "sentence": "The following teams", "pageNo": "2"}
        ]
      }
    ]
  }
}
```

Use `imageResolution` to adjust rendered page image quality when sections appear as images. Enable `enhancedPhotoMode` for PDFs with consistent structural patterns where font-based section detection improves results. Use `pdfConfiguration.links` to surface a PDF as a downloadable link rather than extracted text.

See also: [Show sections as images](/aisera-platform/tenant-setup/aisera-platform-configuration/tenant-configuration-settings/parser.md#show-sections-as-images)

### Docx Parameters

| **Type**    | Text field |
| ----------- | ---------- |
| **Default** | Empty      |

Defines custom Word style-to-HTML heading mappings used when converting `.docx` and `.doc` files to HTML. Enter a semicolon-separated list of entries in the following format:

```
StyleName,HTMLTag:mode
```

* **StyleName:** The name of the Word paragraph style to map, such as `Heading 1` or `Title`. Also accepts `bold` to map all bold-formatted paragraphs, and `underline` to map all underlined paragraphs.
* **HTMLTag:** The HTML heading tag to map the style to, such as `h1` or `h2`.
* **mode:** Optional. Set to `fresh` to create a new HTML element for each matching paragraph. Without `:fresh`, consecutive paragraphs sharing the same style are appended into a single element.

Example:

```
Heading 1,h1:fresh;Heading 2,h2:fresh;bold,h3
```

When not configured, the parser maps Word's built-in "Title" style to `<h1>` and "Subtitle" to `<h2>`, both using `:fresh` behavior by default. If the converted HTML contains no h1–h3 heading tags, the parser automatically promotes paragraphs consisting entirely of bold text to `<h1>` regardless of this setting. Configure this when Word documents use custom style names the parser does not recognize as headings by default, causing documents to parse as a single unstructured block.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.aisera.com/aisera-platform/tenant-setup/aisera-platform-configuration/tenant-configuration-settings/parser/format-specific-parameters.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
