# Aisera LLM Benchmarking

Aisera's [Large Language Model](https://aisera.com/blog/large-language-models-llms/) (LLM) benchmarking process is designed to evaluate the performance and capabilities of our AI-driven service automation solutions. This section outlines our methodology for benchmarking, the key performance metrics used, and the comparative analysis with industry standards.&#x20;

Our benchmarking process involves systematic testing across various parameters, including Answer Detection Classification, Ambiguous action classification, Casual Gibberish Classification & Intent Extraction. By employing standardized benchmarks and rigorous evaluation techniques, we ensure that our LLMs meet high standards of performance and reliability.&#x20;

This section provides detailed insights into the specific metrics used in our [LLM evaluations](https://aisera.com/blog/llm-evaluation/), the results of these tests, and how Aisera's LLMs compare to other leading models in the industry. The goal is to transparently showcase the strengths and capabilities of Aisera's LLMs in delivering efficient and accurate AI-driven solutions.

## Tasks

The datasets used for the task evaluations were sampled from our Retrieval Augmented Generation ([RAG](https://aisera.com/blog/retrieval-augmented-generation-rag/)) system in financial and legal & compliance domains.

### Answer Detection Classification

* **Goal:** Determine if the user's question is explicitly answered in the document or not.&#x20;
* **Dataset:** 1000 data points of production-level query and document chunks.

### Ambiguous Actionable Classification

* **Goal:** Determine if a user query has clear requests and can be directly answered. Ambiguous queries will trigger the system to request clarification from the user.
* **Dataset:** 1000 data points of production-level queries.

### Casual Gibberish Classification

* **Goal:** Determine if a user query is actually relevant to the company the agent is serving. This includes safeguarding against attacks, filtering out casual chat, and identifying potentially harmful content as well.
* **Dataset:** 700 data points of production-level queries and 300 safety-related queries.

### Intent Extraction

* **Goal:** Identify and summarize the intents and search queries inside the original user query. This extracted intent is then used as part of the RAG indexing.
* **Dataset:** 1000 data points of production-level queries.

## Metrics

For classification tasks, we use the following metrics:&#x20;

* **Recall:** The proportion of actual positive cases correctly identified.
  * **Formula:** True Positives / (True Positives + False Negatives)
* **Precision:** The proportion of predicted positive cases that are actually positive.
  * **Formula:** True Positives / (True Positives + False Positives)
* **F1 Score:** The harmonic mean of precision and recall, providing a single score that balances both metrics.
  * **Formula:** 2 \* (Precision \* Recall) / (Precision + Recall)
* **Accuracy:** The proportion of correct predictions (both true positives and true negatives) among the total number of cases examined.
  * **Formula:** (True Positives + True Negatives) / Total Predictions
* **False Negative Rate:** The proportion of actual positive cases incorrectly identified as negative.
  * **Formula:** False Negatives / (True Positives + False Negatives)

For text generation tasks (such as summarization, translation, or question answering), we use the following metric:

* **Judge Score:** This metric compares the quality of a generated response to a ground truth response. The process is as follows:
  1. An intelligent model, such as GPT-4, evaluates both the generated response and the ground truth response.
  2. The LLM assigns a score between 0 and 100 to each response, based on its assessment of the response's quality.
  3. The final Judge Score is calculated by dividing the generated response's score by the ground truth response's score, which is averaged across the dataset.

## Performance Results

The results of the benchmark testing are shown in the following tables.

### Ambiguous Actionable Classification

<table><thead><tr><th width="154">Model</th><th width="91">Recall</th><th width="108">Precision</th><th width="111">F1 Score</th><th width="120">Accuracy</th><th>False Negative Rate</th></tr></thead><tbody><tr><td>Aisera 70B</td><td>0.896</td><td>0.922</td><td>0.909</td><td>0.850</td><td>0.104</td></tr><tr><td>Llama 3 70B</td><td><strong>0.966</strong></td><td>0.898</td><td><strong>0.931</strong></td><td><strong>0.880</strong></td><td><strong>0.034</strong></td></tr><tr><td>Llama 3 8B</td><td>0.647</td><td>0.961</td><td>0..774</td><td>0.690</td><td>0.353</td></tr><tr><td>GPT-4</td><td>0.872</td><td><strong>0.973</strong></td><td>0.920</td><td>0.873</td><td>0.128</td></tr><tr><td>GPT-4 Turbo</td><td>0.834</td><td>0.997</td><td>0.897</td><td>0.840</td><td>0.166</td></tr></tbody></table>

### Answer Detection Classification

<table><thead><tr><th width="181">Model</th><th width="82">Recall</th><th width="140">Precision</th><th width="150">F1 Score</th><th width="93">Accuracy </th><th>False Negative Rate</th></tr></thead><tbody><tr><td>Aisera Finetuned 8B</td><td><strong>0.938</strong></td><td>0.946</td><td><strong>0.942</strong></td><td><strong>0.942</strong></td><td><strong>0.062</strong></td></tr><tr><td>Llama 3 70B</td><td>0.932</td><td>0.891</td><td>0.911</td><td>0.909</td><td>0.068</td></tr><tr><td>Llama 3 8B</td><td>0.912</td><td>0.774</td><td>0.838</td><td>0.823</td><td>0.088</td></tr><tr><td>GPT-4</td><td>0.88</td><td><strong>0.969</strong></td><td>0.922</td><td>0.926</td><td>0.12</td></tr></tbody></table>

### Casual Gibberish Classification

<table><thead><tr><th width="155">Model</th><th width="112">Recall</th><th width="116">Precision</th><th width="109">F1 Score</th><th width="126">Accuracy</th><th>False Negative Rate</th></tr></thead><tbody><tr><td>Aisera 70B</td><td>0.93</td><td>0.9396</td><td>0.923</td><td>0.93</td><td><strong>0</strong></td></tr><tr><td>Llama 3 70B</td><td>0.92</td><td>0.9327</td><td>0.9249</td><td>0.92</td><td>0.02</td></tr><tr><td>Llama 3 8B</td><td>0.78</td><td>0.8835</td><td>0.8188</td><td>0.78</td><td>0.09</td></tr><tr><td>GPT-4</td><td><strong>0.96</strong></td><td><strong>0.9623</strong></td><td><strong>0.9549</strong></td><td><strong>0.96</strong></td><td><strong>0</strong></td></tr><tr><td>GPT-4 Turbo</td><td>0.9</td><td>0.918</td><td>0.9036</td><td>0.9</td><td>0.02</td></tr></tbody></table>

### Intent Extraction

<table><thead><tr><th width="179">Model</th><th width="145">Judge Score</th></tr></thead><tbody><tr><td>Aisera 70B</td><td>0.9626</td></tr><tr><td>Llama 3 70B</td><td>0.9571</td></tr><tr><td>Llama 3 8B</td><td>0.8986</td></tr><tr><td>GPT-4</td><td>1.0968</td></tr><tr><td>GPT-4 Turbo</td><td><strong>1.1134</strong></td></tr></tbody></table>

## Efficiency Results

The following tables show results for important latency and throughput metrics, such as the time per request, the time until the first token is generated, and the number of tokens generated per second. These metrics are reported for three prominent task regimes with varying input and output lengths.

### Short Input & Long Output

e.g. Casual Gibberish Classification

<table><thead><tr><th>Provider</th><th width="111" data-type="number">Parallelism</th><th data-type="number">Time per Request (s)</th><th data-type="number">Time to First Token (s)</th><th data-type="number">Tokens per Second</th></tr></thead><tbody><tr><td>Aisera 70B</td><td>1</td><td>0.69</td><td>0.36</td><td>33.7</td></tr><tr><td></td><td>4</td><td>0.92</td><td>0.49</td><td>25.6</td></tr><tr><td></td><td>16</td><td>1.59</td><td>0.89</td><td>15.8</td></tr><tr><td>Azure AI GPT-4</td><td>1</td><td>0.98</td><td>0.49</td><td>24.9</td></tr><tr><td></td><td>4</td><td>1.28</td><td>0.55</td><td>14.7</td></tr><tr><td></td><td>16</td><td>4.85</td><td>0.81</td><td>2.9</td></tr></tbody></table>

### Long Input & Short Output

e.g. Document Relevance Checking

<table><thead><tr><th width="127">Provider</th><th width="128" data-type="number">Parallelism</th><th>Time per Request (s)</th><th data-type="number">Time to First Token (s)</th><th data-type="number">Tokens per Second</th></tr></thead><tbody><tr><td>Aisera 70B</td><td>1</td><td>0.56</td><td>0.35</td><td>33.2</td></tr><tr><td></td><td>4</td><td>0.69</td><td>0.44</td><td>27</td></tr><tr><td></td><td>16</td><td>1.08</td><td>0.72</td><td>17.6</td></tr><tr><td>Azure AI GPT-4</td><td>1</td><td>.78</td><td>0.48</td><td>31.4</td></tr><tr><td></td><td>4</td><td>1.49</td><td>0.64</td><td>9.3</td></tr><tr><td></td><td>16</td><td>2.41</td><td>0.91</td><td>5.9</td></tr></tbody></table>

### Long Input & Long Output

e.g. Retrieval Augmented Generation

<table><thead><tr><th width="129">Provider</th><th data-type="number">Parallelism</th><th data-type="number">Time per Request (s)</th><th data-type="number">Time to First Token (s)</th><th data-type="number">Tokens per Second</th></tr></thead><tbody><tr><td>Aisera 70B</td><td>1</td><td>0.945</td><td>0.411</td><td>36.3</td></tr><tr><td></td><td>4</td><td>1.27</td><td>0.511</td><td>27.4</td></tr><tr><td></td><td>16</td><td>2.75</td><td>1.159</td><td>13.8</td></tr><tr><td>Azure AI GPT-4</td><td>1</td><td>1.39</td><td>0.51</td><td>22.68</td></tr><tr><td></td><td>4</td><td>3.89</td><td>0.72</td><td>6.99</td></tr><tr><td></td><td>16</td><td>20.84</td><td>8.12</td><td>1.46</td></tr></tbody></table>

## Guardrails and Security

We heavily focus on redteaming and guardrailing efforts, which result in tradeoffs between security and performance in some cases. The table below shows the comparisons between the detection rates (in percentages) for various risks, demonstrating the security strength of Aisera models by domain.&#x20;

Aisera offers over 28+ domain-specific LLMs, each tailored to meet the unique needs of various industries and applications. Future evaluations will encompass additional domains.

<table><thead><tr><th width="94">Provider</th><th>Domain</th><th data-type="number">Code Injection</th><th data-type="number">Toxicity/Sexism</th><th data-type="number">Role-based Attack</th><th data-type="number">Instruction Jailbreaking</th></tr></thead><tbody><tr><td>Aisera 70B</td><td>Finance</td><td>99.5</td><td>96.9</td><td>85.5</td><td>93.1</td></tr><tr><td></td><td>Legal &#x26; Compliance</td><td>98.1</td><td>99.2</td><td>75</td><td>89.4</td></tr><tr><td>AzureAI GPT-4</td><td>Finance</td><td>97.3</td><td>96</td><td>72</td><td>91.8</td></tr><tr><td></td><td>Legal &#x26; Compliance</td><td>98.9</td><td>97.1</td><td>78</td><td>95.5</td></tr></tbody></table>

## Overall Ratings

We use the following overall rating scales:

* Cost: 1 (lowest cost) to 5 (highest cost)
* Latency: 1 (high latency) to 5 (lowest latency)
* Accuracy: 1 (low accuracy) to 5 (very high accuracy)
* Security: 1 (very vulnerable to attacks) to 5 (nearly zero complaints due to attacks)
* Stability: 1 (highly variable responses) to 5 (always returns the same answer)

<table><thead><tr><th>Provider</th><th data-type="number">Cost</th><th data-type="number">Latency</th><th data-type="number">Accuracy</th><th data-type="number">Security</th><th data-type="number">Stability</th></tr></thead><tbody><tr><td>Aisera 70B</td><td>3</td><td>4</td><td>4</td><td>5</td><td>5</td></tr><tr><td>Azure AI GPT-4</td><td>5</td><td>2</td><td>4</td><td>4</td><td>4</td></tr></tbody></table>
