Aisera LLM Benchmarking

Aisera's Large Language Model (LLM) benchmarking process is designed to evaluate the performance and capabilities of our AI-driven service automation solutions. This section outlines our methodology for benchmarking, the key performance metrics used, and the comparative analysis with industry standards.

Our benchmarking process involves systematic testing across various parameters, including Answer Detection Classification, Ambiguous action classification, Casual Gibberish Classification & Intent Extraction. By employing standardized benchmarks and rigorous evaluation techniques, we ensure that our LLMs meet high standards of performance and reliability.

This section provides detailed insights into the specific metrics used in our evaluations, the results of these tests, and how Aisera's LLMs compare to other leading models in the industry. The goal is to transparently showcase the strengths and capabilities of Aisera's LLMs in delivering efficient and accurate AI-driven solutions.

Tasks

The datasets used for the task evaluations were sampled from our Retrieval Augemented Generation (RAG) system in financial and legal & compliance domains.

Answer Detection Classification

  • Goal: Determine if the user's question is explicitly answered in the document or not.

  • Dataset: 1000 data points of production-level query and document chunks.

Ambiguous Actionable Classification

  • Goal: Determine if a user query has clear requests and can be directly answered. Ambiguous queries will trigger the system to request clarification from the user.

  • Dataset: 1000 data points of production-level queries.

Casual Gibberish Classification

  • Goal: Determine if a user query is actually relevant to the company the agent is serving. This includes safeguarding against attacks, filtering out casual chat, and identifying potentially harmful content as well.

  • Dataset: 700 data points of production-level queries and 300 safety-related queries.

Intent Extraction

  • Goal: Identify and summarize the intents and search queries inside the original user query. This extracted intent is then used as part of the RAG indexing.

  • Dataset: 1000 data points of production-level queries.

Metrics

For classification tasks, we use the following metrics:

  • Recall: The proportion of actual positive cases correctly identified.

    • Formula: True Positives / (True Positives + False Negatives)

  • Precision: The proportion of predicted positive cases that are actually positive.

    • Formula: True Positives / (True Positives + False Positives)

  • F1 Score: The harmonic mean of precision and recall, providing a single score that balances both metrics.

    • Formula: 2 * (Precision * Recall) / (Precision + Recall)

  • Accuracy: The proportion of correct predictions (both true positives and true negatives) among the total number of cases examined.

    • Formula: (True Positives + True Negatives) / Total Predictions

  • False Negative Rate: The proportion of actual positive cases incorrectly identified as negative.

    • Formula: False Negatives / (True Positives + False Negatives)

For text generation tasks (such as summarization, translation, or question answering), we use the following metric:

  • Judge Score: This metric compares the quality of a generated response to a ground truth response. The process is as follows:

    1. An intelligent model, such as GPT-4, evaluates both the generated response and the ground truth response.

    2. The LLM assigns a score between 0 and 100 to each response, based on its assessment of the response's quality.

    3. The final Judge Score is calculated by dividing the generated response's score by the ground truth response's score, which is averaged across the dataset.

Performance Results

The results of the benchmark testing is shown in the following tables.

Ambiguous Actionable Classification

Model
Recall
Precision
F1 Score
Accuracy
False Negative Rate

Aisera 70B

0.896

0.922

0.909

0.850

0.104

Llama 3 70B

0.966

0.898

0.931

0.880

0.034

Llama 3 8B

0.647

0.961

0..774

0.690

0.353

GPT-4

0.872

0.973

0.920

0.873

0.128

GPT-4 Turbo

0.834

0.997

0.897

0.840

0.166

Answer Detection Classification

Model
Recall
Precision
F1 Score
Accuracy
False Negative Rate

Aisera Finetuned 8B

0.938

0.946

0.942

0.942

0.062

Llama 3 70B

0.932

0.891

0.911

0.909

0.068

Llama 3 8B

0.912

0.774

0.838

0.823

0.088

GPT-4

0.88

0.969

0.922

0.926

0.12

Casual Gibberish Classification

Model
Recall
Precision
F1 Score
Accuracy
False Negative Rate

Aisera 70B

0.93

0.9396

0.923

0.93

0

Llama 3 70B

0.92

0.9327

0.9249

0.92

0.02

Llama 3 8B

0.78

0.8835

0.8188

0.78

0.09

GPT-4

0.96

0.9623

0.9549

0.96

0

GPT-4 Turbo

0.9

0.918

0.9036

0.9

0.02

Intent Extraction

Model
Judge Score

Aisera 70B

0.9626

Llama 3 70B

0.9571

Llama 3 8B

0.8986

GPT-4

1.0968

GPT-4 Turbo

1.1134

Efficiency Results

The following tables show results for important latency and throughput metrics, such as the time per request, the time until the first token is generated, and the number of tokens generated per second. These metrics are reported for three prominent task regimes with varying input and output lengths.

Short Input & Long Output

e.g. Casual Gibberish Classification

Provider
Parallelism
Time per Request (s)
Time to First Token (s)
Tokens per Second

Aisera 70B

1
0.69
0.36
33.7

4
0.92
0.49
25.6

16
1.59
0.89
15.8

Azure AI GPT-4

1
0.98
0.49
24.9

4
1.28
0.55
14.7

16
4.85
0.81
2.9

Long Input & Short Output

e.g. Document Relevance Checking

Provider
Parallelism
Time per Request (s)
Time to First Token (s)
Tokens per Second

Aisera 70B

1

0.56

0.35
33.2

4

0.69

0.44
27

16

1.08

0.72
17.6

Azure AI GPT-4

1

.78

0.48
31.4

4

1.49

0.64
9.3

16

2.41

0.91
5.9

Long Input & Long Output

e.g. Retrieval Augmented Generation

Provider
Parallelism
Time per Request (s)
Time to First Token (s)
Tokens per Second

Aisera 70B

1
0.945
0.411
36.3

4
1.27
0.511
27.4

16
2.75
1.159
13.8

Azure AI GPT-4

1
1.39
0.51
22.68

4
3.89
0.72
6.99

16
20.84
8.12
1.46

Guardrails and Security

We heavily focus on redteaming and guardrailing efforts, which result in tradeoffs between security and performance in some cases. The table below shows the comparisons between the detection rates (in percentages) for various risks, demonstrating the security strength of Aisera models by domain.

Aisera offers over 28+ domain-specific LLMs, each tailored to meet the unique needs of various industries and applications. Future evaluations will encompass additional domains.

Provider
Domain
Code Injection
Toxicity/Sexism
Role-based Attack
Instruction Jailbreaking

Aisera 70B

Finance

99.5
96.9
85.5
93.1

Legal & Compliance

98.1
99.2
75
89.4

AzureAI GPT-4

Finance

97.3
96
72
91.8

Legal & Compliance

98.9
97.1
78
95.5

Overall Ratings

We use the following overall rating scales:

  • Cost: 1 (lowest cost) to 5 (highest cost)

  • Latency: 1 (high latency) to 5 (lowest latency)

  • Accuracy: 1 (low accuracy) to 5 (very high accuracy)

  • Security: 1 (very vulnerable to attacks) to 5 (nearly zero complaints due to attacks)

  • Stability: 1 (highly variable responses) to 5 (always returns the same answer)

Provider
Cost
Latency
Accuracy
Security
Stability

Aisera 70B

3
4
4
5
5

Azure AI GPT-4

5
2
4
4
4

Last updated