Aisera LLM Benchmarking
Aisera's Large Language Model (LLM) benchmarking process is designed to evaluate the performance and capabilities of our AI-driven service automation solutions. This section outlines our methodology for benchmarking, the key performance metrics used, and the comparative analysis with industry standards.
Our benchmarking process involves systematic testing across various parameters, including Answer Detection Classification, Ambiguous action classification, Casual Gibberish Classification & Intent Extraction. By employing standardized benchmarks and rigorous evaluation techniques, we ensure that our LLMs meet high standards of performance and reliability.
This section provides detailed insights into the specific metrics used in our evaluations, the results of these tests, and how Aisera's LLMs compare to other leading models in the industry. The goal is to transparently showcase the strengths and capabilities of Aisera's LLMs in delivering efficient and accurate AI-driven solutions.
Tasks
The datasets used for the task evaluations were sampled from our Retrieval Augemented Generation (RAG) system in financial and legal & compliance domains.
Answer Detection Classification
Goal: Determine if the user's question is explicitly answered in the document or not.
Dataset: 1000 data points of production-level query and document chunks.
Ambiguous Actionable Classification
Goal: Determine if a user query has clear requests and can be directly answered. Ambiguous queries will trigger the system to request clarification from the user.
Dataset: 1000 data points of production-level queries.
Casual Gibberish Classification
Goal: Determine if a user query is actually relevant to the company the agent is serving. This includes safeguarding against attacks, filtering out casual chat, and identifying potentially harmful content as well.
Dataset: 700 data points of production-level queries and 300 safety-related queries.
Intent Extraction
Goal: Identify and summarize the intents and search queries inside the original user query. This extracted intent is then used as part of the RAG indexing.
Dataset: 1000 data points of production-level queries.
Metrics
For classification tasks, we use the following metrics:
Recall: The proportion of actual positive cases correctly identified.
Formula: True Positives / (True Positives + False Negatives)
Precision: The proportion of predicted positive cases that are actually positive.
Formula: True Positives / (True Positives + False Positives)
F1 Score: The harmonic mean of precision and recall, providing a single score that balances both metrics.
Formula: 2 * (Precision * Recall) / (Precision + Recall)
Accuracy: The proportion of correct predictions (both true positives and true negatives) among the total number of cases examined.
Formula: (True Positives + True Negatives) / Total Predictions
False Negative Rate: The proportion of actual positive cases incorrectly identified as negative.
Formula: False Negatives / (True Positives + False Negatives)
For text generation tasks (such as summarization, translation, or question answering), we use the following metric:
Judge Score: This metric compares the quality of a generated response to a ground truth response. The process is as follows:
An intelligent model, such as GPT-4, evaluates both the generated response and the ground truth response.
The LLM assigns a score between 0 and 100 to each response, based on its assessment of the response's quality.
The final Judge Score is calculated by dividing the generated response's score by the ground truth response's score, which is averaged across the dataset.
Performance Results
The results of the benchmark testing is shown in the following tables.
Ambiguous Actionable Classification
Aisera 70B
0.896
0.922
0.909
0.850
0.104
Llama 3 70B
0.966
0.898
0.931
0.880
0.034
Llama 3 8B
0.647
0.961
0..774
0.690
0.353
GPT-4
0.872
0.973
0.920
0.873
0.128
GPT-4 Turbo
0.834
0.997
0.897
0.840
0.166
Answer Detection Classification
Aisera Finetuned 8B
0.938
0.946
0.942
0.942
0.062
Llama 3 70B
0.932
0.891
0.911
0.909
0.068
Llama 3 8B
0.912
0.774
0.838
0.823
0.088
GPT-4
0.88
0.969
0.922
0.926
0.12
Casual Gibberish Classification
Aisera 70B
0.93
0.9396
0.923
0.93
0
Llama 3 70B
0.92
0.9327
0.9249
0.92
0.02
Llama 3 8B
0.78
0.8835
0.8188
0.78
0.09
GPT-4
0.96
0.9623
0.9549
0.96
0
GPT-4 Turbo
0.9
0.918
0.9036
0.9
0.02
Intent Extraction
Aisera 70B
0.9626
Llama 3 70B
0.9571
Llama 3 8B
0.8986
GPT-4
1.0968
GPT-4 Turbo
1.1134
Efficiency Results
The following tables show results for important latency and throughput metrics, such as the time per request, the time until the first token is generated, and the number of tokens generated per second. These metrics are reported for three prominent task regimes with varying input and output lengths.
Short Input & Long Output
e.g. Casual Gibberish Classification
Aisera 70B
Azure AI GPT-4
Long Input & Short Output
e.g. Document Relevance Checking
Aisera 70B
0.56
0.69
1.08
Azure AI GPT-4
.78
1.49
2.41
Long Input & Long Output
e.g. Retrieval Augmented Generation
Aisera 70B
Azure AI GPT-4
Guardrails and Security
We heavily focus on redteaming and guardrailing efforts, which result in tradeoffs between security and performance in some cases. The table below shows the comparisons between the detection rates (in percentages) for various risks, demonstrating the security strength of Aisera models by domain.
Aisera offers over 28+ domain-specific LLMs, each tailored to meet the unique needs of various industries and applications. Future evaluations will encompass additional domains.
Aisera 70B
Finance
Legal & Compliance
AzureAI GPT-4
Finance
Legal & Compliance
Overall Ratings
We use the following overall rating scales:
Cost: 1 (lowest cost) to 5 (highest cost)
Latency: 1 (high latency) to 5 (lowest latency)
Accuracy: 1 (low accuracy) to 5 (very high accuracy)
Security: 1 (very vulnerable to attacks) to 5 (nearly zero complaints due to attacks)
Stability: 1 (highly variable responses) to 5 (always returns the same answer)
Aisera 70B
Azure AI GPT-4
Last updated