Aisera LLM Benchmarking
Tasks
Answer Detection Classification
Ambiguous Actionable Classification
Casual Gibberish Classification
Intent Extraction
Metrics
Performance Results
Ambiguous Actionable Classification
Model
Recall
Precision
F1 Score
Accuracy
False Negative Rate
Answer Detection Classification
Model
Recall
Precision
F1 Score
Accuracy
False Negative Rate
Casual Gibberish Classification
Model
Recall
Precision
F1 Score
Accuracy
False Negative Rate
Intent Extraction
Model
Judge Score
Efficiency Results
Short Input & Long Output
Provider
Parallelism
Time per Request (s)
Time to First Token (s)
Tokens per Second
1
0.69
0.36
33.7
4
0.92
0.49
25.6
16
1.59
0.89
15.8
1
0.98
0.49
24.9
4
1.28
0.55
14.7
16
4.85
0.81
2.9
Long Input & Short Output
Provider
Parallelism
Time per Request (s)
Time to First Token (s)
Tokens per Second
1
0.35
33.2
4
0.44
27
16
0.72
17.6
1
0.48
31.4
4
0.64
9.3
16
0.91
5.9
Long Input & Long Output
Provider
Parallelism
Time per Request (s)
Time to First Token (s)
Tokens per Second
1
0.945
0.411
36.3
4
1.27
0.511
27.4
16
2.75
1.159
13.8
1
1.39
0.51
22.68
4
3.89
0.72
6.99
16
20.84
8.12
1.46
Guardrails and Security
Provider
Domain
Code Injection
Toxicity/Sexism
Role-based Attack
Instruction Jailbreaking
99.5
96.9
85.5
93.1
98.1
99.2
75
89.4
97.3
96
72
91.8
98.9
97.1
78
95.5
Overall Ratings
Provider
Cost
Latency
Accuracy
Security
Stability
3
4
4
5
5
5
2
4
4
4
Last updated
Was this helpful?
