Evaluation
Similar to LLMs, creating optimizers is easy but evaluating them is not. Evaluating prompt-optimizers is same as evaluating LLMs, just before and after optimization for same prompts and task.
OpenAI Evals
The Evals is framework for evaluating Large Language Models (LLMs). It offers a range of evaluation challenges which can be used to measure the quality of optimizations.
LogiQA
LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning We use the first 100 samples for LogiQA eval to generate the follownig results:
Name |
% Tokens Reduced |
LogiQA Accuracy |
USD Saved Per $100 |
---|---|---|---|
Default |
0.0 |
0.32 |
0.0 |
Entropy_Optim_p_0.05 |
0.06 |
0.3 |
6.35 |
Entropy_Optim_p_0.1 |
0.11 |
0.28 |
11.19 |
Entropy_Optim_p_0.25 |
0.26 |
0.22 |
26.47 |
Entropy_Optim_p_0.5 |
0.5 |
0.08 |
49.65 |
SynonymReplace_Optim_p_1.0 |
0.01 |
0.33 |
1.06 |
Lemmatizer_Optim |
0.01 |
0.33 |
1.01 |
Stemmer_Optim |
-0.06 |
0.09 |
-5.91 |
NameReplace_Optim |
0.01 |
0.34 |
1.13 |
Punctuation_Optim |
0.13 |
0.35 |
12.81 |
Autocorrect_Optim |
0.01 |
0.3 |
1.14 |
Pulp_Optim_p_0.05 |
0.05 |
0.31 |
5.49 |
Pulp_Optim_p_0.1 |
0.1 |
0.25 |
9.52 |