Evaluation

Similar to LLMs, creating optimizers is easy but evaluating them is not. Evaluating prompt-optimizers is same as evaluating LLMs, just before and after optimization for same prompts and task.

OpenAI Evals

The Evals is framework for evaluating Large Language Models (LLMs). It offers a range of evaluation challenges which can be used to measure the quality of optimizations.

LogiQA

LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning We use the first 100 samples for LogiQA eval to generate the follownig results:

Name

% Tokens Reduced

LogiQA Accuracy

USD Saved Per $100

Default

0.0

0.32

0.0

Entropy_Optim_p_0.05

0.06

0.3

6.35

Entropy_Optim_p_0.1

0.11

0.28

11.19

Entropy_Optim_p_0.25

0.26

0.22

26.47

Entropy_Optim_p_0.5

0.5

0.08

49.65

SynonymReplace_Optim_p_1.0

0.01

0.33

1.06

Lemmatizer_Optim

0.01

0.33

1.01

Stemmer_Optim

-0.06

0.09

-5.91

NameReplace_Optim

0.01

0.34

1.13

Punctuation_Optim

0.13

0.35

12.81

Autocorrect_Optim

0.01

0.3

1.14

Pulp_Optim_p_0.05

0.05

0.31

5.49

Pulp_Optim_p_0.1

0.1

0.25

9.52