Evaluation

Similar to LLMs, creating optimizers is easy but evaluating them is not. Evaluating prompt-optimizers is same as evaluating LLMs, just before and after optimization for same prompts and task.

OpenAI Evals 

The Evals is framework for evaluating Large Language Models (LLMs). It offers a range of evaluation challenges which can be used to measure the quality of optimizations.

LogiQA

LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning We use the first 100 samples for LogiQA eval to generate the follownig results:

Name	% Tokens Reduced	LogiQA Accuracy	USD Saved Per $100
Default	0.0	0.32	0.0
Entropy_Optim_p_0.05	0.06	0.3	6.35
Entropy_Optim_p_0.1	0.11	0.28	11.19
Entropy_Optim_p_0.25	0.26	0.22	26.47
Entropy_Optim_p_0.5	0.5	0.08	49.65
SynonymReplace_Optim_p_1.0	0.01	0.33	1.06
Lemmatizer_Optim	0.01	0.33	1.01
Stemmer_Optim	-0.06	0.09	-5.91
NameReplace_Optim	0.01	0.34	1.13
Punctuation_Optim	0.13	0.35	12.81
Autocorrect_Optim	0.01	0.3	1.14
Pulp_Optim_p_0.05	0.05	0.31	5.49
Pulp_Optim_p_0.1	0.1	0.25	9.52

Read the Docs v: latest

Versions: latest

Downloads

On Read the Docs: Project Home; Builds