NIAN needs to evaluate the answers to the questions it asks. As usual with LLMs, evaluation is difficult. NIAN uses 5 LLMs to evaluate the responses and pass/fail is determined by majority vote. The prompts use single or few shot training to improve the evaluators.
NIAN includes a tool answers that identifies every unique answer to a question generated by the LLMs and adds them to a pass or fail list based on the majority vote. To get a variety of answers, I used GPT-3.5, GPT-4, Haiku and Mistral 7B to generate the answers. I examined the answers and used a few answers that passed as few-shot examples of good answers for the questions that had failed answers that should have passed.
To speed up testing of the few shot prompt, I created a tool, revaluate, that reevaluates every generated answer in a prior test. It makes a big difference to iteration time – it can reevaluate 1200 answers in about 2 minutes.
Another useful tool to improve the evaluators is dissent. It reports the number of times an LLM disagrees with the majority vote. I found that GPT-3.5 dissented very often (15-45% of the time) and that it was wrong when it dissented. I removed GPT-3.5 from the evaluator list and doubled up on Mistral 8X22 models. I would like to find another model to use to broaden the perspectives of the evaluators.
NIAN gets very different results for different questions. That is why it is important to ask at least 5 different questions. To ensure that NIAN isn’t just testing the LLMs ability to answer the question, vet, prompts many models with the limerick and question from full_questions.json. It repeats the question 5 times and I planned to discard any question that didn’t get a perfect score. All questions passed, so I didn’t need to remove any questions