Quality Benchmark

This test runs through a tiny subset of MMLU dataset, prompts the language model with each question, and tallies the number of correct single-letter answers.

Correct Answers

Total Questions

Score (%)

Invalid Responses

Progress 0 questions completed

Quality Benchmark

Test Log