The 🤗 Open LLM Leaderboard - Tracking, Ranking and Evaluating LLMs and Chatbots

Wednesday - May 24 2023, 09:35 UTC - 2 years ago

tldr #

The 🤗 Open LLM Leaderboard is the first leaderboard to track the performance of LLMs and chatbots with models submitted from the community. It evaluates models on 4 key benchmarks and only accepts 🤗 Transformers models with weights on the Hub. The leading open source models are llama-65b and MetaIX/GPT4-X-Alpasta-30b, with the latter being the newest model in MosaicML Foundation Series. The latest model MPT-7B was trained from scratch on 1T tokens of text and code in 9.5 days and is open source, available for commercial use. Three finetuned models have been released in addition to the base MPT-7B.

content #

The 🤗 Open LLM Leaderboard aims to track, rank and evaluate LLMs and chatbots as they are released. They evaluate models on 4 key benchmarks from the Eleuther AI Language Model Evaluation Harness , a unified framework to test generative language models on a large number of different evaluation tasks. A key advantage of this leaderboard is that anyone from the community can submit a model for automated evaluation on the 🤗 GPU cluster, as long as it is a 🤗 Transformers model with weights on the Hub. They also support evaluation of models with delta-weights for non-commercial licensed models, such as LLaMa.

The 🤗 Open LLM Leaderboard is the first leaderboard to track the performance of LLMs and chatbots with models submitted from the community.

Evaluation is performed against 4 popular benchmarks: AI2 Reasoning Challenge (25-shot) – a set of grade-school science questions. HellaSwag (10-shot) – a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models. MMLU (5-shot) – a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. TruthfulQA (0-shot) – a benchmark to measure whether a language model is truthful in generating answers to questions. They chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.

The leading open source models are llama-65b and MetaIX/GPT4-X-Alpasta-30b, with the latter being the newest model in MosaicML Foundation Series.

The leading open source models are llama-65b and MetaIX/GPT4-X-Alpasta-30b.MPT-7B, the latest entry in our MosaicML Foundation Series. MPT-7B is a transformer trained from scratch on 1T tokens of text and code. It is open source, available for commercial use, and matches the quality of LLaMA-7B. MPT-7B was trained on the MosaicML platform in 9.5 days with zero human intervention at a cost of ~$200k. Starting today, you can train, finetune, and deploy your own private MPT models, either starting from one of our checkpoints or training from scratch. For inspiration, we are also releasing three finetuned models in addition to the base MPT-7B: MPT-7B-Instruct, MPT-7B-Chat, and MPT-7B-StoryWriter-65k+, the last of which uses a context length of 65k tokens.

The leaderboard aims to track, rank and evaluate LLMs and chatbots on 40+ publicly shared evaluation tasks on a unified framework.

hashtags #

llm chatbot eleutherai llama mosaicml mpt7b

worddensity #

models (8, 2.47%)
evaluation (5, 1.54%)
test (5, 1.54%)
– (4, 1.23%)
mpt-7b (4, 1.23%)