Evaluation suite for large-scale language models.

Overview

LM Evaluation Test Suite

This repo contains code for running the evaluations and reproducing the results from the Jurassic-1 Technical Paper (see blog post), with current support for running the tasks through both the AI21 Studio API and OpenAI's GPT3 API.

Citation

Please use the following bibtex entry:

@techreport{J1WhitePaper,
  author = {Lieber, Opher and Sharir, Or and Lenz, Barak and Shoham, Yoav},
  title = {Jurassic-1: Technical Details And Evaluation},
  institution = {AI21 Labs},
  year = 2021,
  month = aug,
}

Installation

git clone https://github.com/AI21Labs/lm-evaluation.git
cd lm-evaluation
pip install -e .

Usage

The entry point for running the evaluations is lm_evaluation/run_eval.py, which receives a list of tasks and models to run.

The models argument should be in the form "provider/model_name" where provider can be "ai21" or "openai" and the model name is one of the providers supported models.

When running through one of the API models, set the your API key(s) using the environment variables AI21_STUDIO_API_KEY and OPENAI_API_KEY. Make sure to consider the costs and quota limits of the models you are running beforehand.

Examples:

# Evaluate hellaswag and winogrande on j1-large
python -m lm_evaluation.run_eval --tasks hellaswag winogrande --models ai21/j1-large

# Evaluate all multiple-choice tasks on j1-jumbo
python -m lm_evaluation.run_eval --tasks all_mc --models ai21/j1-jumbo

# Evaluate all docprob tasks on curie and j1-large
python -m lm_evaluation.run_eval --tasks all_docprobs --models ai21/j1-large openai/curie

Datasets

The repo currently support the zero-shot multiple-choice and document probability datasets reported in the Jurassic-1 Technical Paper.

Multiple Choice

Multiple choice datasets are formatted as described in the GPT3 paper, and the default reported evaluation metrics are those described there.

All our formatted datasets except for storycloze are publically available and referenced in lm_evaluation/tasks_config.py. Storycloze needs to be manually downloaded and formatted, and the location should be configured through the environment variable 'STORYCLOZE_TEST_PATH'.

Document Probabilities

Document probability tasks include documents from 19 data sources, including C4 and datasets from 'The Pile'.

Each document is pre-split at sentence boundaries to sub-documents of up to 1024 GPT tokens each, to ensure all models see the same inputs/contexts regardless of tokenization, and to support evaluation of models which are limited to sequence lengths of 1024.

Each of the 19 tasks have ~4MB of total text data.

Additional Configuration

Results Folder

By default all results will be saved to the folder 'results', and rerunning the same tasks will load the existing results. The results folder can be changed using the environment variable LM_EVALUATION_RESULTS_DIR.

You might also like...
This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

[ICLR 2021, Spotlight] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks
[ICLR 2021, Spotlight] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks

Large Scale Image Completion via Co-Modulated Generative Adversarial Networks, ICLR 2021 (Spotlight) Demo | Paper [NEW!] Time to play with our interac

The implementation of the CVPR2021 paper "Structure-Aware Face Clustering on a Large-Scale Graph with 10^7 Nodes"

STAR-FC This code is the implementation for the CVPR 2021 paper "Structure-Aware Face Clustering on a Large-Scale Graph with 10^7 Nodes" 🌟 🌟 . 🎓 Re

SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

The SLIDE package contains the source code for reproducing the main experiments in this paper. Dataset The Datasets can be downloaded in Amazon-

Official implementation of
Official implementation of "Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets" (CVPR2021)

Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets This is the official implementation of "Towards Good Pract

Official Implementation and Dataset of
Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

Portrait Photo Retouching with PPR10K Paper | Supplementary Material PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask an

An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.
An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.

Fast Face Classification (F²C) This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicit

DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021)
DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021)

DeepLM DeepLM: Large-scale Nonlinear Least Squares on Deep Learning Frameworks using Stochastic Domain Decomposition (CVPR 2021) Run Please install th

Baseline model for
Baseline model for "GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping" (CVPR 2020)

GraspNet Baseline Baseline model for "GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping" (CVPR 2020). [paper] [dataset] [API] [do

Comments
  • What subset of The Pile is the model evaluated on?

    What subset of The Pile is the model evaluated on?

    Hi,

    I downloaded the OpenSubtitles data from https://storage.googleapis.com/ai21-public-data/lm_evaluation/datasets/doc_probs/max_seq_len_1024-4096KB/open_subtitles.jsonl but I am unsure which part of The Pile it belongs to. For example, I cannot find the instance below in either the validation or test set downloaded from https://mystic.the-eye.eu/public/AI/pile/. Would you please describe where you got the source data and how you processed it? Thanks!

    {'text': '"12 months after it happened, shock and bewilderment continue to surround the strange events that occurred in a remote old country house last summer, where a man is said to have literally vanished into thin air." "Well-known within the International Church of Spiritualism was the revered medium and psychic Mr Jacques Futrelle, who, on June 21st last year, elected to stage an unusual experiment at his home in Berkshire, the bizarre and sprawling mansion known as Metropolis." "Among the specially invited guests that balmy midsummer\'s evening was the Harvard geneticist and outspoken critic of paranormal doctrine," "Eli Mencken, seen here with Futrelle\'s wife, Theodora, and child." "As the night wore on, discussion turned to a curious rumour concerning an old attic room at the top of the house, where a madman had been kept under lock and key by his titled relatives." "And where, it was said, the lunatic\'s ghost still stalked his former prison, feeding on the souls of non-believers." "The challenge thrown down by their host was for the arch-sceptic" "Mencken, if he dared, to spend a night alone in the room." "And so, at ten minutes to midnight, after careful inspection by independent witnesses, the door was closed, and secured with four heavy-duty padlocks supplied by the guests." "To ensure no single person could assist in his escape, each retained their key as, separately, they retired for the night." "What they found the next day sent a thrill of terror through them all." "Though the door and the locks had clearly not been tampered with, and there was no other conceivable way out of the room," "Mencken was gone!" "On a chair nearby lay the clothes he had removed before going to bed, while on the bed itself was found nothing but a gruesome sticky secretion." "Of the eminent scientist\'s body there was no trace." "One year later, no rational explanation has been advanced for what happened that night." "Nor, can it be assumed, will a solution ever be found to this dark, impenetrable mystery." "Welcome back." "Before the break we promised you something a little bit off the radar." "I think our next guest certainly falls into that category!" "Someone whose powers of deduction, and truly phenomenal flair for solving seemingly impossible puzzles are beyond cool." "One might almost say, "magical"." "The seriously interesting Joey Ross." "How you doing here today?" "I\'m doing splendidly, Marcia, how are you?" "It\'s a wicked website you\'ve got here, it truly is, checkreality.co.uk - well worth a visit, folks." "So what\'s the deal with it?" "Basically, people write in to you, about weird things that have happened to them, that they can\'t explain." "And you explain them." "I know, I\'ve become this Agony Aunt of The Abnormal or something, and completely by accident as it happens, cos it all just started off as a common or garden blog." "Three years ago, right, mate of mine had this really bizarre experience where she\'d come home one day and found her fella in bed with the woman next door." "Totally loses it, doesn\'t she?" "Lifts up the duvet and starts jabbing this lighted cigarette in her foot." "By all accounts really took some skin off." "Storms out the room with all her clothes, chucks them on the front lawn." "Five minutes later, this woman\'s coming down the stairs, half naked, but amazingly, her foot has now completely healed up!" "Not a blister, or a burn-mark anywhere!" "Except, it didn\'t take much figuring did it?" "What she hadn\'t considered, he\'d actually got two women in that bed, and the other one\'d done a runner out the back!" "So now you get reports sent to you, from all over the world." "Like, about strange apparitions and premonitions." "And you just apply your brain to the problem, with this like, amazing power of lateral thinking that you seem to possess and sort it all out." "It\'s just, I\'ve always had this brilliant intuition." "It\'s something I was born with - like, an amazing instinct for making the logical short-cuts, that get straight to the root of something." "OK, you\'ve selected a card in your mind only, and written it down." "I want you now to fold the piece of paper in half please." "Then half again, and hold it high up in the air." "I\'m now assimilating that card, removing it from the piece of paper." "It\'s now left your hand." "Unfold the paper for me, would you?" "What can you see?"',
     'doc_idx': 0,
     'segment_idx': 0}
    
    opened by YianZhang 4
  • Integration with lm-eval-harness

    Integration with lm-eval-harness

    I think it would be great to have AI21 api integration in the EleutherAI lm-eval-harness (https://github.com/EleutherAI/lm-evaluation-harness), which would allow evaluating the AI21 api on far more tasks.

    opened by leogao2 1
Owner
null
A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation models. It contains 17 different amateur subjects performing 30 sports-related actions each, for a total of 510 action clips.

Aiden Nibali 25 Jun 20, 2021
QuanTaichi evaluation suite

QuanTaichi: A Compiler for Quantized Simulations (SIGGRAPH 2021) Yuanming Hu, Jiafeng Liu, Xuanda Yang, Mingkuan Xu, Ye Kuang, Weiwei Xu, Qiang Dai, W

Taichi Developers 120 Jan 4, 2023
Semantic Scholar's Author Disambiguation Algorithm & Evaluation Suite

S2AND This repository provides access to the S2AND dataset and S2AND reference model described in the paper S2AND: A Benchmark and Evaluation System f

AI2 54 Nov 28, 2022
This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.

Skeleton Aware Multi-modal Sign Language Recognition By Songyao Jiang, Bin Sun, Lichen Wang, Yue Bai, Kunpeng Li and Yun Fu. Smile Lab @ Northeastern

Isen (Songyao Jiang) 128 Dec 8, 2022
null 190 Jan 3, 2023
DeepGNN is a framework for training machine learning models on large scale graph data.

DeepGNN Overview DeepGNN is a framework for training machine learning models on large scale graph data. DeepGNN contains all the necessary features in

Microsoft 45 Jan 1, 2023
Code for "LoRA: Low-Rank Adaptation of Large Language Models"

LoRA: Low-Rank Adaptation of Large Language Models This repo contains the implementation of LoRA in GPT-2 and steps to replicate the results in our re

Microsoft 394 Jan 8, 2023
Open-AI's DALL-E for large scale training in mesh-tensorflow.

DALL-E in Mesh-Tensorflow [WIP] Open-AI's DALL-E in Mesh-Tensorflow. If this is similarly efficient to GPT-Neo, this repo should be able to train mode

EleutherAI 432 Dec 16, 2022
Apache Spark - A unified analytics engine for large-scale data processing

Apache Spark Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an op

The Apache Software Foundation 34.7k Jan 4, 2023