A diff tool for language models

Hendrik Strobelt

Last update: Dec 29, 2022

Related tags

Overview

LMdiff

Qualitative comparison of large language models.

Demo & Paper: http://lmdiff.net

LMdiff is a MIT-IBM Watson AI Lab collaboration between:
Hendrik Strobelt (IBM, MIT) , Benjamin Hoover (IBM, GeorgiaTech), Arvind Satyanarayan (MIT), and Sebastian Gehrmann (HarvardNLP, Google).

Setting up / Quick start

From the root directory install Conda dependencies:

conda env create -f environment.yml
conda activate LMdiff
pip install -e .

Run the backend in development mode, deploying default models and configurations:

uvicorn backend.server:app --reload

Check the output for the right port (something like http://localhost:8000) and open in Browser.

Rebuild frontend

This is optional, because we have a compiled version checked into this repo.

cd client
npm install
npm run build:backend
cd ..

Using your own models

To use your own models:

Create a TextDataset of phrases to analyze

You can create the dataset file in several ways:
From a text file
So you have already collected all the phrases you want into a text file separated by newlines. Simply run:
```
python scripts/make_dataset.py path/to/my_dataset.txt my_dataset -o folder/i/want/to/save/in
```
From a python object (list of strings)
Want to only work within python?
```
from analysis.create_dataset import create_text_dataset_from_object

my_collection = ["Phrase 1", "My second phrase"]
create_text_dataset_from_object(my_collection, "easy-first-dataset", "human_created", "folder/i/want/to/save/in")
```
From [Huggingface Datasets](https://huggingface.co/docs/datasets/)
It can be created from one of Huggingface's provided datasets with:
```
from analysis.create_dataset import create_text_dataset_from_hf_datasets
import datasets
import path_fixes as pf

glue_mrpc = datasets.load_dataset("glue", "mrpc", split="train")
name = "glue_mrpc_train"

def ds2str(glue):
    """(e.g.,) Turn the first 50 sentences of the dataset into sentence information"""
    sentences = glue['sentence1'][:50]
    return "\n".join(sentences)

create_text_dataset_from_hf_datasets(glue_mrpc, name, ds2str, ds_type="human_created", outfpath=pf.DATASETS)
```
The dataset is a simple .txt file, with a new phrase on every line, and with a bit of required metadata header at the top. E.g.,
```
---
checksum: 92247a369d5da32a44497be822d4a90879807a8751f5db3ff1926adbeca7ba28
name: dataset-dummy
type: human_created
---

This is sentence 1, please analyze this.
Every line is a new phrase to pass to the model.
I can keep adding phrases, so long as they are short enough to pass to the model. They don't even need to be one sentence long.
```
The required fields in the header:
- checksum :: A unique identifier for the state of that file. It can be calculated however you wish, but it should change if anything at all changes in the contents below (e.g., two phrases are transposed, a new phase added, or a period is added after a sentence)
- name :: The name of the dataset.
- type :: Either human_created or machine_generated if you want to compare on a dataset that was spit out by another model
Each line in the contents is a new phrase to compare in the language model. A few warnings:
- Make sure the phrases are short enough that they can be passed to the model given your memory constraints
- The dataset is fully loaded into memory to serve to the front end, so avoid creating a text file that is too large to fit in memory.
Choose two comparable models

Two models are comparable if they:
1. Have the exact same tokenization scheme
2. Have the exact same vocabulary
This allows us to do tokenwise comparisons on the model. For example, this could be:
- A pretrained model and a finetuned version of it (e.g., distilbert-base-cased and distilbert-base-uncased-finetuned-sst-2-english)
- A distilled version mimicking the original model (e.g., bert-base-cased and distilbert-base-cased)
- Different sizes of the same model architecture (e.g., gpt2 and gpt2-large)

Preprocess the models on the chosen dataset

python scripts/preprocess.py all gpt2-medium distilgpt2 data/datasets/glue_mrpc_1+2.csv --output-dir data/sample/gpt2-glue-comparisons

Start the app
```
python backend/server/main.py --config data/sample/gpt2-glue-comparisons
```
Note that if you use a different tokenization scheme than the default gpt, you will need to tell the frontend how to visualize the tokens. For example, a bert based tokenization scheme:
```
python backend/server/main.py --config data/sample/bert-glue-comparisons -t bert
```

Architecture

(Admin) Getting the Data

Models and datasets for the deployed app are stored on the cloud and require a private .dvc/config file.

With the correct config:

dvc pull

will populate the data directories correctly for the deployed version.

Testing

make test

python -m pytest tests

All tests are stored in tests.

Frontend

We like pnpm but npm works just as well. We also like Vite for its rapid hot module reloading and pleasant dev experience. This repository uses Vue as a reactive framework.

From the root directory:

cd client
pnpm install --save-dev
pnpm run dev

If you want to hit the backend routes, make sure to also run the uvicorn backend.server:app command from the project root.

For production (serve with Vite)

pnpm run serve

For production (serve with this repo's FastAPI server)

cd client
pnpm run build:backend
cd ..
uvicorn backend.server:app

Or the gunicorn command from above.

All artifacts are stored in the client/dist directory with the appropriate basepath.

For production (serve with external tooling like NGINX)

pnpm run build

All artifacts are stored in the client/dist directory.

Notes

Check the endpoints by visiting <localhost>:<port>/docs

Comments

Hoo/backend edits
Don't return CLS and SEP information for BERT-like models

Take out KL (nonsense metric)

Return diff: topk with other metrics when analyzing text

Fix the / token appearing in model names (e.g., lysandre/arxiv-nlp)

Recreate the data folder with new logic

After merging, make sure to dvc pull to update the data folder as well
opened by bhoov 1
Make attention recording optional (default=False, remove from existing deploy behavior)

Attentions take a lot of space in the cached analysis files. This will be mighty annoying for those that want to use this tool to compare language models on their local machine.

opened by bhoov 0
Initial backend docs
Trying to do PRs again, let's do better this time

Made it easier to make a dataset from the CLI

Wrote documentation file for how to get started with your own models

Right now, I have included the "Using your own model" content in both the README and the docs/Using-your-own-model.md because there is no easy way to link to the new doc from the README...

Let's chat in these comments. I would enjoy learning to use mkdocs if we want to make documentation for this. It is the same tool used to create the fastapi documentation
opened by bhoov 0
Show type of model in model dropdown selection

Looking at this list it is hard to tell which models are compatible:

From the backend route /api/all-models I return an emoji type (currently selected from 🍀🌼🌻🌺🌹💐) that can be displayed next to the text name. Would love to see this in the dropdown

opened by bhoov 0
Stale search results when you change models

These are the top results with bert-base-uncased and distilbert-base-uncased

Still displays when new models are selected (distilgpt2 and gpt2)

Desired Behavior Hide these results unless the user selects the same m1 and m2 again.

opened by bhoov 0

Show TopK Diff in detailed example analysis

Should also allow Top-10 Diff, corresponding to the search results

It is returned as part of the following packet:

    return {
        "text": text,
        "tokens": tokens,
        "m1": {
            ...
        },
        "m2": {
            ...
        },
        "diff": {
            "rank": ...,
            "prob": ...,
            "rank_clamp": ...,
            "topk": topk_token_diff(
                parsed_output1.topk_token_ids.tolist(),
                parsed_output2.topk_token_ids.tolist(),
            ),
        },

opened by bhoov 0

Show use case of models evaluated on model-generated dataset
Create datasets and analysis results of gpt-gen and distillgpt2-gen.

Questions

How would you generate diverse phrases? (one per line)? Would you have a prompt dataset?
opened by bhoov 1
Add readable name + short description to deployed dataset
mrpc is a sufficient description of glue_mrpc since glue is the task name encompassing several datasets

We would like to provide a popup near the dataset name that displays a description of the dataset to the user. If no description is provided, do not show the popup
opened by bhoov 0
Mutliple copies of results in search

When querying the API for text snippets from pre-computed corpus, some snippets are duplicates which violates the uniqueness requirement for the list.

opened by HendrikStrobelt 3

A diff tool for language models

Related tags

Overview

LMdiff

Setting up / Quick start

Rebuild frontend

Using your own models

Architecture

(Admin) Getting the Data

Frontend

Notes

Comments

Hoo/backend edits

Make attention recording optional (default=False, remove from existing deploy behavior)

Initial backend docs

Show type of model in model dropdown selection

Stale search results when you change models

Show TopK Diff in detailed example analysis

Show use case of models evaluated on model-generated dataset

Add readable name + short description to deployed dataset

Mutliple copies of results in search

Owner

Hendrik Strobelt

The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

Simple tool to combine(merge) onnx models. Simple Network Combine Tool for ONNX.

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

📝 Wrapper library for text generation / language models at char and word level with RNN in TensorFlow

TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020)

A novel method to tune language models. Codes and datasets for paper ``GPT understands, too''.

Code, Data and Demo for Paper: Controllable Generation from Pre-trained Language Models via Inverse Prompting

QA-GNN: Question Answering using Language Models and Knowledge Graphs

[EMNLP 2020] Keep CALM and Explore: Language Models for Action Generation in Text-based Games

Repository for XLM-T, a framework for evaluating multilingual language models on Twitter data

Language models are open knowledge graphs ( non official implementation )

Code for the paper "Adapting Monolingual Models: Data can be Scarce when Language Similarity is High"

True Few-Shot Learning with Language Models

This project provides an unsupervised framework for mining and tagging quality phrases on text corpora with pretrained language models (KDD'21).

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

Code for "LoRA: Low-Rank Adaptation of Large Language Models"