magiCARP: Contrastive Authoring+Reviewing Pretraining

Overview

magiCARP: Contrastive Authoring+Reviewing Pretraining

Welcome to the magiCARP API, the test bed used by EleutherAI for performing text/text bi-encoder experiments.

CARP, or contrastive authorship+reviewing pairings, was first outlined in Cut the CARP: Fishing for zero-shot story evaluation.

CARP presents a scalable method for performing zero-shot evaluation of stories and other mediums. Current CARP efforts at EleutherAI are primarily focused around controllable code generation. This repository will be updated with more experiments over the coming months as we try varying CARP architectures.

To train a model, run poetry run python -m carp.pytorch.train --data_path="carp/dataset" --config_path ./base_config.yml

Finetuning via COOP and preference learning coming soon.

Comments
  • Merge Visualization Code into Main

    Merge Visualization Code into Main

    Overview of changes:

    • Created vis folder
    • Added PCA visualization of random dataset samples embedded
    • Extended above with spherical coordinates
    • Copied carp_cloob config (maybe unnecessary?)
    opened by shahbuland 4
  • Experimentation with MLM pretraining

    Experimentation with MLM pretraining

    We're conducting experiments in using an MLM objective to improve data efficiency.

    This PR will include:

    A new data pipeline API for collating and processing input data. A new learning rate scheduler API for custom LR schedulers that are model state aware. CARP MLM, which optimizes MLM for a fixed number of epochs before switching to a different CARP objective. MixedMLMEncoder, which will start as an MLM encoder and switch to some other encoder after a fixed number of epochs.

    opened by LouisCastricato 4
  • Deepspeed and multiGPU support

    Deepspeed and multiGPU support

    We need to be able to split the contrastive parallel batch between GPUs for the multiGPU setting rather than using DeepSpeed's naive data parallelism.

    In particular, using vanilla CARP as reference,

    The passage encoder contrastive parallel

    https://github.com/EleutherAI/magiCARP/blob/f4f880f7ede3420226d3e41267c1c41d48264a2e/carp/pytorch/model/architectures/carp.py#L61

    and the review encoder contrastive parallel

    https://github.com/EleutherAI/magiCARP/blob/f4f880f7ede3420226d3e41267c1c41d48264a2e/carp/pytorch/model/architectures/carp.py#L73

    can be split between GPUs by sending a subset of the microbatches to its own GPU.

    Similarly, the review and passage encoder do not need to be computed in serial- as long as we gather in parallel.

    opened by LouisCastricato 3
  • Using a GPT model fine-tuned with a contrastive loss

    Using a GPT model fine-tuned with a contrastive loss

    Currently we cannot scale CLOOB past declutr-base for unknown reasons. Current hypothesis is that we need to pretrain or fine-tune a language model using a contrastive loss. Kharr and I are exploring using SimCTG as a method to fine-tune a base model before training a CARP CLOOB model.

    opened by LouisCastricato 2
  • create makefile for styling and style all the code

    create makefile for styling and style all the code

    I added makefile for code styling and static analysis. You can style code by executing 'style' in the makefile. It would be great to style code before pushing code ;)

    opened by hyunwoongko 1
  • Ctx fix

    Ctx fix

    Two fixes

    Context Length Fix:

    • Any trainer inheriting from BaseTrainer was calling create_tok, a method that truncates all strings to only look at last n_ctx (generally 512) characters, as opposed to last n_ctx tokens
    • BaseEncoder was fixed previously to implicitly truncate when required
    • create_tok is now just lowering data quality by cutting all inputted strings short
    • This bug likely effected many training results

    Vicreg Trainer:

    • Previous KeyError bug in BaseTrainer was patched in BaseTrainer but not in other trainers
    • Changed a key in vicreg trainer to reflect the patch to BaseTrainer
    opened by shahbuland 0
  • Update cosine_sim implementation

    Update cosine_sim implementation

    Had an absolute value so the cosine similarity would always be positive, removed that...

    Also, don't know why normalize=False is the default, I think normalize=True is a better default. If you are not normalized, you need it to be, and if it is already normalized to a unit vector, it shouldn't do anything, right? so normalize=True sounds like a better default to me...

    opened by tmabraham 0
  • Fixed issue with context length using characters instead of tokens.

    Fixed issue with context length using characters instead of tokens.

    In several scripts, string length was being used instead of number of tokens for context truncation. I have not checked the entire codebase for more instances of this issue (these may exist) but have resolved the cases within the example scripts. Additionally, the tokenizer in the BaseEncoder has been updated to have automatic truncation to model context length. A one-off bug within data pipeline that seems to be result of old code being left in has also been resolved.

    opened by shahbuland 0
  • Add encoding utilities

    Add encoding utilities

    Added example scripts that can be used to generate passage or review encodings en masse, with indexes of which dataset items were encoded. This is useful for many things so I think it should be incorporated into main branch.

    opened by shahbuland 0
  • Adds a shared encoder for passage and critiques

    Adds a shared encoder for passage and critiques

    Useful for uni-encoder experiments.

    Current results are indecisive. Merging this into main so that we can use it for a multitude of future experiments.

    See a declutr-base run here https://wandb.ai/eleutherai/magicarp-carp_pytorch_training/runs/1z41hm0i?workspace=user-louiscastricato

    opened by LouisCastricato 0
  • Add additional evaluation metrics for representation learning

    Add additional evaluation metrics for representation learning

    Namely:

    1. Linear head classification task on embeddings. Test accuracy from trained linear head.
    2. WANDB ready plot of UMAP'd embeddings to see latent space.
    opened by shahbuland 1
  • Added default setting for review encoding

    Added default setting for review encoding

    Made error of not creating a default value for "force_fresh". This means that running script without command line argument of "FRESH" resulted in no value being set, causing a runtime error. There was effectively no way to have a non fresh run unless a redundant argument was passed.

    opened by shahbuland 0
  • nan gradients

    nan gradients

    not sure what the etiology is here, but figured we should document it and maybe centralize repair discussion here since it's apparently a known issue. I think with CARPFilip we first observed this when gradient checkpointing was turned off, prior to which biases had feasible gradient distributions but weights gradients were already reporting as collapsed to zero (or more likely, had exploded and the nan's were interpreted as zero by wandb or the gradient checkpointing mechanism).

    opened by dmarx 0
  • move `--type` arg from command line to config

    move `--type` arg from command line to config

    alternatively, we could move everything to config files and use command line for overriding config entries. Requiring that certain arguments be provided via the command line is confusing, at least for me.

    opened by dmarx 1
Owner
EleutherAI
EleutherAI
TorchX is a library containing standard DSLs for authoring and running PyTorch related components for an E2E production ML pipeline.

TorchX is a library containing standard DSLs for authoring and running PyTorch related components for an E2E production ML pipeline

null 193 Dec 22, 2022
CLASP - Contrastive Language-Aminoacid Sequence Pretraining

CLASP - Contrastive Language-Aminoacid Sequence Pretraining Repository for creating models pretrained on language and aminoacid sequences similar to C

Michael Pieler 133 Dec 29, 2022
Code for Efficient Visual Pretraining with Contrastive Detection

Code for DetCon This repository contains code for the ICCV 2021 paper "Efficient Visual Pretraining with Contrastive Detection" by Olivier J. Hénaff,

DeepMind 56 Nov 13, 2022
[NeurIPS 2021 Spotlight] Aligning Pretraining for Detection via Object-Level Contrastive Learning

SoCo [NeurIPS 2021 Spotlight] Aligning Pretraining for Detection via Object-Level Contrastive Learning By Fangyun Wei*, Yue Gao*, Zhirong Wu, Han Hu,

Yue Gao 139 Dec 14, 2022
Re-implementation of the Noise Contrastive Estimation algorithm for pyTorch, following "Noise-contrastive estimation: A new estimation principle for unnormalized statistical models." (Gutmann and Hyvarinen, AISTATS 2010)

Noise Contrastive Estimation for pyTorch Overview This repository contains a re-implementation of the Noise Contrastive Estimation algorithm, implemen

Denis Emelin 42 Nov 24, 2022
Saeed Lotfi 28 Dec 12, 2022
TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks [Paper] [Project Website] This repository holds the source code, pretra

Humam Alwassel 83 Dec 21, 2022
Official Pytorch Implementation of: "ImageNet-21K Pretraining for the Masses"(2021) paper

ImageNet-21K Pretraining for the Masses Paper | Pretrained models Official PyTorch Implementation Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, Lihi Zelni

null 574 Jan 2, 2023
[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.

SapBERT: Self-alignment pretraining for BERT This repo holds code for the SapBERT model presented in our NAACL 2021 paper: Self-Alignment Pretraining

Cambridge Language Technology Lab 104 Dec 7, 2022
When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings

When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings This is the repository for t

RegLab 39 Jan 7, 2023
Pretraining Representations For Data-Efficient Reinforcement Learning

Pretraining Representations For Data-Efficient Reinforcement Learning Max Schwarzer, Nitarshan Rajkumar, Michael Noukhovitch, Ankesh Anand, Laurent Ch

Mila 40 Dec 11, 2022
ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information

ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information This repository contains code, model, dataset for ChineseBERT at ACL2021. Ch

null 413 Dec 1, 2022
DETReg: Unsupervised Pretraining with Region Priors for Object Detection

DETReg: Unsupervised Pretraining with Region Priors for Object Detection Amir Bar, Xin Wang, Vadim Kantorov, Colorado J Reed, Roei Herzig, Gal Chechik

Amir Bar 283 Dec 27, 2022
Code for generating a single image pretraining dataset

Single Image Pretraining of Visual Representations As shown in the paper A critical analysis of self-supervision, or what we can learn from a single i

Yuki M. Asano 12 Dec 19, 2022
EMNLP 2021 - Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

Frustratingly Simple Pretraining Alternatives to Masked Language Modeling This is the official implementation for "Frustratingly Simple Pretraining Al

Atsuki Yamaguchi 31 Nov 18, 2022
Does Pretraining for Summarization Reuqire Knowledge Transfer?

Pretraining summarization models using a corpus of nonsense

Approximately Correct Machine Intelligence (ACMI) Lab 12 Dec 19, 2022
The PASS dataset: pretrained models and how to get the data - PASS: Pictures without humAns for Self-Supervised Pretraining

The PASS dataset: pretrained models and how to get the data - PASS: Pictures without humAns for Self-Supervised Pretraining

Yuki M. Asano 249 Dec 22, 2022
PyTorch original implementation of Cross-lingual Language Model Pretraining.

XLM NEW: Added XLM-R model. PyTorch original implementation of Cross-lingual Language Model Pretraining. Includes: Monolingual language model pretrain

Facebook Research 2.7k Dec 27, 2022
Code for the TASLP paper "PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation".

PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation Introduction Getting Started FSD50K Recipe AudioSet Recipe Label E

Yuan Gong 84 Dec 27, 2022