This is an early in-development version of training CLIP models with hivemind.

learning@home

Last update: Nov 6, 2022

Related tags

Deep Learning clip_hivemind

Overview

A transformer that does not hog your GPU memory

This is an early in-development codebase: if you want a stable and documented hivemind codebase, look at CALM or dalle-hivemind.

Testing for correctness: PYTHONPATH=. pytest ./tests
Memory benchmarks: notebooks/performance_and_memory.ipynb

Readme under construction

LeanTransformer implements a specific version of transformer with two goals in mind:

using as little GPU memory as possible
stable training for very large models

The core philosophy of LeanTransformer is to replace torch.autograd with grad students. Automatic differentiation is great if you want to test ideas quickly, less so if a single training run can cost over $4 million (or >1000 years in grad school).

Related work: GSO

Our implementation partially replaces automatic differentiation with Grad Student Optimization (GSO) - a biologically inspired black box optimization algorithm. In the past, GSO has seen widespread adoption thanks to its strong theoretical foundations and unparalleled cost efficiency (Chom et al). Previous successfully applied GSO for hyperparameter tuning and natural language generation. To the best of our knowledge we are the first work to successfully apply distributed fault-tolerant GSO for optimizing the memory footprint of transformers. We summarize our findings below:

Memory saving features:

[default] manual memory-efficient differentiation for feedforward layers
[option] gradient checkpointing (Griewank et al, Chen et al, 2016)
[option] reversible layers using ClashLuke's revlib, based on (Gomez et al, 2017, Kitaev et al, 2020)
[option] PixelFly block-sparse layers that significantly reduce the number of parameters (Chen et al, 2021)
[option] customizable parameter sharing (Radford et al, 2019, Xue et al, 2021)
[option] CPU-offloaded 8-bit LAMB (Dettmers et al, 2021)
A pinch of magic that we'll explain eventually (hopefully)

Other features:

[default] Pre-normalization: a more stable layer order used in GPT2 (as opposed to the original transformer)
[option] Sandwich Norm, as proposed in (Ding et al, 2021)
[option] Maintaining FP32 residuals in mixed precision training, learned from discussions with Samyam and Jeff from DeepSpeed
[option] Rotary Position Embeddings, proposed by Su et al and popularized by EleutherAI
[option] Gated activations (e.g. GeGLU) (Shazeer et al, 2020), based on (Dauphin et al, 2016)
[option] Sequence length warmup aka Curriculum Learning (Li et al, 2021)

Not implemented:

In reversible mode, one can further save memory by computing backward in chunks:
- a few tokens at a time for feedforward layers, since grad(concat(mlp(x1), mlp(x2))) = concat(grad(mlp(x1)), grad(mlp(x2)))
- a few heads at a time for self-attention, since grad(head1 + head2) = grad(head1) + grad(head2), where head1 and head2 are attention outputs after linear projection
Attention could be computed in O(sqrt(n)) memory (Rabe et al, 2021)
No sparse or linear attention: they are great for very long sequences. However, for large models, attention is not a bottleneck in typical NLP and vision tasks (tested gpt-3 up to length 4096).
Per-block grad scaling as described in (Ramesh et al, 2021) - we rely on Sandwich Norm to maintain stability up to 96 layers (did not test more). However, it would be nice to have per-block scaling to avoid the need for an extra LayerNorm.
Something else that we missed - please find us on discord.

A day will come a day when we explain all these modifications and provide instructions on how to tune them. But it is not this day!. Until then, we'll happily answer any questions on our discord.

Running the code

[under constructuion] - use the instructions from CALM readme

Acknowledgements:

Most of the architecture and stability optimizations were learned through the BigScience research workshop
YSDA community helped us survive through the early messy versions of this code
NeuroPark trained the first practical model (SahajBERT-XL, SoTA in bengali, details here)
TODO DALLE community: at least mention the demo, maybe we end up training something even cooler
TODO NCAI community: ask them how best to acknowledge them
TODO Hugging Face: ask them how best to acknowledge them
TODO Personal: stas00, samyam, jared, more? (this does not include co-authors: Tim,Lucile,Quentin,Denis,Gennady,etc; also, this does not include hivemind contributors)

A repository that shares tuning results of trained models generated by TensorFlow / Keras. Post-training quantization (Weight Quantization, Integer Quantization, Full Integer Quantization, Float16 Quantization), Quantization-aware training. TensorFlow Lite. OpenVINO. CoreML. TensorFlow.js. TF-TRT. MediaPipe. ONNX. [.tflite,.h5,.pb,saved_model,tfjs,tftrt,mlmodel,.xml/.bin, .onnx]

PINTO_model_zoo Please read the contents of the LICENSE file located directly under each folder before using the model. My model conversion scripts ar

2.4k Jan 5, 2023

Turi Create simplifies the development of custom machine learning models.

Quick Links: Installation | Documentation | WWDC 2019 | WWDC 2018 Turi Create Check out our talks at WWDC 2019 and at WWDC 2018! Turi Create simplifie

10.9k Jan 1, 2023

10.1k Feb 12, 2021

Code for pre-training CharacterBERT models (as well as BERT models).

Pre-training CharacterBERT (and BERT) This is a repository for pre-training BERT and CharacterBERT. DISCLAIMER: The code was largely adapted from an o

31 Dec 5, 2022

Unofficial Alias-Free GAN implementation. Based on rosinality's version with expanded training and inference options.

Alias-Free GAN An unofficial version of Alias-Free Generative Adversarial Networks (https://arxiv.org/abs/2106.12423). This repository was heavily bas

75 Dec 12, 2022

Ever felt tired after preprocessing the dataset, and not wanting to write any code further to train your model? Ever encountered a situation where you wanted to record the hyperparameters of the trained model and able to retrieve it afterward? Models Playground is here to help you do that. Models playground allows you to train your models right from the browser.

Models Playground 🗂️ Upload a Preprocessed Dataset 🌠 Choose whether to perform Classification or Regression 🦹 Enter the Dependent Variable ?

19 Dec 10, 2022

Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network)

Deep Daze mist over green hills shattered plates on the grass cosmic love and attention a time traveler in the crowd life during the plague meditative

4.4k Jan 3, 2023

Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

CLIP-GLaSS Repository for the paper Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search An in-browser demo is

172 Dec 22, 2022

CLIP+FFT text-to-image

Aphantasia This is a text-to-image tool, part of the artwork of the same name. Based on CLIP model, with FFT parameterizer from Lucent library as a ge