A simple, unofficial implementation of MAE (Masked Autoencoders are Scalable Vision Learners) using pytorch-lightning.
Currently implements training on CUB and StanfordCars, but is easily extensible to any other image dataset.
Setup
# Clone the repository
git clone https://github.com/catalys1/mae-pytorch.git
cd mae-pytorch
# Install required libraries (inside a virtual environment preferably)
pip install -r requirements.txt
# Set up .env for path to data
echo "DATADIR=/path/to/data" > .env
Usage
MAE training
Training options are provided through configuration files, handled by LightningCLI. See configs/
for examples.
Train an MAE model on the CUB dataset:
python train.py fit --config=configs/mae.yaml --config=configs/data/cub_mae.yaml
Using multiple GPUs:
python train.py fit --config=configs/mae.yaml --config=configs/data/cub_mae.yaml --config=configs/multigpu.yaml
Fine-tuning
Not yet implemented.
Implementation
The default model uses ViT-Base for the encoder, and a small ViT (depth=4
, width=192
) for the decoder. This is smaller than the model used in the paper.
Dependencies
- Configuration and training is handled completely by pytorch-lightning.
- The MAE model uses the VisionTransformer from timm.
- Interface to FGVC datasets through fgvcdata.
- Configurable environment variables through python-dotenv.
Results
Image reconstructions of CUB validation set images after training with the following command:
python train.py fit --config=configs/mae.yaml --config=configs/data/cub_mae.yaml --config=configs/multigpu.yaml