Are Convolutional Neural Networks or Transformers more like human vision?
This repository contains the code and fine-tuned models of popular Convolutional Neural Networks (CNNs) and the recently proposed Vision Transformer (ViT) on the augmented Imagenet dataset and the shape/texture bias tests run on the Stylized Imagenet dataset.
This work compares CNNs and the ViT against humans in terms of error consistency beyond traditional metrics. Through these tests, we were able to show that recently proposed self-attention based Transformer models have more human-like errors that traditional CNNs.
Colab
You can directly run tests on the results using a Google Colaboratory without needing to install anything on your local machine. Click "Open in Colab" below:
Developer
Shikhar Tuli. For any questions, comments or suggestions, please reach me at [email protected].
Cite this work
If you use our experimental results or fine-tuned models, please cite:
@article{tuli2021cogsci,
title={Are Convolutional Neural Networks or Transformers more like human vision?},
author={Shikhar Tuli and Ishita Dasgupta and Erin Grant and Thomas L. Griffiths},
year={2021},
eprint={2105.07197},
archivePrefix={arXiv},
primaryClass={cs.CV}
}