TensorFlow Metal Backend on Apple Silicon Experiments (just for fun)

Overview

tf-metal-experiments

TensorFlow Metal Backend on Apple Silicon Experiments (just for fun)

Setup

This is tested on M1 series Apple Silicon SOC only.

TensorFlow 2.x

  1. Follow the official instructions from Apple here
  2. Test that your Metal GPU is working by running tf.config.list_physical_devices("GPU"), you should see 1 GPU present (it is not named). Later when you actually use the GPU, there will be a more informative printout that says Metal device set to: Apple M1 Max or similar.
  3. Now you should be ready to run any TF code that doesn't require external libraries.

HuggingFace Transformers library

If you want to play around with Transformer models (with TF Metal backend of course), you will need to install the HuggingFace Transformers library.

  1. Install the regex library (I don't know why it has to be like this, but yeah): python3 -m pip install --upgrade regex --no-use-pep517. You might need do xcode-select --install if the above command doesn't work.
  2. pip install transfomers ipywidgets

Experiments and Benchmarks

After some trial and error, some initial benchmarks for what should be the approx best capability of the M1 Max. For all the cases here, increasing batch size does not seem to increase the throughput.

Power draw also doesn't seem to be able to exceed 40W. Power draw from the GPU (averaged over 1 second) can be measured with sudo powermetrics --samplers gpu_power -i1000 -n1.

Model GPU BatchSize Throughput Power Memory
ResNet50 M1 Max 32c 64 135 img/sec 40W 13 GB
MobileNetV2 M1 Max 32c 128 352 img/sec 37W 15 GB
DistilBERT M1 Max 32c 64 120 seq/sec 35W 9 GB
BERTLarge M1 Max 32c 32 18 seq/sec 36W 14 GB

The benchmark scripts used are included in this repo.

Reference Benchmarks from RTX 3090

Model GPU BatchSize Throughput Power
ResNet50 3090 64 957 img/sec 300W
MobileNetV2 3090 128 1927 img/sec 310W
DistilBERT 3090 64 1040 seq/sec 310W
BERTLarge 3090 32 164 seq/sec 320W

For 3090, same script is used, but additional optimization that leverage hardware (Tensor Core) and software (XLA compiler) not present/working on M1 is added. This corresponds to the following code segment added:

from tensorflow.keras import mixed_precision
tf.config.optimizer.set_jit(True)
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_global_policy(policy)
physical_devices = tf.config.list_physical_devices('GPU')

Also note that the 3090 is likely to perform better at larger batch sizes.

Measuring Achievable TFLOPS

We can use TF to write a matrix multiplication benchmark to try and estimate what is the max compute performance we can get out of a M1 Max. It seems we can get around ~8 TFLOPS for large enough problem (GEMM) sizes.

The plot can be generated using tflops_sweep.py.

Note that FP64 and FP16 performance appears to be non-existent. (the code automatically runs on CPU if FP64 or FP16 is specified as data type)

Comments
  • adding A100, other configs comparison for fun

    adding A100, other configs comparison for fun

    Summary tables (more details below and in comments): | Model | M1 7c | M1 32c | A 100 (-) | V 100 (-) | P 100 (-) | T4 (-) | K 80 (-) | Q P 5000 (-) | Q M 4000 (-) | Q RTX 4000 (-) | | --- | --- | --- | ---| --- | --- | --- | --- | --- | --- | --- | | RN50 | 10 | 135 | 611 | 347 | 211 | 134 | F | 131 | F | F | | MNV2 | 23 | 352 | 269 | 187 | 125 | 193 | 94 | 181 | F | F | | DBERT | 15 | 120 | 761 | 187 | 149 | 94 | 47 | 109 | 39 | 129 | | BERTL | 1 | 18 | 136 | 31 | 16 | 15 | 4 | 17 | F | F |

    | Model | M1 7c | M1 32c | A 100 (+) | V 100 (+) | P 100 (+) | T4 (+) | K 80 (+) | Q P 5000 (+) | Q M 4000 (+) | Q RTX 4000 (+) | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
    | RN50 | 10 | 135 | 1147 | v100 | 252 | na | na | na | na | na | | MNV2 | 23 | 352 | 1870 | v100 | 128 | na | na | na | na | na | | DBERT | 15 | 120 | 1909 | v100 | 209 | na | na | na | na | na | | BERTL | 1 | 18 | 309 | v100 | 23 | na | na | na | na | na |






    Really good and useful work here --- thank you! I can potentially fill the remaining values if you think this would be of interest. Note: I simply copied-pasted the content of your .py files and ran them in a notebook. Also note: accuracy will improve by using float32 as op indicates M1 can only use float32, hence comparison should be without optimization imo. RTX 3090 and A100 are somewhat similar in my understanding in terms if benchmarks like these fwiw.


    M1 on MBA (7-core, 8GB RAM). Completely froze my laptop! Even the trackpad stopped responding... first time ever I notice I have this slowdown/lag happens on this MBa, but results incoming! Obviously it's taking forever...

    | Model | GPU | BatchSize | Throughput | Power | Memory | | ----------- | ---------- | --------- | ----------- | ----- | ------ | | ResNet50 | M1 7c | 64 | 10.3 img/sec | ? | ? | | MobileNetV2 | M1 7c | 128 | 22.7 img/sec | ? | ? | | DistilBERT | M1 7c | 64 | 15.2 seq/sec | ? | ? | | BERTLarge | M1 7c | 32 | 0.6 seq/sec | ? | ? |


    with optimization:

    | Model | GPU | BatchSize | Throughput | Power | Memory | | ----------- | ---------- | --------- | ----------- | ----- | ------ | | ResNet50 | A100 40GB | 64 | 1147.4 img/sec | ? | ? | | MobileNetV2 | A100 40GB | 128 | 1869.7 img/sec | ? | ? | | DistilBERT | A100 40GB | 64 | 1909.3 seq/sec | ? | ? | | BERTLarge | A100 40GB | 32 | 309.3 seq/sec | ? | ? |


    without optimziation:

    | Model | GPU | BatchSize | Throughput | Power | Memory | | ----------- | ---------- | --------- | ----------- | ----- | ------ | | ResNet50 | A100 40GB | 64 | 610.8 img/sec | ? | ? | | MobileNetV2 | A100 40GB | 128 | 269.4 img/sec | ? | ? | | DistilBERT | A100 40GB | 64 | 761.1 seq/sec | ? | ? | | BERTLarge | A100 40GB | 32 | 135.5 seq/sec | ? | ? |


    | Model | GPU | BatchSize | Throughput | Power | Memory | | ----------- | ---------- | --------- | ----------- | ----- | ------ | | ResNet50 | M1 Max 32c | 64 | 135 img/sec | 40W | 13 GB | | MobileNetV2 | M1 Max 32c | 128 | 352 img/sec | 37W | 15 GB | | DistilBERT | M1 Max 32c | 64 | 120 seq/sec | 35W | 9 GB | | BERTLarge | M1 Max 32c | 32 | 18 seq/sec | 36W | 14 GB |


    | Model | GPU | BatchSize | Throughput | Power | | ----------- | ---------- | --------- | ----------- | ----- | | ResNet50 | 3090 | 64 | 957 img/sec | 300W | | MobileNetV2 | 3090 | 128 | 1927 img/sec| 310W | | DistilBERT | 3090 | 64 | 1040 seq/sec| 310W | | BERTLarge | 3090 | 32 | 164 seq/sec | 320W |


    opened by ngam 30
  • M1 Max Thermal Throttling

    M1 Max Thermal Throttling

    Hi @tlkh

    Thank you so much for creating and running those benchmarks. I would be interested if thermal throttling affects training speed (ms/step) after a few minutes.

    Could you share your training logs when running e.g. bm_rn50.py for benchmark_epochs = 20?

    opened by yadamonk 9
  • smaller batch sizes

    smaller batch sizes

    Thank you ! Because of unified memory, I wonder if tensorflow training on M1 max would be less badly impacted by smaller batch sizes, than training on the RTX 3090... I would love to have the comparisons of resnet and mobilenet for batchsize=16, on M1 max and RTX 3090.

    opened by invoxiaehu 3
  • [ImgBot] Optimize images

    [ImgBot] Optimize images

    Beep boop. Your images are optimized!

    Your image file size has been reduced by 24% 🎉

    Details

    | File | Before | After | Percent reduction | |:--|:--|:--|:--| | /gpu_tflops_plot.jpg | 45.44kb | 34.56kb | 23.95% |


    📝 docs | :octocat: repo | 🙋🏾 issues | 🏪 marketplace

    ~Imgbot - Part of Optimole family

    opened by imgbot[bot] 0
  • Configure WhiteSource Bolt for GitHub

    Configure WhiteSource Bolt for GitHub

    Welcome to WhiteSource Bolt for GitHub! This is an onboarding PR to help you understand and configure settings before WhiteSource starts scanning your repository for security vulnerabilities.

    :vertical_traffic_light: WhiteSource Bolt for GitHub will start scanning your repository only once you merge this Pull Request. To disable WhiteSource Bolt for GitHub, simply close this Pull Request.


    What to Expect

    This PR contains a '.whitesource' configuration file which can be customized to your needs. If no changes were applied to this file, WhiteSource Bolt for GitHub will use the default configuration.

    Before merging this PR, Make sure the Issues tab is enabled. Once you merge this PR, WhiteSource Bolt for GitHub will scan your repository and create a GitHub Issue for every vulnerability detected in your repository.

    If you do not want a GitHub Issue to be created for each detected vulnerability, you can edit the '.whitesource' file and set the 'minSeverityLevel' parameter to 'NONE'.


    :question: Got questions? Check out WhiteSource Bolt for GitHub docs. If you need any further assistance then you can also request help here.

    opened by mend-bolt-for-github[bot] 0
  • Adding pytorch benchmark?

    Adding pytorch benchmark?

    Pytorch just release early support for Metal Performance Shader (mps) device, from official blog. I think this issue might be off-topic as the project name is literally tf-metal-experiments, but it would be nice to have baselines comparison between both Tensorflow and Pytorch performance benchmark

    But noted that (at least, as of right now) a noticeable performance gain seems to come from M1 Ultra chip, and with large enough batch size, and a lot of operation are not supported (see https://github.com/pytorch/pytorch/issues/77764)

    opened by thipokKub 0
  • Disable Eager Execution

    Disable Eager Execution

    I can’t find in the code whether eager mode is disabled. It is enabled by default in TF2.

    This can make a huge difference in performance as eager mode is more of a debug mode, and is generally not used for performance benchmarking.

    Is eager mode disabled in these benchmarks? If not, I would be curious how the results change with it on as I have noted at times >2x differences in performance between the modes.

    opened by my-other-github-account 0
  • Issues from

    Issues from "Big Sur" to "Monterey"

    Hi all,

    I just update my M1 to OS Monterey, and my tensorflow was spoiled (problems with memory allocation and malloc)

    Then I reintall tensorflow metal and then it only train on GPU.

    In my experiments I got better results if train on small batches and therefore it was a lot faster for me to train in mode 'any' ( I think it uses CPU and Neural Engine) This was the code I used:

    from tensorflow.python.compiler.mlcompute import mlcompute mlcompute.set_mlc_device(device_name='any')

    Now with tf-metal how can be possible to train on CPU or/and Neural Engine?

    Thanks in advance

    opened by leoeduardo69 0
  • 2x Faster Inferencing

    2x Faster Inferencing

    This is something that I suspect will be of interest to people who land on this repo:

    https://github.com/octoml/Apple-M1-BERT

    People were able to get 2x speed improvement over TF-metal using TVM to accelerate models. Downside is that TVM doesn't (yet) support model training.

    opened by TortoiseHam 0
  • add wandb support

    add wandb support

    This PR adds wandb logging for the training script 🚀 With this tool, we can compare runs on the dashboard and then get super nice results. It also automatically logs the apple system metrics like GPU power, CPU power, etc.. I have been also benchmarking the processor recently!

    image

    And check the model run here

    opened by tcapelle 3
Owner
Timothy Liu
Deep Learning stuff and Open Source Enthusiast @OpenSUTD
Timothy Liu
Just-Now - This Is Just Now Login Friendlist Cloner Tools

JUST NOW LOGIN FRIENDLIST CLONER TOOLS Install $ apt update $ apt upgrade $ apt

MAHADI HASAN AFRIDI 21 Mar 9, 2022
Repo for my Tensorflow/Keras CV experiments. Mostly revolving around the Danbooru20xx dataset

SW-CV-ModelZoo Repo for my Tensorflow/Keras CV experiments. Mostly revolving around the Danbooru20xx dataset Framework: TF/Keras 2.7 Training SQLite D

null 20 Dec 27, 2022
Cascaded Pyramid Network (CPN) based on Keras (Tensorflow backend)

ML2 Takehome Project Reimplementing the paper: Cascaded Pyramid Network for Multi-Person Pose Estimation Dataset The model uses the COCO dataset which

Vo Van Tu 1 Nov 22, 2021
A Keras implementation of YOLOv4 (Tensorflow backend)

keras-yolo4 请使用更完善的版本: https://github.com/miemie2013/Keras-YOLOv4 Please visit here for more complete model: https://github.com/miemie2013/Keras-YOLOv

null 384 Nov 29, 2022
OHLC Average Prediction of Apple Inc. Using LSTM Recurrent Neural Network

Stock Price Prediction of Apple Inc. Using Recurrent Neural Network OHLC Average Prediction of Apple Inc. Using LSTM Recurrent Neural Network Dataset:

Nouroz Rahman 410 Jan 5, 2023
Research shows Google collects 20x more data from Android than Apple collects from iOS. Block this non-consensual telemetry using pihole blocklists.

pihole-antitelemetry Research shows Google collects 20x more data from Android than Apple collects from iOS. Block both using these pihole lists. Proj

Adrian Edwards 290 Jan 9, 2023
Unofficial PyTorch implementation of Attention Free Transformer (AFT) layers by Apple Inc.

aft-pytorch Unofficial PyTorch implementation of Attention Free Transformer's layers by Zhai, et al. [abs, pdf] from Apple Inc. Installation You can i

Rishabh Anand 184 Dec 12, 2022
Convert Apple NeuralHash model for CSAM Detection to ONNX.

Apple NeuralHash is a perceptual hashing method for images based on neural networks. It can tolerate image resize and compression.

Asuhariet Ygvar 1.5k Dec 31, 2022
Applicator Kit for Modo allow you to apply Apple ARKit Face Tracking data from your iPhone or iPad to your characters in Modo.

Applicator Kit for Modo Applicator Kit for Modo allow you to apply Apple ARKit Face Tracking data from your iPhone or iPad with a TrueDepth camera to

Andrew Buttigieg 3 Aug 24, 2021
Demonstrates iterative FGSM on Apple's NeuralHash model.

apple-neuralhash-attack Demonstrates iterative FGSM on Apple's NeuralHash model. TL;DR: It is possible to apply noise to CSAM images and make them loo

Lim Swee Kiat 11 Jun 23, 2022
Fast, flexible and fun neural networks.

Brainstorm Discontinuation Notice Brainstorm is no longer being maintained, so we recommend using one of the many other,available frameworks, such as

IDSIA 1.3k Nov 21, 2022
Learning based AI for playing multi-round Koi-Koi hanafuda card games. Have fun.

Koi-Koi AI Learning based AI for playing multi-round Koi-Koi hanafuda card games. Platform Python PyTorch PySimpleGUI (for the interface playing vs AI

Sanghai Guan 10 Nov 20, 2022
piSTAR Lab is a modular platform built to make AI experimentation accessible and fun. (pistar.ai)

piSTAR Lab WARNING: This is an early release. Overview piSTAR Lab is a modular deep reinforcement learning platform built to make AI experimentation a

piSTAR Lab 0 Aug 1, 2022
Computer vision - fun segmentation experience using classic and deep tools :)

Computer_Vision_Segmentation_Fun Segmentation of Images and Video. Tools: pytorch Models: Classic model - GrabCut Deep model - Deeplabv3_resnet101 Flo

Mor Ventura 1 Dec 18, 2021
Convex optimization for fun and profit.

CFMM Optimal Routing This repository contains the code needed to generate the figures used in the paper Optimal Routing for Constant Function Market M

Guillermo Angeris 183 Dec 29, 2022
A small fun project using python OpenCV, mediapipe, and pydirectinput

Here I tried a small fun project using python OpenCV, mediapipe, and pydirectinput. Here we can control moves car game when yellow color come to right box (press key 'd') left box (press key 'a') left hand when thumb finger open (press key 'w') right hand when thumb finger open (press key 's') This can be improved later by: Improving press left and right to make them More realistic. Fixing some bugs in hand tracking.

Sameh Elisha 3 Nov 17, 2022
One-line your code easily but still with the fun of doing so!

One-liner-iser One-line your code easily but still with the fun of doing so! Have YOU ever wanted to write one-line Python code, but don't have the sa

null 5 May 4, 2022
Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy.

Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy. Now with tensorflow 1.0 support. Evaluation usa

Marcel R. 349 Aug 6, 2022