TensorFlow Metal Backend on Apple Silicon Experiments (just for fun)

Timothy Liu

Last update: Jan 3, 2023

Related tags

Deep Learning tf-metal-experiments

Overview

tf-metal-experiments

TensorFlow Metal Backend on Apple Silicon Experiments (just for fun)

Setup

This is tested on M1 series Apple Silicon SOC only.

TensorFlow 2.x

Follow the official instructions from Apple here
Test that your Metal GPU is working by running tf.config.list_physical_devices("GPU"), you should see 1 GPU present (it is not named). Later when you actually use the GPU, there will be a more informative printout that says Metal device set to: Apple M1 Max or similar.
Now you should be ready to run any TF code that doesn't require external libraries.

HuggingFace Transformers library

If you want to play around with Transformer models (with TF Metal backend of course), you will need to install the HuggingFace Transformers library.

Install the regex library (I don't know why it has to be like this, but yeah): python3 -m pip install --upgrade regex --no-use-pep517. You might need do xcode-select --install if the above command doesn't work.
pip install transfomers ipywidgets

Experiments and Benchmarks

After some trial and error, some initial benchmarks for what should be the approx best capability of the M1 Max. For all the cases here, increasing batch size does not seem to increase the throughput.

Power draw also doesn't seem to be able to exceed 40W. Power draw from the GPU (averaged over 1 second) can be measured with sudo powermetrics --samplers gpu_power -i1000 -n1.

Model	GPU	BatchSize	Throughput	Power	Memory
ResNet50	M1 Max 32c	64	135 img/sec	40W	13 GB
MobileNetV2	M1 Max 32c	128	352 img/sec	37W	15 GB
DistilBERT	M1 Max 32c	64	120 seq/sec	35W	9 GB
BERTLarge	M1 Max 32c	32	18 seq/sec	36W	14 GB

The benchmark scripts used are included in this repo.

Reference Benchmarks from RTX 3090

Model	GPU	BatchSize	Throughput	Power
ResNet50	3090	64	957 img/sec	300W
MobileNetV2	3090	128	1927 img/sec	310W
DistilBERT	3090	64	1040 seq/sec	310W
BERTLarge	3090	32	164 seq/sec	320W

For 3090, same script is used, but additional optimization that leverage hardware (Tensor Core) and software (XLA compiler) not present/working on M1 is added. This corresponds to the following code segment added:

from tensorflow.keras import mixed_precision
tf.config.optimizer.set_jit(True)
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_global_policy(policy)
physical_devices = tf.config.list_physical_devices('GPU')

Also note that the 3090 is likely to perform better at larger batch sizes.

Measuring Achievable TFLOPS

We can use TF to write a matrix multiplication benchmark to try and estimate what is the max compute performance we can get out of a M1 Max. It seems we can get around ~8 TFLOPS for large enough problem (GEMM) sizes.

The plot can be generated using tflops_sweep.py.

Note that FP64 and FP16 performance appears to be non-existent. (the code automatically runs on CPU if FP64 or FP16 is specified as data type)

Comments

adding A100, other configs comparison for fun

Summary tables (more details below and in comments): | Model | M1 7c | M1 32c | A 100 (-) | V 100 (-) | P 100 (-) | T4 (-) | K 80 (-) | Q P 5000 (-) | Q M 4000 (-) | Q RTX 4000 (-) | | --- | --- | --- | ---| --- | --- | --- | --- | --- | --- | --- | | RN50 | 10 | 135 | 611 | 347 | 211 | 134 | F | 131 | F | F | | MNV2 | 23 | 352 | 269 | 187 | 125 | 193 | 94 | 181 | F | F | | DBERT | 15 | 120 | 761 | 187 | 149 | 94 | 47 | 109 | 39 | 129 | | BERTL | 1 | 18 | 136 | 31 | 16 | 15 | 4 | 17 | F | F |

| Model | M1 7c | M1 32c | A 100 (+) | V 100 (+) | P 100 (+) | T4 (+) | K 80 (+) | Q P 5000 (+) | Q M 4000 (+) | Q RTX 4000 (+) | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| RN50 | 10 | 135 | 1147 | v100 | 252 | na | na | na | na | na | | MNV2 | 23 | 352 | 1870 | v100 | 128 | na | na | na | na | na | | DBERT | 15 | 120 | 1909 | v100 | 209 | na | na | na | na | na | | BERTL | 1 | 18 | 309 | v100 | 23 | na | na | na | na | na |

Really good and useful work here --- thank you! I can potentially fill the remaining values if you think this would be of interest. Note: I simply copied-pasted the content of your .py files and ran them in a notebook. Also note: accuracy will improve by using float32 as op indicates M1 can only use float32, hence comparison should be without optimization imo. RTX 3090 and A100 are somewhat similar in my understanding in terms if benchmarks like these fwiw.

M1 on MBA (7-core, 8GB RAM). Completely froze my laptop! Even the trackpad stopped responding... first time ever I notice I have this slowdown/lag happens on this MBa, but results incoming! Obviously it's taking forever...

| Model | GPU | BatchSize | Throughput | Power | Memory | | ----------- | ---------- | --------- | ----------- | ----- | ------ | | ResNet50 | M1 7c | 64 | 10.3 img/sec | ? | ? | | MobileNetV2 | M1 7c | 128 | 22.7 img/sec | ? | ? | | DistilBERT | M1 7c | 64 | 15.2 seq/sec | ? | ? | | BERTLarge | M1 7c | 32 | 0.6 seq/sec | ? | ? |

with optimization:

| Model | GPU | BatchSize | Throughput | Power | Memory | | ----------- | ---------- | --------- | ----------- | ----- | ------ | | ResNet50 | A100 40GB | 64 | 1147.4 img/sec | ? | ? | | MobileNetV2 | A100 40GB | 128 | 1869.7 img/sec | ? | ? | | DistilBERT | A100 40GB | 64 | 1909.3 seq/sec | ? | ? | | BERTLarge | A100 40GB | 32 | 309.3 seq/sec | ? | ? |

without optimziation:

| Model | GPU | BatchSize | Throughput | Power | Memory | | ----------- | ---------- | --------- | ----------- | ----- | ------ | | ResNet50 | A100 40GB | 64 | 610.8 img/sec | ? | ? | | MobileNetV2 | A100 40GB | 128 | 269.4 img/sec | ? | ? | | DistilBERT | A100 40GB | 64 | 761.1 seq/sec | ? | ? | | BERTLarge | A100 40GB | 32 | 135.5 seq/sec | ? | ? |

| Model | GPU | BatchSize | Throughput | Power | Memory | | ----------- | ---------- | --------- | ----------- | ----- | ------ | | ResNet50 | M1 Max 32c | 64 | 135 img/sec | 40W | 13 GB | | MobileNetV2 | M1 Max 32c | 128 | 352 img/sec | 37W | 15 GB | | DistilBERT | M1 Max 32c | 64 | 120 seq/sec | 35W | 9 GB | | BERTLarge | M1 Max 32c | 32 | 18 seq/sec | 36W | 14 GB |

| Model | GPU | BatchSize | Throughput | Power | | ----------- | ---------- | --------- | ----------- | ----- | | ResNet50 | 3090 | 64 | 957 img/sec | 300W | | MobileNetV2 | 3090 | 128 | 1927 img/sec| 310W | | DistilBERT | 3090 | 64 | 1040 seq/sec| 310W | | BERTLarge | 3090 | 32 | 164 seq/sec | 320W |

opened by ngam 30
M1 Max Thermal Throttling

Hi @tlkh

Thank you so much for creating and running those benchmarks. I would be interested if thermal throttling affects training speed (ms/step) after a few minutes.

Could you share your training logs when running e.g. bm_rn50.py for benchmark_epochs = 20?

opened by yadamonk 9
smaller batch sizes

Thank you ! Because of unified memory, I wonder if tensorflow training on M1 max would be less badly impacted by smaller batch sizes, than training on the RTX 3090... I would love to have the comparisons of resnet and mobilenet for batchsize=16, on M1 max and RTX 3090.

opened by invoxiaehu 3
[ImgBot] Optimize images

Beep boop. Your images are optimized!

Your image file size has been reduced by 24% 🎉

Details

| File | Before | After | Percent reduction | |:--|:--|:--|:--| | /gpu_tflops_plot.jpg | 45.44kb | 34.56kb | 23.95% |

📝 docs | :octocat: repo | 🙋🏾 issues | 🏪 marketplace

~Imgbot - Part of Optimole family

opened by imgbot[bot] 0
Configure WhiteSource Bolt for GitHub

Welcome to WhiteSource Bolt for GitHub! This is an onboarding PR to help you understand and configure settings before WhiteSource starts scanning your repository for security vulnerabilities.

:vertical_traffic_light: WhiteSource Bolt for GitHub will start scanning your repository only once you merge this Pull Request. To disable WhiteSource Bolt for GitHub, simply close this Pull Request.

What to Expect

This PR contains a '.whitesource' configuration file which can be customized to your needs. If no changes were applied to this file, WhiteSource Bolt for GitHub will use the default configuration.

Before merging this PR, Make sure the Issues tab is enabled. Once you merge this PR, WhiteSource Bolt for GitHub will scan your repository and create a GitHub Issue for every vulnerability detected in your repository.

If you do not want a GitHub Issue to be created for each detected vulnerability, you can edit the '.whitesource' file and set the 'minSeverityLevel' parameter to 'NONE'.

:question: Got questions? Check out WhiteSource Bolt for GitHub docs. If you need any further assistance then you can also request help here.

opened by mend-bolt-for-github[bot] 0
Adding pytorch benchmark?

Pytorch just release early support for Metal Performance Shader (mps) device, from official blog. I think this issue might be off-topic as the project name is literally tf-metal-experiments, but it would be nice to have baselines comparison between both Tensorflow and Pytorch performance benchmark

But noted that (at least, as of right now) a noticeable performance gain seems to come from M1 Ultra chip, and with large enough batch size, and a lot of operation are not supported (see https://github.com/pytorch/pytorch/issues/77764)

opened by thipokKub 0
Disable Eager Execution

I can’t find in the code whether eager mode is disabled. It is enabled by default in TF2.

This can make a huge difference in performance as eager mode is more of a debug mode, and is generally not used for performance benchmarking.

Is eager mode disabled in these benchmarks? If not, I would be curious how the results change with it on as I have noted at times >2x differences in performance between the modes.

opened by my-other-github-account 0
Issues from "Big Sur" to "Monterey"

Hi all,

I just update my M1 to OS Monterey, and my tensorflow was spoiled (problems with memory allocation and malloc)

Then I reintall tensorflow metal and then it only train on GPU.

In my experiments I got better results if train on small batches and therefore it was a lot faster for me to train in mode 'any' ( I think it uses CPU and Neural Engine) This was the code I used:

from tensorflow.python.compiler.mlcompute import mlcompute mlcompute.set_mlc_device(device_name='any')

Now with tf-metal how can be possible to train on CPU or/and Neural Engine?

Thanks in advance

opened by leoeduardo69 0
2x Faster Inferencing

This is something that I suspect will be of interest to people who land on this repo:

https://github.com/octoml/Apple-M1-BERT

People were able to get 2x speed improvement over TF-metal using TVM to accelerate models. Downside is that TVM doesn't (yet) support model training.

opened by TortoiseHam 0
add wandb support

This PR adds wandb logging for the training script 🚀 With this tool, we can compare runs on the dashboard and then get super nice results. It also automatically logs the apple system metrics like GPU power, CPU power, etc.. I have been also benchmarking the processor recently!

And check the model run here

opened by tcapelle 3

Owner

Timothy Liu

Deep Learning stuff and Open Source Enthusiast @OpenSUTD

GitHub

Just-Now - This Is Just Now Login Friendlist Cloner Tools

JUST NOW LOGIN FRIENDLIST CLONER TOOLS Install $ apt update $ apt upgrade $ apt

21 Mar 9, 2022

A repository that shares tuning results of trained models generated by TensorFlow / Keras. Post-training quantization (Weight Quantization, Integer Quantization, Full Integer Quantization, Float16 Quantization), Quantization-aware training. TensorFlow Lite. OpenVINO. CoreML. TensorFlow.js. TF-TRT. MediaPipe. ONNX. [.tflite,.h5,.pb,saved_model,tfjs,tftrt,mlmodel,.xml/.bin, .onnx]

PINTO_model_zoo Please read the contents of the LICENSE file located directly under each folder before using the model. My model conversion scripts ar

2.4k Jan 5, 2023

Repo for my Tensorflow/Keras CV experiments. Mostly revolving around the Danbooru20xx dataset

SW-CV-ModelZoo Repo for my Tensorflow/Keras CV experiments. Mostly revolving around the Danbooru20xx dataset Framework: TF/Keras 2.7 Training SQLite D

20 Dec 27, 2022

Cascaded Pyramid Network (CPN) based on Keras (Tensorflow backend)

ML2 Takehome Project Reimplementing the paper: Cascaded Pyramid Network for Multi-Person Pose Estimation Dataset The model uses the COCO dataset which

1 Nov 22, 2021

A Keras implementation of YOLOv4 (Tensorflow backend)

keras-yolo4 请使用更完善的版本: https://github.com/miemie2013/Keras-YOLOv4 Please visit here for more complete model: https://github.com/miemie2013/Keras-YOLOv

384 Nov 29, 2022

OHLC Average Prediction of Apple Inc. Using LSTM Recurrent Neural Network

Stock Price Prediction of Apple Inc. Using Recurrent Neural Network OHLC Average Prediction of Apple Inc. Using LSTM Recurrent Neural Network Dataset:

410 Jan 5, 2023

Research shows Google collects 20x more data from Android than Apple collects from iOS. Block this non-consensual telemetry using pihole blocklists.

pihole-antitelemetry Research shows Google collects 20x more data from Android than Apple collects from iOS. Block both using these pihole lists. Proj

290 Jan 9, 2023

Unofficial PyTorch implementation of Attention Free Transformer (AFT) layers by Apple Inc.

aft-pytorch Unofficial PyTorch implementation of Attention Free Transformer's layers by Zhai, et al. [abs, pdf] from Apple Inc. Installation You can i

184 Dec 12, 2022

Convert Apple NeuralHash model for CSAM Detection to ONNX.

Apple NeuralHash is a perceptual hashing method for images based on neural networks. It can tolerate image resize and compression.

1.5k Dec 31, 2022

Applicator Kit for Modo allow you to apply Apple ARKit Face Tracking data from your iPhone or iPad to your characters in Modo.

Applicator Kit for Modo Applicator Kit for Modo allow you to apply Apple ARKit Face Tracking data from your iPhone or iPad with a TrueDepth camera to

3 Aug 24, 2021

Demonstrates iterative FGSM on Apple's NeuralHash model.

apple-neuralhash-attack Demonstrates iterative FGSM on Apple's NeuralHash model. TL;DR: It is possible to apply noise to CSAM images and make them loo

11 Jun 23, 2022

Fast, flexible and fun neural networks.

Brainstorm Discontinuation Notice Brainstorm is no longer being maintained, so we recommend using one of the many other,available frameworks, such as

1.3k Nov 21, 2022

Learning based AI for playing multi-round Koi-Koi hanafuda card games. Have fun.

Koi-Koi AI Learning based AI for playing multi-round Koi-Koi hanafuda card games. Platform Python PyTorch PySimpleGUI (for the interface playing vs AI

10 Nov 20, 2022

piSTAR Lab is a modular platform built to make AI experimentation accessible and fun. (pistar.ai)

piSTAR Lab WARNING: This is an early release. Overview piSTAR Lab is a modular deep reinforcement learning platform built to make AI experimentation a

0 Aug 1, 2022

Computer vision - fun segmentation experience using classic and deep tools :)

Computer_Vision_Segmentation_Fun Segmentation of Images and Video. Tools: pytorch Models: Classic model - GrabCut Deep model - Deeplabv3_resnet101 Flo

1 Dec 18, 2021

Convex optimization for fun and profit.

CFMM Optimal Routing This repository contains the code needed to generate the figures used in the paper Optimal Routing for Constant Function Market M

183 Dec 29, 2022

A small fun project using python OpenCV, mediapipe, and pydirectinput

Here I tried a small fun project using python OpenCV, mediapipe, and pydirectinput. Here we can control moves car game when yellow color come to right box (press key 'd') left box (press key 'a') left hand when thumb finger open (press key 'w') right hand when thumb finger open (press key 's') This can be improved later by: Improving press left and right to make them More realistic. Fixing some bugs in hand tracking.

3 Nov 17, 2022

One-line your code easily but still with the fun of doing so!

One-liner-iser One-line your code easily but still with the fun of doing so! Have YOU ever wanted to write one-line Python code, but don't have the sa

5 May 4, 2022

Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy.

Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy. Now with tensorflow 1.0 support. Evaluation usa

349 Aug 6, 2022

TensorFlow Metal Backend on Apple Silicon Experiments (just for fun)

Related tags

Overview

tf-metal-experiments

Setup

TensorFlow 2.x

HuggingFace Transformers library

Experiments and Benchmarks

Measuring Achievable TFLOPS

Comments

Beep boop. Your images are optimized!

What to Expect

Owner

Timothy Liu

Just-Now - This Is Just Now Login Friendlist Cloner Tools

Repo for my Tensorflow/Keras CV experiments. Mostly revolving around the Danbooru20xx dataset

Cascaded Pyramid Network (CPN) based on Keras (Tensorflow backend)

A Keras implementation of YOLOv4 (Tensorflow backend)

OHLC Average Prediction of Apple Inc. Using LSTM Recurrent Neural Network

Research shows Google collects 20x more data from Android than Apple collects from iOS. Block this non-consensual telemetry using pihole blocklists.

Unofficial PyTorch implementation of Attention Free Transformer (AFT) layers by Apple Inc.

Convert Apple NeuralHash model for CSAM Detection to ONNX.

Applicator Kit for Modo allow you to apply Apple ARKit Face Tracking data from your iPhone or iPad to your characters in Modo.

Demonstrates iterative FGSM on Apple's NeuralHash model.

Fast, flexible and fun neural networks.

Learning based AI for playing multi-round Koi-Koi hanafuda card games. Have fun.

piSTAR Lab is a modular platform built to make AI experimentation accessible and fun. (pistar.ai)

Computer vision - fun segmentation experience using classic and deep tools :)

Convex optimization for fun and profit.

A small fun project using python OpenCV, mediapipe, and pydirectinput

One-line your code easily but still with the fun of doing so!

Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy.