Reference models and tools for Cloud TPUs.

Last update: Jan 5, 2023

Related tags

Deep Learning tpu

Overview

Cloud TPUs

This repository is a collection of reference models and tools used with Cloud TPUs.

The fastest way to get started training a model on a Cloud TPU is by following the tutorial. Click the button below to launch the tutorial using Google Cloud Shell.

Note: This repository is a public mirror, pull requests will not be accepted. Please file an issue if you have a feature or bug request.

Running Models

To run models in the models subdirectory, you may need to add the top-level /models folder to the Python path with the command:

export PYTHONPATH="$PYTHONPATH:/path/to/models"

Comments

TPU dies after 3hrs, no error promoted to client code

Continued from https://github.com/tensorflow/tpu/issues/590 , which was closed citing an irrelevant patch.

Training using RetinaNet and/or a very similar training flow. Like clockwork, 3hrs of training and then I start seeing health UNKNOWN and the client spins indefinitely. Nothing significant in stackdriver logs.

opened by pwais 20
efficientnet training is too slow

i use tf1.10, nvidia v100 to train the model efficientnetv0, the params is about 1/4 of my origin model(based on resnet-18), but cost more time(about 10%) than my origin model, is it normal?

opened by zuokai 16
Error on TPU Pod

I am trying to connect to my TPU Pod using the documentation found here and the example in the TPU github for MNIST.

https://cloud.google.com/tpu/docs/training-on-tpu-pods

However, when I try to connect, I receive the following error; [I1107 14:24:47.276069 140521844594496 transport.py:157] Attempting refresh to obtain initial access_token [W1107 14:24:47.301813 140521844594496 http.py:118] Invalid JSON content from response: b'{\n "error": {\n "code": 403,\n "message": "Request had insufficient authentication scopes.",\n "status": "PERMISSION_DENIED"\n }\n}\n'

Does anyone know how to fix this issue? I have ran gcloud init to ensure I am using my account.

opened by rd16395p 15

GCP Auth issues with ctpu

Hi I am unable to use ctpu through cloud shell, it worked first time then it stopped working.

akshayubhat@cloudshell:~ (dvatfrc)$ ctpu ls
2018/06/22 23:58:36 Error listing Cloud TPUs: googleapi: Error 403: Read access to project 'dvatfrc' was denied, forbidden
akshayubhat@cloudshell:~ (dvatfrc)$ ctpu --zone us-central-f ls
2018/06/22 23:58:50 Error listing Cloud TPUs: googleapi: Error 403: Read access to project 'dvatfrc' was denied, forbidden
akshayubhat@cloudshell:~ (dvatfrc)$ ctpu --zone us-central-f ls
2018/06/22 23:58:55 Error listing Cloud TPUs: googleapi: Error 403: Read access to project 'dvatfrc' was denied, forbidden
akshayubhat@cloudshell:~ (dvatfrc)$

It also wont use gcloud config e.g. I tried changing the region/zone but it does not changes when I view it using ctpu cfg. It's also not clear how to reset ctpu, restarting the VM did not fix this.

opened by ghost 11

Error when train on customized dataset: Invalid JPEG data or crop window, data size 36864

It seems to be Invalid JPEG data or crop window error, but I double-check the image format in my tf records are jpegs, I am wondering any possible reason that could cause this error?

The code I check the image format in tf records:

for tfrecord in tqdm(tfrecord_files):
    for example in tqdm(tf.python_io.tf_record_iterator(tfrecord)):
        data = tf.train.Example.FromString(example)
        encoded_jpg = data.features.feature['image/encoded'].bytes_list.value[0]
        img = Image.open(BytesIO(encoded_jpg))
        assert img.format == 'JPEG'

The log when I met the error:

E0719 23:46:18.549607 139925925385984 error_handling.py:70] Error recorded frominfeed: From /job:worker/replica:0/task:0:
Invalid JPEG data or crop window, data size 36864
         [[{{node parser/case/cond/else/_20/cond_jpeg/then/_0/DecodeJpeg}}]]
         [[input_pipeline_task0/while/IteratorGetNext_1]]
E0719 23:46:18.572818 139925916993280 error_handling.py:70] Error recorded fromoutfeed: From /job:worker/replica:0/task:0:
Bad hardware status: 0x1
         [[node OutfeedDequeueTuple_4 (defined at /home/panfeng/projects/tpu/models/official/mask_rcnn/distributed_executer.py:115) ]]

Original stack trace for u'OutfeedDequeueTuple_4':
  File "tpu/models/official/mask_rcnn/mask_rcnn_main.py", line 156, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "tpu/models/official/mask_rcnn/mask_rcnn_main.py", line 151, in main
    run_executer(params, train_input_fn, eval_input_fn)
  File "tpu/models/official/mask_rcnn/mask_rcnn_main.py", line 99, in run_executer
    executer.train(train_input_fn, FLAGS.eval_after_training, eval_input_fn)
  File "/home/panfeng/projects/tpu/models/official/mask_rcnn/distributed_executer.py", line 115, in train
input_fn=train_input_fn, max_steps=self._model_params.total_steps)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2721, in train
    saving_listeners=saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 362, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1154, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1184, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2560, in _call_model_fn
    config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1142, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2870, in _model_fn
    host_ops = host_call.create_tpu_hostcall()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1943, in create_tpu_hostcall
    device_ordinal=ordinal_id)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_tpu_ops.py", line 3190, in outfeed_dequeue_tuple
    device_ordinal=device_ordinal, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()
E0719 23:46:19.930372 139927321310656 error_handling.py:70] Error recorded fromtraining_loop: From /job:worker/replica:0/task:0:
9 root error(s) found.
  (0) Cancelled: Node was closed
  (1) Cancelled: Node was closed
  (2) Cancelled: Node was closed
  (3) Cancelled: Node was closed
  (4) Cancelled: Node was closed
  (5) Cancelled: Node was closed
  (6) Cancelled: Node was closed
  (7) Cancelled: Node was closed
  (8) Invalid argument: Gradient for resnet50/batch_normalization_32/beta:0 is NaN : Tensor had NaN values
         [[node CheckNumerics_98 (defined at /home/panfeng/projects/tpu/models/official/mask_rcnn/distributed_executer.py:115) ]]
0 successful operations.
0 derived errors ignored.

opened by panfeng-hover 10

What is AmoebaNet-D?

I check the paper and google search but I could not find any information about AmoebaNet-D. In the paper I found only AmoebaNet-A, AmoebaNet-B, and AmoebaNet-C. What is AmoebaNet-D?

opened by mrteera 10

TPU dies after 3hrs (e.g. with no 'health' state)

Not sure what happened, can't see anything in stackdriver, but looks like the TPU RPC can have a malformed response:

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/tpu/preempted_hook.py", line 89, in run
    self._cluster._tpu, response['state'], response['health'])  # pylint: disable=protected-access
KeyError: 'health'

opened by pwais 9

Official efficientnet broken for all non-1000 class datasets ( --num_label_classes ignored)

The --num_label_classes option in main.py on master is now ignored in https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/imagenet_input.py#L85, breaking any training that is not being done on 1000 or 1001 classes.. This worked fine in r1.14.

Related: I don't think https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/main.py#L606 is a great idea. It assumes imagenet, which is probably only going to be the case when running benchmarks. Could this be moved into its own flag?

opened by UsmannK 9
Ways to freeze RetinaNet to a .pb file?
Is there a way to freeze RetinaNet checkpoint to a .pb file for further referencing, after it got trained? From my limited knowledge, there are two ways to convert a checkpoint to a .pb file in TF, which are all impossible to convert the trained RetinaNet model to .pb file.

use the freeze_graph tool by TensorFlow, as described here (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/tools/freeze_graph.py). However this command requires to specify output_node_names parameter which is hard to get for RetinaNet by analyzing its graph or using the summarize_graph provided (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/graph_transforms#inspecting-graphs). The summarize_graph tool will give over 1,000 possible names.

Use the export_inference_graph tool provided by the Object Detection API (https://github.com/tensorflow/models/blob/master/research/object_detection/export_inference_graph.py), which requires the model definition, which does not exist yet for RetinaNet.

So my question is - what's the best way to freeze the trained RetinaNet model to a .pb file for further inference?
opened by xiaoyongzhu 9

Attempt to access beyond input size: 4 >= 4

https://github.com/tensorflow/tpu-demos/blob/cb18fe2a4bacf4c8ef7685aebfbffb4550d5e938/cloud_tpu/models/resnet_garden/resnet_main.py#L204

I can't get parallel_interleave to work here. It gives this error:

InvalidArgumentError (see above for traceback): Attempt to access beyond input size: 4 >= 4
	In ParallelInterleaveDataset = ParallelInterleaveDataset[Targuments=[], f=tf_map_func_c72e772a[], output_shapes=[[]], output_types=[DT_STRING]](RepeatDataset:handle:0, ParallelInterleaveDataset/input_pipeline_task0/cycle_length:output:0, ParallelInterleaveDataset/input_pipeline_task0/block_length:output:0, ParallelInterleaveDataset/input_pipeline_task0/sloppy:output:0)
	 [[Node: input_pipeline_task0/OneShotIterator = OneShotIterator[container="", dataset_factory=_make_dataset_737011ca[], output_shapes=[[1024,224,224,3], [1024,1001]], output_types=[DT_FLOAT, DT_FLOAT], shared_name="", _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"]()]]

The ImageNet data is coming from gcs and it's definitely accessible. It works fine if I replace the line with interleave (no parallel) and remove the apply, but I'm afraid that might slow it down significantly.

Note: my gpu version of this code is happy with parallel_interleave.

opened by ryanjay0 9

Fashionpedia: Not able to restore model from given checkpoint

Thanks! for sharing the model training and inference codes, however, model restoration issue still persisting, here is what error I'm getting when I'm just trying to restore model from spinenet-143 ckpt:

CODE used to restore:

saver = tf.train.Saver() with tf.Session() as sess: saver.restore(sess, './model_spinenet_143/model.ckpt')

ERROR:

NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error: root error(s) found. (0) Not found: Key Variable not found in checkpoint [[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

I'm using tensorflow-gpu 1.15.0 version in colab, any hint as this error is occurring ? Thanks

opened by mitraavi 8
about the Copy_Paste

Hi, I am using copy_paste to enhance my target detection dataset offline, but I find that the large scale jittering in copy_paste scales up the image, which causes the original bbox to be eliminated, will this affect the dataset?

opened by 12moli 0

[Question] expected throughput of cloud tpu on embedding lookup?

(duplicated question as in https://github.com/tensorflow/recommenders/issues/579, just not sure which repo is a better place for this kind of question) Hi,

I read this blog recently https://cloud.google.com/blog/topics/developers-practitioners/building-large-scale-recommenders-using-cloud-tpus, very interested in it and wondering the raw performance of TPUEmbedding lookup performance.(we can quite easily get the perf data of tf.nn.(safe)embedding_lookup(_sparse) etc. but it becomes harder to get TPUEmbedding lookup perf data)

Based the test script included in this repo, I wrote piece of benchmarking code to test it:

# Copyright 2022 The TensorFlow Recommenders Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# https://github.com/tensorflow/recommenders/blob/main/tensorflow_recommenders/layers/embedding/tpu_embedding_layer_test.py
# trying to test tpu embedding lookup throughput, while I could not find lower-level API for doing such test

import time
import numpy as np
import tensorflow as tf

from tensorflow_recommenders.layers.embedding import tpu_embedding_layer

TABLE_SIZE = 1000000
EMB_DIM = 128
QUERY_KEY_NUM = 65536 * 8 # * 64, killed as Allocation of xxx exceeds 10% of free system memory.

class TPUEmbeddingLayerTest():

  def __init__(self):

    self.embedding_values = np.arange(TABLE_SIZE * EMB_DIM, dtype=np.float32)
    self.initializer = tf.constant_initializer(self.embedding_values)

    self.table_config = tf.tpu.experimental.embedding.TableConfig(
                                            vocabulary_size=TABLE_SIZE,
                                            dim=EMB_DIM,
                                            initializer=self.initializer,
                                            combiner='sum',
                                            name='embedding_table')

    self.feature_config = {
        'indices2embeddings': tf.tpu.experimental.embedding.FeatureConfig(
            table=self.table_config, name='indices2embeddings'),
    }

    self.batch_size = QUERY_KEY_NUM
    self.sample_size = 1

    # TODO(pehuang): draw samples randomly by given distribution
    self.data_point_indices = np.zeros((self.batch_size, 2), dtype=np.int32)
    self.data_point_indices[:, 0] = np.arange(self.batch_size, dtype=np.int32)
    self.data_points = np.random.choice(TABLE_SIZE, QUERY_KEY_NUM)

    self.embedding_lookup_input_data = tf.SparseTensor(
        indices=self.data_point_indices,
        values=tf.convert_to_tensor(self.data_points, dtype=tf.int32), # fp64 embedding and int32 key by default?
        dense_shape=[self.batch_size, self.sample_size])

    self.dataset = tf.data.Dataset.from_tensors({'indices2embeddings': self.embedding_lookup_input_data})

  def embedding_lookup_throughput_test(self, optimizer_name='sgd', training=False):
    # resolver = tf.distribute.cluster_resolver.TPUClusterResolver('').connect('')
    # strategy = tf.distribute.TPUStrategy(resolver)
    strategy = tf.distribute.get_strategy() # Use the default strategy.

    with strategy.scope():
      embedding_layer = tpu_embedding_layer.TPUEmbedding(feature_config=self.feature_config, optimizer=None)
      input_args = {'batch_size': self.batch_size,
                    'shape': (),
                    'sparse': True,
                    'dtype': tf.int32}
      inputs = {'indices2embeddings': tf.keras.Input(**input_args, name='indices2embeddings')}
      embeddings = embedding_layer(inputs)
      self.model = tf.keras.Model(inputs=(inputs), outputs=(embeddings))

      dist = strategy.experimental_distribute_dataset(self.dataset, options=tf.distribute.InputOptions(experimental_fetch_to_device=False))
      dist_iter = iter(dist)

      def lookup(features):
        res = self.model(features)
        return res

      #  for _ in range(10): # warmup 10 rounds failed after batch data used up, and raised StopIteration exception
      #  result = strategy.run(lookup, args=(next(dist_iter),))

      t_start = time.time()
      result = strategy.run(lookup, args=(next(dist_iter),))
      t_end = time.time()
      #  import pdb; pdb.set_trace() # stop to check embeddings
      print("embedding throughput: {}GB/s".format((QUERY_KEY_NUM * EMB_DIM) / 1e9 / (t_end - t_start)))
      # embedding throughput: 0.04273784880155698GB/s, must be cold-start, D2H and H2D etc. included
      # but I cannot warmup/increase table size temporally, using cloud shell provided tpu


if __name__ == '__main__':
  test = TPUEmbeddingLayerTest()
  test.embedding_lookup_throughput_test()

I used cloud shell to get access to tpu resource as introduced in https://github.com/tensorflow/tpu, but the reported throughput data is quite below expectation(obviously, its due to various issues in my benchmark script), may anyone help correction the script to get benchmark results in the right way? Or just provide some reference data, so I could know what the expected throughput should be?

THX!

opened by pmixer 0

Does efficientnet support TF2?

The code in models/official/efficientnet/ in the r2.10 branch contain imports of tensorflow.contrib, which is not available in TF2.

https://github.com/tensorflow/tpu/blob/30ed88e2103bdb203334898d94a59d09987d7e85/models/official/efficientnet/autoaugment.py#L27

Does the efficientnet code support training using a TF2 runtime?

opened by tfausten 0
docs: demo, experiments and live inference API on Tiyaro

Hello Tensorflow Gurus (@d0k @mithro @d0k @miaout17 @tafsiri)!

We here at Tiyaro believe this project is critical in the development of AI, especially with the TF hardware for inference and it would be a great to make this work instantly discoverable & available as an API for all your users, to quickly try and use it in their applications.

On Tiyaro, every model in Tensorflow will get its own: Dedicated model card (e.g. https://console.tiyaro.ai/explore/faster_r720c/api) Live Model demo page (e.g. https://console.tiyaro.ai/explore/faster_r720c/demo) Unique Inference API (e.g. https://api.tiyaro.ai/v1/ent/tensorflow/1/faster_rcnn/inception_resnet_v2_1024x1024) Sample code snippets and swagger spec for the API

Users will also be able to compare your model with other models of similar types on various parameters using Tiyaro Experiments (e.g. https://console.tiyaro.ai/experiments/q1lHSbejvVe89qkQxee9?wId=123456789012&runId=123456789012-experiment-20d1f67bdf214a86bdf14b4cd19ce815)

—- I am a engineer from Tiyaro.ai (https://tiyaro.ai/). We are working on enabling developers to instantly evaluate, use and customize the world’s best AI. We are constantly working on adding new features to Tiyaro EasyTrain, EasyServe & Experiments, to make the best use of your ML model, and making AI more accessible for anyone.

Sincerely, I-Jong Lin Founding Engineer from Tiyaro

opened by ijonglin 1
ResNet model changes for PS paper
This Pull Request opensources the modification to the resnet model for our TMLR paper (Do better ImageNet classifiers assess perceptual similarity better?, https://openreview.net/forum?id=qrGKGZZvH0). In specific, there are two changes:

Supports a shallow ResNet-6 model.

Extracts 4 endpoints used throughout in the paper.
opened by MechCoder 0

Reference models and tools for Cloud TPUs.

Related tags

Overview

Cloud TPUs

Running Models

Comments

TPU dies after 3hrs, no error promoted to client code

efficientnet training is too slow

Error on TPU Pod

GCP Auth issues with ctpu

Error when train on customized dataset: Invalid JPEG data or crop window, data size 36864

What is AmoebaNet-D?

TPU dies after 3hrs (e.g. with no 'health' state)

Official efficientnet broken for all non-1000 class datasets ( --num_label_classes ignored)

Ways to freeze RetinaNet to a .pb file?

Attempt to access beyond input size: 4 >= 4

Fashionpedia: Not able to restore model from given checkpoint

about the Copy_Paste

[Question] expected throughput of cloud tpu on embedding lookup?

Does efficientnet support TF2?

docs: demo, experiments and live inference API on Tiyaro

ResNet model changes for PS paper

Owner

Point Cloud Denoising input segmentation output raw point-cloud valid/clear fog rain de-noised Abstract Lidar sensors are frequently used in environme

MPRNet-Cloud-removal: Progressive cloud removal

Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.

Core ML tools contain supporting tools for Core ML model conversion, editing, and validation.

Image morphing without reference points by applying warp maps and optimizing over them.

Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch.

MASA-SR: Matching Acceleration and Spatial Adaptation for Reference-Based Image Super-Resolution (CVPR2021)

YouRefIt: Embodied Reference Understanding with Language and Gesture

Wanli Li and Tieyun Qian: Exploit a Multi-head Reference Graph for Semi-supervised Relation Extraction, IJCNN 2021

Intel® Nervana™ reference deep learning framework committed to best performance on all hardware

Intel® Nervana™ reference deep learning framework committed to best performance on all hardware

A mini library for Policy Gradients with Parameter-based Exploration, with reference implementation of the ClipUp optimizer from NNAISENSE.

Interpretation of T cell states using reference single-cell atlases

Code for C2-Matching (CVPR2021). Paper: Robust Reference-based Super-Resolution via C2-Matching.

A embed able annotation tool for end to end cross document co-reference

Simple reference implementation of GraphSAGE.