TensorFlow ROCm port

ROCm Software Platform

Last update: Jan 9, 2023

Related tags

Deep Learning tensorflow-upstream

Overview

`Documentation`

TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML-powered applications.

TensorFlow was originally developed by researchers and engineers working on the Google Brain team within Google's Machine Intelligence Research organization to conduct machine learning and deep neural networks research. The system is general enough to be applicable in a wide variety of other domains, as well.

TensorFlow provides stable Python and C++ APIs, as well as non-guaranteed backward compatible API for other languages.

Keep up-to-date with release announcements and security updates by subscribing to [email protected]. See all the mailing lists.

Tensorflow ROCm port

Please follow the instructions here to set up your ROCm stack. A docker container: rocm/tensorflow:latest(https://hub.docker.com/r/rocm/tensorflow/) is readily available to be used:

alias drun='sudo docker run \
      -it \
      --network=host \
      --device=/dev/kfd \
      --device=/dev/dri \
      --ipc=host \
      --shm-size 16G \
      --group-add video \
      --cap-add=SYS_PTRACE \
      --security-opt seccomp=unconfined \
      -v $HOME/dockerx:/dockerx'

drun rocm/tensorflow:latest

We maintain tensorflow-rocm whl packages on PyPI here, to install tensorflow-rocm package using pip:

# Install some ROCm dependencies
sudo apt install rocm-libs rccl

# Pip3 install the whl package from PyPI
pip3 install --user tensorflow-rocm --upgrade

For details on Tensorflow ROCm port, please take a look at the ROCm-specific README file.

Install

See the TensorFlow install guide for the pip package, to enable GPU support, use a Docker container, and build from source.

To install the current release, which includes support for CUDA-enabled GPU cards (Ubuntu and Windows):

$ pip install tensorflow

A smaller CPU-only package is also available:

$ pip install tensorflow-cpu

To update TensorFlow to the latest version, add --upgrade flag to the above commands.

Nightly binaries are available for testing using the tf-nightly and tf-nightly-cpu packages on PyPi.

Try your first TensorFlow program

$ python

>>> import tensorflow as tf
>>> tf.add(1, 2).numpy()
3
>>> hello = tf.constant('Hello, TensorFlow!')
>>> hello.numpy()
b'Hello, TensorFlow!'

For more examples, see the TensorFlow tutorials.

Contribution guidelines

If you want to contribute to TensorFlow, be sure to review the contribution guidelines. This project adheres to TensorFlow's code of conduct. By participating, you are expected to uphold this code.

We use GitHub issues for tracking requests and bugs, please see TensorFlow Discuss for general questions and discussion, and please direct specific questions to Stack Overflow.

The TensorFlow project strives to abide by generally accepted best practices in open-source software development:

Continuous build status

Official Builds

Build Type	Status	Artifacts
Linux CPU		PyPI
Linux GPU		PyPI
Linux XLA		TBA
macOS		PyPI
Windows CPU		PyPI
Windows GPU		PyPI
Android
Raspberry Pi 0 and 1		Py3
Raspberry Pi 2 and 3		Py3
Libtensorflow MacOS CPU		Nightly GCS Official GCS
Libtensorflow Linux CPU		Nightly GCS Official GCS
Libtensorflow Linux GPU		Nightly GCS Official GCS
Libtensorflow Windows CPU		Nightly GCS Official GCS
Libtensorflow Windows GPU		Nightly GCS Official GCS

Community Supported Builds

Build Type	Status	Artifacts
Linux AMD ROCm GPU Nightly		Nightly
Linux AMD ROCm GPU Stable Release		Release 1.15 / 2.x
Linux s390x Nightly		Nightly
Linux s390x CPU Stable Release		Release
Linux ppc64le CPU Nightly		Nightly
Linux ppc64le CPU Stable Release		Release 1.15 / 2.x
Linux ppc64le GPU Nightly		Nightly
Linux ppc64le GPU Stable Release		Release 1.15 / 2.x
Linux aarch64 CPU Nightly (Linaro)		Nightly
Linux aarch64 CPU Stable Release (Linaro)		Release 1.x & 2.x
Linux aarch64 CPU Nightly (OpenLab) Python 3.6		Nightly
Linux aarch64 CPU Stable Release (OpenLab)		Release 1.15 / 2.x
Linux CPU with Intel oneAPI Deep Neural Network Library (oneDNN) Nightly		Nightly
Linux CPU with Intel oneAPI Deep Neural Network Library (oneDNN) Stable Release		Release 1.15 / 2.x
Red Hat® Enterprise Linux® 7.6 CPU & GPU Python 2.7, 3.6		1.13.1 PyPI

Community Supported Containers

Container Type	Status	Artifacts
TensorFlow aarch64 Neoverse-N1 CPU Stable (Linaro) Debian	Static	Release 2.3

Resources

Learn more about the TensorFlow community and how to contribute.

License

Apache License 2.0

Comments

Seemingly random shape error during gradient calculation

edit: Important point I missed to mention: I did not encounter this issue with CUDA backend.

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Mint 19.1
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary): binary (pypi)
TensorFlow version (use command below): v1.12.0-871-gf480b4a 1.12.0
Python version: 3.6.7
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
ROCm/MIOpen version: Rocm: 2.1.96, MiOpen: 1.7.1 (both installed through apt)
GPU model and memory: Radeon VII, 16GB (gfx906)

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the current behavior After training a model for a variable number of epochs, the program throws an exception because of inco,patible shapes during gradient calculation for a tile op inside a tf.while_loop. The exception occurs inside the _TileGrad method, which interleaves the multiples and the shapes of the original tile op by stacking, transposing and reshaping. From the behaviour that I could see by printing the input tensors and intermediate steps in _TileGrad, it seems that something goes wrong during the interleaving. The interleaved shape at times ends up as nonsense like: [949434578 -1198049073 1 16 1 25] , while something like [50 1 1 21 1 25] would be expected.

The output of the transpose at one of these exceptions was:

 [[1036548730 1061580315]
 [-1110934980 -1085778476]
 [-1085903306 1061705196]]

resulting in the following interleaved shape: [1036548730 1061580315 -1110934980 -1085778476 -1085903306 1061705196]

I wasn't able to find the related stack output or input shapes, so I can't tell if the shape error is caused by something further upstream. My reply to this issue includes an example with parallel_iterations=1, including all the steps.

A full stacktrace can be found at the bottom of this issue.

The error is somewhat hard to reproduce and seems to happen at random. I don't believe it is directly related to tf.while_loop as the exception never occured in an RNN layer.

Describe the expected behavior No InvalidArgumentError during gradient calculation.

Code to reproduce the issue I ran this code for about 25 minutes before the exception happened. It might not be the minimal code required to reproduce the error, but since it's not reliably reproducable I can't narrow it down easily.

import tensorflow as tf
import numpy as np

def loop_cond_dist(i, _l, hs, __ow, _dist):
    return tf.less(i, tf.shape(hs)[1])


def loop_body_dist(i, l, hs, out_weights, dist_lookup):
    dists = tf.nn.embedding_lookup(dist_lookup, tf.clip_by_value(tf.range(1, limit=tf.shape(hs)[1] - i + 1), 0, 50))
    dists = tf.expand_dims(dists, axis=0)
    dists = tf.tile(dists, [tf.shape(hs)[0], 1, 1]) #Error seems to happen in gradients for this op
    cur = tf.einsum('ijk,kl -> ijl', dists, out_weights, name="out_mul")
    pre_pad = tf.zeros([tf.shape(l)[0], tf.shape(l)[1] - tf.reduce_sum(tf.range(tf.shape(hs)[1] - i + 1)), 2])
    post_pad = tf.zeros([tf.shape(l)[0], tf.reduce_sum(tf.range(tf.shape(hs)[1] - i)), 2])
    cur = tf.concat([pre_pad, cur, post_pad], axis=1)
    i += 1
    return i, tf.add(l, cur), hs, out_weights, dist_lookup

def build():
    dist_lookup = tf.get_variable('distance_embeds', dtype=tf.float32, shape=[51, 25])
    hs = tf.placeholder(dtype=tf.float32, shape=[None, None, 50])
    out_weights = tf.get_variable('out_weights', dtype=tf.float32, shape=[25, 2])
    logits = tf.zeros([50, tf.cast(((tf.shape(hs)[1] * tf.shape(hs)[1]) - tf.shape(hs)[1]) / 2, dtype=tf.float32), 2])
    loop_vars = [1, logits, hs, out_weights, dist_lookup]
    logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, name='clause_logits')[1]

    targets = tf.placeholder(tf.int32)

    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=targets, logits=logits)
    train = tf.train.AdamOptimizer(0.005).minimize(loss)
    return train, targets, hs

if __name__ == "__main__":
    with tf.Session() as sess:
        train, y, hs = build()
        sess.run([tf.global_variables_initializer()])
        while True:
            timesteps = np.random.randint(low=1, high=150)
            targets = np.random.randint(low=0, high=2, size=[50, int((timesteps*timesteps-timesteps)/2)])
            rand_hs = np.random.rand(50, timesteps, 50)
            _ = sess.run([train], {y: targets, hs: rand_hs})

Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs

--------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1333     try:
-> 1334       return fn(*args)
   1335     except errors.OpError as e:

~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
   1318       return self._call_tf_sessionrun(
-> 1319           options, feed_dict, fetch_list, target_list, run_metadata)
   1320 

~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
   1406         self._session, options, feed_dict, fetch_list, target_list,
-> 1407         run_metadata)
   1408 

InvalidArgumentError: Size 2 must be non-negative, not -1110934980
	 [[{{node gradients/clause_logits/Tile_grad/Reshape_1}} = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
	 [[{{node gradients/clause_logits/Tile_grad/Identity/_59}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_401_gradients/clause_logits/Tile_grad/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]

During handling of the above exception, another exception occurred:

InvalidArgumentError                      Traceback (most recent call last)
~/.cargo/toponn/python/bug.py in <module>
     45             targets = np.random.randint(low=0, high=2, size=[50, int((timesteps*timesteps-timesteps)/2)])
     46             rand_hs = np.random.rand(50, timesteps, 50)
---> 47             _ = sess.run([train], {y: targets, hs: rand_hs})
     48 

~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
    927     try:
    928       result = self._run(None, fetches, feed_dict, options_ptr,
--> 929                          run_metadata_ptr)
    930       if run_metadata:
    931         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1150     if final_fetches or final_targets or (handle and feed_dict_tensor):
   1151       results = self._do_run(handle, final_targets, final_fetches,
-> 1152                              feed_dict_tensor, options, run_metadata)
   1153     else:
   1154       results = []

~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1326     if handle is None:
   1327       return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1328                            run_metadata)
   1329     else:
   1330       return self._do_call(_prun_fn, handle, feeds, fetches)

~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
   1346           pass
   1347       message = error_interpolation.interpolate(message, self._graph)
-> 1348       raise type(e)(node_def, op, message)
   1349 
   1350   def _extend_graph(self):

InvalidArgumentError: Size 2 must be non-negative, not -1110934980
	 [[node gradients/clause_logits/Tile_grad/Reshape_1 (defined at /home/seb/.cargo/toponn/python/bug.py:34)  = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
	 [[{{node gradients/clause_logits/Tile_grad/Identity/_59}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_401_gradients/clause_logits/Tile_grad/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]

Caused by op 'gradients/clause_logits/Tile_grad/Reshape_1', defined at:
  File "/home/seb/.pyenv/versions/3.6.7/bin/ipython", line 10, in <module>
    sys.exit(start_ipython())
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/__init__.py", line 125, in start_ipython
    return launch_new_instance(argv=argv, **kwargs)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/traitlets/config/application.py", line 657, in launch_instance
    app.initialize(argv)
  File "</home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/decorator.py:decorator-gen-112>", line 2, in initialize
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/traitlets/config/application.py", line 87, in catch_config_error
    return method(app, *args, **kwargs)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/terminal/ipapp.py", line 323, in initialize
    self.init_code()
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/shellapp.py", line 288, in init_code
    self._run_cmd_line_code()
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/shellapp.py", line 408, in _run_cmd_line_code
    self._exec_file(fname, shell_futures=True)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/shellapp.py", line 340, in _exec_file
    raise_exceptions=True)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2683, in safe_execfile
    self.compile if shell_futures else None)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/utils/py3compat.py", line 188, in execfile
    exec(compiler(f.read(), fname, 'exec'), glob, loc)

  File "/home/seb/.cargo/toponn/python/bug.py", line 39, in <module>
    train, y, hs = build()
  File "/home/seb/.cargo/toponn/python/bug.py", line 34, in build
    train = tf.train.AdamOptimizer(0.005).minimize(loss)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 400, in minimize
    grad_loss=grad_loss)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 519, in compute_gradients
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 674, in gradients
    unconnected_gradients)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 864, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 409, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 864, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/array_grad.py", line 599, in _TileGrad
    input_grad = math_ops.reduce_sum(array_ops.reshape(grad, split_shape), axes)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 6482, in reshape
    "Reshape", tensor=tensor, shape=shape, name=name)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

...which was originally created as op 'clause_logits/Tile', defined at:
  File "/home/seb/.pyenv/versions/3.6.7/bin/ipython", line 10, in <module>
    sys.exit(start_ipython())
[elided 10 identical lines from previous traceback]
  File "/home/seb/.cargo/toponn/python/bug.py", line 39, in <module>
    train, y, hs = build()
  File "/home/seb/.cargo/toponn/python/bug.py", line 29, in build
    logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, name='clause_logits', parallel_iterations=250)[1]
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3295, in while_loop
    return_same_structure)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3007, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2942, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/home/seb/.cargo/toponn/python/bug.py", line 13, in loop_body_dist
    dists = tf.tile(dists, [tf.shape(hs)[0], 1, 1])
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 8805, in tile
    "Tile", input=input, multiples=multiples, name=name)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Size 2 must be non-negative, not -1110934980
	 [[node gradients/clause_logits/Tile_grad/Reshape_1 (defined at /home/seb/.cargo/toponn/python/bug.py:34)  = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
	 [[{{node gradients/clause_logits/Tile_grad/Identity/_59}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_401_gradients/clause_logits/Tile_grad/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

bug

opened by sebpuetz 130

Tensorflow 2.0 AMD support

I would be curious if Tensorflow 2.0 works with AMD Radeon VII?

Also, if it is available, are there any benchmark comparison with 2080Ti on some standard network to see if we should invest in Radeon VII clusters?

opened by Cvikli 58
Memory access fault by GPU node-1 (Agent handle: 0x2e0dbf0) on address 0x6dccc0000. Reason: Page not present or supervisor privilege.
Hello guys..

I am having issue to run rocm tensorflow with detail as follow:

System information

Have I written custom code : No I try to run this keras tensorflow codes : Keras Mask RCNN : https://github.com/matterport/Mask_RCNN Keras SSD : https://github.com/pierluigiferrari/ssd_keras

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.1 LTS

TensorFlow installed from whl package : pip3 install --user tensorflow-rocm

TensorFlow version (use command below): 1.12

Python version: 3.6.7

ROCM version : 2.0

CPU Memory: 16GB

GPU model and memory: RADEON RX 580 8 GB recongnized as: name: Ellesmere [Radeon RX 470/480] AMDGPU ISA: gfx803 memoryClockRate (GHz) 1.34 pciBusID 0000:01:00.0 Total memory: 8.00GiB Free memory: 7.75GiB

Describe the current behavior Epoch 1/30 2019-01-29 22:25:46.392668: I tensorflow/core/kernels/conv_grad_input_ops.cc:1023] running auto-tune for Backward-Data 2019-01-29 22:25:46.446704: I tensorflow/core/kernels/conv_grad_filter_ops.cc:975] running auto-tune for Backward-Filter Memory access fault by GPU node-1 (Agent handle: 0x2e0dbf0) on address 0x6dccc0000. Reason: Page not present or supervisor privilege. Aborted (core dumped)

Describe the expected behavior Running normally until epoch 30/30

Code to reproduce the issue Keras Mask RCNN python3 platno.py train --dataset=/home/path/to/dataset --weights=coco Always getting error with core dumped as above message

Keras SSD python3 ssd300_training.py can run normally when lowering batch size from 32 to 8

python3 ssd7_training.py getting core dumped even lowering batch size to 1

Other info / logs Have tried to enable some env variable for debug but still get error: HSA_ENABLE_SDMA=0 HSA_ENABLE_INTERRUPT=0 HSA_SVM_GUARD_PAGES=0 HSA_DISABLE_CACHE=1

Please assist how to resolve this problem

Thanks and Regards
bug gfx803
opened by fendiwira 44
errors in pin-in-place path in HCC unpinned copy engine
Using latest develop-upstream branch and latest benchmarks master. Running the tf_cnn_benchmarks.py code like so:

python tf_cnn_benchmarks.py --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=parameter_server --local_parameter_device=cpu

Eventually produces during warmup the following message

terminate called after throwing an instance of 'Kalmar::runtime_exception' what(): HCC unpinned copy engine error Aborted (core dumped)

If you set --local_parameter_device=gpu instead, the problem doesn't manifest.

However, the problem happens again even with --local_parameter_device=gpu during distributed training. Running 1 worker and 1 server like so:

# worker python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=distributed_replicated --ps_hosts=prj47-rack-05:50000 --worker_hosts=prj47-rack-02:50001 --job_name=worker --task_index=0 --server_protocol=grpc # ps python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=distributed_replicated --ps_hosts=prj47-rack-05:50000 --wo^Cer_hosts=prj47-rack-02:50001 --job_name=ps --task_index=0 --server_protocol=grpc

At least with the distributed training, my guess is that tensors are moving from GPU to CPU prior to being packed into protobufs and shipped via grpc. Not sure why this is also happening during warm-up except that I specified the parameter device to be CPU, forcing a device to host copy for storing the params.

misc system info

c++ (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609

lscpu

AMD EPYC 7551 32-Core Processor

uname -a Linux prj47-rack-02 4.13.0-43-generic #48~16.04.1-Ubuntu SMP Thu May 17 12:56:46 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

LD_LIBRARY_PATH /home/jdaily/openmpi-3.1.0-install/lib DYLD_LIBRARY_PATH is unset

rocm-clang-ocl/Ubuntu 16.04,now 0.3.0-c1b678e amd64 [installed,automatic] rocm-dev/Ubuntu 16.04,now 1.8.151 amd64 [installed] rocm-device-libs/Ubuntu 16.04,now 0.0.1 amd64 [installed] rocm-dkms/Ubuntu 16.04,now 1.8.151 amd64 [installed] rocm-libs/Ubuntu 16.04,now 1.8.151 amd64 [installed] rocm-opencl/Ubuntu 16.04,now 1.2.0-2018053053 amd64 [installed] rocm-opencl-dev/Ubuntu 16.04,now 1.2.0-2018053053 amd64 [installed] rocm-profiler/Ubuntu 16.04,now 5.4.6797 amd64 [installed] rocm-smi/Ubuntu 16.04,now 1.0.0-42-g0ae1c36 amd64 [installed,automatic] rocm-utils/Ubuntu 16.04,now 1.8.151 amd64 [installed] rocminfo/now 1.0.7 amd64 [installed,local]
opened by jeffdaily 39
Crash when performing inference
System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): I suppose. I'm using a software package that uses tensorflow-gpu under the hood, but I manually installed tensorflow-rocm in to their environment.

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04

TensorFlow installed from (source or binary): from pypi binary

TensorFlow version (use command below): 1.14.1 (Problem also occurs on 1.14.0, I can't test it on 1.13.x)

Python version: 3.6.2

ROCm/MIOpen version:

miopen-hip/Ubuntu 16.04,now 2.0.1.7405-rocm-rel-2.7-22-4e39a83 amd64 [installed] AMD's DNN Library miopen-opencl/Ubuntu 16.04 2.0.1.7405-rocm-rel-2.7-22-4e39a83 amd64 AMD's DNN Library miopengemm/Ubuntu 16.04,now 1.1.6.645-rocm-rel-2.7-22-6275a87 amd64 [installed] A tool for generating OpenCL matrix multiplication (GEMM) kernels

GPU model and memory: Vega 7 16GB

Describe the current behavior

I don't know the entire lingo, as I'm new to all of this and I didn't implement any of the tensorflow stuff.

So I used a software package that uses tensorflow-gpu to perform Deep Learning. My colleages have generated a few networks and it works on their machines and others that have a nvidia card.

When I try using those networks on my computer, with tensorflow-rocm, and I try to use those trained networks for inference, it crashes my computer. Like it reboots itself.

The networks are saved in h5 format. I haven't tried just generating a new network and training a new network.

Describe the expected behavior For it to not crash my whole computer. At least only crash python.

Other info / logs

It's been a while that I have installed rocm, so I don't remember how I did it, but is it normal that my miopen packages are called ubuntu 16.04 but I'm on ubuntu 18.04?

The Vega is also driving my desktop environment.
miopen
opened by thejinx0r 30

Unable to find a suitable algorithm for doing forward convolution

Hi, I get a weird error about Unable to find a suitable algorithm for doing forward convolution when I run the session. From what I understand, there is a kernel compiled with -DLOCAL_MEM_SIZE=19008 that is not something coming from my code. Even with a batch size of 1 I get the same error.

ml_1  | 2018-08-23 21:03:11.045474: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1451] Found device 0 with properties:
ml_1  | name: Device 687f
ml_1  | AMDGPU ISA: gfx900
ml_1  | memoryClockRate (GHz) 1.63
ml_1  | pciBusID 0000:0c:00.0
ml_1  | Total memory: 7.98GiB
ml_1  | Free memory: 7.73GiB
ml_1  | 2018-08-23 21:03:11.045489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1562] Adding visible gpu devices: 0
ml_1  | 2018-08-23 21:03:11.045503: I tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Device interconnect StreamExecutor with strength 1 edge matrix:
ml_1  | 2018-08-23 21:03:11.045510: I tensorflow/core/common_runtime/gpu/gpu_device.cc:995]      0
ml_1  | 2018-08-23 21:03:11.045516: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1008] 0:   N
ml_1  | 2018-08-23 21:03:11.045547: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1124] Created TensorFlow device (/device:GPU:0 with 7524 MB memory) -> physical GPU (device: 0, name: Device 687f, pci bus id: 0000:0c:00.0)
ml_1  | 2018-08-23 21:03:26.581328: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1562] Adding visible gpu devices: 0
ml_1  | 2018-08-23 21:03:26.581382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Device interconnect StreamExecutor with strength 1 edge matrix:
ml_1  | 2018-08-23 21:03:26.581396: I tensorflow/core/common_runtime/gpu/gpu_device.cc:995]      0
ml_1  | 2018-08-23 21:03:26.581407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1008] 0:   N
ml_1  | 2018-08-23 21:03:26.581440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1124] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7524 MB memory) -> physical GPU (device: 0, name: Device 687f, pci bus id: 0000:0c:00.0)
ml_1  | 2018-08-23 21:04:20.430885: I tensorflow/core/kernels/conv_grad_input_ops.cc:1007] running auto-tune for Backward-Data
ml_1  | 2018-08-23 21:04:20.495395: I tensorflow/core/kernels/conv_grad_input_ops.cc:1007] running auto-tune for Backward-Data
ml_1  | 2018-08-23 21:04:20.557689: I tensorflow/core/kernels/conv_grad_input_ops.cc:1007] running auto-tune for Backward-Data
ml_1  | error: local memory limit exceeded (76032) in Im2Col
ml_1  | MIOpen Error: /data/repo/MIOpen/src/tmp_dir.cpp:18: Can't execute cd /tmp/miopen-MIOpenUtilKernels.cl-faa6-605d-295b-fc2e; /opt/rocm/bin/clang-ocl  -DNUM_CH_PER_WG=1 -DNUM_IM_BLKS_X=3 -DNUM_IM_BLKS=9 -DLOCAL_MEM_SIZE=19008 -DSTRIDE_GT_1=1 -DTILE_SZ_X=32 -DTILE_SZ_Y=8 -DUSE_IM_OFF_GUARD=1 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP32=1 -mcpu=gfx900 -Wno-everything MIOpenUtilKernels.cl -o /tmp/miopen-MIOpenUtilKernels.cl-faa6-605d-295b-fc2e/MIOpenUtilKernels.cl.o
ml_1  | 2018-08-23 21:04:20.879002: F tensorflow/stream_executor/rocm/rocm_dnn.cc:1803] Check failed: status == miopenStatusSuccess (7 vs. 0)Unable to find a suitable algorithm for doing forward convolution
ml_1  | [I 21:04:21.291 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
ml_1  | WARNING:root:kernel 0d0fea33-23e8-4e97-8fa9-0bda0c19ea6f restarted
ml_1  | [I 21:04:37.435 NotebookApp] Saving file at /Road Segmentation.ipynb

miopen

opened by Sumenia 30

Dramatic difference in perf between 1080ti and VEGA FE

Hey there,

I'm trialing some code to benchmark my VEGA vs a colleagues 1080ti.

I've noticed some very peculiar differences in time to fit per epoch, I'm guessing I'm messing up in some way.

For one epoch on AMD 2990WX ~400 seconds

For one epoch on 1080ti < 100s

For one epoch on VEGA FE > 40 minutes
enhancement question

opened by PhilipDeegan 28
Integrate rocPRIM 0.3.1 milestone
A couple of notes:

rocPRIM is still marked experimental

in this PR, most of the reduction kernels, l2loss, and softmax are converted from cub to rocPRIM

complex types are not supported out of the box - hence no complex reduction yet

some cub kernels are not yet converted (where, topk, ...) due to issues w/ rocPRIM and/or the TF interface to cub. There will be follow-up PRs for these.

all rocPRIM kernels in this PR are marked P (for in Progress) in the documentation until we have confirmed it works and the patch is accepted, then I'll mark done
opened by iotamudelta 28
Fix for finding RCCL that works on 5.1 and 5.2

With the various ROCm libs moving around once fix got added that wasn't backward compatible with ROCm 5.1. This fixes the TF2.7 build on 5.1 and works on 5.2. (assuming we're finally settled on lib and include locations)

opened by jayfurmanek 27
low performance ?

if I take this benchmarks for reference, Inception v3 performs way slower on Vega 56 than Nvidia 1080

I'm a bit disappointed about the performance of my cards, are those results normal ?

python3.5 benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_gpus=2 --model resnet50 --batch_size 64

--> total images/sec: 192.75

python3.5 benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_gpus=2 --model inception3 --batch_size 64

--> total images/sec: 92.29

CPU: AMD Threadripper 1900X GPU 1: AMD Vega 56 GPU 2: AMD Vega 56 Memory: 32 Go DDR4

opened by Sumenia 27

Building (and using) libtensorflow.so

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

OS Platform and Distribution: Linux Mint 19.1
TensorFlow installed from (source or binary): Source
TensorFlow version: 1.12
Python version: 3.6.7
Installed using virtualenv? pip? conda?: pyenv
Bazel version (if compiling from source): 0.16.0, 0.19.2 and 0.21.0
GCC/Compiler version (if compiling from source): 7.3.0
ROCm version: 2.1
GPU model and memory: Radeon VII, 16GB

Describe the problem I want to use tensorflow from rust, to do so I need to build the libtensorflow.so shared library. Compilation goes through on r1.12 but when trying to execute the graph I get a runtime exception (see other info/logs section).

I don't encounter any issues with tensorflow in python, running a graph and training model works like a charm there. Although that was not compiled from source but installed from pypi.

Provide the exact sequence of commands / steps that you executed before running into the problem

Install bazel 19.2 as recommended in #304 
git clone -b r1.12-rocm [email protected]:ROCmSoftwarePlatform/tensorflow-upstream
cd tensorflow-upstream
./configure n for everything except ROCm support
bazel build --config=opt --config=rocm --action_env=HIP_PLATFORM=hcc tensorflow:libtensorflow.so

Any other info / logs

Runtime exception:

2019-02-09 11:14:33.291267: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1531] Found device 0 with properties: 
name: Device 66af
AMDGPU ISA: gfx906
memoryClockRate (GHz) 1.802
pciBusID 0000:28:00.0
Total memory: 15.98GiB
Free memory: 15.73GiB
2019-02-09 11:14:33.291334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1642] Adding visible gpu devices: 0
2019-02-09 11:14:33.291371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-09 11:14:33.291383: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1059]      0 
2019-02-09 11:14:33.291391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1072] 0:   N 
2019-02-09 11:14:33.291489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1189] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15306 MB memory) -> physical GPU (device: 0, name: Device 66af, pci bus id: 0000:28:00.0)
terminate called after throwing an instance of 'std::runtime_error'
  what():  Missing metadata for __global__ function: _ZN10tensorflow7functor28FillPhiloxRandomKernelLaunchINS_6random19UniformDistributionINS2_12PhiloxRandomEfEEEEvS4_PNT_17ResultElementTypeExS6_
[1]    11952 abort (core dumped)  LD_PRELOAD="/home/seb/.libtf/libtensorflow.so"

hipconfig

HIP version  : 1.5.19025

== hipconfig
HIP_PATH     : /opt/rocm/hip
HIP_PLATFORM : hcc
CPP_CONFIG   :  -D__HIP_PLATFORM_HCC__=   -I/opt/rocm/hip/include -I/opt/rocm/hcc/include

== hcc
HSA_PATH     : /opt/rocm/hsa
HCC_HOME     : /opt/rocm/hcc
HCC clang version 8.0.0 (ssh://gerritgit/compute/ec/hcc-tot/clang 683c680a6bff215baa3bd9d3099ba1a43e24cf2e) (ssh://gerritgit/lightning/ec/llvm 6e349ce344586b4254654aea8f34444a13aedb67) (based on HCC 1.3.19045-fea3e2b-683c680-6e349ce )
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/hcc/bin
LLVM (http://llvm.org/):
  LLVM version 8.0.0svn
  Optimized build.
  Default target: x86_64-unknown-linux-gnu
  Host CPU: znver1

  Registered Targets:
    amdgcn - AMD GCN GPUs
    r600   - AMD GPUs HD2XXX-HD6XXX
    x86    - 32-bit X86: Pentium-Pro and above
    x86-64 - 64-bit X86: EM64T and AMD64
HCC-cxxflags :  -hc -std=c++amp -I/opt/rocm/hcc/includeHCC-ldflags  :  -hc -std=c++amp -L/opt/rocm/hcc/lib -Wl,--rpath=/opt/rocm/hcc/lib -ldl -lm -lpthread -lhc_am -Wl,--whole-archive -lmcwamp -Wl,--no-whole-archive

=== Environment Variables
PATH=/opt/rocm/hcc/bin:/opt/rocm/hip/bin:/home/seb/.pyenv/shims:/home/seb/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/opt/rocm/bin:/opt/rocm/profiler/bin:/opt/rocm/opencl/bin/x86_64:/home/seb/.pyenv/bin
LD_LIBRARY_PATH=/opt/rocm/opencl/lib
HIP_PATH=/opt/rocm/hip
HCC_HOME=/opt/rocm/hcc

== Linux Kernel
Hostname     : seb-desktop
Linux seb-desktop 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID:	LinuxMint
Description:	Linux Mint 19.1 Tessa
Release:	19.1
Codename:	tessa

hcc --version

HCC clang version 8.0.0 (ssh://gerritgit/compute/ec/hcc-tot/clang 683c680a6bff215baa3bd9d3099ba1a43e24cf2e) (ssh://gerritgit/lightning/ec/llvm 6e349ce344586b4254654aea8f34444a13aedb67) (based on HCC 1.3.19045-fea3e2b-683c680-6e349ce )
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/hcc/bin

rocminfo

=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (number of timestamp)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 2700X Eight-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0                                  
  Queue Min Size:          0                                  
  Queue Max Size:          0                                  
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768KB                            
  Chip ID:                 0                                  
  Cacheline Size:          64                                 
  Max Clock Frequency (MHz):3700                               
  BDFID:                   0                                  
  Compute Unit:            16                                 
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    49448920KB                         
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    49448920KB                         
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 2                  
*******                  
  Name:                    gfx906                             
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128                                
  Queue Min Size:          4096                               
  Queue Max Size:          131072                             
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16KB                               
  Chip ID:                 26287                              
  Cacheline Size:          64                                 
  Max Clock Frequency (MHz):1802                               
  BDFID:                   10240                              
  Compute Unit:            60                                 
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64                                 
  Workgroup Max Size:      1024                               
  Workgroup Max Size Per Dimension:
    Dim[0]:                  67109888                           
    Dim[1]:                  671089664                          
    Dim[2]:                  0                                  
  Grid Max Size:           4294967295                         
  Waves Per CU:            40                                 
  Max Work-item Per CU:    2560                               
  Grid Max Size per Dimension:
    Dim[0]:                  4294967295                         
    Dim[1]:                  4294967295                         
    Dim[2]:                  4294967295                         
  Max number Of fbarriers Per Workgroup:32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832KB                         
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64KB                               
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Acessible by all:        FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx906          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Dimension: 
        Dim[0]:                  67109888                           
        Dim[1]:                  1024                               
        Dim[2]:                  16777217                           
      Workgroup Max Size:      1024                               
      Grid Max Dimension:      
        x                        4294967295                         
        y                        4294967295                         
        z                        4294967295                         
      Grid Max Size:           4294967295                         
      FBarrier Max Size:       32                                 
*** Done ***

enhancement

opened by sebpuetz 26

Memory access fault by GPU node-2 (Agent handle: 0x38b6960) on address 0x1000. Reason: Page not present or supervisor privilege.

Issue Type

Bug

Source

source

Tensorflow Version

tensorflow-rocm 2.2

Custom Code

Yes

OS Platform and Distribution

Ubuntu 20.04

Mobile device

No response

Python version

3.8

Bazel version

No response

GCC/Compiler version

No response

CUDA/cuDNN version

ROCm v3.5

GPU model and memory

2 x RX 480 4Go

Current Behaviour?

When switching my LSTM neurons from 'relu' activation function to 'tanh' I get the following error : `Memory access fault by GPU node-2 (Agent handle: 0x38b6960) on address 0x1000. Reason: Page not present or supervisor privilege.`

It also appears when this error doesn't occur (ie. when the program work) I have this warnings printed at the beginning:
WARNING:tensorflow:Layer lstm will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
WARNING:tensorflow:Layer lstm_1 will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
WARNING:tensorflow:Layer lstm_2 will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU

Standalone code to reproduce the issue

import os
import random
import time

import numpy as np
import tensorflow as tf
from tqdm import tqdm
from collections import deque

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'


print(tf.config.experimental.list_physical_devices("GPU"))
mirrored_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(tf.distribute.experimental.CollectiveCommunication.RING)

window_size = 5
episodes = 20
batch_size = 32
NAME = f"Blackstonev1-LSTM-32x64x64-{int(time.time())}"
tensorboard = tf.keras.callbacks.TensorBoard(log_dir="logs\{}".format(NAME))

class AIAgent:
    def __init__(self, state_size, action_space=3, model_name=NAME):  # Stay, Buy, Sell
        self.state_size = state_size
        self.action_space = action_space
        self.memory = deque(maxlen=2000)
        self.inventory = []
        self.margin_inventory = []
        self.model_name = model_name

        self.gamma = 0.95
        self.epsilon = 1.0
        self.epsilon_final = 0.05
        self.epsilon_decay = 0.995

        self.model = self.model_builder()

    def model_builder(self):
        with mirrored_strategy.scope():
            model = tf.keras.models.Sequential()

            model.add(tf.keras.Input(shape=(window_size, 2)))

            model.add(tf.keras.layers.LSTM(units=32, activation='relu', return_sequences=True))
            model.add(tf.keras.layers.LSTM(units=64, activation='relu', return_sequences=True))
            model.add(tf.keras.layers.LSTM(units=64, activation='relu', return_sequences=False))
            model.add(tf.keras.layers.Dense(units=self.action_space, activation='linear'))
            model.compile(loss='mse', optimizer=tf.keras.optimizers.Adam(lr=0.001))

        return model

    def trade(self, state):
        rdm = random.random()
        if rdm <= self.epsilon:
            rdm_act = random.randrange(self.action_space)
            print(f"random: {rdm_act}")
            return rdm_act

        actions = self.model.predict(state)
        argmax = np.argmax(actions[0])
        print(f'model: {argmax}')
        return argmax

    def batch_train(self, batch_size):
        batch = []
        for i in range(len(self.memory) - batch_size + 1, len(self.memory)):
            batch.append(self.memory[i])

        for state, action, reward, next_state, done in batch:
            reward = reward

            if not done:
                reward = reward + self.gamma * np.amax(self.model.predict(next_state)[0])

            target = self.model.predict(state)
            target[0][action] = reward

            self.model.fit(state, target, epochs=1, verbose=0, callbacks=[tensorboard])

        if self.epsilon > self.epsilon_final:
            self.epsilon *= self.epsilon_decay


def state_creator(data, timestep, window_size):
    starting_id = timestep - window_size + 1

    if starting_id >= 0:
        windowed_data = data[starting_id:timestep + 1]
    else:
        windowed_data = - starting_id * [data[0]] + list(data[0:timestep + 1])

    state = windowed_data

    return np.array([state])


def main(batch_size, window_size, episodes):
    data = load_data(stock_name) # Replace with your own input here
    data_samples = len(data) - 1
    agent = AIAgent(window_size)
    agent.model.summary()


    for episode in range(1, episodes + 1):
        print("Episode: {}/{}".format(episode, episodes))
        state = state_creator(data, 0, window_size)

        total_profit = 0
        agent.inventory = []

        for t in tqdm(range(data_samples)):
            action = agent.trade(state)

            next_state = state_creator(data, t + 1, window_size)
            reward = 0

            if action == 1:
                # Do that
                continue
            elif action == 2:
                # Do that
                continue

            elif action == 0:
                # Do that
                continue

            if t == data_samples - 1:
                done = True
            else:
                done = False

            agent.memory.append((state, action, reward, next_state, done))
            state = next_state

            if len(agent.memory) > batch_size:
                agent.batch_train(batch_size)

        agent.model.save(f"{agent.model_name}_{episode}.h5")

Relevant log output

No response

opened by hugo-mrc 0

7900 XTX Fails to Run

Issue Type

Bug

Tensorflow Version

Tensorflow-rocm v2.11.0-3797-gfe65ef3bbcf 2.11.0

rocm Version

5.4.1

Custom Code

Yes

OS Platform and Distribution

Archlinux: Kernel 6.1.1

Python version

3.10

GPU model and memory

7900 XTX 24GB

Current Behaviour?

I am not entirely sure whether this is an upstream (ROCM) issue, or with Tensorflow-rocm specifically, so I am reporting it to both repo's. A toy example refuses to run and dumps core. I would have expected it to train successfully.

Standalone code to reproduce the issue

import tensorflow as tf
import numpy as np

features = np.random.randn(10000,25)
targets = np.random.randn(10000)

model = tf.keras.Sequential([
     tf.keras.layers.Dense(1)
])

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
              loss=tf.keras.losses.MeanSquaredError())

model.fit(x=features, y=targets)

Relevant log output

[jaap@Jaap-Desktop code]$ pipenv run python testNN.py
2022-12-24 11:18:37.178811: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
python: /build/hsa-rocr/src/ROCR-Runtime-rocm-5.4.1/src/core/runtime/amd_gpu_agent.cpp:339: void rocr::AMD::GpuAgent::AssembleShader(const char*, AssembleTarget, void*&, size_t&) const: Assertion `code_buf != NULL && "Code buffer allocation failed"' failed.

opened by Mushoz 0

I'm not sure if ROCm or the GPU are working properly based on two console outputs

Issue Type

Support

Source

binary

Tensorflow Version

2.11.0

Custom Code

Yes

OS Platform and Distribution

Kubuntu 20.04

Mobile device

No response

Python version

3.7

Bazel version

No response

GCC/Compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current Behaviour?

ROCM Fusion seems to be enabled, but GPU doesn't appear on tf.config.list_physical_devices('GPU'). This seems a bit contradictory to me.

Standalone code to reproduce the issue

import tensorflow as tf
import numpy as np

tensor = tf.constant(np.random.rand(117120,1))

Relevant log output

2022-12-21 09:16:33.582542: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1990] Ignoring visible gpu device (device: 0, name: AMD Radeon RX 6600 XT, pci bus id: 0000:0c:00.0) with AMDGPU version : gfx1032. The supported AMDGPU versions are gfx1030, gfx900, gfx906, gfx908, gfx90a.
2022-12-21 09:16:35.013263: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.

opened by tvandraren 1

Unable to use profiler

Issue Type

Bug

Source

source

Tensorflow Version

2.11.0

Custom Code

Yes

OS Platform and Distribution

Archlinux kernel: 6.0.12

Mobile device

No response

Python version

3.9

Bazel version

No response

GCC/Compiler version

No response

CUDA/cuDNN version

ROCM 5.4.0

GPU model and memory

6900XT

Current Behaviour?

I am unable to run the profiler without tensorflow-rocm crashing.

Standalone code to reproduce the issue

This example triggers the issue:

import tensorflow as tf
import tensorflow as np

tboard_callback = tf.keras.callbacks.TensorBoard(log_dir = '../logs/',
                                                 histogram_freq = 1,
                                                 profile_batch = '500,520')

features = np.random.randn(10000,25)
targets = np.random.randn(10000)

model = tf.keras.Sequential([
     tf.keras.layers.Dense(1)
])

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
              loss=tf.keras.losses.MeanSquaredError())

model.fit(x=features, y=targets, callbacks=[tboard_callback])

Relevant log output

Fatal Python error: Aborted


Main thread:
Current thread 0x00007f3889bec740 (most recent call first):
  File "/home/jaap/.local/share/virtualenvs/code-NonYUw1A/lib/python3.10/site-packages/tensorflow/python/profiler/profiler_v2.py", line 117 in start
  File "/home/jaap/.local/share/virtualenvs/code-NonYUw1A/lib/python3.10/site-packages/keras/callbacks.py", line 2882 in _start_profiler
  File "/home/jaap/.local/share/virtualenvs/code-NonYUw1A/lib/python3.10/site-packages/keras/callbacks.py", line 2672 in _init_profile_batch
  File "/home/jaap/.local/share/virtualenvs/code-NonYUw1A/lib/python3.10/site-packages/keras/callbacks.py", line 2421 in __init__
  File "/home/jaap/Dropbox/Projects/Google_Trends_Analysis/code/testNN.py", line 12 in <module>
  File "/home/jaap/.local/share/virtualenvs/code-NonYUw1A/lib/python3.10/site-packages/spyder_kernels/py3compat.py", line 356 in compat_exec
  File "/home/jaap/.local/share/virtualenvs/code-NonYUw1A/lib/python3.10/site-packages/spyder_kernels/customize/spydercustomize.py", line 469 in exec_code
  File "/home/jaap/.local/share/virtualenvs/code-NonYUw1A/lib/python3.10/site-packages/spyder_kernels/customize/spydercustomize.py", line 611 in _exec_file
  File "/home/jaap/.local/share/virtualenvs/code-NonYUw1A/lib/python3.10/site-packages/spyder_kernels/customize/spydercustomize.py", line 524 in runfile
  File "/tmp/ipykernel_99678/3578018583.py", line 1 in <cell line: 1>


Restarting kernel...

opened by Mushoz 0

rocWMMA support?

Issue Type

Feature Request

Source

binary

Tensorflow Version

tf 2.10.0.530

Custom Code

OS Platform and Distribution

Linux Ubuntu 20.04

Mobile device

N/A

Python version

3.8

Bazel version

N/A

GCC/Compiler version

N/A

CUDA/cuDNN version

N/A

GPU model and memory

N/A

Current Behaviour?

With the impending release of GFX11 GPUs which support WMMA instructions, there seems to be currently no support for such WMMA instructions integrated into the ROCm tensorflow stack yet. I'm a bit concerned as the main competition already has TensorFloat-32 support for their GPUs in their tensorflow stack, and if tensorflow-ROCm is to remain relevant and competitive with the competition then at least I believe that support for WMMA instructions should be integrated into the ROCm tensorflow stack ASAP. And if the ROCm + proprietary amdgpu stack can _also_ start fully supporting GFX11 GPUs within the next few months after they launch then that'd be huge incentive for me to purchase a RX 7000 series GPU with at least 12 GB VRAM and rocWMMA support ;)