edit: Important point I missed to mention: I did not encounter this issue with CUDA backend.
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Mint 19.1
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
- TensorFlow installed from (source or binary): binary (pypi)
- TensorFlow version (use command below): v1.12.0-871-gf480b4a 1.12.0
- Python version: 3.6.7
- Bazel version (if compiling from source):
- GCC/Compiler version (if compiling from source):
- ROCm/MIOpen version: Rocm: 2.1.96, MiOpen: 1.7.1 (both installed through apt)
- GPU model and memory: Radeon VII, 16GB (gfx906)
You can collect some of this information using our environment capture script
You can also obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
Describe the current behavior
After training a model for a variable number of epochs, the program throws an exception because of inco,patible shapes during gradient calculation for a tile op inside a tf.while_loop. The exception occurs inside the _TileGrad
method, which interleaves the multiples and the shapes of the original tile op by stacking, transposing and reshaping. From the behaviour that I could see by printing the input tensors and intermediate steps in _TileGrad
, it seems that something goes wrong during the interleaving. The interleaved shape at times ends up as nonsense like: [949434578 -1198049073 1 16 1 25]
, while something like [50 1 1 21 1 25]
would be expected.
The output of the transpose at one of these exceptions was:
[[1036548730 1061580315]
[-1110934980 -1085778476]
[-1085903306 1061705196]]
resulting in the following interleaved shape:
[1036548730 1061580315 -1110934980 -1085778476 -1085903306 1061705196]
I wasn't able to find the related stack output or input shapes, so I can't tell if the shape error is caused by something further upstream. My reply to this issue includes an example with parallel_iterations=1
, including all the steps.
A full stacktrace can be found at the bottom of this issue.
The error is somewhat hard to reproduce and seems to happen at random. I don't believe it is directly related to tf.while_loop as the exception never occured in an RNN layer.
Describe the expected behavior
No InvalidArgumentError
during gradient calculation.
Code to reproduce the issue
I ran this code for about 25 minutes before the exception happened. It might not be the minimal code required to reproduce the error, but since it's not reliably reproducable I can't narrow it down easily.
import tensorflow as tf
import numpy as np
def loop_cond_dist(i, _l, hs, __ow, _dist):
return tf.less(i, tf.shape(hs)[1])
def loop_body_dist(i, l, hs, out_weights, dist_lookup):
dists = tf.nn.embedding_lookup(dist_lookup, tf.clip_by_value(tf.range(1, limit=tf.shape(hs)[1] - i + 1), 0, 50))
dists = tf.expand_dims(dists, axis=0)
dists = tf.tile(dists, [tf.shape(hs)[0], 1, 1]) #Error seems to happen in gradients for this op
cur = tf.einsum('ijk,kl -> ijl', dists, out_weights, name="out_mul")
pre_pad = tf.zeros([tf.shape(l)[0], tf.shape(l)[1] - tf.reduce_sum(tf.range(tf.shape(hs)[1] - i + 1)), 2])
post_pad = tf.zeros([tf.shape(l)[0], tf.reduce_sum(tf.range(tf.shape(hs)[1] - i)), 2])
cur = tf.concat([pre_pad, cur, post_pad], axis=1)
i += 1
return i, tf.add(l, cur), hs, out_weights, dist_lookup
def build():
dist_lookup = tf.get_variable('distance_embeds', dtype=tf.float32, shape=[51, 25])
hs = tf.placeholder(dtype=tf.float32, shape=[None, None, 50])
out_weights = tf.get_variable('out_weights', dtype=tf.float32, shape=[25, 2])
logits = tf.zeros([50, tf.cast(((tf.shape(hs)[1] * tf.shape(hs)[1]) - tf.shape(hs)[1]) / 2, dtype=tf.float32), 2])
loop_vars = [1, logits, hs, out_weights, dist_lookup]
logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, name='clause_logits')[1]
targets = tf.placeholder(tf.int32)
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=targets, logits=logits)
train = tf.train.AdamOptimizer(0.005).minimize(loss)
return train, targets, hs
if __name__ == "__main__":
with tf.Session() as sess:
train, y, hs = build()
sess.run([tf.global_variables_initializer()])
while True:
timesteps = np.random.randint(low=1, high=150)
targets = np.random.randint(low=0, high=2, size=[50, int((timesteps*timesteps-timesteps)/2)])
rand_hs = np.random.rand(50, timesteps, 50)
_ = sess.run([train], {y: targets, hs: rand_hs})
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
Other info / logs
--------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1333 try:
-> 1334 return fn(*args)
1335 except errors.OpError as e:
~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
1318 return self._call_tf_sessionrun(
-> 1319 options, feed_dict, fetch_list, target_list, run_metadata)
1320
~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
1406 self._session, options, feed_dict, fetch_list, target_list,
-> 1407 run_metadata)
1408
InvalidArgumentError: Size 2 must be non-negative, not -1110934980
[[{{node gradients/clause_logits/Tile_grad/Reshape_1}} = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
[[{{node gradients/clause_logits/Tile_grad/Identity/_59}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_401_gradients/clause_logits/Tile_grad/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]
During handling of the above exception, another exception occurred:
InvalidArgumentError Traceback (most recent call last)
~/.cargo/toponn/python/bug.py in <module>
45 targets = np.random.randint(low=0, high=2, size=[50, int((timesteps*timesteps-timesteps)/2)])
46 rand_hs = np.random.rand(50, timesteps, 50)
---> 47 _ = sess.run([train], {y: targets, hs: rand_hs})
48
~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
927 try:
928 result = self._run(None, fetches, feed_dict, options_ptr,
--> 929 run_metadata_ptr)
930 if run_metadata:
931 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
1150 if final_fetches or final_targets or (handle and feed_dict_tensor):
1151 results = self._do_run(handle, final_targets, final_fetches,
-> 1152 feed_dict_tensor, options, run_metadata)
1153 else:
1154 results = []
~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
1326 if handle is None:
1327 return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1328 run_metadata)
1329 else:
1330 return self._do_call(_prun_fn, handle, feeds, fetches)
~/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1346 pass
1347 message = error_interpolation.interpolate(message, self._graph)
-> 1348 raise type(e)(node_def, op, message)
1349
1350 def _extend_graph(self):
InvalidArgumentError: Size 2 must be non-negative, not -1110934980
[[node gradients/clause_logits/Tile_grad/Reshape_1 (defined at /home/seb/.cargo/toponn/python/bug.py:34) = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
[[{{node gradients/clause_logits/Tile_grad/Identity/_59}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_401_gradients/clause_logits/Tile_grad/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]
Caused by op 'gradients/clause_logits/Tile_grad/Reshape_1', defined at:
File "/home/seb/.pyenv/versions/3.6.7/bin/ipython", line 10, in <module>
sys.exit(start_ipython())
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/__init__.py", line 125, in start_ipython
return launch_new_instance(argv=argv, **kwargs)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/traitlets/config/application.py", line 657, in launch_instance
app.initialize(argv)
File "</home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/decorator.py:decorator-gen-112>", line 2, in initialize
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/traitlets/config/application.py", line 87, in catch_config_error
return method(app, *args, **kwargs)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/terminal/ipapp.py", line 323, in initialize
self.init_code()
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/shellapp.py", line 288, in init_code
self._run_cmd_line_code()
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/shellapp.py", line 408, in _run_cmd_line_code
self._exec_file(fname, shell_futures=True)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/shellapp.py", line 340, in _exec_file
raise_exceptions=True)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2683, in safe_execfile
self.compile if shell_futures else None)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/IPython/utils/py3compat.py", line 188, in execfile
exec(compiler(f.read(), fname, 'exec'), glob, loc)
File "/home/seb/.cargo/toponn/python/bug.py", line 39, in <module>
train, y, hs = build()
File "/home/seb/.cargo/toponn/python/bug.py", line 34, in build
train = tf.train.AdamOptimizer(0.005).minimize(loss)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 400, in minimize
grad_loss=grad_loss)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 519, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 674, in gradients
unconnected_gradients)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 864, in _GradientsHelper
lambda: grad_fn(op, *out_grads))
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 409, in _MaybeCompile
return grad_fn() # Exit early
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 864, in <lambda>
lambda: grad_fn(op, *out_grads))
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/array_grad.py", line 599, in _TileGrad
input_grad = math_ops.reduce_sum(array_ops.reshape(grad, split_shape), axes)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 6482, in reshape
"Reshape", tensor=tensor, shape=shape, name=name)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
...which was originally created as op 'clause_logits/Tile', defined at:
File "/home/seb/.pyenv/versions/3.6.7/bin/ipython", line 10, in <module>
sys.exit(start_ipython())
[elided 10 identical lines from previous traceback]
File "/home/seb/.cargo/toponn/python/bug.py", line 39, in <module>
train, y, hs = build()
File "/home/seb/.cargo/toponn/python/bug.py", line 29, in build
logits = tf.while_loop(loop_cond_dist, loop_body_dist, loop_vars, name='clause_logits', parallel_iterations=250)[1]
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3295, in while_loop
return_same_structure)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3007, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2942, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/home/seb/.cargo/toponn/python/bug.py", line 13, in loop_body_dist
dists = tf.tile(dists, [tf.shape(hs)[0], 1, 1])
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 8805, in tile
"Tile", input=input, multiples=multiples, name=name)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/home/seb/.pyenv/versions/3.6.7/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): Size 2 must be non-negative, not -1110934980
[[node gradients/clause_logits/Tile_grad/Reshape_1 (defined at /home/seb/.cargo/toponn/python/bug.py:34) = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/clause_logits/out_mul/Reshape_grad/Reshape, gradients/clause_logits/Tile_grad/Reshape)]]
[[{{node gradients/clause_logits/Tile_grad/Identity/_59}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_401_gradients/clause_logits/Tile_grad/Identity", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopgradients/clause_logits/Tile_grad/StringFormat/_1)]]
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
bug