My Env:
TensorFlow: 1.3
CUDA: 8.0
cuDNN: 6.0
I notice an update for distributed_all_reduce so I want to have a try. But I'm not sure what value should controller_host takes...
My args are:
--variable_update=distributed_all_reduce
--all_reduce_spec=pscpu:32k:xring
and I start 3 processes with args:
FIRST:
--job_name=worker
--worker_hosts=127.0.0.1:50001,127.0.0.1:50002
--task_index=0
SECONDE:
--job_name=worker
--worker_hosts=127.0.0.1:50001,127.0.0.1:50002
--task_index=1
THIRD:
--job_name=controller
--controller_host=??
--task_index=0
When I put 127.0.0.1:50000 or 127.0.0.1:50001 on controller_host, I got:
TensorFlow: 1.3
Model: resnet50
Mode: training
SingleSess: True
Batch size: 128 global
64 per device
Devices: ['job:worker/task0/gpu:0', 'job:worker/task1/gpu:0']
Data format: NCHW
Optimizer: sgd
Variables: distributed_all_reduce
AllReduce: pscpu:32k:xring
Sync: True
==========
Generating model
WARNING:tensorflow:From /home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/preprocessing.py:486: __init__ (from tensorflow.contrib.data.python.ops.readers) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.TFRecordDataset`.
WARNING:tensorflow:From /home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/preprocessing.py:487: range (from tensorflow.contrib.data.python.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.range()`.
WARNING:tensorflow:From /home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/preprocessing.py:489: zip (from tensorflow.contrib.data.python.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.zip()`.
2017-10-10 14:03:34.183287: E tensorflow/core/common_runtime/session.cc:69] Not found: No session factory registered for the given session options: {target: "127.0.0.1:50001" config: intra_op_parallelism_threads: 1 gpu_options { force_gpu_compatible: true } allow_soft_placement: true} Registered factories are {DIRECT_SESSION, GRPC_SESSION}.
Traceback (most recent call last):
File "/home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 46, in <module>
tf.app.run()
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 42, in main
bench.run()
File "/home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 892, in run
return self._benchmark_cnn()
File "/home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1068, in _benchmark_cnn
start_standard_services=start_standard_services) as sess:
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 792, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 953, in managed_session
start_standard_services=start_standard_services)
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 708, in prepare_or_wait_for_session
init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 273, in prepare_session
config=config)
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 178, in _restore_checkpoint
sess = session.Session(self._target, graph=self._graph, config=config)
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1482, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 622, in __init__
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: No session factory registered for the given session options: {target: "127.0.0.1:50001" config: intra_op_parallelism_threads: 1 gpu_options { force_gpu_compatible: true } allow_soft_placement: true} Registered factories are {DIRECT_SESSION, GRPC_SESSION}.