- I was trying to execute the example program atari_ppo.py on the following machine:
Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
32GB RAM
GTX 1080 with 8G RAM
Ubuntu 16.04
cuda 10.2
==
I have edited my configuration file conf_ppo.yaml to adapt to reduce the resource usage
m_server_name: "m_server"
m_server_addr: "127.0.0.1:4411"
r_server_name: "r_server"
r_server_addr: "127.0.0.1:4412"
c_server_name: "c_server"
c_server_addr: "127.0.0.1:4413"
train_device: "cuda:0"
infer_device: "cuda:0"
timeout: 180
env: "PongNoFrameskip-v4"
max_episode_steps: 2700
num_train_rollouts: 1
num_train_workers: 1
num_eval_rollouts: 1
num_eval_workers: 1
replay_buffer_size: 1024
prefetch: 2
batch_size: 32
lr: 3e-4
push_every_n_steps: 50
num_epochs: 1000
steps_per_epoch: 3000
num_eval_episodes: 20
train_seed: 123
eval_seed: 456
Here is what I got:
[2022-01-18 18:34:54,797][root][INFO] - {'m_server_name': 'm_server', 'm_server_addr': '127.0.0.1:4411', 'r_server_name': 'r_server', 'r_server_addr': '127.0.0.1:4412', 'c_server_name': 'c_server', 'c_server_addr': '127.0.0.1:4413', 'train_device': 'cuda:0', 'infer_device': 'cuda:0', 'env': 'PongNoFrameskip-v4', 'max_episode_steps': 2700, 'num_train_rollouts': 1, 'num_train_workers': 1, 'num_eval_rollouts': 1, 'num_eval_workers': 1, 'replay_buffer_size': 1024, 'prefetch': 2, 'batch_size': 8, 'lr': 0.0003, 'push_every_n_steps': 100, 'num_epochs': 20, 'steps_per_epoch': 300, 'num_eval_episodes': 20, 'train_seed': 123, 'eval_seed': 456}
[2022-01-18 18:35:08,193][root][INFO] - Warming up replay buffer: [ 0 / 1024 ]
[2022-01-18 18:35:09,194][root][INFO] - Warming up replay buffer: [ 0 / 1024 ]
[2022-01-18 18:35:10,196][root][INFO] - Warming up replay buffer: [ 0 / 1024 ]
[2022-01-18 18:35:11,198][root][INFO] - Warming up replay buffer: [ 0 / 1024 ]
[2022-01-18 18:35:12,220][root][INFO] - Warming up replay buffer: [ 0 / 1024 ]
[2022-01-18 18:35:13,222][root][INFO] - Warming up replay buffer: [ 894 / 1024 ]
[2022-01-18 18:35:14,228][root][INFO] - Warming up replay buffer: [ 894 / 1024 ]
[2022-01-18 18:35:15,229][root][INFO] - Warming up replay buffer: [ 894 / 1024 ]
[2022-01-18 18:35:16,231][root][INFO] - Warming up replay buffer: [ 1024 / 1024 ]
Exception in callback handle_task_exception(<Task finishe...) timed out')>) at /media/research/ml2558/rlmeta/rlmeta/utils/asycio_utils.py:11
handle: <Handle handle_task_exception(<Task finishe...) timed out')>) at /media/research/ml2558/rlmeta/rlmeta/utils/asycio_utils.py:11>
Traceback (most recent call last):
File "/home/ml2558/miniconda3/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/media/research/ml2558/rlmeta/rlmeta/utils/asycio_utils.py", line 17, in handle_task_exception
raise e
File "/media/research/ml2558/rlmeta/rlmeta/utils/asycio_utils.py", line 13, in handle_task_exception
task.result()
File "/media/research/ml2558/rlmeta/rlmeta/core/loop.py", line 161, in _run_loop
stats = await self._run_episode(env, agent, index)
File "/media/research/ml2558/rlmeta/rlmeta/core/loop.py", line 182, in _run_episode
action = await agent.async_act(timestep)
File "/media/research/ml2558/rlmeta/rlmeta/agents/ppo/ppo_agent.py", line 78, in async_act
action, logpi, v = await self.model.async_act(
RuntimeError: Call (m_server::act) timed out
Error executing job with overrides: ['env=PongNoFrameskip-v4', 'num_epochs=20']
Traceback (most recent call last):
File "/media/research/ml2558/rlmeta/examples/atari/ppo/atari_ppo.py", line 96, in main
stats = agent.train(cfg.steps_per_epoch)
File "/media/research/ml2558/rlmeta/rlmeta/agents/ppo/ppo_agent.py", line 139, in train
self.model.push()
File "/media/research/ml2558/rlmeta/rlmeta/core/model.py", line 69, in push
self.client.sync(self.server_name, "push", state_dict)
RuntimeError: Call (m_server::<unknown>) timed out
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
I tried to modify the timeout but seems with the same error. Any hint on how to resolve this?