I have installed the exact package versions as listed in the README.md file. By the way, NumPy version 0.15.0 does not exist. I suspect this is a typo and the README should read NumPy 1.15.0. I have also downloaded the data from Google Drive, unzipped via 7z, then untarred the resulting file.
I have modified the train.sh
file to point to the downloaded data. When running the script on an AWS EC2 instance (p2.xlarge), I receive the following error:
(pytorch_p36) [ec2-user@ip-xxx-xx-xx-xxx PPGNet]$ ./train.sh
Loading weights for net_encoder @ ckpt/backbone/encoder_epoch_20.pth
Loading weights for net_decoder @ ckpt/backbone/decoder_epoch_20.pth
start training epoch: 0
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File "main.py", line 521, in <module>
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
fire.Fire(LSDTrainer)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fire/core.py", line 127, in Fire
component_trace = _Fire(component, args, context, name)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire
component, remaining_args)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File "main.py", line 358, in train
self._train_epoch()
File "main.py", line 209, in _train_epoch
for i, batch in enumerate(data_loader):
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in __next__
return self._process_next_batch(batch)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
TypeError: Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in <listcomp>
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/ec2-user/ray/PPGNet/data/sist_line.py", line 24, in __getitem__
lg = LineGraph().load(os.path.join(self.data_root, self.img[item][:-4] + ".lg"))
File "/home/ec2-user/ray/PPGNet/data/line_graph.py", line 29, in load
data = pickle.load(f)
File "sklearn/neighbors/binary_tree.pxi", line 1166, in sklearn.neighbors.kd_tree.BinaryTree.__setstate__
File "stringsource", line 653, in View.MemoryView.memoryview_cwrapper
File "stringsource", line 348, in View.MemoryView.memoryview.__cinit__
TypeError: a bytes-like object is required, not 'code'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
pid, sts = os.waitpid(self.pid, flag)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 11499) is killed by signal: Segmentation fault.
Searching on forums, they suggest that you should set num_workers=0
. When I tried doing that, I simply get a segmentation fault without premise:
(pytorch_p36) [ec2-user@ip-xxx-xx-xx-xxx PPGNet]$ ./train.sh
Loading weights for net_encoder @ ckpt/backbone/encoder_epoch_20.pth
Loading weights for net_decoder @ ckpt/backbone/decoder_epoch_20.pth
start training epoch: 0
./train.sh: line 13: 11528 Segmentation fault python main.py --exp-name line_weighted_wo_focal_junc --backbone resnet50 --backbone-kwargs '{"encoder_weights": "ckpt/backbone/encoder_epoch_20.pth", "decoder_weights": "ckpt/backbone/decoder_epoch_20.pth"}' --dim-embedding 256 --junction-pooling-threshold 0.2 --junc-pooling-size 64 --attention-sigma 1.5 --block-inference-size 128 --data-root ./indoorDist --junc-sigma 3 --batch-size 16 --gpus 0,1,2,3 --num-workers 0 --resume-epoch latest --is-train-junc True --is-train-adj True --vis-junc-th 0.1 --vis-line-th 0.1 - train --end-epoch 9 --solver SGD --lr 0.2 --weight-decay 5e-4 --lambda-heatmap 1. --lambda-adj 5. - train --end-epoch 15 --solver SGD --lr 0.02 --weight-decay 5e-4 --lambda-heatmap 1. --lambda-adj 10. - train --end-epoch 30 --solver SGD --lr 0.002 --weight-decay 5e-4 --lambda-heatmap 1. --lambda-adj 10. - end
I also suspect this may be due to the batch size, even though a p2.xlarge has K80 GPUs so this should theoretically handle the training as per the README.md instructions (requiring a GPU with at least 24 GB of RAM). I've tried reducing the batch size to 1 and still get the same error.
The above errors when looking at the original training script as is suggests that there is something wrong with the line graph (.lg
) files and I'm not familiar with how these are structured to proceed any further. Any help would be appreciated in sorting this out.
good first issue