- [ ] Issue is labeled using the label menu on the right side.
Environment
- Python version: (python -V) 3.7.7
- deepgnn-ge Version: (python -m pip show deepgnn-ge) 0.1.55.1
- deepgnn-torch Version: (python -m pip show deepgnn-torch) 0.1.55.1
- deepgnn-tf Version: (python -m pip show deepgnn-tf) not installed
- OS: (Windows, Linux, ...) Windows 10 Enterprise
Issue Details
- What you did - code sample or commands run
I installed deepgnn-torch via pip in a virtual environment. Then I cloned the deepgnn repository, cd-ed into the examples/pytorch/gat/
and then ran bash run.sh
I expected the training script to run without issues.
I see the error
File "c:\users\myid\appdata\local\programs\python\python37\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'CDLL.__init__.<locals>._FuncPtr'
Full stack trace:
$ bash run.sh
+ DEVICE=cpu
++ dirname run.sh
+ DIR_NAME=.
+ GRAPH=/tmp/cora
+ python -m deepgnn.graph_engine.data.citation --data_dir /tmp/cora
c:\users\myid\appdata\local\programs\python\python37\lib\runpy.py:125: RuntimeWarning: 'deepgnn.graph_engine.data.citation' found in sys.modules after import of package 'deepgnn.graph_engine.data', but prior to execution of 'deepgnn.graph_engine.data.citation'; this may result in unpredictable behaviour
warn(RuntimeWarning(msg))
[2022-09-09 14:46:04,150] {convert.py:100} INFO - worker 0 try to generate partition: 0 - 1
[2022-09-09 14:46:04,151] {_adl_reader.py:124} INFO - [1,0] Input files: ['C:/Users/myid/AppData/Local/Temp/cora\\graph.json']
[2022-09-09 14:46:04,782] {dispatcher.py:143} INFO - record processed: 1000
[2022-09-09 14:46:05,257] {dispatcher.py:143} INFO - record processed: 2000
[2022-09-09 14:46:05,657] {local.py:44} INFO - Graph data path: C:/Users/myid/AppData/Local/Temp/cora. Partitions [0]. Storage type 0. Config path . Stream False.
[2022-09-09 14:46:05,707] {local.py:52} INFO - Loaded snark graph. Node counts: [140, 500, 1000, 1068]. Edge counts: [10556]
graph data: C:/Users/myid/AppData/Local/Temp/cora
+ MODEL_DIR=/tmp/model_fix
+ rm -rf /tmp/model_fix
+ [[ cpu == \g\p\u ]]
+ python ./main.py --data_dir /tmp/cora --mode train --seed 123 --backend snark --graph_type local --converter skip --batch_size 140 --learning_rate 0.005 --num_epochs 180 --sample_file /tmp/cora/train.nodes --node_type 0 --model_dir /tmp/model_fix --metric_dir /tmp/model_fix --save_path /tmp/model_fix --eval_file /tmp/cora/test.nodes --eval_during_train_by_steps 1 --feature_idx 0 --feature_dim 1433 --label_idx 1 --label_dim 1 --head_num 8,1 --num_classes 7 --neighbor_edge_types 0 --attn_drop 0.6 --ffd_drop 0.6 --log_by_steps 1 --use_per_step_metrics
[2022-09-09 14:46:08,646] {factory.py:38} INFO - GE_OMP_NUM_THREADS=1
[2022-09-09 14:46:08,647] {factory.py:38} INFO - apex_opt_level=O2
[2022-09-09 14:46:08,647] {factory.py:38} INFO - attn_drop=0.6
[2022-09-09 14:46:08,647] {factory.py:38} INFO - backend=snark
[2022-09-09 14:46:08,647] {factory.py:38} INFO - batch_size=140
[2022-09-09 14:46:08,647] {factory.py:38} INFO - client_rank=None
[2022-09-09 14:46:08,647] {factory.py:38} INFO - clip_grad=False
[2022-09-09 14:46:08,647] {factory.py:38} INFO - config_path=
[2022-09-09 14:46:08,647] {factory.py:38} INFO - converter=skip
[2022-09-09 14:46:08,647] {factory.py:38} INFO - data_dir=C:/Users/myid/AppData/Local/Temp/cora
[2022-09-09 14:46:08,647] {factory.py:38} INFO - data_parallel_num=2
[2022-09-09 14:46:08,647] {factory.py:38} INFO - dim=256
[2022-09-09 14:46:08,647] {factory.py:38} INFO - disable_ib=False
[2022-09-09 14:46:08,647] {factory.py:38} INFO - enable_adl_uploader=False
[2022-09-09 14:46:08,647] {factory.py:38} INFO - enable_ssl=False
[2022-09-09 14:46:08,647] {factory.py:38} INFO - eval_during_train_by_steps=1
[2022-09-09 14:46:08,647] {factory.py:38} INFO - eval_file=C:/Users/myid/AppData/Local/Temp/cora/test.nodes
[2022-09-09 14:46:08,647] {factory.py:38} INFO - fanouts=[10, 10]
[2022-09-09 14:46:08,647] {factory.py:38} INFO - featenc_config=None
[2022-09-09 14:46:08,647] {factory.py:38} INFO - feature_dim=1433
[2022-09-09 14:46:08,648] {factory.py:38} INFO - feature_idx=0
[2022-09-09 14:46:08,648] {factory.py:38} INFO - feature_type=float
[2022-09-09 14:46:08,648] {factory.py:38} INFO - ffd_drop=0.6
[2022-09-09 14:46:08,648] {factory.py:38} INFO - fp16=amp
[2022-09-09 14:46:08,648] {factory.py:38} INFO - ge_start_timeout=30
[2022-09-09 14:46:08,648] {factory.py:38} INFO - gpu=False
[2022-09-09 14:46:08,648] {factory.py:38} INFO - grad_max_norm=1.0
[2022-09-09 14:46:08,648] {factory.py:38} INFO - graph_type=local
[2022-09-09 14:46:08,648] {factory.py:38} INFO - head_num=[8, 1]
[2022-09-09 14:46:08,648] {factory.py:38} INFO - hidden_dim=8
[2022-09-09 14:46:08,648] {factory.py:38} INFO - job_id=aa812d6f
[2022-09-09 14:46:08,648] {factory.py:38} INFO - l2_coef=0.0005
[2022-09-09 14:46:08,648] {factory.py:38} INFO - label_dim=1
[2022-09-09 14:46:08,648] {factory.py:38} INFO - label_idx=1
[2022-09-09 14:46:08,648] {factory.py:38} INFO - learning_rate=0.005
[2022-09-09 14:46:08,648] {factory.py:38} INFO - local_rank=0
[2022-09-09 14:46:08,648] {factory.py:38} INFO - log_by_steps=1
[2022-09-09 14:46:08,648] {factory.py:38} INFO - max_id=None
[2022-09-09 14:46:08,648] {factory.py:38} INFO - max_samples=0
[2022-09-09 14:46:08,648] {factory.py:38} INFO - max_saved_ckpts=0
[2022-09-09 14:46:08,648] {factory.py:38} INFO - meta_dir=
[2022-09-09 14:46:08,648] {factory.py:38} INFO - metric_dir=C:/Users/myid/AppData/Local/Temp/model_fix
[2022-09-09 14:46:08,649] {factory.py:38} INFO - mode=train
[2022-09-09 14:46:08,649] {factory.py:38} INFO - model_args=
[2022-09-09 14:46:08,649] {factory.py:38} INFO - model_dir=C:/Users/myid/AppData/Local/Temp/model_fix
[2022-09-09 14:46:08,649] {factory.py:38} INFO - neighbor_count=10
[2022-09-09 14:46:08,649] {factory.py:38} INFO - neighbor_edge_types=[0]
[2022-09-09 14:46:08,649] {factory.py:38} INFO - node_type=0
[2022-09-09 14:46:08,649] {factory.py:38} INFO - num_classes=7
[2022-09-09 14:46:08,649] {factory.py:38} INFO - num_epochs=180
[2022-09-09 14:46:08,649] {factory.py:38} INFO - num_ge=0
[2022-09-09 14:46:08,649] {factory.py:38} INFO - num_negs=5
[2022-09-09 14:46:08,649] {factory.py:38} INFO - num_parallel=2
[2022-09-09 14:46:08,649] {factory.py:38} INFO - partitions=[0]
[2022-09-09 14:46:08,649] {factory.py:38} INFO - prefetch_factor=2
[2022-09-09 14:46:08,649] {factory.py:38} INFO - prefetch_size=16
[2022-09-09 14:46:08,649] {factory.py:38} INFO - sample_file=C:/Users/myid/AppData/Local/Temp/cora/train.nodes
[2022-09-09 14:46:08,649] {factory.py:38} INFO - save_ckpt_by_epochs=1
[2022-09-09 14:46:08,649] {factory.py:38} INFO - save_ckpt_by_steps=0
[2022-09-09 14:46:08,649] {factory.py:38} INFO - save_path=C:/Users/myid/AppData/Local/Temp/model_fix
[2022-09-09 14:46:08,649] {factory.py:38} INFO - seed=123
[2022-09-09 14:46:08,649] {factory.py:38} INFO - server_idx=None
[2022-09-09 14:46:08,649] {factory.py:38} INFO - servers=
[2022-09-09 14:46:08,649] {factory.py:38} INFO - skip_ge_start=False
[2022-09-09 14:46:08,649] {factory.py:38} INFO - sort_ckpt_by_mtime=False
[2022-09-09 14:46:08,649] {factory.py:38} INFO - ssl_cert=
[2022-09-09 14:46:08,650] {factory.py:38} INFO - storage_type=0
[2022-09-09 14:46:08,650] {factory.py:38} INFO - strategy=RandomWithoutReplacement
[2022-09-09 14:46:08,650] {factory.py:38} INFO - stream=False
[2022-09-09 14:46:08,650] {factory.py:38} INFO - sync_dir=
[2022-09-09 14:46:08,650] {factory.py:38} INFO - trainer=base
[2022-09-09 14:46:08,650] {factory.py:38} INFO - uploader_process_num=1
[2022-09-09 14:46:08,650] {factory.py:38} INFO - uploader_store_name=
[2022-09-09 14:46:08,650] {factory.py:38} INFO - uploader_threads_num=12
[2022-09-09 14:46:08,650] {factory.py:38} INFO - use_per_step_metrics=True
[2022-09-09 14:46:08,650] {factory.py:38} INFO - user_name=10.0.0.200
[2022-09-09 14:46:08,650] {factory.py:38} INFO - warmup=0.0002
[2022-09-09 14:46:08,654] {local.py:44} INFO - Graph data path: C:/Users/myid/AppData/Local/Temp/cora. Partitions [0]. Storage type 0. Config path . Stream False.
[2022-09-09 14:46:08,666] {local.py:52} INFO - Loaded snark graph. Node counts: [140, 500, 1000, 1068]. Edge counts: [10556]
[2022-09-09 14:46:08,666] {main.py:37} INFO - Creating GAT model with seed:123.
[2022-09-09 14:46:08,668] {base_model.py:39} INFO - [BaseModel] feature_type: FeatureType.FLOAT, feature_idx:0, feature_dim:0.
[2022-09-09 14:46:08,672] {trainer.py:472} INFO - [1,0] Max steps per epoch:-1
[2022-09-09 14:46:08,672] {utils.py:107} INFO - 0, input_layer.att_head-0.bias: torch.Size([8]), cpu
[2022-09-09 14:46:08,672] {utils.py:107} INFO - 1, input_layer.att_head-0.w.weight: torch.Size([8, 1433]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 2, input_layer.att_head-0.attn_l.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 3, input_layer.att_head-0.attn_l.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 4, input_layer.att_head-0.attn_r.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 5, input_layer.att_head-0.attn_r.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 6, input_layer.att_head-1.bias: torch.Size([8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 7, input_layer.att_head-1.w.weight: torch.Size([8, 1433]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 8, input_layer.att_head-1.attn_l.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 9, input_layer.att_head-1.attn_l.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 10, input_layer.att_head-1.attn_r.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 11, input_layer.att_head-1.attn_r.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 12, input_layer.att_head-2.bias: torch.Size([8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 13, input_layer.att_head-2.w.weight: torch.Size([8, 1433]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 14, input_layer.att_head-2.attn_l.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 15, input_layer.att_head-2.attn_l.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 16, input_layer.att_head-2.attn_r.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 17, input_layer.att_head-2.attn_r.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 18, input_layer.att_head-3.bias: torch.Size([8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 19, input_layer.att_head-3.w.weight: torch.Size([8, 1433]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 20, input_layer.att_head-3.attn_l.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 21, input_layer.att_head-3.attn_l.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 22, input_layer.att_head-3.attn_r.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 23, input_layer.att_head-3.attn_r.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 24, input_layer.att_head-4.bias: torch.Size([8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 25, input_layer.att_head-4.w.weight: torch.Size([8, 1433]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 26, input_layer.att_head-4.attn_l.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 27, input_layer.att_head-4.attn_l.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 28, input_layer.att_head-4.attn_r.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,673] {utils.py:107} INFO - 29, input_layer.att_head-4.attn_r.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 30, input_layer.att_head-5.bias: torch.Size([8]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 31, input_layer.att_head-5.w.weight: torch.Size([8, 1433]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 32, input_layer.att_head-5.attn_l.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 33, input_layer.att_head-5.attn_l.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 34, input_layer.att_head-5.attn_r.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 35, input_layer.att_head-5.attn_r.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 36, input_layer.att_head-6.bias: torch.Size([8]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 37, input_layer.att_head-6.w.weight: torch.Size([8, 1433]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 38, input_layer.att_head-6.attn_l.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 39, input_layer.att_head-6.attn_l.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 40, input_layer.att_head-6.attn_r.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 41, input_layer.att_head-6.attn_r.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 42, input_layer.att_head-7.bias: torch.Size([8]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 43, input_layer.att_head-7.w.weight: torch.Size([8, 1433]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 44, input_layer.att_head-7.attn_l.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 45, input_layer.att_head-7.attn_l.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 46, input_layer.att_head-7.attn_r.weight: torch.Size([1, 8]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 47, input_layer.att_head-7.attn_r.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 48, out_layer.att_head-0.bias: torch.Size([7]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 49, out_layer.att_head-0.w.weight: torch.Size([7, 64]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 50, out_layer.att_head-0.attn_l.weight: torch.Size([1, 7]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 51, out_layer.att_head-0.attn_l.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 52, out_layer.att_head-0.attn_r.weight: torch.Size([1, 7]), cpu
[2022-09-09 14:46:08,674] {utils.py:107} INFO - 53, out_layer.att_head-0.attn_r.bias: torch.Size([1]), cpu
[2022-09-09 14:46:08,675] {utils.py:116} INFO - parameter count: 92391
[2022-09-09 14:46:08,675] {logging_utils.py:84} INFO - Training worker started. Model: GAT.
Traceback (most recent call last):
File "./main.py", line 126, in <module>
_main()
File "./main.py", line 121, in _main
init_args_fn=init_args,
File "C:\Users\myid\Downloads\DeepGNN\venv\lib\site-packages\deepgnn\pytorch\training\factory.py", line 134, in run_dist
eval_dataloader_for_training,
File "C:\Users\myid\Downloads\DeepGNN\venv\lib\site-packages\deepgnn\pytorch\training\trainer.py", line 100, in run
self._train(model)
File "C:\Users\myid\Downloads\DeepGNN\venv\lib\site-packages\deepgnn\pytorch\training\trainer.py", line 171, in _train
self._train_one_epoch(model, epoch)
File "C:\Users\myid\Downloads\DeepGNN\venv\lib\site-packages\deepgnn\pytorch\training\trainer.py", line 174, in _train_one_epoch
for i, data in enumerate(self.dataset):
File "C:\Users\myid\Downloads\DeepGNN\venv\lib\site-packages\torch\utils\data\dataloader.py", line 444, in __iter__
return self._get_iterator()
File "C:\Users\myid\Downloads\DeepGNN\venv\lib\site-packages\torch\utils\data\dataloader.py", line 390, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "C:\Users\myid\Downloads\DeepGNN\venv\lib\site-packages\torch\utils\data\dataloader.py", line 1077, in __init__
w.start()
File "c:\users\myid\appdata\local\programs\python\python37\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "c:\users\myid\appdata\local\programs\python\python37\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "c:\users\myid\appdata\local\programs\python\python37\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "c:\users\myid\appdata\local\programs\python\python37\lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
reduction.dump(process_obj, to_child)
File "c:\users\myid\appdata\local\programs\python\python37\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'CDLL.__init__.<locals>._FuncPtr'
bug