Prerequisite
Task
I'm using the official example scripts/configs for the officially supported tasks/models/datasets.
Branch
master branch https://github.com/open-mmlab/mmediting
Environment
sys.platform: linux
Python: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0]
CUDA available: True
GPU 0: NVIDIA GeForce RTX 2080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.2, V11.2.152
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.10.2
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.3
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
- CuDNN 8.2
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
TorchVision: 0.11.3
OpenCV: 4.5.4
MMCV: 1.5.0
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3
MMEditing: 0.16.0+7b3a8bd
Reproduces the problem - code sample
I just training again.
Reproduces the problem - command or script
./tools/dist_train.sh ./configs/restorers/real_basicvsr/realbasicvsr_wogan_c64b20_2x30x8_lr1e-4_300k_reds.py 1
Reproduces the problem - error message
File "./tools/train.py", line 169, in <module>
main()
File "./tools/train.py", line 165, in main
meta=meta)
File "/home/gihwan/mmedit/mmedit/apis/train.py", line 104, in train_model
meta=meta)
File "/home/gihwan/mmedit/mmedit/apis/train.py", line 241, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_iters)
File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run
iter_runner(iter_loaders[i], **kwargs)
File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 59, in train
data_batch = next(data_loader)
File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 32, in __next__
data = next(self.iter_loader)
File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
data = self._next_data()
File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/_utils.py", line 434, in reraise
raise exception
av.codec.codec.UnknownCodecError: Caught UnknownCodecError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/gihwan/mmedit/mmedit/datasets/dataset_wrappers.py", line 31, in __getitem__
return self.dataset[idx % self._ori_len]
File "/home/gihwan/mmedit/mmedit/datasets/base_sr_dataset.py", line 52, in __getitem__
return self.pipeline(results)
File "/home/gihwan/mmedit/mmedit/datasets/pipelines/compose.py", line 42, in __call__
data = t(data)
File "/home/gihwan/mmedit/mmedit/datasets/pipelines/random_degradations.py", line 547, in __call__
results = degradation(results)
File "/home/gihwan/mmedit/mmedit/datasets/pipelines/random_degradations.py", line 465, in __call__
results[key] = self._apply_random_compression(results[key])
File "/home/gihwan/mmedit/mmedit/datasets/pipelines/random_degradations.py", line 434, in _apply_random_compression
stream = container.add_stream(codec, rate=1)
File "av/container/output.pyx", line 64, in av.container.output.OutputContainer.add_stream
File "av/codec/codec.pyx", line 184, in av.codec.codec.Codec.__cinit__
File "av/codec/codec.pyx", line 193, in av.codec.codec.Codec._init
av.codec.codec.UnknownCodecError: libx264
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 10741) of binary: /home/gihwan/anaconda3/envs/openmmlab2/bin/python
Traceback (most recent call last):
File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/gihwan/anaconda3/envs/openmmlab2/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Additional information
I'm trying to train a Real BasicVSR to check if it trains in my environment.
I have similar issue like issue. But that issue isn't resolved yet.
kind/bug