多进程 dataloader 报错

models 中的分类和检测训练如果使用多进程会报错:

  1. 分类中可以让 dataloader 的 worker = 0 来规避,但是 dataloader 会有瓶颈。
  2. 检测中整个训练都是多进程的,所以只能令 ngpu = 1,使用单 gpu。

Process Process-1:
Traceback (most recent call last):
File “/usr/local/lib/python3.7/multiprocessing/process.py”, line 297, in _bootstrap
self.run()
File “/usr/local/lib/python3.7/multiprocessing/process.py”, line 99, in run
self._target(*self._args, **self._kwargs)
File “/root/megengine/Models/official/vision/detection/tools/train.py”, line 85, in worker
train_loader = iter(loader[“train”])
File “/usr/local/lib/python3.7/site-packages/megengine/data/dataloader.py”, line 122, in iter
return _ParallelDataLoaderIter(self)
File “/usr/local/lib/python3.7/site-packages/megengine/data/dataloader.py”, line 216, in init
worker.start()
File “/usr/local/lib/python3.7/multiprocessing/process.py”, line 112, in start
self._popen = self._Popen(self)
File “/usr/local/lib/python3.7/multiprocessing/context.py”, line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File “/usr/local/lib/python3.7/multiprocessing/context.py”, line 284, in _Popen
return Popen(process_obj)
File “/usr/local/lib/python3.7/multiprocessing/popen_spawn_posix.py”, line 32, in init
super().init(process_obj)
File “/usr/local/lib/python3.7/multiprocessing/popen_fork.py”, line 20, in init
self._launch(process_obj)
File “/usr/local/lib/python3.7/multiprocessing/popen_spawn_posix.py”, line 47, in _launch
reduction.dump(process_obj, fp)
File “/usr/local/lib/python3.7/multiprocessing/reduction.py”, line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: can’t pickle weakref objects
Exception ignored in: <function _ParallelDataLoaderIter.del at 0x7f1411cf39d8>
Traceback (most recent call last):
File “/usr/local/lib/python3.7/site-packages/megengine/data/dataloader.py”, line 544, in del
if self.__initialized:
AttributeError: ‘_ParallelDataLoaderIter’ object has no attribute ‘_ParallelDataLoaderIter__initialized’

环境:
ubuntu16.04
python3.7
cuda10.0
cudnn7.6

您好,您的问题我分为两部分回答。

  1. 首先检测代码依然可以设置num_workers=0来规避报错。
  2. 对于您遇到的报错,可以先尝试一下去掉代码中的mp.set_start_method("spawn")语句。

多谢答复,你说的问题我之前试过了,

  1. 分类中可以使用num_workers=0,但是 dataloader 会有瓶颈,检测中worker是和 ngpu 绑定的,导致只能用一张卡。
  2. 去掉 mp.set_start_method(“spawn”)没有用

针对上述问题我也查了下,应该是 python3.7 的一个 bug(https://bugs.python.org/issue34034),我换了 python3.6 之后检测没有问题了。

另外发现一处书写的错误:


https://github.com/MegEngine/MegEngine/blob/b938b1cf3cd50cf46a746d416824bd0d3e70060a/python_module/megengine/data/dataloader.py#L170

换成 python3.6 后分类的多进程 dataloader 也没问题了。

您好,这处书写错误我们会尽快更正,非常感谢您的反馈!