模型forward函数中batchnorm2d处遇到megengine.core._imperative_rt.core2.AsyncError: An async error is reported.

为使您的问题得到快速解决,建议参考以下模板:

【模型forward函数中batchnorm2d处遇到megengine.core._imperative_rt.core2.AsyncError: An async error is reported.】
(简洁、精准的描述您的问题,例如“int8模型,多次抽feature,存在可见误差”)
【版本、环境信息】

  • MegEngine 版本:1.9.1
  • CPU型号:__
  • GPU型号:NVIDIA 2080Ti
  • 系统环境:ubuntu18.04, 64位,brain++环境
  • python版本: 3.6.13

【模型信息】

  • 算法:(请提供算法源码,如有特殊实现请简单介绍)
  • 性能对比:(现在速度 vs 之前速度, shape是多少之类等)
  • 模型文件地址:(请提供模型文件地址)

【Load_and_run LOG】

  • 请提供Load_and_run复现LOG

【如为报错请提供以下复现信息】

  • 复现步骤:将pytorch版本代码迁移为megengine版本,模型forward中的batchnorm2d处报错
  • 日志信息:_The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “/home/dingyikang/anaconda3/envs/kdmvs-megengin/lib/python3.6/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/home/dingyikang/anaconda3/envs/kdmvs-megengin/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/dingyikang/anaconda3/envs/kdmvs-megengin/lib/python3.6/site-packages/megengine/distributed/launcher.py”, line 58, in _run_wrapped
ret = func(*args, **kwargs)
File “/data/UnsupMVS/KD-MVS-release-megengine/train_unsup.py”, line 239, in main
train(model, model_loss, optimizer, gm, TrainImgLoader, TestImgLoader, start_epoch, logger, args)
File “/data/UnsupMVS/KD-MVS-release-megengine/train_unsup.py”, line 85, in train
loss, scalar_outputs, image_outputs = train_sample(model, model_loss, optimizer, gm, sample, args)
File “/data/UnsupMVS/KD-MVS-release-megengine/train_unsup.py”, line 127, in train_sample
outputs = model(sample_cuda[“imgs”], sample_cuda[“proj_matrices”], sample_cuda[“depth_values”])
File “/home/dingyikang/anaconda3/envs/kdmvs-megengin/lib/python3.6/site-packages/megengine/module/module.py”, line 149, in call
outputs = self.forward(*inputs, **kwargs)
File “/data/UnsupMVS/KD-MVS-release-megengine/models/cas_mvsnet.py”, line 368, in forward
var_reg=self.var_regression if self.share_cr else self.var_regression[stage_idx])
File “/home/dingyikang/anaconda3/envs/kdmvs-megengin/lib/python3.6/site-packages/megengine/module/module.py”, line 149, in call
outputs = self.forward(*inputs, **kwargs)
File “/data/UnsupMVS/KD-MVS-release-megengine/models/cas_mvsnet.py”, line 220, in forward
log_var = var_reg(ref_feature)
File “/home/dingyikang/anaconda3/envs/kdmvs-megengin/lib/python3.6/site-packages/megengine/module/module.py”, line 149, in call
outputs = self.forward(*inputs, **kwargs)
File “/data/UnsupMVS/KD-MVS-release-megengine/models/cas_mvsnet.py”, line 192, in forward
x = self.conv0(x)
File “/home/dingyikang/anaconda3/envs/kdmvs-megengin/lib/python3.6/site-packages/megengine/module/module.py”, line 149, in call
outputs = self.forward(*inputs, **kwargs)
File “/data/UnsupMVS/KD-MVS-release-megengine/models/module.py”, line 222, in forward
return F.relu(self.bn(self.conv(x)))
File “/home/dingyikang/anaconda3/envs/kdmvs-megengin/lib/python3.6/site-packages/megengine/module/module.py”, line 149, in call
outputs = self.forward(*inputs, **kwargs)
File “/home/dingyikang/anaconda3/envs/kdmvs-megengin/lib/python3.6/site-packages/megengine/module/batchnorm.py”, line 77, in forward
self._check_input_ndim(inp)
File “/home/dingyikang/anaconda3/envs/kdmvs-megengin/lib/python3.6/site-packages/megengine/module/batchnorm.py”, line 325, in _check_input_ndim
if len(inp.shape) != 4:
File “/home/dingyikang/anaconda3/envs/kdmvs-megengin/lib/python3.6/site-packages/megengine/tensor.py”, line 112, in shape
shape = super().shape
megengine.core.imperative_rt.core2.AsyncError: An async error is reported. See above for the actual cause. Hint: This is where it is reported, not where it happened. You may call `megengine.config.async_level = 0 to get better error reporting.

  • 代码关键片段:_def init(self, in_channels):
    super(UncertaintyNet, self).init()
    self.inplanes = in_channels
    self.conv0 = ConvBnReLU(self.inplanes, 4self.inplanes, 3, 1, 1)
    self.conv1 = ConvBnReLU(4
    self.inplanes, 8self.inplanes, 3, 1, 1)
    self.conv2 = ConvBnReLU(8
    self.inplanes, self.inplanes, 3, 1, 1)
    self.var = nn.Conv2d(self.inplanes, 1, 3, 1, 1)

    def forward(self, x):
    x = self.conv0(x)
    x = self.conv1(x)
    x = self.conv2(x)
    x = self.var(x)
    x = F.squeeze(x, axis=1)
    return x

class ConvBnReLU(nn.Module):
def init(self, in_channels, out_channels, kernel_size=3, stride=1, pad=1):
super(ConvBnReLU, self).init()
self.conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride=stride, padding=pad, bias=False)
self.bn = nn.BatchNorm2d(out_channels)

def forward(self, x):
    return F.relu(self.bn(self.conv(x)))

在self.bn处报错_

看到代码报错在取 shape 的时候,请问是用 trace-module 跑的模型还是 trace 跑的呢,能把训练部分的代码也提供一下吗

有同事联系我了,谢谢!

1赞

有同事联系我了谢谢!