为使您的问题得到快速解决,建议参考以下模板:
【标题】
自己生成数据集生成器,训练语义分割模型时偶发出现cuda700错误。
RuntimeError: cuda error 700: an illegal memory access was encountered (cudaEventSynchronize(m_cuda_event) at …/…/…/…/…/…/src/core/impl/comp_node/cuda/comp_node.cpp:host_wait_cv:755)
backtrace:
/opt/conda/lib/python3.6/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb9CudaErrorC1ERKSs+0x54) [0x7f5a5702d164]
/opt/conda/lib/python3.6/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb14_on_cuda_errorEPKc9cudaErrorS1_S1_i+0x6a) [0x7f5a5700c4ba]
/opt/conda/lib/python3.6/site-packages/megengine/core/lib/libmegengine_shared.so(+0x2357bf2) [0x7f5a56fe0bf2]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x3d23eb) [0x7f5ab0e343eb]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x34e17f) [0x7f5ab0db017f]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x305175) [0x7f5ab0d67175]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x2f76b2) [0x7f5ab0d596b2]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x2e75b3) [0x7f5ab0d495b3]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x2f2ceb) [0x7f5ab0d54ceb]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x2f5b93) [0x7f5ab0d57b93]
(last_err=700(an illegal memory access was encountered) device=0 mem_free=0.000MiB mem_tot=0.000MiB)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File “train.py”, line 131, in
loss = train_func(mge.tensor(inputs), mge.tensor(labels))
File “train.py”, line 30, in train_func
pred, label, ignore_label=model.cfg.ignore_label
File “train.py”, line 24, in cross_entropy
return F.loss.cross_entropy(pred[mask], label[mask], axis)
File “/opt/conda/lib/python3.6/site-packages/megengine/functional/loss.py”, line 27, in reduced_loss_fn
loss = loss_fn(*args, **kwargs)
File “/opt/conda/lib/python3.6/site-packages/megengine/functional/loss.py”, line 191, in cross_entropy
n0 = pred.ndim
File “/opt/conda/lib/python3.6/site-packages/megengine/core/tensor/array_method.py”, line 351, in ndim
shape = self._tuple_shape
File “/opt/conda/lib/python3.6/site-packages/megengine/tensor.py”, line 111, in _tuple_shape
return super().shape
megengine.core._imperative_rt.core2.AsyncError: An async error is reported. See above for the actual cause. Hint: This is where it is reported, not where it happened. You may call `megengine.config.async_level = 0 to get better error reporting.
Error in atexit._run_exitfuncs:
RuntimeError: cuda error 700: an illegal memory access was encountered (cudaEventSynchronize(m_cuda_event) at …/…/…/…/…/…/src/core/impl/comp_node/cuda/comp_node.cpp:host_wait_cv:755)
backtrace:
/opt/conda/lib/python3.6/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb9CudaErrorC1ERKSs+0x54) [0x7f5a5702d164]
/opt/conda/lib/python3.6/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb14_on_cuda_errorEPKc9cudaErrorS1_S1_i+0x6a) [0x7f5a5700c4ba]
/opt/conda/lib/python3.6/site-packages/megengine/core/lib/libmegengine_shared.so(+0x2357bf2) [0x7f5a56fe0bf2]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x3d23eb) [0x7f5ab0e343eb]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x34e17f) [0x7f5ab0db017f]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x305175) [0x7f5ab0d67175]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x2f76b2) [0x7f5ab0d596b2]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x2e75b3) [0x7f5ab0d495b3]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x2f2ceb) [0x7f5ab0d54ceb]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x2f5b93) [0x7f5ab0d57b93]
(last_err=700(an illegal memory access was encountered) device=0 mem_free=0.000MiB mem_tot=0.000MiB)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File “/opt/conda/lib/python3.6/site-packages/megengine/init.py”, line 104, in _run_exit_handlers
handler()
megengine.core._imperative_rt.core2.AsyncError: An async error is reported. See above for the actual cause. Hint: This is where it is reported, not where it happened. You may call megengine.config.async_level = 0 to get better error reporting. terminate called after throwing an instance of 'mgb::AssertionError' what(): assertion
m_valid_handle.count(handle)’ failed at …/…/…/…/…/…/imperative/src/impl/interpreter/interpreter_impl.cpp:219: void mgb::imperative::interpreter::intl::ChannelImpl::del_impl(mgb::imperative::interpreter::intl::Handle)
extra message: invalid handle: 0x55692be9f550
backtrace:
/opt/conda/lib/python3.6/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb13MegBrainErrorC1ERKSs+0x4a) [0x7f5a56fcae1a]
/opt/conda/lib/python3.6/site-packages/megengine/core/lib/libmegengine_shared.so(_ZN3mgb15__assert_fail__EPKciS1_S1_S1_z+0x10f) [0x7f5a56fcb26f]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x2df67b) [0x7f5ab0d4167b]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x2df6fd) [0x7f5ab0d416fd]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x437c5c) [0x7f5ab0e99c5c]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x438a83) [0x7f5ab0e9aa83]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x43ff76) [0x7f5ab0ea1f76]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x440430) [0x7f5ab0ea2430]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x16d076) [0x7f5ab0bcf076]
/opt/conda/lib/python3.6/site-packages/megengine/core/_imperative_rt.cpython-36m-x86_64-linux-gnu.so(+0x16d115) [0x7f5ab0bcf115]
【版本、环境信息】
- MegEngine 版本:1.11.1
- CPU型号:(如为CPU,请提供CPU型号)
- GPU型号:Tesla T4
- 系统环境:Ubuntu 22.04.1
- python版本: 3.6.13
【模型信息】
- 算法:
import os
basedir = os.path.abspath(os.path.dirname(file))
pretriend_path = ‘/root/.cache/megengine/serialized’
if not os.path.exists(pretriend_path):
os.makedirs(pretriend_path)
os.system(‘cp ‘+os.path.join(basedir,‘pretrain’,‘bf05a2_resnet101_fbaug_77944_b7932921.pkl’)+’ ‘+pretriend_path+’/’)
os.environ[“CUDA_VISIBLE_DEVICES”] = ‘0’#如果不指定 会偶发性出现显存非法泄露
from megengine.data import DataLoader, Infinite, RandomSampler, dataset
from megengine.data import transform as T
from megengine.optimizer import SGD
from megengine.autodiff import GradManager
import megengine.functional as F
import megengine as mge
import megengine.distributed as dist
from Dataset import SegDataset
from parse import Params
mge.device.set_prealloc_config(1024, 1024, 256 * 1024 * 1024, 4.0)
from official.vision.segmentation.tools.utils import AverageMeter, get_config_info, import_from_file
current_network = import_from_file(‘official/vision/segmentation/configs/deeplabv3plus_res101_cityscapes_768size.py’)
model = current_network.Net(current_network.Cfg())
def cross_entropy(pred, label, axis=1, ignore_label=255):
mask = label != ignore_label
pred = pred.transpose(0, 2, 3, 1)
return F.loss.cross_entropy(pred[mask], label[mask], axis)
def train_func(data, label):
with gm:
pred = model(data)
loss = cross_entropy(
pred, label, ignore_label=model.cfg.ignore_label
)
gm.backward(loss)
opt.step().clear_grad()
return loss
def adjust_learning_rate(optimizer, epoch, step, tot_step, cfg):
max_iter = cfg.max_epoch * tot_step
cur_iter = epoch * tot_step + step
cur_lr = cfg.learning_rate * (1 - cur_iter / (max_iter + 1)) ** 0.9
optimizer.param_groups[0][“lr”] = cur_lr * 0.1
optimizer.param_groups[1][“lr”] = cur_lr
if name == ‘main’:
pie_param = Params()
train_dataset = SegDataset(pie_param,'train')
train_sampler = RandomSampler(train_dataset, 4, drop_last=True)
train_dataloader = DataLoader(
train_dataset,
sampler=train_sampler,
transform=T.Compose(
transforms=[
T.RandomHorizontalFlip(0.5),
T.RandomResize(scale_range=(0.5, 2)),
T.RandomCrop(
output_size=(pie_param.load_size[0], pie_param.load_size[1]),
padding_value=[0, 0, 0],
padding_maskvalue=255,
),
# T.Normalize(mean=[103.530, 116.280, 123.675], std=[57.375, 57.120, 58.395]),
T.ToMode(),
]
),
num_workers=0,
)
# train_dataloader = iter(train_dataloader)
from official.vision.segmentation.tools.utils import AverageMeter, get_config_info, import_from_file
current_network = import_from_file(
'official/vision/segmentation/configs/deeplabv3plus_res101_cityscapes_768size.py')
networkcfg = current_network.Cfg()
networkcfg.batch_size = pie_param.batch_size
networkcfg.learning_rate = pie_param.learning_rate
networkcfg.max_epoch = pie_param.epochs
networkcfg.num_classes = pie_param.num_cls
model = current_network.Net(networkcfg)
backbone_params = []
head_params = []
for name, param in model.named_parameters():
if "backbone" in name:
backbone_params.append(param)
else:
head_params.append(param)
opt = SGD(
[
{
"params": backbone_params,
"lr": model.cfg.learning_rate * dist.get_world_size() * 0.1,
},
{"params": head_params},
],
lr=model.cfg.learning_rate * dist.get_world_size(),
momentum=model.cfg.momentum,
weight_decay=model.cfg.weight_decay,
)
gm = GradManager()
if dist.get_world_size() > 1:
gm.attach(
model.parameters(),
callbacks=[dist.make_allreduce_cb("mean", dist.WORLD)]
)
else:
gm.attach(model.parameters())
if pie_param.resume_path != '':
pretrained = mge.load(pie_param.resume_path)
cur_epoch = pretrained["epoch"] + 1
model.load_state_dict(pretrained["state_dict"])
opt.load_state_dict(pretrained["opt"])
# if dist.get_rank() == 0:
# logger.info("load success: epoch %d", cur_epoch)
tot_step = len(train_dataloader)
# train_dataloader_ = train_dataloader
print(pie_param.epochs)
for epoch in range(pie_param.epochs):
# print(123)
print(epoch+1,':',pie_param.epochs)
train_dataloader_ = iter(train_dataloader)
# for ind,batch_data in enumerate(train_dataloader_):
for ind in range(tot_step):
adjust_learning_rate(opt, epoch, ind, tot_step, model.cfg)
# data_tik = time.time()
inputs, labels = next(train_dataloader_)
# labels = np.squeeze(labels, axis=1).astype(np.int32)
# data_tok = time.time()
# tik = time.time()
loss = train_func(mge.tensor(inputs), mge.tensor(labels))
print(loss,type(loss))
【Load_and_run LOG】
- 请提供Load_and_run复现LOG
【如为报错请提供以下复现信息】
- 复现步骤:(请提供复现方法及步骤)
- 日志信息:(请提供完整的日志及报错信息)
- 代码关键片段:(请提供关键的代码片段便于追查问题)