【AI-CAMP三期】第一组_炼丹问题

本周每位同学需要提出1-2个cookbook上没有的你实际中遇到的项目或者炼丹问题,可以没有答案,有答案更好,提交在论坛上

cookbook:https://codimd.iap.wh-a.brainpp.cn/s/SklQ_GJnw

MegStudio 登录问题:登录成功自己退出,再登录再退出… , 晾它N分钟,也许可能就好了吧

1赞

retinanet训充气拱门经常训着训着就爆炸了,即便用的R的原生代码也有50%的几率不收敛
A:你应该好好想想为什么50%收敛,50%不收敛的问题;如果100%不收敛是代码问题,50%显然是数据处理等参数问题
Q:更新下,直接用的R的参考答案,训练几次均出现爆炸,用自己的数据集和微改动的代码,有一定几率爆炸。盲猜大概率和数据集有关,小概率和代码有关,更小概率和模型特性有关,因为R也说了retinanet的有时候是容易爆炸……

Q:训完模型后测试,出现如下报错:

A:发现是因为json里的id填的和图片名字里面的id不一致引起的,转换脚本默认是填成0~n,能训练但不能测试,如果都填成图片名字里面的id,测试OK了,但训练一直在处理dataset那卡着不动??目前绕过的方法是生成train.json和test.json分别改一下脚本那个地方,本质解还在研究中

Q:跑完测试后结果很糟糕,但loss看着没大问题:

A:不太可能是因为模型没有收敛或者过拟合,测试集数据出问题的可能也不大,怀疑代码的问题,排查中

Q:调试retinanet报错,具体原因未知,运行时模型训练进度未知,不知道是在正确进行,还是已经异常了。ctrl+c 结束时,就显示报错log。
运行平台:brain++
运行方式:terminal 中 ./run.sh

报错log
Process Process-3:
  File "tools/train.py", line 283, in <module>
    main()
  File "tools/train.py", line 85, in main
    p.join()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 124, in join
    res = self._popen.wait(timeout)
  File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 50, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/megengine/distributed/server.py", line 196, in connect
    if self.proxy.connect():
  File "/usr/lib/python3.6/xmlrpc/client.py", line 1112, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib/python3.6/xmlrpc/client.py", line 1452, in __request
    verbose=self.__verbose
  File "/usr/lib/python3.6/xmlrpc/client.py", line 1154, in request
    return self.single_request(host, handler, request_body, verbose)
  File "/usr/lib/python3.6/xmlrpc/client.py", line 1166, in single_request
    http_conn = self.send_request(host, handler, request_body, verbose)
  File "/usr/lib/python3.6/xmlrpc/client.py", line 1279, in send_request
    self.send_content(connection, request_body)
  File "/usr/lib/python3.6/xmlrpc/client.py", line 1309, in send_content
    connection.endheaders(request_body)
  File "/usr/lib/python3.6/http/client.py", line 1249, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1036, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 974, in send
    self.connect()
  File "/usr/lib/python3.6/http/client.py", line 946, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/usr/lib/python3.6/socket.py", line 724, in create_connection
    raise err
  File "/usr/lib/python3.6/socket.py", line 713, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "tools/train.py", line 97, in worker
    device=rank,
  File "/usr/local/lib/python3.6/dist-packages/megengine/distributed/group.py", line 126, in init_process_group
    _sd.client = Client(master_ip, port)
  File "/usr/local/lib/python3.6/dist-packages/megengine/distributed/server.py", line 187, in __init__
    self.connect()
  File "/usr/local/lib/python3.6/dist-packages/megengine/distributed/server.py", line 199, in connect
    time.sleep(1)
KeyboardInterrupt
Process Process-2:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/megengine/distributed/server.py", line 196, in connect
    if self.proxy.connect():
  File "/usr/lib/python3.6/xmlrpc/client.py", line 1112, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib/python3.6/xmlrpc/client.py", line 1452, in __request
    verbose=self.__verbose
  File "/usr/lib/python3.6/xmlrpc/client.py", line 1154, in request
    return self.single_request(host, handler, request_body, verbose)
  File "/usr/lib/python3.6/xmlrpc/client.py", line 1166, in single_request
    http_conn = self.send_request(host, handler, request_body, verbose)
  File "/usr/lib/python3.6/xmlrpc/client.py", line 1279, in send_request
    self.send_content(connection, request_body)
  File "/usr/lib/python3.6/xmlrpc/client.py", line 1309, in send_content
    connection.endheaders(request_body)
  File "/usr/lib/python3.6/http/client.py", line 1249, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1036, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 974, in send
    self.connect()
  File "/usr/lib/python3.6/http/client.py", line 946, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "/usr/lib/python3.6/socket.py", line 724, in create_connection
    raise err
  File "/usr/lib/python3.6/socket.py", line 713, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "tools/train.py", line 97, in worker
    device=rank,
  File "/usr/local/lib/python3.6/dist-packages/megengine/distributed/group.py", line 126, in init_process_group
    _sd.client = Client(master_ip, port)
  File "/usr/local/lib/python3.6/dist-packages/megengine/distributed/server.py", line 187, in __init__
    self.connect()
  File "/usr/local/lib/python3.6/dist-packages/megengine/distributed/server.py", line 199, in connect
    time.sleep(1)
KeyboardInterrupt

##########################20210703更新###########################

将megengine1.3.1降成1.2.0解决上述问题

我也遇到过,后来更新了测试集就好了,虽然点数还是低


同样这个问题,问一下,在哪里改。另外,有什么比较好的coding方式吗,在brainpp上调试代码太费劲了。

##########################20210704更新###########################
再训练一次又好了…