本周每位同学需要提出1-2个cookbook上没有的你实际中遇到的项目或者炼丹问题,可以没有答案,有答案更好,提交在论坛上
【AI-CAMP三期】第一组_炼丹问题
MegStudio 登录问题:登录成功自己退出,再登录再退出… , 晾它N分钟,也许可能就好了吧
1赞
retinanet训充气拱门经常训着训着就爆炸了,即便用的R的原生代码也有50%的几率不收敛
A:你应该好好想想为什么50%收敛,50%不收敛的问题;如果100%不收敛是代码问题,50%显然是数据处理等参数问题
Q:更新下,直接用的R的参考答案,训练几次均出现爆炸,用自己的数据集和微改动的代码,有一定几率爆炸。盲猜大概率和数据集有关,小概率和代码有关,更小概率和模型特性有关,因为R也说了retinanet的有时候是容易爆炸……
Q:训完模型后测试,出现如下报错:
A:发现是因为json里的id填的和图片名字里面的id不一致引起的,转换脚本默认是填成0~n,能训练但不能测试,如果都填成图片名字里面的id,测试OK了,但训练一直在处理dataset那卡着不动??目前绕过的方法是生成train.json和test.json分别改一下脚本那个地方,本质解还在研究中
Q:调试retinanet报错,具体原因未知,运行时模型训练进度未知,不知道是在正确进行,还是已经异常了。ctrl+c 结束时,就显示报错log。
运行平台:brain++
运行方式:terminal 中 ./run.sh
报错log
Process Process-3:
File "tools/train.py", line 283, in <module>
main()
File "tools/train.py", line 85, in main
p.join()
File "/usr/lib/python3.6/multiprocessing/process.py", line 124, in join
res = self._popen.wait(timeout)
File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 50, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/megengine/distributed/server.py", line 196, in connect
if self.proxy.connect():
File "/usr/lib/python3.6/xmlrpc/client.py", line 1112, in __call__
return self.__send(self.__name, args)
File "/usr/lib/python3.6/xmlrpc/client.py", line 1452, in __request
verbose=self.__verbose
File "/usr/lib/python3.6/xmlrpc/client.py", line 1154, in request
return self.single_request(host, handler, request_body, verbose)
File "/usr/lib/python3.6/xmlrpc/client.py", line 1166, in single_request
http_conn = self.send_request(host, handler, request_body, verbose)
File "/usr/lib/python3.6/xmlrpc/client.py", line 1279, in send_request
self.send_content(connection, request_body)
File "/usr/lib/python3.6/xmlrpc/client.py", line 1309, in send_content
connection.endheaders(request_body)
File "/usr/lib/python3.6/http/client.py", line 1249, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1036, in _send_output
self.send(msg)
File "/usr/lib/python3.6/http/client.py", line 974, in send
self.connect()
File "/usr/lib/python3.6/http/client.py", line 946, in connect
(self.host,self.port), self.timeout, self.source_address)
File "/usr/lib/python3.6/socket.py", line 724, in create_connection
raise err
File "/usr/lib/python3.6/socket.py", line 713, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "tools/train.py", line 97, in worker
device=rank,
File "/usr/local/lib/python3.6/dist-packages/megengine/distributed/group.py", line 126, in init_process_group
_sd.client = Client(master_ip, port)
File "/usr/local/lib/python3.6/dist-packages/megengine/distributed/server.py", line 187, in __init__
self.connect()
File "/usr/local/lib/python3.6/dist-packages/megengine/distributed/server.py", line 199, in connect
time.sleep(1)
KeyboardInterrupt
Process Process-2:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/megengine/distributed/server.py", line 196, in connect
if self.proxy.connect():
File "/usr/lib/python3.6/xmlrpc/client.py", line 1112, in __call__
return self.__send(self.__name, args)
File "/usr/lib/python3.6/xmlrpc/client.py", line 1452, in __request
verbose=self.__verbose
File "/usr/lib/python3.6/xmlrpc/client.py", line 1154, in request
return self.single_request(host, handler, request_body, verbose)
File "/usr/lib/python3.6/xmlrpc/client.py", line 1166, in single_request
http_conn = self.send_request(host, handler, request_body, verbose)
File "/usr/lib/python3.6/xmlrpc/client.py", line 1279, in send_request
self.send_content(connection, request_body)
File "/usr/lib/python3.6/xmlrpc/client.py", line 1309, in send_content
connection.endheaders(request_body)
File "/usr/lib/python3.6/http/client.py", line 1249, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1036, in _send_output
self.send(msg)
File "/usr/lib/python3.6/http/client.py", line 974, in send
self.connect()
File "/usr/lib/python3.6/http/client.py", line 946, in connect
(self.host,self.port), self.timeout, self.source_address)
File "/usr/lib/python3.6/socket.py", line 724, in create_connection
raise err
File "/usr/lib/python3.6/socket.py", line 713, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "tools/train.py", line 97, in worker
device=rank,
File "/usr/local/lib/python3.6/dist-packages/megengine/distributed/group.py", line 126, in init_process_group
_sd.client = Client(master_ip, port)
File "/usr/local/lib/python3.6/dist-packages/megengine/distributed/server.py", line 187, in __init__
self.connect()
File "/usr/local/lib/python3.6/dist-packages/megengine/distributed/server.py", line 199, in connect
time.sleep(1)
KeyboardInterrupt
##########################20210703更新###########################
将megengine1.3.1降成1.2.0解决上述问题
我也遇到过,后来更新了测试集就好了,虽然点数还是低
##########################20210704更新###########################
再训练一次又好了…