由于 PyPI 对项目 Wheel 包的体积存在限制, Windows 用户需要使用以下 pip 命令选择从天元 MegEngine 官网进行下载和安装:
pip3 install megengine -f https://megengine.org.cn/whl/mge.html
对应 .whl 的文件地址列表在:https://megengine.org.cn/whl/mge.html
其它系统的用户无需添加 -f https://megengine.org.cn/whl/mge.html
参数即可正常下载安装。
问题修复
- 修复asan报错的问题
- 修复寒武纪跨计算节点拷贝的问题
- 修复profile导致的显存爆炸
- 修复寒武纪环境下显存未能正确回收
- 修复由于CUDA环境变量没有正确设置而导致分布式训练卡0显存爆炸的问题
- 修复tensor split
- 修复 ARM testcase 内存占用过多的问题
- 修复 Fastrun 占用显存过多的问题
- 修复 Atlas dump 模型指定的 batch size 大于模型最大 batch size 的问题
- 修复 MLIR 不能正确处理不同的 shape 的问题
- 修复 MLIR 执行 CUDA 时出现 Dangling Pointer 的问题
- 修复 Weight 前处理时没有考虑无 bias 的 ConvBias 的问题
- 修复打印错误堆栈过程中再次crash导致 log 混乱的问题
新功能
- python退出时做full sync
- MegEngine中添加subpackages
- pooling window size 小于 padding size 时输出警告信息
- 添加 Atlas Stub, 支持在 X86 平台上 dump Atlas 模型
- 为 JITExecutor opr 添加 memory forwarding 功能
- 为 load_and_run 添加将结果输出到 stdout/stderr 的功能
- 增加EasyQuant量化方法
- 支持Tensor换入/换出重计算功能
- Optimizer支持inplace add_update
性能优化
- 添加常见 Video Detection 网络前处理融合优化
- 添加 DimShuffle, Reformat 与 ConvBias 的融合优化
- 添加 WarpPerspective 和 DimShuffle 的融合优化
- 将tensor,求导以及trace从python实现改到cpp实现,提高了性能
- 修改部分opr的求导规则以节省显存
- 优化QAT和TQT量化训练性能和显存
- 调整 CUDA chanwise Convolution 算法选择策略
- 优化 NCHW32 的 pooling 算子性能
- 优化 CallbackCaller 算子的性能
- 优化 CUDA IO 通信
兼容性破坏
- 无
Bug Fixes
- Fix errors reported by ASAN
- Fix the problem of cross compute node copy in Cambricon
- Fix out of memory error caused by profiling
- Fix memory leak in the Cambrian
- Fix out of memory error during distributed training due to the incorrect setting of CUDA environment variables
- Fix tensor split
- Reduce the memory usage of ARM testcase
- Reduce the memory usage of Fastrun
- Fix the issue that the batch size specified when dumping the Atlas model exceeds the maximum batch size of the model
- Fix the problem that MLIR cannot handle different shapes correctly
- Fix the problem of Dangling Pointer when MLIR executes CUDA
- Fix the weight pre-processing to handle ConvBias without bias correctly
- Fix the broken log caused by crash again in the process of printing error stack
New Features
- Full sync when exits in Python
- Add sub-packages to MegEngine
- Print warning message when pooling window size is smaller than padding size
- Add Atlas Stub, enabling R dump Atlas model on X86 platform
- Add memory forwarding to JITExecutor operator
- Make load_and_run print the result to stdout/stderr not just files
- Add EasyQuant quantification method
- Support tensor swap-in/swap-out recalculation
- Optimizer supports inplace add_update
Optimization
- Optimize common Video Detection network by pre-processing fusion
- Optimize performance by fusing DimShuffle and Reformat with Convolution
- Fuse WarpPerspective with DimShuffle
- Improve performance by rewriting tensor, derivation and trace in cpp
- Refactor some opr derivation rules to save memory usage
- Optimize QAT and TQT quantitative training in terms of both performance and memory usage
- Adjust the CUDA chanwise Convolution algorithm selection strategy
- Optimize the performance of NCHW32 pooling operator
- Optimize the performance of CallbackCaller operator
- Optimize CUDA IO communication
Compatibility violation
- No