MegEngine v1.5.0 Release Notes

huahua404-MegEngine · 2021年08月04日02:10

Compatibility violation

去掉不必要的隐式类型转换，以解决性能和显存的问题。
- 使用影响：不再接受 numpy array 作为 functional 的 input ，需要转换为MegEngine 的 Tensor 类型。

HighLight

DTR 升级
- 在 trace 的静态构造模式下支持用 DTR 算法优化计算图的显存峰值，与 Sublinear 相比，ResNet 50 单卡最大 batch size 350->450，八卡 300→450。
- 动态图模式下支持无阈值开启，用户无需指定 eviction_threshold。
支持混合精度训练。
增加高阶导支持（试验版）。

Bug Fixes

Python API

修复对 nvof 输出形状的计算。
launcher 中 fork thread 之前检查 CUDA 是否初始化。
修复 expand_dims 中对 scalar 未处理的问题。
修复 F.topk 中 kth_only=True 不可用的问题。
从 state dict 创建模型参数对应的 Tensor 时不使用 cache ，以防止 inplace 修改参数导致的错误结果。
megengine.random.RNG：修复了当 RNG 被定义成一个全局变量，程序退出时，系统报错的问题。
megengine.random.seed：修复对 random seed 的重置，使用相同的 seed( ) 值，每次生成的随即数相同，与 numpy 保持一致。

CUDA

修复tensorRT runtime，支持 int8 nchw4 输入，可以减少显存用量。
修复 cuDNN ConvolutionBackwardData获取算法时候错误。

周边工具

修复 Windows 中 cmake 开启 asan 不工作问题。
修复 toposort 能按定义序获取 opr 顺序。
修复对量化模型统计计算参数 std 时报错的问题，修复 pooling 的 kernel size 为 2d 时，参数量统计会报错 type 问题，支持统计量返回 dims

通用组件

关闭 TEE 模式下的 static 内存统计功能，以保证 TEE 环境的安全性。
修复 x86 matmul 算子在输出 tensor 不连续时候计算错误。
修复 oss 模型序列化中的兼容性问题。
修复 dump 模型时的 device 类型。

New Features

Python API

增加 lsq 算子。
DTR 中去除需要用户指定的 threshold。
增加 opr _has_inf。
分布式训练增加user_pop函数在用户获取自定义的 key-value pair后释放资源。
废弃 get_device_count_by_fork。
增加单机利用 cpu shared memory 做 allreduce 的功能。在 launcher 中设置 backend=“auto” 即可在不支持p2p通信的 GPU 中开启 cpu shared memory。
增加 unfold。
增加 silu 和 gelu。
interpolate 对 channel=1 或 3 的 input 增加 nearest 和 bicubic mode。
增加 op 实现 gamma、beta、poisson 和 permutation 等随机算子。

ARM

新增 nchw44 layout 下第一层卷积为 K1x1S1 的优化。

CUDA

CUDA topK 支持 FP16 数据类型。

通用组件

修复多 batch 精度抖动问题，fast-run 增加忽略 batch size 功能
修改 CUDA JIT 配置接口。
新增统计计算图中内存使用信息的功能。
集合通信增加对 uint8 的支持。
增加 trace、PowC、elemwise 算子支持空的输入输出。
增加 bn 推理模式下的梯度反传。
支持对 metadata 的序列化。
新增 RelayoutEmitter，便于 Tensor 处理复杂的 Layout 变换

Improvements

ARM

优化 ARM pooling 和多线程性能。

CUDA

重构 cutlass 相关 kernel 的生成逻辑。
重构 CUDA relayout format 相关 kernels。
CUDA topK 支持 fp16 数据类型

通用组件

Pooling 算子支持 fast-run 搜参功能。

重构 profiler 功能，并添加对 trace 的支持。
Group 卷积支持新版 fast-run。
优化 x88 pooling 性能。

Compatibility violation

Remove undesired implicit type conversion.
- The effects of use : It no longer accepts numpy array as functional input and needs to be converted to Tensor type.

HighLight

DTR
Support DTR memory optimization for static graph under trace mode. Compared with Sublinear, the maximum batch size of training a ResNet 50 increases from 350 to 450 with 1 gpu, and from 300 to 450 with 8 gpu.
In dynamic graph , DTR can be used without the need to specify memory eviction threshold.
Add mix precision.
Support higher-order differentiation (experimental).

Bug Fixes

Python API

Fix nvof output shape computation.
Add CUDA env check before fork thread in launcher.
Fix expand_dims for scalar.
Fix F.topk with kth_only.
The cache is not used when creating the tensors corresponding to the model parameters from the state dict to prevent incorrect results caused by inplace modification of the parameters.
megengine.random.RNG : Fix the system error during the program exit.
megengine.random.seed : Fix the reset of random seed when using the same seed value.

CUDA

Repair the tensorRT runtime and support input in nchw4 format, int8 dtype, which may reduce memory usage.
Fix cuDNN convolutionbackwarddata error when getting algorithm.

Tools

Fix asan don’t work in windows when build with cmake.
Fix toposort to get definition order.
Fix module status error.

General Components

Turn off the static memory statistics function in TEE to ensure the safety of the TEE environment.
Fix the compute error of X86 matmul operator when output tensor is not continuous
Fix compatibility error of oss model.
Fix dump device error with const.

New Features

Python API

Add lsq opr.
Remove eviction threshold in DTR.
Add _has_inf opr.
Add user_pop function to get user defined key-value pair and delete the resources when the get is done.
Deprecate get_device_count_by_fork.
Enable shared memory allreduce on a single machine.
Add unfold.
Add silu and gelu.
Interpolate supports nearest and bicubic modes for tensors with the channel as 1 or 3.
Add random op’s including gamma, beta, poisson and permutation.

ARM

Add optimization of first layer Convolution with param K1x1S1 in nchw44.

CUDA

CUDA topK operator supports fp16 data types.

General Components

Fix the problem of multi batch precision jitter, and ignoring batch size option in fast-run.
Modify CUDA JIT configuration interface.
Add recording memory usage information function in compute graph.
Enable uint8 for collective communication.
Add more support to empty IO.
Add bn inference backward.
Add support of serializing metadata.
A new relayemitter is added to facilitate complex layout transformations of tensor

Improvements

ARM

Optimize ARM pooling and multithread performance.

CUDA

Refactor the generation logic of cutlass related kernels.
Refactor CUDA relayout format related kernels
CUDA topK operator supports fp16 data types

General Components

The pooling operator supports fast-run.
Refactor the profiler function and add support for MegEngine trace.
The group Convolution operator supports the new version of fast-run.
Add algo for x86 max pooling for W13S1 under NCHW88.

附

GitHub 源码地址：https://github.com/MegEngine/MegEngine/
MegEngine 官方文档 - 新手入门：https://megengine.org.cn/doc/stable/zh/getting-started/index.html
MegStudio：https://studio.brainpp.com/