MegEngine v1.8.1 Release Notes

huahua404-MegEngine · 2022年02月15日02:25

MegEngine

Notice

从下个版本 MegEngine v1.9 开始将停止对 python3.5 支持，请大家提前做好准备。

HighLight

megengine.functional.topk 新增「descending」以定义排序行为，本次版本默认为「False」保持从小到大排列，如果未指定则提示warning 信息。在 v1.12 版本将修改「descending」默认值为 true 以符合惯常情况下大家对 topK 的定义，即从选出二维矩阵中 Top-K 个最大元素。
MegEngine 支持端上训练，使用参考这里。

Bug fixes

Python API

修复 megengine.functional.floor_div 对于异号整数输入的计算错误。
使 megengine.functional.broadcast_to 接受 None，表示这一维无需进行广播以支持 -1 shape 自动推导。

发版流程

修复 MegEngine v1.7 版本序列化的 TM 模型，由 MegEngine v1.8 版本加载做图手术会失败的问题。
TracedModule Bug 修复如下。
- 修复无法序列化第三方后端中 op 的问题。
- 修复 Input 类型 expr 未绑定 top_graph 的问题。
- 修复图手术中将 ModuleNode 作为输入时，expr 的插入位置计算错误的问题。
- 修复 TracedModule 加载 v1.7 及之前含有 ones 或 zeros 的模型无法运行的问题。
- 修复 TracedModule 在部分情况下递归过深的问题。
- 修复 TracedModule 无法重复 trace 的问题。
- 修复 TracedModule 无法正确识别 pad 的问题。
- 改善 TracedModule 对不合法输入的报错信息。
修复同时开全局图优化和 fastrun 时，选中的算法只有 naive 时会报错的问题。

CUDA

前置输入 Tensor 太大的判断，优化错误提示信息，避免直接输出 cuDNN 报错。
修复 tensorrt 改变 shape 时，output推导错误问题

通用组件

修复 MegDNN fallback 的 ConvBias 算子不可用的问题。
修复图优化之后无法正常 fastrun 模型中的 matmul 和 pooling 的问题。
修复在低内存环境（8G）无法编译 MegEngine 的问题。
修复将较大的 numpy array 转换为 tensor，或将较大的 tensor 转换为 numpy array 时，占用额外内存的问题。
增加计算设备上的异步错误的检查与报错。
修复了 tensor 的 ndim 未知时 indexing 操作无法被 trace 的问题。

周边工具

修复 load and run 命令行输入的数据无法解析的问题
修复 io dump 中 qint4 和 bool 数据类型 dump 错误
修复 megengine.utils.module_stats 没有import相关库而无法使用的问题
修复 load and run 编译 cuda 时错误。
删除 dump_with_testcase 工具。
修复 load and run 无法识别用 flatbuffer 序列化模型的问题。
修复参数和计算量统计工具 module_stats 接口的 inputs 为 dict 时，无法统计的问题。
修复 load and run工具使用 --get-static-mem-info 选项，统计得到的权重信息数据有误的问题。
修复 load_and_run 工具中，使用形如 –input "ratio:1.0" 选项时的参数解析错误。

New Features

Python API

添加 megengine.functional.diag 算子。

发版流程

TracedModule 支持在图手术过程中修改 Node 的名字。
为 TracedModule 提供一个 enable_expr_checker 开关，以在 trace 时进行更多检查。

ARM

优化 Arm 中部分数学计算的实现，性能有微弱的提升
ARM 后端支持 rnn_cell/lstm_cell/lstm 算子
添加 elemwise 部分 case 对多线程的支持，以支持 TS 项目部分模型性能优化。

第三方硬件

增加对寒武纪 MLU270 支持。
TensorRT Runtime Opr 支持动态 shape 的模型,且可根据输入 shape 主动选择相近「IOptimizationProfile」。

通用组件

CPU 支持运行 int4 模型。
megengine.functional.nn.remap 支持 dtype 为 float16 下的求导
优化非连续情况下的 typecvt 的性能
新增端上训练支持，更多详情查看这里
在 windows 系统上，load_and_run 增加动态链接 MegEngine 支持。

周边工具

新增了 cmake 格式化工具，执行可将 cmake 相关文件进行格式化。
Custom Op 增加 JIT 构建工具，文档待补充。
支持构建 Android whl 包。

Improvements

Python API

优化 megengine.random.RNG.uniform API中 low=0 & high=1 的情况下的 elemwise 开销，单算子性能提升约75% 。

CUDA

改进 megengine.functional.nn.softmax 在 axis 为标量时，CUDA 平台上的性能提升约200%～450%。
提高 megengine.functional.nn.dropout 在 CUDA 平台上的性能，可提升约 650%。
提高 megengine.functional.nn.layer_norm 在 CUDA 平台上的性能，可提升约 540%。

ARM

当一个 tensor 需要进行 int16/uint16 → float 的转换，并且转换后的数据进行 Mul/ADD 运算时，将多个运算合并为 ElemwiseMultiType，在010项目的 369 号模型验证性能提升约20倍(23512.8us →1208 us)。

通用组件

动态 AMP 性能提升，多个模型验证可提升约1% 。
优化 cpu 环境下 jit.trace 的时间。bs 256 、VGG16 模型验证，jit.trace 从约 4 分钟提升至 2 分钟。
修复在 cpu 上模型执行速度过慢的问题，在 VGG16 bs 10 验证从 10 分钟提升至约 6s。

MegEngine Lite

Bug fixes

修复 lite 中 TensorBatchCollector 的 device id 设置错误
Lite 中空 tensor 的 to_numpy 方法增加输出 Tensor 的数据类型信息
修复用户在自定义模型输出空间时部分模型推理失败的问题
修复 MegEngine Lite 的 device 配置接口为只设置 xpu 的 device type 为用户指定的 device type 。
修复 MegEngine Lite python 接口在 TensorBatchCollector 的 batch id 出错时没有报错日志输出的问题。
修复 MegEngine Lite 开启「record level 2」时报错的问题。

New Features

lite 中增加对寒武纪的支持。
MegEngineLite 新增一个名为 get_data_by_share 的接口。通过调用该接口，用户可以零拷贝地获得一个 lite tensor 的 numpy 对象。
增加 cv 的分类与检测的 example 。
新增全局图优化支持。

MegEngine

Notice

Drop support for python3.5 from MegEngine v1.9.

HighLight

megengine.functional.topk will default to descending order in v1.12. Please specify the “descending” argument during the transition period.
MegEngine support Device Training，you can refer to here.

Bug fixes

Python API

Correct behavior of megengine.functional.floor_div for integers with opposite sign.
Allow passing None to megengine.functional.broadcast_to , meaning the corresponding axis should not broadcast.

Release process

Fix a compatibility issue with TracedModule.
Fix TracedModule Bug ：
- Fix the problem that ops in third-party backend such as tensorrt can not be serialized.
- Fix the problem that input expr bound top_ graph failed.
- Fix the problem of incorrect calculation of expr insertion position when ModuleNode is used as input of graph operation.
- Fix a bug of v1.7: the model with ones or zeros can’t work.
- Fix a recursion too deep issue when copying traced module.
- Fix an error that prevents traced module from tracing a module more than once.
- Fix traced module not recognizing pad.
- Improve error message for illegal inputs feed into traced module.
Fixed the problem that when global graph optimization and fastrun are enabled at the same time, an error will be reported when the selected algorithm is only naive.

CUDA

To judge that the front input Tensor is too large, optimize the error message, and avoid directly outputting cuDNN to report errors.
Fixed output derivation error when tensorrt changed shape.

Common components

Fix the problem that the ConvBias operator of MegDNN fallback is not available.
matmul, pooling operators support fastrun, which will lead to better inference performance for C++ models.
MegEngine（8G） fix build issue at low memory env(8G).
Reduce memory consumption when a large numpy array is converted to tensor or a large tensor is converted to numpy array
Add out-of-bound access check for some operators.
Fix the problem that the indexing operation cannot be traced when the ndim of the tensor is unknown.

Peripheral tools

Fixed the problem that the data entered in the load and run command line could not be parsed.
Fix qint4 and bool data type dump errors in io dump.
Fix the problem that megengine.utils.module_stats cannot be used without import related libraries.
Fix load and run build error when build with CUDA.
Remove dump_with_testcase tool.
Fix the problem that load and run cannot recognize the serialized model with flatbuffer.
fix a bug in megengine.tools.network_visualize when inputs is instance of dict.
Fix a bug that user will get wrong statistic when using --get-static-mem-info .
Fix a bug that load_and_run will get parsing error when meet command like –input "ratio:1.0" .

New Features

Python API

Add megengine.functional.diag operator.

Release process

Support that the name of node can to be modified during the graph operation in TraceModule.
Add a enable_expr_checker switch for traced module, which adds more checks during tracing.

ARM

Optimize the implementation of some mathematical calculations in arm, the performance is slightly improved.
Add arm rnn_cell/lstm_cell/lstm operator.
Support part of arm ternary elemwise multithread.

Third-party hardware

Added support for cambricon MLU270.
Supporting dynamic shape model in TensorRT Runtime Opr and set closest IOptimizationProfile according to input shape automatically .

Common components

CPU supports running int4 model.
Support backward computation for float16 dtype in remap.
Optimize the performance of typecvt in non-continuous situations.
Add training based on cpp Interface, more.
For windows system, load_and_run supports dynamicly linking megengine now.

Peripheral tools

Added a cmake formatting tool: cmakeformat.py.
Add the JIT builder for Custom Op.
Support build python wheel for Android(termux env).

Improvements

Python API

Add fastpath when low=0 and high=1 for megengine.random.RNG.uniform to improve performance.

CUDA

Improve performance of softmax when axis is scalar on CUDA platforms, by 200% - 450%.
Enhance performance of dropout on CUDA platforms by up to 650%.
Enhance performance of layer_norm on CUDA platforms, by up to 540%.

ARM

ADD an operator fusion case of TypeCvt and Elemwise. A pass will fuse a Typecvt(uint16 to float) operator and one Elemwise operator(Mul/ADD) to an ElemwiseMultiType operator and developing relative kernel on aarch64.

Common components

Add fastpath when low=0 and high=1 for megengine.random.RNG.uniform to improve performance.
Optimize the placement order of algorithms in matrixmul under the x86 platform in dnn to improve the dump time of jit.trace(bs256 VGG16, 4min -> 2min).
Fix the problem that the model speed on CPU is too slow (bs10 VGG16,10min -> 6s).

MegEngine Lite

Bug fixes

Fix the device ID setting error of tensorbatchcollector in lite.
Add data type information when call empty tensor to_numpy method.
Fix the problem that some model inferences fail when users customize the output space of the model.
Fix device type configuration for megengine lite. Now only the devices of which the device type is unspecified will be modified.
Add warning for megengine lite python interface, when error of batch indexes occurs in the TensorBatchCollector.
Fix runtime error when record level of megengine lite is 2.

New Features

Add interface for cambricon models in lite.
Add a new interface in megenginelite tensor module named get_data_by_share . A zero-copy numpy object will be returned containing data of a lite tensor object.
Add classification and detection examples in lite.
Add megenginelite Python & c/c++ global graph optimization interface.

附

GitHub 源码地址：https://github.com/MegEngine/MegEngine/
MegEngine 官方文档 - 新手入门：https://megengine.org.cn/doc/stable/zh/getting-started/index.html
MegStudio：https://studio.brainpp.com/