• [热门活动] MindSpore算子使用经验分享活动
    您在MindSpore使用过程中有没有遇到过问题?您是怎么解决的呢?能不能跟我们分享一下?算子使用和算子开发过程中遇到的问题都可以~奖品预览活动时间10月8日——10月23日参与方式分享您日常使用过程中遇到的MindSpore算子相关问题和解决办法,或成功的算子开发记录。算子问题的干货格式,内容必须包括:环境配置、问题描述、根因分析、解决办法共四个部分。格式参考:cid:link_0干货发布;将干货发布在论坛的“算子使用”板块。链接回复;在本活动帖下评论回复您的技术干货帖和联系邮箱。注意事项请勿刷帖,刷帖一律删帖并做无效参与处理。必须保证原创,抄袭按无效参与处理。请勿盗取他人截图或内容,一经发现做无效参与处理。必须符合分享格式,否则算作无效参与。最终解释权归MindSpore团队所有。
  • [数据处理] 迭代数据进行训练时,报数据准换错误
    如图,274行在进行模型训练时,数据准换报错了报错如下:字典的key是str类型的。debug了好久没有成功,急需各位大佬帮忙,感谢
  • [执行问题] ValAccMonitor使用时内存爆炸的问题
    AI框架:mindspore模型:resnet18问题场景:图像分类模型:resnet18数据量:2万张左右,训练集验证集划分:[0.9, 0.1]出现问题:使用ValAccMonitor模块,训练时,电脑卡死,鼠标都卡,查看内存发现爆满。使用其他回调函数如lossMonitor则一切正常无报错内容,只显示:Process finished with exit code 137 (interrupted by signal 9: SIGKILL)我的电脑内存是32GB,难道代码没有问题,就是纯粹内存太小吗?大家有遇到同样问题的吗?—————————分割线———————————目前看来应该就是内存不够,不是代码的问题,缩小数据集的大小后就能正常运行了,实在没想到这么吃内存
  • [数据处理] mindspore-gpu模型运行,警告数据集为动态shape
    在动态图模式下可以运行,但在静态图下运行会报错。报错信息[WARNING] DEVICE(99493,7faff314e640,python3):2022-10-06-14:15:02.536.690 [mindspore/ccsrc/plugin/device/gpu/hal/device/gpu_data_queue.cc:91] Push] Detected that dataset is dynamic shape, it is suggested to call network.set_inputs() to configure dynamic dims of input data before running the network自定义数据集的代码已经以附件的形式上传。另外,我不知道是不是一些数据增强操作会改变shape的原因。比如需要调用mindspore.dataset.GeneratorDataset.map接口,将以下的数据增强操作传入。class TwoNoiseTransform(object): """Create two crops of the same image""" def __init__(self, transform): self.transform = transform def __call__(self, x): return [self.transform(x), self.transform(x)]
  • [API使用] LossMonitor 使用咨询
    请问一下关于LossMonitor的详细介绍手册在哪里看呢? 我看调用的时候是从mindvision中导入,和mindspore分开了,在mindspore的手册里面只能找到零零散散的一些应用实例,想要了解得更详细一点。此外本人在使用LossMonitor监控网络时:model.train(num_epochs, dataset_train, callbacks=[ckpoint, LossMonitor(0.005, 50)])在训练得输出中得到的结果在lr那一栏显示的一直是我传入的0.005,但是我试过不传入lr时候显示的好像是一个默认数值,我网络训练用的是AdamWeightDecay,没有观察到lr的动态变化,这是为啥呢。
  • [安装] mindconverter使用出错
    安装Mindconverter后,使用提示我依赖安装错误。ERROR] MI(18360:2520,MainProcess):2022-10-05-23:52:36.919.594 [MINDCONVERTER] [RuntimeIntegrityError] code: 1000003, msg: mindspore(>=1.2.0), onnx(>=1.8.0), onnxruntime(>=1.5.2) and onnxoptimizer(>=0.1.2) are required when using graph based scripts converter or ONNX conversion.但是我的安装都符合要求
  • [算子编译] ReduceSum算子性能问题
    【功能模块】MindSpore  GPU1.8.1Ubuntu 22.4 GPU 1080Ti【操作步骤&问题现象】mindspore模型在训练过程中的速度明显低于pytorch模型。具体到代码而言,在损失函数中,使用到了几次该算子。我根据API映射将该算子迁移过来。#pytorchlog_prob = logits - torch.log(exp_logits.sum(1, keepdim=True))mean_log_prob_pos = (mask * log_prob).sum(1) / mask.sum(1)#mindsporelog_prob = logits - ops.log(ops.ReduceSum(keep_dims=True)(exp_logits, 1) + 1e-8)op_sum = ops.ReduceSum()mean_log_prob_pos = op_sum(mask * log_prob,1) /op_sum(mask,1)【截图信息】
  • [算子编译] tile算子性能问题
    【功能模块】MindSpore  GPU1.8.1Ubuntu 22.4 GPU 1080Ti【操作步骤&问题现象】mindspore模型在训练过程中的速度明显低于pytorch模型。具体到代码而言,在损失函数中,使用到了一次该算子。我根据API映射将该算子迁移过来。#Pytorchmask = mask.repeat(anchor_count, contrast_count)#mindsporemask = ms.numpy.tile(mask, (anchor_count, contrast_count))【截图信息】
  • [问题求助] Ascend-tool-kit使用atc工具转换报错
    Ascend-tool-kit使用atc工具转换报错问题描述在ModelArts中使用如下环境进行模型转换时报错:找不到numpy。执行的命令以及报错信息执行的命令为:atc --framework=5 --model=PoseEstNet_export.onnx --output=PoseEstNet_export --input_format=NCHW --input_shape="image:32,3,256,256" --log=debug --soc_version=Ascend310完整的报错信息为:(MindSpore) [ma-user FileTransfer]$atc --framework=5 --model=PoseEstNet_export.onnx --output=PoseEstNet_export --input_format=NCHW --input_shape="image:32,3,256,256" --log=debug --soc_version=Ascend310 ATC start working now, please wait for a moment. Traceback (most recent call last): File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/te_fusion/fusion_util.py", line 30, in from tbe.common.utils import shape_util File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/__init__.py", line 43, in import tvm File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/tbe/tvm/__init__.py", line 28, in from . import tensor File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/tbe/tvm/tensor.py", line 20, in from ._ffi.node import NodeBase, NodeGeneric, register_node, convert_to_node File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/tbe/tvm/_ffi/node.py", line 24, in from .object import Object, register_object, _set_class_node File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/tbe/tvm/_ffi/object.py", line 23, in from .base import _FFI_MODE, _RUNTIME_ONLY, check_call, _LIB, c_str File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/tbe/tvm/_ffi/base.py", line 25, in import numpy as np ModuleNotFoundError: No module named 'numpy' Traceback (most recent call last): File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/te/__init__.py", line 96, in import tvm File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/te/tvm/__init__.py", line 28, in from . import tensor File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/te/tvm/tensor.py", line 20, in from ._ffi.node import NodeBase, NodeGeneric, register_node, convert_to_node File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/tbe/tvm/_ffi/node.py", line 24, in from .object import Object, register_object, _set_class_node File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/tbe/tvm/_ffi/object.py", line 23, in from .base import _FFI_MODE, _RUNTIME_ONLY, check_call, _LIB, c_str File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/tbe/tvm/_ffi/base.py", line 25, in import numpy as np ModuleNotFoundError: No module named 'numpy' Traceback (most recent call last): File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/te_fusion/compile_task_manager.py", line 31, in from te_fusion.parallel_compilation import mygetattr File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/te_fusion/parallel_compilation.py", line 43, in import tbe.common.utils.log as logger File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/__init__.py", line 43, in import tvm File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/tbe/tvm/__init__.py", line 28, in from . import tensor File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/tbe/tvm/tensor.py", line 20, in from ._ffi.node import NodeBase, NodeGeneric, register_node, convert_to_node File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/tbe/tvm/_ffi/node.py", line 24, in from .object import Object, register_object, _set_class_node File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/tbe/tvm/_ffi/object.py", line 23, in from .base import _FFI_MODE, _RUNTIME_ONLY, check_call, _LIB, c_str File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/tbe/tvm/_ffi/base.py", line 25, in import numpy as np ModuleNotFoundError: No module named 'numpy' Traceback (most recent call last): File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/__init__.py", line 43, in import tvm File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/tbe/tvm/__init__.py", line 28, in from . import tensor File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/tbe/tvm/tensor.py", line 20, in from ._ffi.node import NodeBase, NodeGeneric, register_node, convert_to_node File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/tbe/tvm/_ffi/node.py", line 24, in from .object import Object, register_object, _set_class_node File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/tbe/tvm/_ffi/object.py", line 23, in from .base import _FFI_MODE, _RUNTIME_ONLY, check_call, _LIB, c_str File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/tbe/tvm/_ffi/base.py", line 25, in import numpy as np ModuleNotFoundError: No module named 'numpy' Traceback (most recent call last): File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/te/__init__.py", line 96, in import tvm File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/te/tvm/__init__.py", line 28, in from . import tensor File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/te/tvm/tensor.py", line 20, in from ._ffi.node import NodeBase, NodeGeneric, register_node, convert_to_node File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/tbe/tvm/_ffi/node.py", line 24, in from .object import Object, register_object, _set_class_node File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/tbe/tvm/_ffi/object.py", line 23, in from .base import _FFI_MODE, _RUNTIME_ONLY, check_call, _LIB, c_str File "/usr/local/Ascend/ascend-toolkit/5.1.RC1.1/python/site-packages/tbe/tvm/_ffi/base.py", line 25, in import numpy as np ModuleNotFoundError: No module named 'numpy' ATC run failed, Please check the detail log, Try 'atc --help' for more information E40001: Failed to import the Python module: [fusion_manager[Success] fusion_util[Failure] cce_policy[Failure] cce_conf[Failure] compile_task_manager[Failure] auto_tune_manager[Failure]] [GraphOpt][InitializeInner][InitTbeFunc] Failed to init tbe.[FUNC:InitializeInner][FILE:tbe_op_store_adapter.cc][LINE:1338] [SubGraphOpt][PreCompileOp][InitAdapter] InitializeAdapter adapter [tbe_op_adapter] failed! Ret [4294967295][FUNC:InitializeAdapter][FILE:op_store_adapter_manager.cc][LINE:67] [SubGraphOpt][PreCompileOp][Init] Initialize op store adapter failed, OpsStoreName[tbe-custom].[FUNC:Initialize][FILE:op_store_adapter_manager.cc][LINE:114] [FusionMngr][Init] Op store adapter manager init failed.[FUNC:Initialize][FILE:fusion_manager.cc][LINE:326] PluginManager InvokeAll failed.[FUNC:Initialize][FILE:ops_kernel_manager.cc][LINE:99] OpsManager initialize failed.[FUNC:InnerInitialize][FILE:gelib.cc][LINE:167] GELib::InnerInitialize failed.[FUNC:Initialize][FILE:gelib.cc][LINE:119]做过的尝试尝试直接执行pip install numpy,但是得到了以下提示:Requirement already satisfied: numpy in /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages (1.21.2);尝试退出conda环境,再安装numpy,同样得到了上述提示;尝试export到当前已经安装了numpy的python环境,但是执行atc命令后依然报错。 多次尝试后不能解决该问题,故向各位大佬求助😭
  • [问题求助] RuntimeError: Exception thrown from PyFunc. map operation: [PvFunc] failed.
    这是哪里出问题了,我训练讯到一半就自动报错了
  • [执行问题] 使用Summary收集数据(MindInsight)警告已有相同的值
    警告说明,我不太理解这是什么意思,为什么会有重复的值呢?loss每轮都是在变的呀[WARNING] ME(15264:9768,MainProcess):2022-10-05-00:08:28.727.958 [mindspore\train\summary\summary_record.py:290] For "SummaryRecord.add_value", 'loss/scalar' has duplicate values. Only the newest one will be recorded.相关记录部分的代码 summary_collect_frequency = 2 with SummaryRecord('./summary_dir', network=train_net) as summary_record: for epoch in range(epochs): step = 0 for columns in train_ds.create_dict_iterator(): data = shapeChange(columns['data']) target = columns['target'] labels = target[:,0] current_step = epoch * ds_train.get_dataset_size() + step loss,_ = train_net(data, labels) losses.update(loss, labels.shape[0]) if current_step % summary_collect_frequency == 0: summary_record.add_value('scalar', 'loss', loss) summary_record.record(current_step) print(f"Epoch: [{epoch} / {opt.epochs}], " f"step: [{step} / {steps}], " f"loss: {loss}") step = step + 1
  • [执行问题] 静态图占用显存过大
    我在进行Pytorch模型向mindspore模型的迁移工作。我在gpu上以静态图的方式训练模型,发现显存消耗过大。具体来说,我使用Tesla T4(显存约15G)训练Pytorch模型时,batch_size可以达到128,而且好像只使用到了约7个G的内存。我在使用1080Ti(显存约11G)训练mindspore模型时,batch_size设为64时勉强能跑,且在第一个epoch过渡到第二个epoch时就报错超内存了(显示需要17个G)。我想问一下这是正常现象呢,还是我的网络实现有问题呢?Epoch: [0 / 200], step: [9 / 125], loss: 0.8117148Epoch: [0 / 200], step: [19 / 125], loss: 0.7338444Epoch: [0 / 200], step: [29 / 125], loss: 0.7048834Epoch: [0 / 200], step: [39 / 125], loss: 0.6817198Epoch: [0 / 200], step: [49 / 125], loss: 0.6602388Epoch: [0 / 200], step: [59 / 125], loss: 0.6611078Epoch: [0 / 200], step: [69 / 125], loss: 0.636853Epoch: [0 / 200], step: [79 / 125], loss: 0.6674745Epoch: [0 / 200], step: [89 / 125], loss: 0.6669285Epoch: [0 / 200], step: [99 / 125], loss: 0.6442152Epoch: [0 / 200], step: [109 / 125], loss: 0.6637263[WARNING] PRE_ACT(30686,7fb5d97fe640,python3):2022-10-04-21:29:11.824.373 [mindspore/ccsrc/common/mem_reuse/mem_dynamic_allocator.cc:254] CalMemBlockAllocSize] Memory not enough: current free memory size[353107968] is smaller than required size[1704591360].[WARNING] PRE_ACT(30686,7fb5d97fe640,python3):2022-10-04-21:29:11.824.426 [mindspore/ccsrc/common/mem_reuse/mem_dynamic_allocator.cc:535] DumpDynamicMemPoolDebugInfo] Start dump dynamic memory pool debug info.[WARNING] PRE_ACT(30686,7fb5d97fe640,python3):2022-10-04-21:29:11.824.437 [mindspore/ccsrc/common/mem_reuse/mem_dynamic_allocator.cc:494] operator()] Common mem all mem_block info: counts[4].[WARNING] PRE_ACT(30686,7fb5d97fe640,python3):2022-10-04-21:29:11.824.450 [mindspore/ccsrc/common/mem_reuse/mem_dynamic_allocator.cc:498] operator()] MemBlock info: number[0] mem_buf_counts[3] base_address[0x7fb1f8000000] block_size[4294967296].[WARNING] PRE_ACT(30686,7fb5d97fe640,python3):2022-10-04-21:29:11.824.469 [mindspore/ccsrc/common/mem_reuse/mem_dynamic_allocator.cc:498] operator()] MemBlock info: number[1] mem_buf_counts[2] base_address[0x7fb2f8000000] block_size[2147483648].[WARNING] PRE_ACT(30686,7fb5d97fe640,python3):2022-10-04-21:29:11.824.479 [mindspore/ccsrc/common/mem_reuse/mem_dynamic_allocator.cc:498] operator()] MemBlock info: number[2] mem_buf_counts[3] base_address[0x7fb378000000] block_size[2147483648].[WARNING] PRE_ACT(30686,7fb5d97fe640,python3):2022-10-04-21:29:11.825.421 [mindspore/ccsrc/common/mem_reuse/mem_dynamic_allocator.cc:498] operator()] MemBlock info: number[3] mem_buf_counts[689] base_address[0x7fb53a000000] block_size[1073741824].[WARNING] PRE_ACT(30686,7fb5d97fe640,python3):2022-10-04-21:29:11.825.504 [mindspore/ccsrc/common/mem_reuse/mem_dynamic_allocator.cc:515] operator()] Common mem all idle mem_buf info: counts[23].[WARNING] PRE_ACT(30686,7fb5d97fe640,python3):2022-10-04-21:29:11.825.525 [mindspore/ccsrc/common/mem_reuse/mem_dynamic_allocator.cc:525] operator()] Common mem total allocated memory[9663676416], used memory[5494418432], idle memory[4169257984].[WARNING] PRE_ACT(30686,7fb5d97fe640,python3):2022-10-04-21:29:11.825.536 [mindspore/ccsrc/common/mem_reuse/mem_dynamic_allocator.cc:494] operator()] Persistent mem all mem_block info: counts[1].[WARNING] PRE_ACT(30686,7fb5d97fe640,python3):2022-10-04-21:29:11.835.403 [mindspore/ccsrc/common/mem_reuse/mem_dynamic_allocator.cc:498] operator()] MemBlock info: number[0] mem_buf_counts[21790] base_address[0x7fb4f6000000] block_size[1073741824].[WARNING] PRE_ACT(30686,7fb5d97fe640,python3):2022-10-04-21:29:11.837.188 [mindspore/ccsrc/common/mem_reuse/mem_dynamic_allocator.cc:515] operator()] Persistent mem all idle mem_buf info: counts[1].[WARNING] PRE_ACT(30686,7fb5d97fe640,python3):2022-10-04-21:29:11.837.200 [mindspore/ccsrc/common/mem_reuse/mem_dynamic_allocator.cc:525] operator()] Persistent mem total allocated memory[1073741824], used memory[197615104], idle memory[876126720].[WARNING] PRE_ACT(30686,7fb5d97fe640,python3):2022-10-04-21:29:11.837.207 [mindspore/ccsrc/common/mem_reuse/mem_dynamic_allocator.cc:538] DumpDynamicMemPoolDebugInfo] Finish dump dynamic memory pool debug info.Traceback (most recent call last): File "train.py", line 141, in train(train_ds, train_net, opt) File "train.py", line 91, in train loss,_ = train_net(data, labels) File "/home/ydbd/anaconda3/envs/mindspore/lib/python3.8/site-packages/mindspore/nn/cell.py", line 578, in __call__ out = self.compile_and_run(*args) File "/home/ydbd/anaconda3/envs/mindspore/lib/python3.8/site-packages/mindspore/nn/cell.py", line 988, in compile_and_run return _cell_graph_executor(self, *new_inputs, phase=self.phase) File "/home/ydbd/anaconda3/envs/mindspore/lib/python3.8/site-packages/mindspore/common/api.py", line 1202, in __call__ return self.run(obj, *args, phase=phase) File "/home/ydbd/anaconda3/envs/mindspore/lib/python3.8/site-packages/mindspore/common/api.py", line 1239, in run return self._exec_pip(obj, *args, phase=phase_real) File "/home/ydbd/anaconda3/envs/mindspore/lib/python3.8/site-packages/mindspore/common/api.py", line 98, in wrapper results = fn(*arg, **kwargs) File "/home/ydbd/anaconda3/envs/mindspore/lib/python3.8/site-packages/mindspore/common/api.py", line 1221, in _exec_pip return self._graph_executor(args, phase)RuntimeError: Device(id:0) memory isn't enough and alloc failed, kernel name: Gradients/Default/network-MyWithLossCell/backbone-ResGCN/main_stream-CellList/0-ResGCN_Module/gradMul/ReduceSum-op436942, alloc size: 1704591360B.----------------------------------------------------- C++ Call Stack: (For framework developers)----------------------------------------------------mindspore/ccsrc/runtime/graph_scheduler/graph_scheduler.cc:628 Run
  • [ModelArts昇...] [Modelart训练] slowfast模型训练时数据集处理超时
    在华为云modelarts上开发Slowfast算子适配功能时,出现以下问题[10/04 14:43:43][INFO] start copy.py: 299: ============== Starting Training ============== [10/04 14:43:43][INFO] start copy.py: 301: total_epoch=20, steps_per_epoch=101 [WARNING] MD(178,fffba4ff91e0,python):2022-10-04-14:44:30.306.953 [mindspore/ccsrc/minddata/dataset/engine/datasetops/device_queue_op.cc:725] DetectPerBatchTime] Bad performance attention, it takes more than 25 seconds to fetch a batch of data from dataset pipeline, which might result `GetNext` timeout problem. You may test dataset processing performance(with creating dataset iterator) and optimize it. [ERROR] MD(178,ffff60c791e0,python):2022-10-04-14:45:25.453.944 [mindspore/ccsrc/minddata/dataset/util/task.cc:67] operator()] Task: GeneratorOp(ID:3) - thread(281472305435104) is terminated with err msg: Exception thrown from PyFunc. Exception: Generator worker process timeout. At: /home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/dataset/engine/datasets.py(3841): process Line of code : 195 File : /home/jenkins/agent-working-dir/workspace/Compile_Ascend_ARM_CentOS@2/mindspore/mindspore/ccsrc/minddata/dataset/engine/datasetops/source/generator_op.cc [ERROR] MD(178,ffff60c791e0,python):2022-10-04-14:45:25.454.325 [mindspore/ccsrc/minddata/dataset/util/task_manager.cc:217] InterruptMaster] Task is terminated with err msg(more detail in info level log):Exception thrown from PyFunc. Exception: Generator worker process timeout. At: /home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/dataset/engine/datasets.py(3841): process Line of code : 195 File : /home/jenkins/agent-working-dir/workspace/Compile_Ascend_ARM_CentOS@2/mindspore/mindspore/ccsrc/minddata/dataset/engine/datasetops/source/generator_op.cc [WARNING] CORE(178,ffffaff20170,python):2022-10-04-14:48:20.618.138 [mindspore/core/ir/anf_extends.cc:65] fullname_with_scope] Input 0 of cnode is not a value node, its type is CNode.可以看到,提示数据处理超时,之后在等一段时间就会显示运行失败但是同样的数据集在启智平台上运行时没有问题的,启智平台上运行的mindspore版本为1.7, 华为云平台上最高仅有1.5,是否是因为这一点导致的问题呢?如果是,这一点在mindspore1.5版本上有什么处理办法呢?
  • [执行问题] 【Modelarts训练】 Slowfast在modelarts上训练出现数据集相关问题
    在modelarts平台开发slowfast算子时出现数据集处理问题[10/04 14:43:43][INFO] start copy.py: 299: ============== Starting Training ============== [10/04 14:43:43][INFO] start copy.py: 301: total_epoch=20, steps_per_epoch=101 [WARNING] MD(178,fffba4ff91e0,python):2022-10-04-14:44:30.306.953 [mindspore/ccsrc/minddata/dataset/engine/datasetops/device_queue_op.cc:725] DetectPerBatchTime] Bad performance attention, it takes more than 25 seconds to fetch a batch of data from dataset pipeline, which might result `GetNext` timeout problem. You may test dataset processing performance(with creating dataset iterator) and optimize it. [ERROR] MD(178,ffff60c791e0,python):2022-10-04-14:45:25.453.944 [mindspore/ccsrc/minddata/dataset/util/task.cc:67] operator()] Task: GeneratorOp(ID:3) - thread(281472305435104) is terminated with err msg: Exception thrown from PyFunc. Exception: Generator worker process timeout. At: /home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/dataset/engine/datasets.py(3841): process Line of code : 195 File : /home/jenkins/agent-working-dir/workspace/Compile_Ascend_ARM_CentOS@2/mindspore/mindspore/ccsrc/minddata/dataset/engine/datasetops/source/generator_op.cc [ERROR] MD(178,ffff60c791e0,python):2022-10-04-14:45:25.454.325 [mindspore/ccsrc/minddata/dataset/util/task_manager.cc:217] InterruptMaster] Task is terminated with err msg(more detail in info level log):Exception thrown from PyFunc. Exception: Generator worker process timeout. At: /home/ma-user/anaconda/lib/python3.7/site-packages/mindspore/dataset/engine/datasets.py(3841): process Line of code : 195 File : /home/jenkins/agent-working-dir/workspace/Compile_Ascend_ARM_CentOS@2/mindspore/mindspore/ccsrc/minddata/dataset/engine/datasetops/source/generator_op.cc [WARNING] CORE(178,ffffaff20170,python):2022-10-04-14:48:20.618.138 [mindspore/core/ir/anf_extends.cc:65] fullname_with_scope] Input 0 of cnode is not a value node, its type is CNode.可以看到提示处理数据集时超时,但是相关数据集在启智平台上运行时没有问题启智平台运行时使用的时mindspore1.7版本,但在华为云的modelarts上使用的是mindspore1.5.1版本,是否是因为这一版本问题导致的呢?是否有其余解决办法呢?
  • [安装经验] 张小白带你体验开源多语言代码生成大模型CodeGeeX
    打开xihe.mindspore.cn在滚动窗口可以看到如下的画面:这是个什么东东?张小白手贱,点击了立即体验:这是AI自动帮我写代码吗?点击介绍:用张小白有点熟悉的C++语言试一下吧!张小白写了几个字:然后不停地点击右下角的 Generate Code按钮:系统给我一段一段的写代码:它居然真的给我写出了一段 在字符串中查找字符的代码。。。换成Java再试试:居然也生成了。。。MindSpore已经如此智能了吗?我下面打算写一个delete all program in world的代码。。。MindSpore你行吗?LOL。。。天啦,它在写。。。。看样子它打算include所有的头函数。。。好在每次点击 generate code它只生成一段代码。。。那它会不会真的写完这段代码呢?(未完待续)