kernel 5.4.214 bcache加速hdd,遇到一个cache回收策略的问题,如何设置回收策略
【功能模块】hinic【操作步骤&问题现象】1、写入大量数据后2、网卡停止接收数据,所有数据全部drop【截图信息】【日志信息】(可选,上传日志内容或者附件)net_hinic: Fault event report received, func_id: 0net_hinic: fault type: 0 [chip]net_hinic: fault val[0]: 0x00710203net_hinic: fault val[1]: 0x05801550net_hinic: fault val[2]: 0x4807881bnet_hinic: fault val[3]: 0x05800001net_hinic: err_level: 2 [flr]net_hinic: flr func_id: 1net_hinic: node_id(12),err_csr_addr(0x5801550),err_csr_val(0x4807881b),err_level(0x2),err_type(0x71)
【功能模块】图算融合,GPU (NVIDIA-RTX3080) 验证【操作步骤&问题现象】1、参考(基于mindspore0.5.0)链接1: https://gitee.com/mindspore/course/tree/master/06_distributed/graph_kernel2、参考(基于mindspore1.0.0)链接2: https://bbs.huaweicloud.com/forum/forum.php?mod=viewthread&tid=78817&page=1&replytype=13、按照教程和另一位大佬的帖子开启图算融合后,对基本组合算子和自组合算子进行试验,发现mindspore1.7.0生成的ir和.dot文件和之前版本的mindir文件的命名形式不一样,发现没有生成例子当中的文件(hwopt_d_fuse_basic_opt_end_graph_0.dot/hwopt_d_composite_opt_end_graph_0.dot)。未找到如下图红框中类似的图算融合后生成的ir/dot文件.minspore1.0.0生成的ir文件如下图(源自链接2楼主)请问是否有图算融合的整体流程图(细化到生成每个IR的小阶段)参考,多谢!【截图信息】本人基于mindspore1.7.0生成的mindir文件信息中,未找到图算融合后的.dot。烦请指导~多谢【日志信息】(可选,上传日志内容或者附件)
1 报错描述1.1 系统环境Hardware Environment(Ascend/GPU/CPU): GPUSoftware Environment:– MindSpore version (source or binary): 1.5.2– Python version (e.g., Python 3.7.5): 3.7.6– OS platform and distribution (e.g., Linux Ubuntu 16.04): Ubuntu 4.15.0-74-generic– GCC/Compiler version (if compiled from source):1.2 基本信息1.2.1 脚本训练脚本是通过构建BatchNorm单算子网络,对Tensor做归一化处理。脚本如下: 01 class Net(nn.Cell): 02 def __init__(self): 03 super(Net, self).__init__() 04 self.batch_norm = ops.BatchNorm() 05 def construct(self,input_x, scale, bias, mean, variance): 06 output = self.batch_norm(input_x, scale, bias, mean, variance) 07 return output 08 09 net = Net() 10 input_x = Tensor(np.ones([2, 2]), mindspore.float16) 11 scale = Tensor(np.ones([2]), mindspore.float16) 12 bias = Tensor(np.ones([2]), mindspore.float16) 13 bias = Tensor(np.ones([2]), mindspore.float16) 14 mean = Tensor(np.ones([2]), mindspore.float16) 15 variance = Tensor(np.ones([2]), mindspore.float16) 16 output = net(input_x, scale, bias, mean, variance) 17 print(output) 1.2.2 报错这里报错信息如下:Traceback (most recent call last): File "116945.py", line 22, in <module> output = net(input_x, scale, bias, mean, variance) File "/data2/llj/mindspores/r1.5/build/package/mindspore/nn/cell.py", line 407, in __call__ out = self.compile_and_run(*inputs) File "/data2/llj/mindspores/r1.5/build/package/mindspore/nn/cell.py", line 734, in compile_and_run self.compile(*inputs) File "/data2/llj/mindspores/r1.5/build/package/mindspore/nn/cell.py", line 721, in compile _cell_graph_executor.compile(self, *inputs, phase=self.phase, auto_parallel_mode=self._auto_parallel_mode) File "/data2/llj/mindspores/r1.5/build/package/mindspore/common/api.py", line 551, in compile result = self._graph_executor.compile(obj, args_list, phase, use_vm, self.queue_name) TypeError: mindspore/ccsrc/runtime/device/gpu/kernel_info_setter.cc:355 PrintUnsupportedTypeException] Select GPU kernel op[BatchNorm] fail! Incompatible data type! The supported data types are in[float32 float32 float32 float32 float32], out[float32 float32 float32 float32 float32]; in[float16 float32 float32 float32 float32], out[float16 float32 float32 float32 float32]; , but get in [float16 float16 float16 float16 float16 ] out [float16 float16 float16 float16 float16 ]原因分析我们看报错信息,在TypeError中,写到Select GPU kernel op[BatchNorm] fail! Incompatible data type!The supported data types are in[float32 float32 float32 float32 float32], out[float32 float32 float32 float32 float32]; in[float16 float32 float32 float32 float32], out[float16 float32 float32 float32 float32]; , but get in [float16 float16 float16 float16 float16 ] out [float16 float16 float16 float16 float16 ],大概意思是GPU环境下, 不支持当前输入的数据类型组合, 并说明了支持的数据类型组合是怎样的:全部为float32或者input_x为float16, 其余为float32。检查脚本的输入发现全部为float16类型, 因此报错。2 解决方法基于上面已知的原因,很容易做出如下修改: 01 class Net(nn.Cell): 02 def __init__(self): 03 super(Net, self).__init__() 04 self.batch_norm = ops.BatchNorm() 05 def construct(self,input_x, scale, bias, mean, variance): 06 output = self.batch_norm(input_x, scale, bias, mean, variance) 07 return output 08 09 net = Net() 10 input_x = Tensor(np.ones([2, 2]), mindspore.float16) 11 scale = Tensor(np.ones([2]), mindspore.float32) 12 bias = Tensor(np.ones([2]), mindspore.float32) 13 mean = Tensor(np.ones([2]), mindspore.float32) 14 variance = Tensor(np.ones([2]), mindspore.float32) 15 16 output = net(input_x, scale, bias, mean, variance) 17 print(output)此时执行成功,输出如下:output: (Tensor(shape=[2, 2], dtype=Float16, value= [[ 1.0000e+00, 1.0000e+00], [ 1.0000e+00, 1.0000e+00]]), Tensor(shape=[2], dtype=Float32, value= [ 0.00000000e+00, 0.00000000e+00]), Tensor(shape=[2], dtype=Float32, value= [ 0.00000000e+00, 0.00000000e+00]), Tensor(shape=[2], dtype=Float32, value= [ 0.00000000e+00, 0.00000000e+00]), Tensor(shape=[2], dtype=Float32, value= [ 0.00000000e+00, 0.00000000e+00]))3 总结定位报错问题的步骤:1、找到报错的用户代码行: 16 output = net(input_x, scale, bias, mean, variance);2、 根据日志报错信息中的关键字,缩小分析问题的范围:The supported data types are in[float32 float32 float32 float32 float32], out[float32 float32 float32 float32 float32]; in[float16 float32 float32 float32 float32], out[float16 float32 float32 float32 float32]; , but get in [float16 float16 float16 float16 float16 ] out [float16 float16 float16 float16 float16 ]3、需要重点关注变量定义、初始化的正确性。4 参考文档4.1 BatchNorm算子API接口
【功能模块】网络反向传播,求导过程中报错【操作步骤&问题现象】【截图信息】具体报错语句:【日志信息】(可选,上传日志内容或者附件)报错信息在截图信息展示的代码中,报错语句调用的参数 *args包含的tensor最高维度是4维,并没有报错信息说的5维输入,这个错误该如何处理呢?期待您早日回复!
【日志信息】(可选,上传日志内容或者附件)【运行软硬件】python3.9.5, mindspore 1.8.0.devcuda 11.1RTX 2080 Ti您好,我在训练模型反向传播时报了上述的错误,这个该如何解决?
【功能模块】自制底板,RC模式,TF卡【操作步骤&问题现象】1、系统从TF启动,挂死在end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)2.完整LOG麻烦查看附件【截图信息】yd_290219945发帖:1粉丝:0发表于2022年06月11日 22:40:32 阅读133 回复3 楼主只看该作者[问题求助-Atlas200] 【atlas200产品】自制底板,系统挂死在end Kernel panic - not syncing:【功能模块】自制底板,RC模式,TF卡【操作步骤&问题现象】1、系统从TF启动,挂死在end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)2.完整LOG麻烦查看附件【截图信息】[ 0.084093] smp: Bringing up secondary CPUs ...[ 0.089163] Detected VIPT I-cache on CPU1[ 0.089180] GICv3: CPU1: found redistributor 80100 region 0:0x0000000109140000[ 0.089189] GICv3: CPU1: using allocated LPI pending table @0x00000009f6ce0000[ 0.089206] CPU1: Booted secondary processor 0x0000080100 [0x411fd050][ 0.089783] Detected VIPT I-cache on CPU2[ 0.089799] GICv3: CPU2: found redistributor 80200 region 0:0x0000000109180000[ 0.089808] GICv3: CPU2: using allocated LPI pending table @0x00000009f6cf0000[ 0.089825] CPU2: Booted secondary processor 0x0000080200 [0x411fd050][ 0.090407] Detected VIPT I-cache on CPU3[ 0.090422] GICv3: CPU3: found redistributor 80300 region 0:0x00000001091c0000[ 0.090431] GICv3: CPU3: using allocated LPI pending table @0x00000009f6d80000[ 0.090449] CPU3: Booted secondary processor 0x0000080300 [0x411fd050][ 0.091034] Detected VIPT I-cache on CPU4[ 0.091050] GICv3: CPU4: found redistributor 80400 region 0:0x0000000109200000[ 0.091059] GICv3: CPU4: using allocated LPI pending table @0x00000009f6d90000[ 0.091077] CPU4: Booted secondary processor 0x0000080400 [0x411fd050][ 0.091658] Detected VIPT I-cache on CPU5[ 0.091675] GICv3: CPU5: found redistributor 80500 region 0:0x0000000109240000[ 0.091684] GICv3: CPU5: using allocated LPI pending table @0x00000009f6da0000[ 0.091702] CPU5: Booted secondary processor 0x0000080500 [0x411fd050][ 0.092281] Detected VIPT I-cache on CPU6[ 0.092298] GICv3: CPU6: found redistributor 80600 region 0:0x0000000109280000[ 0.092307] GICv3: CPU6: using allocated LPI pending table @0x00000009f6db0000[ 0.092325] CPU6: Booted secondary processor 0x0000080600 [0x411fd050][ 0.092739] Detected VIPT I-cache on CPU7[ 0.092756] GICv3: CPU7: found redistributor 80700 region 0:0x00000001092c0000[ 0.092765] GICv3: CPU7: using allocated LPI pending table @0x00000009f6dc0000[ 0.092783] CPU7: Booted secondary processor 0x0000080700 [0x411fd050][ 0.092845] smp: Brought up 1 node, 8 CPUs[ 0.273678] SMP: Total of 8 processors activated.[ 0.278432] CPU features: detected: Privileged Access Never[ 0.284063] CPU features: detected: User Access Override[ 0.289430] CPU features: detected: 32-bit EL0 Support[ 0.294621] CPU features: detected: RAS Extension Support[ 0.300077] CPU features: detected: Data cache clean to the PoU not required for I/D coherence[ 0.308784] CPU features: detected: CRC32 instructions[ 0.314326] CPU: All CPU(s) started at EL2[ 0.318489] alternatives: patching kernel code[ 0.324487] devtmpfs: initialized[ 0.331093] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns[ 0.340968] futex hash table entries: 2048 (order: 5, 131072 bytes)[ 0.352748] pinctrl core: initialized pinctrl subsystem[ 0.358127] DMI not present or invalid.[ 0.362129] NET: Registered protocol family 16[ 0.366899] audit: initializing netlink subsys (disabled)[ 0.372463] audit: type=2000 audit(0.244:1): state=initialized audit_enabled=0 res=1[ 0.380316] cpuidle: using governor menu[ 0.384465] hw-breakpoint: found 6 breakpoint and 4 watchpoint registers.[ 0.391956] DMA: preallocated 256 KiB pool for atomic allocations[ 0.398174] Serial: AMBA PL011 UART driver[ 0.408355] init lowpm table success.[ 0.415895] HISI SPMI probe[ 0.418824] spmi hisilicon-hisi-pmic-spmi-pmic: device hisilicon-hisi-pmic-spmi-pmic registered[ 0.428350] 10cf80000.uart: ttyAMA0 at MMIO 0x10cf80000 (irq = 86, base_baud = 0) is a SBSA[ 0.436815] console [ttyAMA0] enabled[ 0.436815] console [ttyAMA0] enabled[ 0.444162] bootconsole [pl11] disabled[ 0.444162] bootconsole [pl11] disabled[ 0.452138] 130930000.uart1: ttyAMA1 at MMIO 0x130930000 (irq = 89, base_baud = 0) is a PL011 rev3[ 0.465307] HugeTLB: unsupported default_hugepagesz 2097152. 报错信息:报错代码:具体描述: 在标红那行代码的前两行代码都执行过报错处的代码,但是均顺利执行,但运行到标红处的时候,调用 get_time_embdding_A函数执行到报错代码处,就出现了报错信息中的错误,这种情况是怎么回事呢?又该如何解决呢? 谢谢专家的阅读和回复!
# yum list | grep kernelkernel.aarch64 5.10.109-200.el7 @updateskernel-core.aarch64 5.10.109-200.el7 @updates上面是旧的服务器,新服务器在updates没找到5.10.109的kernel版本。
【功能模块】【Ascend310产品】【npu迁移】pytorch【操作步骤&问题现象】1、连续几次跑到一半的时候突然报错,ACL stream synchronize failed, error code:507018【截图信息】【日志信息】(可选,上传日志内容或者附件)
【功能模块】【操作步骤&问题现象】1、执行zy_api.py使用tfplugin自动代码迁移工具把代码迁移过来运行的报错tensorflow.python.framework.errors_impl.InternalError: GeOp3_0GEOP::::DoRunAsync FailedError Message is : E39999: Inner Error!E39999 Aicpu kernel execute failed, device_id=0, stream_id=8, task_id=1, flip_num=0, fault so_name=, fault kernel_name=, fault op_name=, extend_info=[FUNC:GetError][FILE:stream.cc][LINE:739] rtStreamSynchronize execute failed, reason=[aicpu exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:45] Call rtStreamSynchronize(stream) fail, ret: 0x7BC8A[FUNC:KernelLaunchEx][FILE:model_manager.cc][LINE:141] Task index:0 init failed, ret:507018.[FUNC:InitTaskInfo][FILE:davinci_model.cc][LINE:3270]【截图信息】【日志信息】(可选,上传日志内容或者附件)
【功能模块】使用mindspore1.2GPU跑RNN_attention时遇到data type问题【操作步骤&问题现象】1、使用mindspore编写RNN_attention模型,排除模型内部报错后,在运行时抛出data type问题2、模型代码及部分参数如下:parser.add_argument('--num_layers', type=int, default=1, ) # lstm层数parser.add_argument('--hidden_size', type=int, default=128, ) # lstm隐藏层parser.add_argument('--embedding_pretrained', default=None, ) # 预训练parser.add_argument('--embed', type=int, default=300) #文本embedding维度parser.add_argument('--hidden_size2', type=int, default=64, ) #import mindsporeimport mindspore.nn as nnfrom mindspore import dtype as mstypefrom mindspore import Tensorfrom mindspore.ops import operations as Pimport mindspore.ops as opsimport numpy as npclass RNN_attent(nn.Cell): def __init__(self, config): super(RNN_attent, self).__init__() self.embedding = nn.Embedding(config.n_vocab, config.embed) self.lstm = nn.LSTM(config.embed, config.hidden_size, config.num_layers, bidirectional=True, batch_first=True, dropout=config.dropout) self.softmax = nn.Softmax() self.tanh1 = nn.Tanh() # self.u = nn.Parameter(torch.Tensor(config.hidden_size * 2, config.hidden_size * 2)) self.w = mindspore.Parameter(Tensor(np.zeros(config.hidden_size * 2))) self.fc1 = nn.Dense(config.hidden_size * 2, config.hidden_size2) self.fc = nn.Dense(config.hidden_size2, config.num_classes) self.relu = nn.ReLU() self.unsqueeze = ops.ExpandDims() self.num_directions = 2 self.hidden_size = config.hidden_size self.num_layers = config.num_layers self.batch_size = config.batch_size def construct(self, x): embed = self.embedding(x) # [batch_size, seq_len, embeding]=[128, 32, 300] h_0 = Tensor(np.zeros([self.num_directions * self.num_layers, self.batch_size, self.hidden_size]).astype(np.float32)) c_0 = Tensor(np.zeros([self.num_directions * self.num_layers, self.batch_size, self.hidden_size]).astype(np.float32)) hx_0 = (h_0, c_0) H, _ = self.lstm(embed, hx_0) M = self.tanh1(H) # [128, 32, 256] alpha = self.softmax(ops.matmul(M, self.w)) # [128, 32, 1] alpha = self.unsqueeze(alpha, -1) out = H * alpha # [128, 32, 256] out = ops.reduce_sum(out, 1) # [128, 256] out = self.relu(out) out = self.fc1(out) out = self.fc(out) # [128, 64] return outargs.n_vocab = vocab_sizeargs.dropout = 0.1model = RNN_attent(config = args)【截图信息】完整报错信息:【日志信息】(可选,上传日志内容或者附件)
