• [API使用] CPU架构训练出了.mindir模型,并且生成了.cpkt文件,如何在CPU架构下直接进行推理使用呢?
    https://www.mindspore.cn/news/newschildren?id=354在进行如上链接中的实验时,有两个问题想要求助一下1.生成好的训练模型.mindir文件直接在CPU上如何使用?链接中给的是转换为.ms文件在Android上使用,想请问在win10系统CPU架构下可否直接使用呢?有相关使用教程吗?2.这个实验中模型训练后仅可以进行两种类型分类,修改哪个脚本可以增加分类类型呢?
  • [问题求助] transformer中matmul算子
    现在我在尝试转transformer模型,其中涉及到一个matmul(onnx中)算子,输入为400*19*128 & 128*128. 使用mindstudio转模型时,提示a(400)与b(128)必须相等,该算子转换失败。 想请问一下各位大佬,有没有什么好办法解决/避开这个问题。 貌似om中的matmul无法进行广播机制
  • [MindX SDK] 安全帽识别 HelmetIdentification
    在mind-SDK的应用案例,安全帽识别中,其需要的mxVersion版本是2.0.4,我下载的版本是3.0.RC2。其脚本main-env.sh中设置了环境变量export ${MX_SDK_HOME}/lib:${MX_SDK_HOME}/opensource/lib:${MX_SDK_HOME}/opensource/lib64:${install_path}/acllib/lib64:/usr/local/Ascend/driver/lib64:${MX_SDK_HOME}/include:${MX_SDK_HOME}/python其中有一个目录是 ${MX_SDK_HOME}/opensource/lib64,但下载的mxVersion中,opensource中并没有lib64这个目录。想问问是什么原因呢?
  • [功能调试] 数据集路径传入错误:有关wgan模型使用lsun数据集训练的问题
    1. 系统环境硬件环境(Ascend/GPU/CPU): Ascend软件环境:– MindSpore 版本: 1.8.0执行模式:动态图(PYNATIVE_MODE) – Python 版本: 3.7.6– 操作系统平台: linux2. 报错信息2.1 问题描述开启脚本处理成功后,输出文件内没有任何ckpt文件生成,执行时间也很快,只有几秒钟2.2 报错信息2.3 脚本代码python data.py export ../bridge_train_lmdb --out_dir /cache/data/bridge --flat python train.py --dataset lsun --dataroot /cache/data/bridge/ --noBN 03. 根因分析数据集需要指定bedroom等数据集的上一级,例如我解压到/cache/data/bedroom,需要指定/cache/data,不能指定/cache/data/bedroom,数据集路径传错了4. 解决方案修改启动脚本的命令python train.py --dataset lsun --dataroot /cache/data/ --noBN 0
  • [功能调试] 问题报错:num_workers 设置成2以上时,会不断创建新的python进程
    1.系统环境硬件环境(Ascend/GPU/CPU): Ascend软件环境:– MindSpore 版本: 1.8.0执行模式:动态图(GRAPH_MODE) – Python 版本: 3.7.6– 操作系统平台: linux2. 报错信息2.1 问题描述num_workers 设置成8的时候, 会随着Epoch的增加而增加python进程,比如一个Epoch是70分钟,那么就是每70分钟又新增好多python进程,之前的进程也没有关闭,这样导致不断增加内存消耗,跑到中途就会把内存撑爆。以下是4卡时的情况(无论几卡都一样)2.2 报错信息内存爆炸2.3 脚本代码# Copyright 2022 Huawei Technologies Co., Ltd # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # ============================================================================ import os import time import mindspore import mindspore.nn as nn from mindspore import save_checkpoint, context, load_checkpoint, load_param_into_net from mindspore.communication.management import init, get_rank from mindspore.context import ParallelMode from mindspore.nn.dynamic_lr import piecewise_constant_lr from trainonestepgen import TrainOnestepGen from src.datasets.dataset import RBPNDataset, create_train_dataset from src.loss.generatorloss import GeneratorLoss from src.model.rbpn import Net as RBPN from src.util.config import get_args from src.util.utils import save_losses, init_weights args = get_args() mindspore.set_seed(args.seed) epoch_loss = [] best_avgpsnr = 0 eval_mean_psnr = [] save_loss_path = 'results/genloss/' if not os.path.exists(save_loss_path): os.makedirs(save_loss_path) def train(trainoneStep, trainds): """train the generator Args: trainoneStep(Cell): the network of trainds(dataset): train datasets """ trainoneStep.set_train() trainoneStep.set_grad() steps = trainds.get_dataset_size() for epoch in range(args.start_iter, args.nEpochs + 1): e_loss = 0 t0 = time.time() for iteration, batch in enumerate(trainds.create_dict_iterator(), 1): x = batch['input_image'] target = batch['target_image'] neighbor_tensor = batch['neighbor_image'] flow_tensor = batch['flow_image'] loss = trainoneStep(target, x, neighbor_tensor, flow_tensor) e_loss += loss.asnumpy() print('Epoch[{}]({}/{}): loss: {:.4f}'.format(epoch, iteration, steps, loss.asnumpy())) t1 = time.time() mean = e_loss / steps epoch_loss.append(mean) print("Epoch {} Complete: Avg. Loss: {:.4f}|| Time: {} min {}s.".format(epoch, mean, int((t1 - t0) / 60), int(int(t1 - t0) % 60))) save_ckpt = os.path.join(args.save_folder, '{}_{}.ckpt'.format(epoch, args.model_type)) save_checkpoint(trainoneStep.network, save_ckpt) name = os.path.join(save_loss_path, args.valDataset + '_' + args.model_type) save_losses(epoch_loss, None, name) if __name__ == '__main__': # distribute # parallel environment setting context.set_context(mode=context.GRAPH_MODE, device_target=args.device_target) if args.run_distribute: print("distribute") device_id = int(os.getenv("DEVICE_ID")) device_num = args.device_num context.set_context(device_id=device_id) init() context.reset_auto_parallel_context() context.set_auto_parallel_context(parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True, device_num=device_num) rank = get_rank() else: device_id = args.device_id context.set_context(device_id=device_id) # get dataset train_dataset = RBPNDataset(args.data_dir, args.nFrames, args.upscale_factor, args.data_augmentation, args.file_list, args.other_dataset, args.patch_size, args.future_frame) train_ds = create_train_dataset(train_dataset, args) train_loader = train_ds.create_dict_iterator() train_steps = train_ds.get_dataset_size() print('===>Building model ', args.model_type) model = RBPN(num_channels=3, base_filter=256, feat=64, num_stages=3, n_resblock=5, nFrames=args.nFrames, scale_factor=args.upscale_factor) init_weights(model, 'KaimingNormal', 0.02) print('====>start training') if args.pretrained: ckpt = os.path.join(args.save_folder, args.pretrained_sr) print('=====> load params into generator') params = load_checkpoint(ckpt) load_param_into_net(model, params) print('=====> finish load generator') lossNetwork = GeneratorLoss(model) milestone = [int(args.nEpochs / 2) * train_steps, args.nEpochs * train_steps] learning_rates = [args.lr, args.lr / 10.0] lr = piecewise_constant_lr(milestone, learning_rates) optimizer = nn.Adam(model.trainable_params(), lr, loss_scale=args.sens) trainonestepNet = TrainOnestepGen(lossNetwork, optimizer, sens=args.sens) train(trainonestepNet, train_ds) print(train_dataset)3. 根因分析关于train_loader=train_ds.create_dict_iterator()这个函数的创建我认为是: 1、直接在 for epoch in range():的循环里创建迭代器是可以使用的,但是num_parallel_workers要设置成1,不然会发生不断创建python进程的情况。 2、在for epoch in range():循环外面 生成迭代器加载数据集(一般都是这样),for循环调用train_loader即可, num_parallel_workers可以设置>=2。4. 解决方案按照官网上的标准流程修改代码 cid:link_0修改后可正常运行。
  • [功能调试] 报错:YOLOv3_darknet53图片解码失败:[Decode] failed. Decode: image decode failed
    1. 系统环境硬件环境(Ascend/GPU/CPU): modelart软件环境:– MindSpore 版本: 1.5.1执行模式:动态图(PYNATIVE_MODE) – Python 版本: 3.7.6– 操作系统平台: linux2. 问题描述2.1 问题描述YOLOv3_darknet53图片解码失败2.2 报错信息2.3 脚本代码# Copyright 2020-2022 Huawei Technologies Co., Ltd # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # ============================================================================ """YoloV3 train.""" import os import time import datetime import mindspore as ms import mindspore.nn as nn import mindspore.communication as comm from src.yolo import YOLOV3DarkNet53, YoloWithLossCell from src.logger import get_logger from src.util import AverageMeter, get_param_groups, cpu_affinity from src.lr_scheduler import get_lr from src.yolo_dataset import create_yolo_dataset from src.initializer import default_recurisive_init, load_yolov3_params from src.util import keep_loss_fp32 from model_utils.config import config # only useful for huawei cloud modelarts. from model_utils.moxing_adapter import moxing_wrapper, modelarts_pre_process ms.set_seed(1) def conver_training_shape(args): training_shape = [int(args.training_shape), int(args.training_shape)] return training_shape def set_graph_kernel_context(): if ms.get_context("device_target") == "GPU": ms.set_context(enable_graph_kernel=True) ms.set_context(graph_kernel_flags="--enable_parallel_fusion " "--enable_trans_op_optimize " "--disable_cluster_ops=ReduceMax,Reshape " "--enable_expand_ops=Conv2D") def network_init(args): device_id = int(os.getenv('DEVICE_ID', '0')) ms.set_context(mode=ms.GRAPH_MODE, device_target=args.device_target, save_graphs=False, device_id=device_id) set_graph_kernel_context() # Set mempool block size for improving memory utilization, which will not take effect in GRAPH_MODE if ms.get_context("mode") == ms.PYNATIVE_MODE: ms.set_context(mempool_block_size="31GB") # Since the default max memory pool available size on ascend is 30GB, # which does not meet the requirements and needs to be adjusted larger. if ms.get_context("device_target") == "Ascend": ms.set_context(max_device_memory="31GB") profiler = None if args.need_profiler: profiling_dir = os.path.join("profiling", datetime.datetime.now().strftime('%Y-%m-%d_time_%H_%M_%S')) profiler = ms.profiler.Profiler(output_path=profiling_dir) # init distributed if args.is_distributed: comm.init() args.rank = comm.get_rank() args.group_size = comm.get_group_size() if args.device_target == "GPU" and args.bind_cpu: cpu_affinity(args.rank, min(args.group_size, args.device_num)) # select for master rank save ckpt or all rank save, compatible for model parallel args.rank_save_ckpt_flag = 0 if args.is_save_on_master: if args.rank == 0: args.rank_save_ckpt_flag = 1 else: args.rank_save_ckpt_flag = 1 # logger args.outputs_dir = os.path.join(args.ckpt_path, datetime.datetime.now().strftime('%Y-%m-%d_time_%H_%M_%S')) args.logger = get_logger(args.outputs_dir, args.rank) args.logger.save_args(args) return profiler def parallel_init(args): ms.reset_auto_parallel_context() parallel_mode = ms.ParallelMode.STAND_ALONE degree = 1 if args.is_distributed: parallel_mode = ms.ParallelMode.DATA_PARALLEL degree = comm.get_group_size() ms.set_auto_parallel_context(parallel_mode=parallel_mode, gradients_mean=True, device_num=degree) @moxing_wrapper(pre_process=modelarts_pre_process) def run_train(): """Train function.""" if config.lr_scheduler == 'cosine_annealing' and config.max_epoch > config.T_max: config.T_max = config.max_epoch config.lr_epochs = list(map(int, config.lr_epochs.split(','))) config.data_root = os.path.join(config.data_dir, 'train2014') config.annFile = os.path.join(config.data_dir, 'annotations/instances_train2014.json') profiler = network_init(config) loss_meter = AverageMeter('loss') parallel_init(config) network = YOLOV3DarkNet53(is_training=True) # default is kaiming-normal default_recurisive_init(network) load_yolov3_params(config, network) network = YoloWithLossCell(network) config.logger.info('finish get network') if config.training_shape: config.multi_scale = [conver_training_shape(config)] ds = create_yolo_dataset(image_dir=config.data_root, anno_path=config.annFile, is_training=True, batch_size=config.per_batch_size, device_num=config.group_size, rank=config.rank, config=config) config.logger.info('Finish loading dataset') config.steps_per_epoch = ds.get_dataset_size() lr = get_lr(config) opt = nn.Momentum(params=get_param_groups(network), momentum=config.momentum, learning_rate=ms.Tensor(lr), weight_decay=config.weight_decay, loss_scale=config.loss_scale) is_gpu = ms.get_context("device_target") == "GPU" if is_gpu: loss_scale_value = 1.0 loss_scale = ms.FixedLossScaleManager(loss_scale_value, drop_overflow_update=False) network = ms.build_train_network(network, optimizer=opt, loss_scale_manager=loss_scale, level="O2", keep_batchnorm_fp32=False) keep_loss_fp32(network) else: network = nn.TrainOneStepCell(network, opt, sens=config.loss_scale) network.set_train() t_end = time.time() data_loader = ds.create_dict_iterator(output_numpy=True) first_step = True stop_profiler = False for epoch_idx in range(config.max_epoch): for step_idx, data in enumerate(data_loader): images = data["image"] input_shape = images.shape[2:4] config.logger.info('iter[{}], shape{}'.format(step_idx, input_shape[0])) images = ms.Tensor.from_numpy(images) batch_y_true_0 = ms.Tensor.from_numpy(data['bbox1']) batch_y_true_1 = ms.Tensor.from_numpy(data['bbox2']) batch_y_true_2 = ms.Tensor.from_numpy(data['bbox3']) batch_gt_box0 = ms.Tensor.from_numpy(data['gt_box1']) batch_gt_box1 = ms.Tensor.from_numpy(data['gt_box2']) batch_gt_box2 = ms.Tensor.from_numpy(data['gt_box3']) loss = network(images, batch_y_true_0, batch_y_true_1, batch_y_true_2, batch_gt_box0, batch_gt_box1, batch_gt_box2) loss_meter.update(loss.asnumpy()) # it is used for loss, performance output per config.log_interval steps. if (epoch_idx * config.steps_per_epoch + step_idx) % config.log_interval == 0: time_used = time.time() - t_end if first_step: fps = config.per_batch_size * config.group_size / time_used per_step_time = time_used * 1000 first_step = False else: fps = config.per_batch_size * config.log_interval * config.group_size / time_used per_step_time = time_used / config.log_interval * 1000 config.logger.info('epoch[{}], iter[{}], {}, fps:{:.2f} imgs/sec, ' 'lr:{}, per step time: {}ms'.format(epoch_idx + 1, step_idx + 1, loss_meter, fps, lr[step_idx], per_step_time)) t_end = time.time() loss_meter.reset() if config.need_profiler: if epoch_idx * config.steps_per_epoch + step_idx == 10: profiler.analyse() stop_profiler = True break if config.rank_save_ckpt_flag: ckpt_path = os.path.join(config.outputs_dir, 'ckpt_' + str(config.rank)) if not os.path.exists(ckpt_path): os.makedirs(ckpt_path, exist_ok=True) ckpt_name = os.path.join(ckpt_path, "yolov3_{}_{}.ckpt".format(epoch_idx + 1, config.steps_per_epoch)) ms.save_checkpoint(network, ckpt_name) if stop_profiler: break config.logger.info('==========end training===============') if __name__ == "__main__": run_train()3. 根因分析记录一下排查流程吧,用户使用了自有数据集,因此无法判断是代码异常还是数据集异常,先在相同的网络上跑coco公共数据集,发现不报错,排除代码问题,然后使用get_batch_size(),get_class_indexing(),get_col_names(),get_dataset_size(),get_repeat_count(),查看数据集是否争取加载,发现正确加载。排除图片问题。一次偶然的机会,偶然发现其实是数据集中描述图片标签的json文件损坏,某张图片未查找到,至此,问题排查结束。4. 解决方案重新换了一份json文件正确的数据集(非公开的)。5. 经验总结排查数据集问题,不止要注意对数据集目录结构、图像的问题排查,保存图像信息的json也很重要。
  • [问题求助] bearpi-hm-micro可以做外接摄像头模块吗
    请问各位大佬,bearpi-hm-micro可以做外接摄像头模块吗?可以后面搞深度学习吗?
  • [技术干货] 关于利用Ascend910 推理网络,NPU利用率为0的问题
    使用mindspore搭建了LeNet5网络,对Mnist数据集进行训练,在Modelarts平台上运行,使用公共资源池中的Ascend 910,在训练阶段可以看到NPU利用率在10%左右,但是在推理阶段NPU利用率始终为0,不知这种现象是否正常?对于NPU的利用是否存在单独的开关呢?还是说只要用mindspore搭建即可?
  • [执行问题] 训练到第23个epoch中止,无法正常运行
    报错内容RuntimeError: For 'GetNext', get data timeout. Queue name: 9d263f4a-46dc-11ed-b99c-0242ac110002 ---------------------------------------------------- - C++ Call Stack: (For framework developers) ---------------------------------------------------- mindspore/ccsrc/plugin/device/gpu/kernel/data/dataset_iterator_kernel.cc:135 ReadDevice看不出问题代码是哪部分,因为是dataset_iterator_kernel的问题,请问出错的地方是数据加载吗用官网的例子修改的函数如下def create_dataset(dataset_path, spilt, repeat_num=1, batch_size=32,num_classes=6):    """    Create a train or eval dataset.    Args:        dataset_path (str): The path of dataset.        spilt (str): Whether dataset is used for train or eval.        repeat_num (int): The repeat times of dataset. Default: 1.        batch_size (int): The batch size of dataset. Default: 32.    Returns:        Dataset.    """    if spilt == 'train':        dataset_path = os.path.join(dataset_path, 'train')        do_shuffle = True    elif spilt == 'val':        dataset_path = os.path.join(dataset_path, 'val')        do_shuffle = False    else:        dataset_path = os.path.join(dataset_path, 'test')        do_shuffle = False    if device_num == 1 :        ds = das.ImageFolderDataset(dataset_path,num_parallel_workers=8, shuffle=do_shuffle,decode=True,                                   class_indexing={'scab':0,'healthy':1,'frog_eye_leaf_spot':2,'rust':3,'complex':4,'powdery_mildew':5})    else:       ds = das.ImageFolderDataset(dataset_path,num_parallel_workers=8, shuffle=do_shuffle, num_shards=device_num, shard_id=device_id,decode=True,                                   class_indexing={'scab':0,'healthy':1,'frog_eye_leaf_spot':2,'rust':3,'complex':4,'powdery_mildew':5})    resize_height = 224    resize_width = 224    buffer_size = 100    rescale = 1.0 / 255.0    shift = 0.0    # define map operations    random_crop_op = C.RandomCrop((32, 32), (4, 4, 4, 4))    random_horizontal_flip_op = C.RandomHorizontalFlip(device_id / (device_id + 1))    resize_op = C.Resize((resize_height, resize_width))    rescale_op = C.Rescale(rescale, shift)    normalize_op = C.Normalize([0.4914, 0.4822, 0.4465], [0.2023, 0.1994, 0.2010])    change_swap_op = C.HWC2CHW()    trans = []    if spilt == 'train':        trans += [random_crop_op, random_horizontal_flip_op]    type_op = C2.TypeCast(mstype.float32)    trans += [resize_op, rescale_op, normalize_op, change_swap_op,type_op]    Onehot_op = C2.OneHot(num_classes)    type_cast_op = C2.TypeCast(mstype.int32)       ds = ds.map(operations=type_cast_op, input_columns="label", num_parallel_workers=8)    ds = ds.map(operations=trans, input_columns="image", num_parallel_workers=8)    # apply batch operations    ds = ds.batch(batch_size, drop_remainder=True)    # apply dataset repeat operation    ds = ds.repeat(repeat_num)    return ds
  • [推理] 5.1rc1环境使用5.0.2的镜像进行sdk推理,可以正常运行。5.0.4环境使用5.0.2的镜像进行sdk推理,报错507011
    5.1rc1环境使用5.0.2的镜像进行sdk推理,可以正常运行。5.0.4环境使用5.0.2的镜像进行sdk推理,报错507011ubuntu 18.04800-9000服务器(910A)cann:5.1rc1800-9000服务器(910B) cann:5.0.4sdk 5.0.2镜像版本 cann 5.0.2om直接在modelzoo上下载的,链接如下cid:link_0
  • [问题求助] modelarts上训练yolov4保存模型失败报错
    [mindspore/train/serialization.py:189] Failed to save the checkpoint file /cache/train/outputs/2022-09-16_time_10_09_50/ckpt_0/0-1_154.ckpt.
  • [执行问题] mindspore框架,MaskRCNN模型,gpu进行训练,损失出现nan值
    mindspore框架,MaskRCNN模型,gpu进行训练,损失出现nan值
  • [技术干货] ImageNet2012数据预处理
    作用使用代码实现一键式将ImageNet2012数据集解压、打标签介绍使用shell脚本对ImageNet2012数据集进行预处理,执行shell脚本便可以实现给ImageNet2012数据集的训练集、验证集解压和解决验证集没有标签的问题。使用的数据集:ImageNet2012数据集大小:共1000个类、224*224彩色图像训练集:共1,281,167张图像验证集:共50,000张图像数据集下载适用场景ImageNet2012ILSVRC2012_img_train.tar解压后还是tar包ILSVRC2012_img_val.tar解压后没有标签使用说明文件目录ILSVRC2012 ├── ILSVRC2012_img_val.tar ├── ILSVRC2012_img_train.tar ├── ImageNet2012Preprocess/ ├── train/ └── val/git 方法git clone cid:link_1.git bash ImageNet2012Preprocess/ImageNet2012Preprocess.sh下载zip包方法访问 cid:link_1 并下载zip包unzip ImageNet2012Preprocess-master.zip bash ImageNet2012Preprocess-master/ImageNet2012Preprocess.sh后台运行nohup bash ImageNet2012Preprocess/ImageNet2012Preprocess.sh &或者nohup bash ImageNet2012Preprocess-master/ImageNet2012Preprocess.sh &
  • [问题求助] Yolov4模型单帧推理时间比Yolov3单帧推理时间长
    Yolov3模型推理一帧图片的时间是10ms左右,Yolov4模型推理一帧图片的时间是40ms左右,yolov4的推理时间是yolov3的好几倍,怎样降低yolov4模型单帧推理所用的时间测试模型均为.om离线模型。这个有什么办法可以解决吗。
  • [MindX SDK] sdk推理报错
    使用MindX SDK 2.0.4 mxManufacture进行推理时报错