• [执行问题] 【MindSpore】【模型训练】Model_Zoo中deeplabv3+模型在GPU环境运行下,混合精度报错
    环境:ubuntu18.04python3.7.5MindSpore1.3.0GPU CUDA10.1 CPU运行正常GPU运行报错,错误信息如下:[ERROR] DEVICE(19755,7f0b04de2740,python3.7):2021-09-22-13:58:04.484.439 [mindspore/ccsrc/runtime/device/gpu/kernel_info_setter.cc:118] SelectAkgKernel] Not find op[SoftmaxCrossEntropyWithLogits] in akg[ERROR] DEVICE(19755,7f0b04de2740,python3.7):2021-09-22-13:58:04.484.496 [mindspore/ccsrc/runtime/device/gpu/kernel_info_setter.cc:347] PrintUnsupportedTypeException] Select GPU kernel op[SoftmaxCrossEntropyWithLogits] fail! Incompatible data type!The supported data types are in[float32 float32], out[float32 float32]; , but get in [float16 float16 ] out [float16 float16 ]Traceback (most recent call last):  File "/home/hb2020/zy/Project/RemoteProject/Deeplabv3plus_mindspore/train.py", line 221, in <module>    train()  File "/home/hb2020/zy/Project/RemoteProject/Deeplabv3plus_mindspore/train.py", line 205, in train    model.train(args.train_epochs, dataset, callbacks=cbs)  File "/home/hb2020/anaconda3/envs/MindSp/lib/python3.7/site-packages/mindspore/train/model.py", line 649, in train    sink_size=sink_size)  File "/home/hb2020/anaconda3/envs/MindSp/lib/python3.7/site-packages/mindspore/train/model.py", line 439, in _train    self._train_dataset_sink_process(epoch, train_dataset, list_callback, cb_params, sink_size)  File "/home/hb2020/anaconda3/envs/MindSp/lib/python3.7/site-packages/mindspore/train/model.py", line 499, in _train_dataset_sink_process    outputs = self._train_network(*inputs)  File "/home/hb2020/anaconda3/envs/MindSp/lib/python3.7/site-packages/mindspore/nn/cell.py", line 386, in __call__    out = self.compile_and_run(*inputs)  File "/home/hb2020/anaconda3/envs/MindSp/lib/python3.7/site-packages/mindspore/nn/cell.py", line 644, in compile_and_run    self.compile(*inputs)  File "/home/hb2020/anaconda3/envs/MindSp/lib/python3.7/site-packages/mindspore/nn/cell.py", line 631, in compile    _executor.compile(self, *inputs, phase=self.phase, auto_parallel_mode=self._auto_parallel_mode)  File "/home/hb2020/anaconda3/envs/MindSp/lib/python3.7/site-packages/mindspore/common/api.py", line 531, in compile    result = self._executor.compile(obj, args_list, phase, use_vm, self.queue_name)TypeError: mindspore/ccsrc/runtime/device/gpu/kernel_info_setter.cc:347 PrintUnsupportedTypeException] Select GPU kernel op[SoftmaxCrossEntropyWithLogits] fail! Incompatible data type!The supported data types are in[float32 float32], out[float32 float32]; , but get in [float16 float16 ] out [float16 float16 ]# [WARNING] MD(19755,7f0b04de2740,python3.7):2021-09-22-13:58:04.572.570 [mindspore/ccsrc/minddata/dataset/engine/datasetops/device_queue_op.cc:73] ~DeviceQueueOp] preprocess_batch: 0; batch_queue: 0; push_start_time: ; push_end_time: .Process finished with exit code 1import osimport argparseimport astfrom mindspore import contextfrom mindspore.train.model import Modelfrom mindspore.context import ParallelModeimport mindspore.nn as nnfrom mindspore.train.callback import ModelCheckpoint, CheckpointConfigfrom mindspore.train.serialization import load_checkpoint, load_param_into_netfrom mindspore.communication.management import init, get_rank, get_group_sizefrom mindspore.train.callback import LossMonitor, TimeMonitorfrom mindspore.train.loss_scale_manager import FixedLossScaleManagerfrom mindspore.common import set_seedfrom src import dataset as data_generatorfrom src import loss, learning_ratesfrom src.deeplab_v3plus import DeepLabV3Plusimport timeset_seed(1)class BuildTrainNetwork(nn.Cell): def __init__(self, network, criterion): super(BuildTrainNetwork, self).__init__() self.network = network self.criterion = criterion def construct(self, input_data, label): output = self.network(input_data) net_loss = self.criterion(output, label) return net_lossdef parse_args(): parser = argparse.ArgumentParser('MindSpore DeepLabV3+ training') # Ascend or CPU parser.add_argument('--train_dir', type=str, default='./save_record', help='where training log and CKPTs saved') # dataset parser.add_argument('--data_file', type=str, default='./Data/Mindrecords/VOC2012/voc_train_mindspore', help='path and Name of one MindRecord file') parser.add_argument('--batch_size', type=int, default=32, help='batch size') parser.add_argument('--crop_size', type=int, default=513, help='crop size') parser.add_argument('--image_mean', type=list, default=[103.53, 116.28, 123.675], help='image mean') parser.add_argument('--image_std', type=list, default=[57.375, 57.120, 58.395], help='image std') parser.add_argument('--min_scale', type=float, default=0.5, help='minimum scale of data argumentation') parser.add_argument('--max_scale', type=float, default=2.0, help='maximum scale of data argumentation') parser.add_argument('--ignore_label', type=int, default=255, help='ignore label') parser.add_argument('--num_classes', type=int, default=21, help='number of classes') # optimizer parser.add_argument('--train_epochs', type=int, default=1, help='epoch, default=300') parser.add_argument('--lr_type', type=str, default='cos', help='type of learning rate') parser.add_argument('--base_lr', type=float, default=0.08, help='base learning rate') parser.add_argument('--lr_decay_step', type=int, default=40000, help='learning rate decay step') parser.add_argument('--lr_decay_rate', type=float, default=0.1, help='learning rate decay rate') parser.add_argument('--loss_scale', type=float, default=3072.0, help='loss scale') # model parser.add_argument('--model', type=str, default='DeepLabV3plus_s16', help='select model') parser.add_argument('--freeze_bn', action='store_true', help='freeze bn') parser.add_argument('--ckpt_pre_trained', type=str, default='./src/Pretrained/resnet.ckpt', help='PreTrained model') # train parser.add_argument('--device_target', type=str, default='GPU', choices=['Ascend', 'CPU', 'GPU'], help='device where the code will be implemented. (Default: Ascend)') parser.add_argument('--is_distributed', action='store_true', help='distributed training') parser.add_argument('--rank', type=int, default=0, help='local rank of distributed') parser.add_argument('--group_size', type=int, default=1, help='world size of distributed') parser.add_argument('--save_steps', type=int, default=110, help='steps interval for saving') parser.add_argument('--keep_checkpoint_max', type=int, default=200, help='max checkpoint for saving') # ModelArts parser.add_argument('--modelArts_mode', type=ast.literal_eval, default=False, help='train on modelarts or not, default is False') parser.add_argument('--train_url', type=str, default='', help='where training log and CKPTs saved') parser.add_argument('--data_url', type=str, default='', help='the directory path of saved file') parser.add_argument('--dataset_filename', type=str, default='', help='Name of the MindRecord file') parser.add_argument('--pretrainedmodel_filename', type=str, default='', help='Name of the pretraining model file') args, _ = parser.parse_known_args() return argsdef train(): args = parse_args() if args.device_target == "CPU": context.set_context(mode=context.GRAPH_MODE, save_graphs=False, device_target="CPU") elif args.device_target == "GPU": context.set_context(mode=context.GRAPH_MODE, device_target="GPU", enable_graph_kernel=True, save_graphs=True) else: context.set_context(mode=context.GRAPH_MODE, enable_auto_mixed_precision=True, save_graphs=False, device_target="Ascend", device_id=int(os.getenv('DEVICE_ID'))) # init multicards training if args.modelArts_mode: import moxing as mox local_data_url = '/cache/data' local_train_url = '/cache/ckpt' device_id = int(os.getenv('DEVICE_ID')) device_num = int(os.getenv('RANK_SIZE')) if device_num > 1: init() args.rank = get_rank() args.group_size = get_group_size() parallel_mode = ParallelMode.DATA_PARALLEL context.set_auto_parallel_context(parallel_mode=parallel_mode, gradients_mean=True, device_num=args.group_size) local_data_url = os.path.join(local_data_url, str(device_id)) # download dataset from obs to cache mox.file.copy_parallel(src_url=args.data_url, dst_url=local_data_url) data_file = local_data_url + '/' + args.dataset_filename ckpt_file = local_data_url + '/' + args.pretrainedmodel_filename train_dir = local_train_url else: if args.is_distributed: init() args.rank = get_rank() args.group_size = get_group_size() parallel_mode = ParallelMode.DATA_PARALLEL context.set_auto_parallel_context(parallel_mode=parallel_mode, gradients_mean=True, device_num=args.group_size) data_file = args.data_file ckpt_file = args.ckpt_pre_trained train_dir = args.train_dir # dataset dataset = data_generator.SegDataset(image_mean=args.image_mean, image_std=args.image_std, data_file=data_file, batch_size=args.batch_size, crop_size=args.crop_size, max_scale=args.max_scale, min_scale=args.min_scale, ignore_label=args.ignore_label, num_classes=args.num_classes, num_readers=2, num_parallel_calls=4, shard_id=args.rank, shard_num=args.group_size) dataset = dataset.get_dataset(repeat=1) # data: dtype=float32 label: dtype=uint8 # for item in dataset.create_dict_iterator(output_numpy=True): # print(item['data'].dtype) # network if args.model == 'DeepLabV3plus_s16': network = DeepLabV3Plus('train', args.num_classes, 16, args.freeze_bn) elif args.model == 'DeepLabV3plus_s8': network = DeepLabV3Plus('train', args.num_classes, 8, args.freeze_bn) else: raise NotImplementedError('model [{:s}] not recognized'.format(args.model)) # loss loss_ = loss.SoftmaxCrossEntropyLoss(args.num_classes, args.ignore_label) loss_.add_flags_recursive(fp32=True) train_net = BuildTrainNetwork(network, loss_) # load pretrained model if args.ckpt_pre_trained or args.pretrainedmodel_filename: param_dict = load_checkpoint(ckpt_file) load_param_into_net(train_net, param_dict) # optimizer iters_per_epoch = dataset.get_dataset_size() print("iters_per_epoch = ",iters_per_epoch) total_train_steps = iters_per_epoch * args.train_epochs if args.lr_type == 'cos': lr_iter = learning_rates.cosine_lr(args.base_lr, total_train_steps, total_train_steps) elif args.lr_type == 'poly': lr_iter = learning_rates.poly_lr(args.base_lr, total_train_steps, total_train_steps, end_lr=0.0, power=0.9) elif args.lr_type == 'exp': lr_iter = learning_rates.exponential_lr(args.base_lr, args.lr_decay_step, args.lr_decay_rate, total_train_steps, staircase=True) else: raise ValueError('unknown learning rate type') opt = nn.Momentum(params=train_net.trainable_params(), learning_rate=lr_iter, momentum=0.9, weight_decay=0.0001, loss_scale=args.loss_scale) # loss scale manager_loss_scale = FixedLossScaleManager(args.loss_scale, drop_overflow_update=False) # amp_level = "O0" if args.device_target == "CPU" else "O3" # GPU==02 if args.device_target == "CPU": amp_level = "O0" elif args.device_target == "GPU": amp_level = "O2" else: amp_level = "O3" model = Model(train_net, optimizer=opt, amp_level=amp_level, loss_scale_manager=manager_loss_scale) # callback for saving ckpts time_cb = TimeMonitor(data_size=iters_per_epoch) loss_cb = LossMonitor() cbs = [time_cb, loss_cb] if args.rank == 0: config_ck = CheckpointConfig(save_checkpoint_steps=args.save_steps, keep_checkpoint_max=args.keep_checkpoint_max) ckpoint_cb = ModelCheckpoint(prefix=args.model, directory=train_dir, config=config_ck) cbs.append(ckpoint_cb) model.train(args.train_epochs, dataset, callbacks=cbs) if args.modelArts_mode: # copy train result from cache to obs if args.rank == 0: mox.file.copy_parallel(src_url=local_train_url, dst_url=args.train_url)if __name__ == '__main__':train()
  • [其他] 查看设备的硬件信息
    介绍  很多时候,我们希望能够查看一下自己使用的设备的信息,特别是训练的时候,可能要看看显卡的显存的使用率和利用率。这个相信大家都很熟悉了,就是nvidia-smi命令,打开终端,执行即可。但这条命令只能查看目前的,无法实时查看,持续刷新。那这就要用到另一条命令了,打开终端,输入即可。watch -n 0.1 nvidia-smi  你会得到信息  当然,这不是重点,重点是,希望看到CPU信息。很简单,使用s-tui库就行了,安装特别方便。打开终端,执行:pip install s-tui  运行也特别简单,直接执行s-tui命令  你会得到  按下键盘的向下的箭头键,你还可以看到每个Core的信息规格  是不是很详细,我使用的是免费体验规格,可以看到使用的是Intel Xeon E5-2690v4 @2.60GHz的处理器,8核,服务器级处理器,性能稳定,频率稳定在2.60GHz,紫色的部分是每个Core的频率,可以看到基本不变,绿色部分是使用率,不错吧,这处理器比绝大多数笔记本和台式机的处理器要好,起码核多呀,这样在训练是读取数据可以使用多线程,大大提高读取数据的速度,节约训练时间。
  • [其他] 强化学习基本模型和原理
    强化学习是从动物学习、参数扰动自适应控制等理论发展而来,其基本原理是:如果Agent的某个行为策略导致环境正的奖赏(强化信号),那么Agent以后产生这个行为策略的趋势便会加强。Agent的目标是在每个离散状态发现最优策略以使期望的折扣奖赏和最大。强化学习把学习看作试探评价过程,Agent选择一个动作用于环境,环境接受该动作后状态发生变化,同时产生一个强化信号(奖或惩)反馈给Agent,Agent根据强化信号和环境当前状态再选择下一个动作,选择的原则是使受到正强化(奖)的概率增大。选择的动作不仅影响立即强化值,而且影响环境下一时刻的状态及最终的强化值。强化学习不同于连接主义学习中的监督学习,主要表现在教师信号上,强化学习中由环境提供的强化信号是Agent对所产生动作的好坏作一种评价(通常为标量信号),而不是告诉Agent如何去产生正确的动作。由于外部环境提供了很少的信息,Agent必须靠自身的经历进行学习。通过这种方式,Agent在行动一一评价的环境中获得知识,改进行动方案以适应环境。强化学习系统学习的目标是动态地调整参数,以达到强化信号最大。若已知r/A梯度信息,则可直接可以使用监督学习算法。因为强化信号r与Agent产生的动作A没有明确的函数形式描述,所以梯度信息r/A无法得到。因此,在强化学习系统中,需要某种随机单元,使用这种随机单元,Agent在可能动作空间中进行搜索并发现正确的动作。
  • [执行问题] 训练时报错
    【功能模块】在第一回合训练时,能正常训练,但是第二回合开始就报错,如下图之前都不报错,我尝试导入部分resnet50的预训练数据,然后修改了开始conv1的格式,导入代码如下,不知道是不是这块有问题【操作步骤&问题现象】1、2、【截图信息】【日志信息】(可选,上传日志内容或者附件)
  • [训练管理] 【modelarts】训练中途报错。
    在modelarts云平台进行多卡训练。(8卡,2卡)在运行一段时间后报错,是因为内存不足的原因么?
  • [活动体验] 《挑战进阶教程:活动一》
    活动一:对抗示例生成1.导入模型训练需要的库2.训练LeNet网络context.set_context(mode=context.GRAPH_MODE, device_target='CPU') class LeNet5(nn.Cell): def __init__(self, num_class=10, num_channel=1): super(LeNet5, self).__init__() self.conv1 = nn.Conv2d(num_channel, 6, 5, pad_mode='valid') self.conv2 = nn.Conv2d(6, 16, 5, pad_mode='valid') self.fc1 = nn.Dense(16 * 5 * 5, 120, weight_init=Normal(0.02)) self.fc2 = nn.Dense(120, 84, weight_init=Normal(0.02)) self.fc3 = nn.Dense(84, num_class, weight_init=Normal(0.02)) self.relu = nn.ReLU() self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2) self.flatten = nn.Flatten() def construct(self, x): x = self.conv1(x) x = self.relu(x) x = self.max_pool2d(x) x = self.conv2(x) x = self.relu(x) x = self.max_pool2d(x) x = self.flatten(x) x = self.fc1(x) x = self.relu(x) x = self.fc2(x) x = self.relu(x) x = self.fc3(x) return x net = LeNet5() def create_dataset(data_path, batch_size=1, num_parallel_workers=1): # 定义数据集 mnist_ds = ds.MnistDataset(data_path) resize_height, resize_width = 32, 32 rescale = 1.0 / 255.0 shift = 0.0 rescale_nml = 1 / 0.3081 shift_nml = -1 * 0.1307 / 0.3081 # 定义所需要操作的map映射 resize_op = CV.Resize((resize_height, resize_width), interpolation=Inter.LINEAR) rescale_nml_op = CV.Rescale(rescale_nml, shift_nml) rescale_op = CV.Rescale(rescale, shift) hwc2chw_op = CV.HWC2CHW() type_cast_op = C.TypeCast(mstype.int32) # 使用map映射函数,将数据操作应用到数据集 mnist_ds = mnist_ds.map(operations=type_cast_op, input_columns="label", num_parallel_workers=num_parallel_workers) mnist_ds = mnist_ds.map(operations=resize_op, input_columns="image", num_parallel_workers=num_parallel_workers) mnist_ds = mnist_ds.map(operations=rescale_op, input_columns="image", num_parallel_workers=num_parallel_workers) mnist_ds = mnist_ds.map(operations=rescale_nml_op, input_columns="image", num_parallel_workers=num_parallel_workers) mnist_ds = mnist_ds.map(operations=hwc2chw_op, input_columns="image", num_parallel_workers=num_parallel_workers) # 进行shuffle、batch操作 buffer_size = 10000 mnist_ds = mnist_ds.shuffle(buffer_size=buffer_size) mnist_ds = mnist_ds.batch(batch_size, drop_remainder=True) return mnist_ds net_loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean') net_opt = nn.Momentum(net.trainable_params(), learning_rate=0.01, momentum=0.9) config_ck = CheckpointConfig(save_checkpoint_steps=1875, keep_checkpoint_max=10) ckpoint = ModelCheckpoint(prefix="checkpoint_lenet", config=config_ck) def test_net(model, data_path): ds_eval = create_dataset(os.path.join(data_path, "test")) acc = model.eval(ds_eval, dataset_sink_mode=False) print("{}".format(acc)) def train_net(model, epoch_size, data_path, ckpoint_cb, sink_mode): ds_train = create_dataset(os.path.join(data_path, "train"), 32) model.train(epoch_size, ds_train, callbacks=[ckpoint_cb, LossMonitor(125)], dataset_sink_mode=sink_mode) train_epoch = 2 mnist_path = "E:/mnist_dataset" model = Model(net, net_loss, net_opt, metrics={"Accuracy": nn.Accuracy()}) train_net(model, train_epoch, mnist_path, ckpoint, False) test_net(model, mnist_path) 训练过程:3.选择模型加载:4.实现FGSM:class WithLossCell(nn.Cell): """ 包装网络与损失函数 """ def __init__(self, network, loss_fn): super(WithLossCell, self).__init__() self._network = network self._loss_fn = loss_fn def construct(self, data, label): out = self._network(data) return self._loss_fn(out, label) class GradWrapWithLoss(nn.Cell): """ 通过loss求反向梯度 """ def __init__(self, network): super(GradWrapWithLoss, self).__init__() self._grad_all = ops.composite.GradOperation(get_all=True, sens_param=False) self._network = network def construct(self, inputs, labels): gout = self._grad_all(self._network)(inputs, labels) return gout[0] class FastGradientSignMethod: """ 实现FGSM攻击 """ def __init__(self, network, eps=0.07, loss_fn=None): # 变量初始化 self._network = network self._eps = eps with_loss_cell = WithLossCell(self._network, loss_fn) self._grad_all = GradWrapWithLoss(with_loss_cell) self._grad_all.set_train() def _gradient(self, inputs, labels): # 求取梯度 out_grad = self._grad_all(inputs, labels) gradient = out_grad.asnumpy() gradient = np.sign(gradient) return gradient def generate(self, inputs, labels): # 实现FGSM inputs_tensor = Tensor(inputs) labels_tensor = Tensor(labels) gradient = self._gradient(inputs_tensor, labels_tensor) # 产生扰动 perturbation = self._eps*gradient # 生成受到扰动的图片 adv_x = inputs + perturbation return adv_x def batch_generate(self, inputs, labels, batch_size=32): # 对数据集进行处理 arr_x = inputs arr_y = labels len_x = len(inputs) batches = int(len_x / batch_size) rest = len_x - batches*batch_size res = [] for i in range(batches): x_batch = arr_x[i*batch_size: (i + 1)*batch_size] y_batch = arr_y[i*batch_size: (i + 1)*batch_size] adv_x = self.generate(x_batch, y_batch) res.append(adv_x) adv_x = np.concatenate(res, axis=0) return adv_x images = [] labels = [] test_images = [] test_labels = [] predict_labels = [] ds_test = create_dataset(os.path.join(mnist_path, "test"), batch_size=32).create_dict_iterator(output_numpy=True) for data in ds_test: images = data['image'].astype(np.float32) labels = data['label'] test_images.append(images) test_labels.append(labels) pred_labels = np.argmax(model.predict(Tensor(images)).asnumpy(), axis=1) predict_labels.append(pred_labels) test_images = np.concatenate(test_images) predict_labels = np.concatenate(predict_labels) true_labels = np.concatenate(test_labels)5.运行攻击邮箱:mycode_mindspore@163.com
  • [活动体验] 挑战进阶教程:活动一
    对抗示例生成1.导入模型训练需要的库import os import numpy as np from mindspore import Tensor, context, Model, load_checkpoint, load_param_into_net import mindspore.nn as nn import mindspore.ops as ops from mindspore.common.initializer import Normal from mindspore.train.callback import LossMonitor, ModelCheckpoint, CheckpointConfig import mindspore.dataset as ds import mindspore.dataset.transforms.c_transforms as C import mindspore.dataset.vision.c_transforms as CV from mindspore.dataset.vision import Inter from mindspore import dtype as mstype2.  训练精度达标的LeNet网络context.set_context(mode=context.GRAPH_MODE, device_target='CPU') class LeNet5(nn.Cell): def __init__(self, num_class=10, num_channel=1): super(LeNet5, self).__init__() self.conv1 = nn.Conv2d(num_channel, 6, 5, pad_mode='valid') self.conv2 = nn.Conv2d(6, 16, 5, pad_mode='valid') self.fc1 = nn.Dense(16 * 5 * 5, 120, weight_init=Normal(0.02)) self.fc2 = nn.Dense(120, 84, weight_init=Normal(0.02)) self.fc3 = nn.Dense(84, num_class, weight_init=Normal(0.02)) self.relu = nn.ReLU() self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2) self.flatten = nn.Flatten() def construct(self, x): x = self.conv1(x) x = self.relu(x) x = self.max_pool2d(x) x = self.conv2(x) x = self.relu(x) x = self.max_pool2d(x) x = self.flatten(x) x = self.fc1(x) x = self.relu(x) x = self.fc2(x) x = self.relu(x) x = self.fc3(x) return x net = LeNet5() def create_dataset(data_path, batch_size=1, num_parallel_workers=1): # 定义数据集 mnist_ds = ds.MnistDataset(data_path) resize_height, resize_width = 32, 32 rescale = 1.0 / 255.0 shift = 0.0 rescale_nml = 1 / 0.3081 shift_nml = -1 * 0.1307 / 0.3081 # 定义所需要操作的map映射 resize_op = CV.Resize((resize_height, resize_width), interpolation=Inter.LINEAR) rescale_nml_op = CV.Rescale(rescale_nml, shift_nml) rescale_op = CV.Rescale(rescale, shift) hwc2chw_op = CV.HWC2CHW() type_cast_op = C.TypeCast(mstype.int32) # 使用map映射函数,将数据操作应用到数据集 mnist_ds = mnist_ds.map(operations=type_cast_op, input_columns="label", num_parallel_workers=num_parallel_workers) mnist_ds = mnist_ds.map(operations=resize_op, input_columns="image", num_parallel_workers=num_parallel_workers) mnist_ds = mnist_ds.map(operations=rescale_op, input_columns="image", num_parallel_workers=num_parallel_workers) mnist_ds = mnist_ds.map(operations=rescale_nml_op, input_columns="image", num_parallel_workers=num_parallel_workers) mnist_ds = mnist_ds.map(operations=hwc2chw_op, input_columns="image", num_parallel_workers=num_parallel_workers) # 进行shuffle、batch操作 buffer_size = 10000 mnist_ds = mnist_ds.shuffle(buffer_size=buffer_size) mnist_ds = mnist_ds.batch(batch_size, drop_remainder=True) return mnist_ds net_loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean') net_opt = nn.Momentum(net.trainable_params(), learning_rate=0.01, momentum=0.9) config_ck = CheckpointConfig(save_checkpoint_steps=1875, keep_checkpoint_max=10) ckpoint = ModelCheckpoint(prefix="checkpoint_lenet", config=config_ck) def test_net(model, data_path): ds_eval = create_dataset(os.path.join(data_path, "test")) acc = model.eval(ds_eval, dataset_sink_mode=False) print("{}".format(acc)) def train_net(model, epoch_size, data_path, ckpoint_cb, sink_mode): ds_train = create_dataset(os.path.join(data_path, "train"), 32) model.train(epoch_size, ds_train, callbacks=[ckpoint_cb, LossMonitor(125)], dataset_sink_mode=sink_mode) train_epoch = 1 mnist_path = "C:/Users/DIVA_Score/Desktop/mnist_dataset" model = Model(net, net_loss, net_opt, metrics={"Accuracy": nn.Accuracy()}) train_net(model, train_epoch, mnist_path, ckpoint, False) ''' epoch: 1 step: 125, loss is 2.309873 epoch: 1 step: 250, loss is 2.30484 epoch: 1 step: 375, loss is 2.302457 epoch: 1 step: 500, loss is 2.2895803 epoch: 1 step: 625, loss is 2.306251 epoch: 1 step: 750, loss is 2.3196976 epoch: 1 step: 875, loss is 2.3023741 epoch: 1 step: 1000, loss is 2.3087487 epoch: 1 step: 1125, loss is 2.3092415 epoch: 1 step: 1250, loss is 0.821931 epoch: 1 step: 1375, loss is 0.22458856 epoch: 1 step: 1500, loss is 0.23729604 epoch: 1 step: 1625, loss is 0.2950006 epoch: 1 step: 1750, loss is 0.27920684 epoch: 1 step: 1875, loss is 0.34254763 ''' test_net(model, mnist_path) #{'Accuracy': 0.9513} param_dict = load_checkpoint("checkpoint_lenet-1_1875.ckpt") load_param_into_net(net, param_dict)实现FGSMclass WithLossCell(nn.Cell): """ 包装网络与损失函数 """ def __init__(self, network, loss_fn): super(WithLossCell, self).__init__() self._network = network self._loss_fn = loss_fn def construct(self, data, label): out = self._network(data) return self._loss_fn(out, label) class GradWrapWithLoss(nn.Cell): """ 通过loss求反向梯度 """ def __init__(self, network): super(GradWrapWithLoss, self).__init__() self._grad_all = ops.composite.GradOperation(get_all=True, sens_param=False) self._network = network def construct(self, inputs, labels): gout = self._grad_all(self._network)(inputs, labels) return gout[0] class FastGradientSignMethod: """ 实现FGSM攻击 """ def __init__(self, network, eps=0.07, loss_fn=None): # 变量初始化 self._network = network self._eps = eps with_loss_cell = WithLossCell(self._network, loss_fn) self._grad_all = GradWrapWithLoss(with_loss_cell) self._grad_all.set_train() def _gradient(self, inputs, labels): # 求取梯度 out_grad = self._grad_all(inputs, labels) gradient = out_grad.asnumpy() gradient = np.sign(gradient) return gradient def generate(self, inputs, labels): # 实现FGSM inputs_tensor = Tensor(inputs) labels_tensor = Tensor(labels) gradient = self._gradient(inputs_tensor, labels_tensor) # 产生扰动 perturbation = self._eps*gradient # 生成受到扰动的图片 adv_x = inputs + perturbation return adv_x def batch_generate(self, inputs, labels, batch_size=32): # 对数据集进行处理 arr_x = inputs arr_y = labels len_x = len(inputs) batches = int(len_x / batch_size) rest = len_x - batches*batch_size res = [] for i in range(batches): x_batch = arr_x[i*batch_size: (i + 1)*batch_size] y_batch = arr_y[i*batch_size: (i + 1)*batch_size] adv_x = self.generate(x_batch, y_batch) res.append(adv_x) adv_x = np.concatenate(res, axis=0) return adv_x images = [] labels = [] test_images = [] test_labels = [] predict_labels = [] ds_test = create_dataset(os.path.join(mnist_path, "test"), batch_size=32).create_dict_iterator(output_numpy=True) for data in ds_test: images = data['image'].astype(np.float32) labels = data['label'] test_images.append(images) test_labels.append(labels) pred_labels = np.argmax(model.predict(Tensor(images)).asnumpy(), axis=1) predict_labels.append(pred_labels) test_images = np.concatenate(test_images) predict_labels = np.concatenate(predict_labels) true_labels = np.concatenate(test_labels)运行攻击fgsm = FastGradientSignMethod(net, eps=0.0, loss_fn=net_loss) advs = fgsm.batch_generate(test_images, true_labels, batch_size=32) adv_predicts = model.predict(Tensor(advs)).asnumpy() adv_predicts = np.argmax(adv_predicts, axis=1) accuracy = np.mean(np.equal(adv_predicts, true_labels)) print(accuracy) #设定为0.7,尝试运行攻击: fgsm = FastGradientSignMethod(net, eps=0.7, loss_fn=net_loss) advs = fgsm.batch_generate(test_images, true_labels, batch_size=32) adv_predicts = model.predict(Tensor(advs)).asnumpy() adv_predicts = np.argmax(adv_predicts, axis=1) accuracy = np.mean(np.equal(adv_predicts, true_labels)) print(accuracy) #0.2675280448717949 import matplotlib.pyplot as plt adv_examples = np.transpose(advs[:10],[0,2,3,1]) ori_examples = np.transpose(test_images[:10],[0,2,3,1]) plt.figure() for i in range(10): plt.subplot(2,10,i+1) plt.imshow(ori_examples[i]) plt.subplot(2,10,i+11) plt.imshow(adv_examples[i])以下改变攻击系数的截图:攻击系数为0时:攻击系数为0.2时:攻击系数为0.4时:攻击系数为0.5时:攻击系数为0.7时:攻击系数为0.8时:攻击系数为1时:邮箱:diva_score@163.com
  • [执行问题] 多尺度训练
    【功能模块】【操作步骤&问题现象】1、pytorch中多尺度实现比较简单2、而mindspore中一般需要提前知道tensor的shape,即在GRAPH模式下Tensor的shape是固定的那么,请问mindspore模型实现多尺度是需要多次训练(每次一个尺度)来达到多尺度的效果嘛?同一个模型(pytorch版本)单尺度训练和多尺度训练会影响最终精度,mindspore版本模型也会有影响嘛?【截图信息】【日志信息】(可选,上传日志内容或者附件)
  • [数据处理] 【急】【mindspore】Get data timeout
    【功能模块】8卡V100+mindspore1.3-gpu1、随即初始化模型,在某2G数据集mindrecord上finetune,编译需要30分钟,然后开始训练正常2、加载预训练模型,在某2G数据集mindrecord上finetune,编译+模型加载需要超过1小时,开始训练时,报错【dataset_iterator_kernel.cc:108] ReadDevice] Get data timeout。(修正ds.config.set_calback_timeout为2000后,仍然报错同上。使用了fullbatch )请问如何设置延长加载数据的timeout超时????如何处理该问题?模型加载花费过多时间,导致callback读取数据超时?【操作步骤&问题现象】[ERROR] KERNEL(105835,7f098edb5700,python):2021-09-19-02:20:31.300.627 [mindspore/ccsrc/backend/kernel_compiler/gpu/data/dataset_iterator_kernel.cc:108] ReadDevice] Get data timeout[WARNING] MD(105835,7f09c6402200,python):2021-09-19-02:20:32.346.905 [mindspore/ccsrc/minddata/dataset/engine/datasetops/device_queue_op.cc:73] ~DeviceQueueOp] preprocess_batch: 0; batch_queue: 0; push_start_time: ; push_end_time: .Traceback (most recent call last):  File "/home/XXX/scripts/../finetune_pangu_no_pipeline.py", line 441, in <module>    run_train(opt)  File "/home/XXXscripts/../finetune_pangu_no_pipeline.py", line 409, in run_train    model.train(actual_epoch_num, ds, callbacks=callback, sink_size=callback_size, dataset_sink_mode=True)【截图信息】【日志信息】(可选,上传日志内容或者附件)
  • [数据处理] 【紧急】【MINDSPORE1.3】PANGU FINRTUNE
    【功能模块】环境mindspore1.3-gpu,需求“8卡”加载模型并继续增量训练。1、直接增量训练不用加载模型,大约编译半小时后可以开始训练2、而增量训练且加载模型,编译+加载大约需要超过1小时,1小时开始训练时,立马报错【dataset_iterator_kernel.cc:108] ReadDevice] Get data timeout。请问如何设置延长加载数据的timeout超时????【操作步骤&问题现象】[ERROR] KERNEL(105835,7f098edb5700,python):2021-09-19-02:20:31.300.627 [mindspore/ccsrc/backend/kernel_compiler/gpu/data/dataset_iterator_kernel.cc:108] ReadDevice] Get data timeout[WARNING] MD(105835,7f09c6402200,python):2021-09-19-02:20:32.346.905 [mindspore/ccsrc/minddata/dataset/engine/datasetops/device_queue_op.cc:73] ~DeviceQueueOp] preprocess_batch: 0; batch_queue: 0; push_start_time: ; push_end_time: .Traceback (most recent call last):  File "/home/XXX/scripts/../finetune_pangu_no_pipeline.py", line 441, in <module>    run_train(opt)  File "/home/XXXscripts/../finetune_pangu_no_pipeline.py", line 409, in run_train    model.train(actual_epoch_num, ds, callbacks=callback, sink_size=callback_size, dataset_sink_mode=True)2、【截图信息】【日志信息】(可选,上传日志内容或者附件)
  • [算子编译] resnet50修改参数以后进行预训练数据的读取
    【功能模块】由于模型需要,我要改变一开始conv1的shape从(64,3,h,w)-》(64,15,h,w)然后我又需要读resnet50的预训练模型参数,因此我想着是扩充这个conv1的维度,如下但是我发现这个预训练模型的参数类型都是ParameterTensor,被我进行扩维操作以后,但是我对这个网络层就是做了扩维操作以后变成了纯粹的Tensor类型了,然后这个load_param_into_net函数就读不进去了,报错,说必须是ParameterTensor类型,所以不知道应该怎么解决。【操作步骤&问题现象】1、2、【截图信息】【日志信息】(可选,上传日志内容或者附件)
  • [活动体验] 《挑战进阶教程:活动一》对抗示例生成的实验记录
    这次做《对抗示例生成》吧。打开 https://www.mindspore.cn/tutorials/zh-CN/r1.3/intermediate/image_and_video/adversarial_example_generation.html 页面,看样子似乎支持cpu下运行,而且支持MindSpore 1.3,而这个,张小白在做小Mi老师的作业中刚刚弄好这个环境:https://bbs.huaweicloud.com/forum/forum.php?mod=redirect&goto=findpost&ptid=154274&pid=1334908&fromuid=70062所以很简单,先下载这个Notebook,将其拷贝到 D:\ipynb目录下去,然后在jupyter lab中执行。。。开始跑跑看吧:首先,是老错误。但是,验证能通过,所以这关应该算是过了吧。(后来发现并不算)然后下载数据集:报了wget不存在,文件夹也不存在。张小白也没时间去安装个wget,就做了两件事:手工建目录:并将wget的内容统统在浏览器上打开,下载这四个数据集的文件:然后将下载好的文件集放到相应的目录下:训练集:测试集:攻击准备,将device_target改为CPU:依次执行训练LeNet的脚本:。。。又报了ds没定义。而ds应该是前面加载什么模块失败导致没import进去的吧。。。。张小白感觉用现在的Mindspore 1.3是不行的。今天专家说mindspore 1.4解决了这个bug,那就重装1.4吧。到Mindspore的官网 https://mindspore.cn/versions 找到1.4.1 CPU版本的下载链接:https://ms-release.obs.cn-north-4.myhuaweicloud.com/1.4.1/MindSpore/cpu/x86_64/mindspore-1.4.1-cp37-cp37m-win_amd64.whl 将mindspore 1.3的安装指令对应的文件名换为刚才1.4.1的文件名:pip install https://ms-release.obs.cn-north-4.myhuaweicloud.com/1.4.1/MindSpore/cpu/x86_64/mindspore-1.4.1-cp37-cp37m-win_amd64.whl --trusted-host ms-release.obs.cn-north-4.myhuaweicloud.com -i https://pypi.tuna.tsinghua.edu.cn/simple好像报错了,需要加个--user参数安装:这回安装成功了。重新打开JupyterLab:jupyter lab --no-browser复制链接,浏览器重新进入JupyterLab:这回mindspore的验证倒是没报错。专家说的都对。。。下面需要重做除了下载数据集的所有步骤。攻击准备:依次执行脚本,直到开始训练LeNet网络:这回顺利训练成功了1个epoch,看一下精度:0.9644.checkpoint文件也生成了:根据文档操作:运行实现FSGM攻击的相关脚本:。。。运行攻击:eps=0时的效果:eps设置成0.5:此时,精度为0.469250。。。看下实际的效果:要装 matplotlib:!pip install matplotlib -i https://pypi.tuna.tsinghua.edu.cn/simple重新执行脚本看效果:可见生成了攻击后加入噪声的图片,这就算是完成了挑战作业。最后附上修改后的ipynb文件(想用脚本试验的曼友们可以将ipynb.txt改名为ipynb文件,放入jupyterlab的目录下就可以运行了)邮箱:zhanghui_china2020@163.com(全文完,谢谢阅读)
  • [分布式] 【MindSpore1.3】分布式训练出错
    【功能模块】【操作步骤&问题现象】两台机器,每个单卡,分布式训练失败。device_ip配置和实际一致。详细信息在附件中。【截图信息】【日志信息】(可选,上传日志内容或者附件)
  • [其他] R-CNN 原理
    分析  传统的目标检测方法大多以图像识别为基础。一般可以在图片上使用穷举法选出所有物体可能出现的区域框,对这些区域框提取特征并使用图像识别方法分类,在得到所有分类成功的区域后,通过非极大值抑制(Non-maximum suppression)输出结果。  R-CNN的全称是Region-CNN,可以说是第一个成功将深度学习用到目标检测上的算法。  R-CNN同样遵循传统目标检测的思路,同样采用提取框、对每个提取框提取特征、图像分类、非极大值抑制四个步骤进行目标检测。只不过在提取特征这一步,将传统的特征换成了深度卷积网络(CNN)提取的特征。基本步骤  1)对于原始图像,首先使用Selective Search搜寻可能存在物体的区域。    Selective Search可以从图像中启发式地搜索出可能包含物体的区域,相比穷举法可以节省一部分计算量。  2)将取出的可能含有物体的区域送入CNN中提取特征。    CNN通常是接受一个固定大小的图像,而 Selective Search所取出的区域大小却各有不同,对此,R-CNN的做法是将区域缩放到统一大小,再使用CNN提取特征。  3)提取出特征后使用支持向量机SVM进行分类,最后通过非极大值抑制输出结果。R-CNN训练  1)在训练集上训练CNN,R-CNN论文中使用的CNN网络是AlexNet,数据集是ImageNet。  2)在目标检测的数据集上,对训练好的CNN进行微调  3)用Selective Search搜索候选区域,统一使用微调后的CNN对这些区域提取特征,并将提取的特征存储起来。  4)使用提取的特征,训练SVM分类器。算量  R-CNN的缺点在于计算量太大,因此,后续研究者又提出了Fast R-CNN和Faster R-CNN,这两者在一定程度上改进了R-CNN计算量大的缺点,不仅速度变快不少,识别准确率也得到了提高。在介绍Fast R-CNN和Faster R-CNN之前,需要先引入SPPNet,并介绍SPPNet的原理。
  • [分布式] 【Mindspore】多机多卡分布式训练
    【功能模块】【操作步骤&问题现象】1、请问,多机多卡分布式并行训练,仅支持8*n卡的设备集群吗?2、若是一个机器上2张卡,另一个机器1张卡,这种情况支持分布式训练吗?【截图信息】【日志信息】(可选,上传日志内容或者附件)