• 风行的作品
  • 使用强化学习AlphaZero算法训练中国象棋AI
    使用强化学习AlphaZero算法训练中国象棋AI案例目标通过本案例的学习和课后作业的练习:了解强化学习AlphaZero算法;利用AlphaZero算法进行一次中国象棋AI训练;你也可以将本案例相关的 ipynb 学习笔记分享到 AI Gallery Notebook 版块获得成长值,分享方法请查看此文档。案例内容介绍AlphaZero是一种强化学习算法,近期利用AlphaZero训练出的AI以绝对的优势战胜了多名围棋以及国际象棋冠军。AlphaZero创新点在于,它能够在不依赖于外部先验知识(也称专家知识),仅仅了解游戏规则的情况下,在棋盘类游戏中获得超越人类的表现。本次案例将详细的介绍AlphaZero算法核心原理,包括神经网络构建、MCTS搜索、自博弈训练,以代码的形式加深对算法的理解,算法详情亦可见论文《Mastering the game of Go without human knowledge》。同时本案例提供中国象棋强化学习环境,利用AlphaZero进行一次中国象棋训练,最后可视化象棋AI自博弈对局。由于训练一个强力的中国象棋AI需要大量的训练时间和资源,本案例偏重于算法理解,在运行过程中简化了训练过程,减少了自博弈次数和搜索次数。如果想要完整地训练一个中国象棋AlphaZero AI,可在AI Gallery中订阅《CChess中国象棋》算法,并在ModelArts中进行训练。注意事项本案例运行环境为 TensorFlow-1.13.1,且建议使用 GPU 运行,请查看《ModelAtrs JupyterLab 硬件规格使用指南》了解切换硬件规格的方法;如果您是第一次使用 JupyterLab,请查看《ModelAtrs JupyterLab使用指导》了解使用方法;如果您在使用 JupyterLab 过程中碰到报错,请参考《ModelAtrs JupyterLab常见问题解决办法》尝试解决问题;请逐步运行下面的每一个代码块;实验步骤程序初始化构建神经网络实现MCTS实现自博弈过程进行训练参数配置开始自博弈训练模型更新可视化对局1. 程序初始化第1步:安装基础依赖要确保所有依赖都安装成功后,再执行之后的代码。如果某些模块因为网络原因导致安装失败,直接重试一次即可。!pip install tornado==6.1.0!pip install tflearn==0.3.2!pip install tqdm!pip install urllib3==1.22!pip install threadpool==1.3.2!pip install xmltodict==0.12.0!pip install requests!pip install pandas==0.19.2!pip install numpy==1.14.5!pip install scipy==1.1.0!pip install matplotlib==2.0.0!pip install nest_asyncio!pip install gast==0.2.2第2步: 下载依赖包import osimport moxing as moxif not os.path.exists('cchess_training'): mox.file.copy("obs://modelarts-labs-bj4/course/modelarts/reinforcement_learning/cchess_gameplay/cchess_training/cchess_training.zip", "cchess_training.zip") os.system('unzip cchess_training.zip')第3步:导入相关的库%matplotlib notebook%matplotlib autoimport osimport sysimport loggingimport subprocessimport copyimport randomimport jsonimport asyncioimport timeimport numpy as npimport tensorflow as tffrom multiprocessing import Processfrom cchess_training.cchess_zero import board_visualizerfrom cchess_training.gameplays import players, gameplayfrom cchess_training.config import conffrom cchess_training.common.board import create_uci_labelsfrom cchess_training.cchess_training_model_update import model_updatefrom cchess_training.cchess_zero.gameboard import GameBoardfrom cchess_training.cchess_zero import cbffrom cchess_training.utils import get_latest_weight_pathimport nest_asyncionest_asyncio.apply()os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)logging.basicConfig(level=logging.INFO, format="[%(asctime)s] [%(levelname)s] [%(message)s]", datefmt='%Y-%m-%d %H:%M:%S' )2.构建神经网络这里基于Resnet实现了AlphaZero中的神经网络,神经网络输入为当前象棋棋面转化得到的0-1图,大小为[10, 9, 14],[10, 9]表示象棋棋盘大小,[14]每一个plane对应一类棋子,我方7类(兵、炮、车、马、相、仕、将),敌方7类,共14个plane。经过Resnet提取特征后分为两个分支,一个是价值分支,输出当前棋面价值,另一个是策略头,输出神经网络计算得到的动作对应概率。# resnetdef res_block(inputx, name, training, block_num=2, filters=256, kernel_size=(3, 3)): net = inputx for i in range(block_num): net = tf.layers.conv2d( net, filters=filters, kernel_size=kernel_size, activation=None, name="{}_res_conv{}".format(name, i), padding='same' ) net = tf.layers.batch_normalization(net, training=training, name="{}_res_bn{}".format(name, i)) if i == block_num - 1: net = net + inputx net = tf.nn.elu(net, name="{}_res_elu{}".format(name, i)) return netdef conv_block(inputx, name, training, block_num=1, filters=2, kernel_size=(1, 1)): net = inputx for i in range(block_num): net = tf.layers.conv2d( net, filters=filters, kernel_size=kernel_size, activation=None, name="{}_convblock_conv{}".format(name, i), padding='same' ) net = tf.layers.batch_normalization(net, training=training, name="{}_convblock_bn{}".format(name, i)) net = tf.nn.elu(net, name="{}_convblock_elu{}".format(name, i)) # net shape [None,10,9,2] netshape = net.get_shape().as_list() net = tf.reshape(net, shape=(-1, netshape[1] * netshape[2] * netshape[3])) net = tf.layers.dense(net, 10 * 9, name="{}_dense".format(name)) net = tf.nn.elu(net, name="{}_elu".format(name)) return netdef res_net_board(inputx, name, training, filters=256, num_res_layers=4): net = inputx net = tf.layers.conv2d( net, filters=filters, kernel_size=(3, 3), activation=None, name="{}_res_convb".format(name), padding='same' ) net = tf.layers.batch_normalization(net, training=training, name="{}_res_bnb".format(name)) net = tf.nn.elu(net, name="{}_res_elub".format(name)) for i in range(num_res_layers): net = res_block(net, name="{}_layer_{}".format(name, i + 1), training=training, filters=filters) return netdef get_scatter(name): with tf.variable_scope("Test"): ph = tf.placeholder(tf.float32, name=name) op = tf.summary.scalar(name, ph) return ph, opdef average_gradients(tower_grads): """Calculate the average gradient for each shared variable across all towers. Note that this function provides a synchronization point across all towers. Args: tower_grads: List of lists of (gradient, variable) tuples. The outer list is over individual gradients. The inner list is over the gradient calculation for each tower. Returns: List of pairs of (gradient, variable) where the gradient has been averaged across all towers. """ average_grads = [] for grad_and_vars in zip(*tower_grads): # Note that each grad_and_vars looks like the following: # ((grad0_gpu0, var0_gpu0), ... , (grad0_gpuN, var0_gpuN)) grads = [] for g, _ in grad_and_vars: # Add 0 dimension to the gradients to represent the tower. expanded_g = tf.expand_dims(g, 0) # Append on a 'tower' dimension which we will average over below. grads.append(expanded_g) # Average over the 'tower' dimension. grad = tf.concat(grads, 0) grad = tf.reduce_mean(grad, 0) # Keep in mind that the Variables are redundant because they are shared # across towers. So .. we will just return the first tower's pointer to # the Variable. v = grad_and_vars[0][1] grad_and_var = (grad, v) average_grads.append(grad_and_var) return average_gradsdef add_grad_to_list(opt, train_param, loss, tower_grad): grads = opt.compute_gradients(loss, var_list=train_param) grads = [i[0] for i in grads] tower_grad.append(zip(grads, train_param))def get_op_mul(tower_gradients, optimizer, gs): grads = average_gradients(tower_gradients) train_op = optimizer.apply_gradients(grads, gs) return train_opdef reduce_mean(x): return tf.reduce_mean(x)def merge(x): return tf.concat(x, axis=0)def get_model_resnet( model_name, labels, gpu_core=[0], batch_size=512, num_res_layers=4, filters=256, extra=False, extrav2=False): tf.reset_default_graph() graph = tf.Graph() with graph.as_default(): x_input = tf.placeholder(tf.float32, [None, 10, 9, 14]) nextmove = tf.placeholder(tf.float32, [None, len(labels)]) score = tf.placeholder(tf.float32, [None, 1]) training = tf.placeholder(tf.bool, name='training_mode') learning_rate = tf.placeholder(tf.float32) global_step = tf.train.get_or_create_global_step() optimizer_policy = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=0.9) optimizer_value = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=0.9) optimizer_multitarg = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=0.9) tower_gradients_policy, tower_gradients_value, tower_gradients_multitarg = [], [], [] net_softmax_collection = [] value_head_collection = [] multitarget_loss_collection = [] value_loss_collection = [] policy_loss_collection = [] accuracy_select_collection = [] with tf.variable_scope(tf.get_variable_scope()) as vscope: for ind, one_core in enumerate(gpu_core): if one_core is not None: devicestr = "/gpu:{}".format(one_core) if one_core is not None else "" else: devicestr = '/cpu:0' with tf.device(devicestr): body = res_net_board( x_input[ind * (batch_size // len(gpu_core)):(ind + 1) * (batch_size // len(gpu_core))], "selectnet", training=training, filters=filters, num_res_layers=num_res_layers ) with tf.variable_scope("policy_head"): policy_head = tf.layers.conv2d(body, 2, 1, padding='SAME') policy_head = tf.contrib.layers.batch_norm( policy_head, center=False, epsilon=1e-5, fused=True, is_training=training, activation_fn=tf.nn.relu ) policy_head = tf.reshape(policy_head, [-1, 9 * 10 * 2]) policy_head = tf.contrib.layers.fully_connected(policy_head, len(labels), activation_fn=None) # 价值头 with tf.variable_scope("value_head"): value_head = tf.layers.conv2d(body, 1, 1, padding='SAME') value_head = tf.contrib.layers.batch_norm( value_head, center=False, epsilon=1e-5, fused=True, is_training=training, activation_fn=tf.nn.relu ) value_head = tf.reshape(value_head, [-1, 9 * 10 * 1]) value_head = tf.contrib.layers.fully_connected(value_head, 256, activation_fn=tf.nn.relu) value_head = tf.contrib.layers.fully_connected(value_head, 1, activation_fn=tf.nn.tanh) value_head_collection.append(value_head) net_unsoftmax = policy_head with tf.variable_scope("Loss"): policy_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits( labels=nextmove[ind * (batch_size // len(gpu_core)): (ind + 1) * (batch_size // len(gpu_core))], logits=net_unsoftmax)) value_loss = tf.losses.mean_squared_error( labels=score[ind * (batch_size // len(gpu_core)):(ind + 1) * (batch_size // len(gpu_core))], predictions=value_head) value_loss = tf.reduce_mean(value_loss) regularizer = tf.contrib.layers.l2_regularizer(scale=1e-5) regular_variables = tf.trainable_variables() l2_loss = tf.contrib.layers.apply_regularization(regularizer, regular_variables) multitarget_loss = value_loss + policy_loss + l2_loss multitarget_loss_collection.append(multitarget_loss) value_loss_collection.append(value_loss) policy_loss_collection.append(policy_loss) net_softmax = tf.nn.softmax(net_unsoftmax) net_softmax_collection.append(net_softmax) correct_prediction = tf.equal(tf.argmax(nextmove, 1), tf.argmax(net_softmax, 1)) with tf.variable_scope("Accuracy"): accuracy_select = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) accuracy_select_collection.append(accuracy_select) tf.get_variable_scope().reuse_variables() trainable_params = tf.trainable_variables() tp_policy = [i for i in trainable_params if ('value_head' not in i.name)] tp_value = [i for i in trainable_params if ('policy_head' not in i.name)] add_grad_to_list(optimizer_policy, tp_policy, policy_loss, tower_gradients_policy) add_grad_to_list(optimizer_value, tp_value, value_loss, tower_gradients_value) add_grad_to_list(optimizer_multitarg, trainable_params, multitarget_loss, tower_gradients_multitarg) update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) with tf.control_dependencies(update_ops): train_op_policy = get_op_mul(tower_gradients_policy, optimizer_policy, global_step) train_op_value = get_op_mul(tower_gradients_value, optimizer_value, global_step) train_op_multitarg = get_op_mul(tower_gradients_multitarg, optimizer_multitarg, global_step) net_softmax = merge(net_softmax_collection) value_head = merge(value_head_collection) multitarget_loss = reduce_mean(multitarget_loss_collection) value_loss = reduce_mean(value_loss_collection) policy_loss = reduce_mean(policy_loss_collection) accuracy_select = reduce_mean(accuracy_select_collection) with graph.as_default(): config = tf.ConfigProto() config.gpu_options.allow_growth = True config.allow_soft_placement = True sess = tf.Session(config=config) if model_name is not None: with graph.as_default(): saver = tf.train.Saver(var_list=tf.global_variables()) saver.restore(sess, model_name) else: with graph.as_default(): sess.run(tf.global_variables_initializer()) return (sess, graph), ((x_input, training), (net_softmax, value_head))3.实现MCTSAlphaZero利用MCTS来自博弈生成棋局,MCTS搜索原理简述如下:每次模拟通过选择具有最大行动价值Q的边加上取决于所存储的先验概率P和该边的访问计数N(每次访问都被增加一次)的上限置信区间U来遍历树;展开叶子节点,通过神经网络来评估局面s,向量P的值存储在叶子结点扩展的边上;更新行动价值Q等于在该行动下的子树中的所有评估值V的均值;一旦MCTS搜索完成,返回局面s下的落子概率π。def softmax(x): probs = np.exp(x - np.max(x)) probs /= np.sum(probs) return probsclass TreeNode(object): """A node in the MCTS tree. Each node keeps track of its own value Q, prior probability P, and its visit-count-adjusted prior score u. """ def __init__(self, parent, prior_p, noise=False): self._parent = parent self._children = {} # a map from action to TreeNode self._n_visits = 0 self._Q = 0 self._u = 0 self._P = prior_p self.virtual_loss = 0 self.noise = noise def expand(self, action_priors): """Expand tree by creating new children. action_priors: a list of tuples of actions and their prior probability according to the policy function. """ # dirichlet noise should be applied when every select action if False and self.noise is True and self._parent is None: noise_d = np.random.dirichlet([0.3] * len(action_priors)) for (action, prob), one_noise in zip(action_priors, noise_d): if action not in self._children: prob = (1 - 0.25) * prob + 0.25 * one_noise self._children[action] = TreeNode(self, prob, noise=self.noise) else: for action, prob in action_priors: if action not in self._children: self._children[action] = TreeNode(self, prob) def select(self, c_puct): """Select action among children that gives maximum action value Q plus bonus u(P). Return: A tuple of (action, next_node) """ if self.noise is False: return max(self._children.items(), key=lambda act_node: act_node[1].get_value(c_puct)) elif self.noise is True and self._parent is not None: return max(self._children.items(), key=lambda act_node: act_node[1].get_value(c_puct)) else: noise_d = np.random.dirichlet([0.3] * len(self._children)) return max(list(zip(noise_d, self._children.items())), key=lambda act_node: act_node[1][1].get_value(c_puct, noise_p=act_node[0]))[1] def update(self, leaf_value): """Update node values from leaf evaluation. leaf_value: the value of subtree evaluation from the current player's perspective. """ # Count visit. self._n_visits += 1 # Update Q, a running average of values for all visits. self._Q += 1.0 * (leaf_value - self._Q) / self._n_visits def update_recursive(self, leaf_value): """Like a call to update(), but applied recursively for all ancestors. """ # If it is not root, this node's parent should be updated first. if self._parent: self._parent.update_recursive(-leaf_value) self.update(leaf_value) def get_value(self, c_puct, noise_p=None): """Calculate and return the value for this node. It is a combination of leaf evaluations Q, and this node's prior adjusted for its visit count, u. c_puct: a number in (0, inf) controlling the relative impact of value Q, and prior probability P, on this node's score. """ if noise_p is None: self._u = (c_puct * self._P * np.sqrt(self._parent._n_visits) / (1 + self._n_visits)) return self._Q + self._u + self.virtual_loss else: self._u = (c_puct * (self._P * 0.75 + noise_p * 0.25) * np.sqrt(self._parent._n_visits) / (1 + self._n_visits)) return self._Q + self._u + self.virtual_loss def is_leaf(self): """Check if leaf node (i.e. no nodes below this have been expanded).""" return self._children == {} def is_root(self): return self._parent is Noneclass MCTS(object): """An implementation of Monte Carlo Tree Search.""" def __init__( self, policy_value_fn, c_puct=5, n_playout=10000, search_threads=32, virtual_loss=3, policy_loop_arg=False, dnoise=False, play=False ): """ policy_value_fn: a function that takes in a board state and outputs a list of (action, probability) tuples and also a score in [-1, 1] (i.e. the expected value of the end game score from the current player's perspective) for the current player. c_puct: a number in (0, inf) that controls how quickly exploration converges to the maximum-value policy. A higher value means relying on the prior more. """ self._root = TreeNode(None, 1.0, noise=dnoise) self._policy = policy_value_fn self._c_puct = c_puct self._n_playout = n_playout self.virtual_loss = virtual_loss self.loop = asyncio.get_event_loop() self.policy_loop_arg = policy_loop_arg self.sem = asyncio.Semaphore(search_threads) self.now_expanding = set() self.select_time = 0 self.policy_time = 0 self.update_time = 0 self.num_proceed = 0 self.dnoise = dnoise self.play = play async def _playout(self, state): """Run a single playout from the root to the leaf, getting a value at the leaf and propagating it back through its parents. State is modified in-place, so a copy must be provided. """ with await self.sem: node = self._root road = [] while 1: while node in self.now_expanding: await asyncio.sleep(1e-4) start = time.time() if node.is_leaf(): break # Greedily select next move. action, node = node.select(self._c_puct) road.append(node) node.virtual_loss -= self.virtual_loss state.do_move(action) self.select_time += (time.time() - start) # at leave node if long check or long catch then cut off the node if state.should_cutoff() and not self.play: # cut off node for one_node in road: one_node.virtual_loss += self.virtual_loss # now at this time, we do not update the entire tree branch, the accuracy loss is supposed to be small # set virtual loss to -inf so that other threads would not # visit the same node again(so the node is cut off) node.virtual_loss = - np.inf self.update_time += (time.time() - start) # however the proceed number still goes up 1 self.num_proceed += 1 return start = time.time() self.now_expanding.add(node) # Evaluate the leaf using a network which outputs a list of # (action, probability) tuples p and also a score v in [-1, 1] # for the current player if self.policy_loop_arg is False: action_probs, leaf_value = await self._policy(state) else: action_probs, leaf_value = await self._policy(state, self.loop) self.policy_time += (time.time() - start) start = time.time() # Check for end of game. end, winner = state.game_end() if not end: node.expand(action_probs) else: # for end state,return the "true" leaf_value if winner == -1: # tie leaf_value = 0.0 else: leaf_value = ( 1.0 if winner == state.get_current_player() else -1.0 ) # Update value and visit count of nodes in this traversal. for one_node in road: one_node.virtual_loss += self.virtual_loss node.update_recursive(-leaf_value) self.now_expanding.remove(node) # node.update_recursive(leaf_value) self.update_time += (time.time() - start) self.num_proceed += 1 def get_move_probs(self, state, temp=1e-3, predict_workers=[], can_apply_dnoise=False, verbose=False, infer_mode=False): """Run all playouts sequentially and return the available actions and their corresponding probabilities. state: the current game state temp: temperature parameter in (0, 1] controls the level of exploration """ if can_apply_dnoise is False: self._root.noise = False coroutine_list = [] for n in range(self._n_playout): state_copy = copy.deepcopy(state) coroutine_list.append(self._playout(state_copy)) coroutine_list += predict_workers self.loop.run_until_complete(asyncio.gather(*coroutine_list)) # calc the move probabilities based on visit counts at the root node act_visits = [(act, node._n_visits) for act, node in self._root._children.items()] acts, visits = zip(*act_visits) act_probs = softmax(1.0 / temp * np.log(np.array(visits) + 1e-10)) if infer_mode: info = [(act, node._n_visits, node._Q, node._P) for act, node in self._root._children.items()] if infer_mode: return acts, act_probs, info else: return acts, act_probs def update_with_move(self, last_move, allow_legacy=True): """Step forward in the tree, keeping everything we already know about the subtree. """ self.num_proceed = 0 if last_move in self._root._children and allow_legacy: self._root = self._root._children[last_move] self._root._parent = None else: self._root = TreeNode(None, 1.0, noise=self.dnoise) def __str__(self): return "MCTS"4.实现自博弈过程实现自博弈训练,基于同一个神经网络初始化对弈双方棋手,对弈过程中双方棋手每下一步前均采用MCTS搜索最优下子策略,每次自博弈一局结束后保存棋局。# Self-playclass Game(object): def __init__(self, white, black, verbose=True): self.white = white self.black = black self.verbose = verbose self.gamestate = gameplay.GameState() def play_till_end(self): winner = 'peace' moves = [] peace_round = 0 remain_piece = gameplay.countpiece(self.gamestate.statestr) while True: start_time = time.time() if self.gamestate.move_number % 2 == 0: player_name = 'w' player = self.white else: player_name = 'b' player = self.black move, score = player.make_move(self.gamestate) if move is None: winner = 'b' if player_name == 'w' else 'w' break moves.append(move) total_time = time.time() - start_time logging.info('move {} {} play {} use {:.2f}s'.format( self.gamestate.move_number, player_name, move, total_time,)) game_end, winner_p = self.gamestate.game_end() if game_end: winner = winner_p break remain_piece_round = gameplay.countpiece(self.gamestate.statestr) if remain_piece_round < remain_piece: remain_piece = remain_piece_round peace_round = 0 else: peace_round += 1 if peace_round > conf.non_cap_draw_round: winner = 'peace' break return winner, movesclass NetworkPlayGame(Game): def __init__(self, network_w, network_b, **xargs): whiteplayer = players.NetworkPlayer('w', network_w, **xargs) blackplayer = players.NetworkPlayer('b', network_b, **xargs) super(NetworkPlayGame, self).__init__(whiteplayer, blackplayer)class ContinousNetworkPlayGames(object): def __init__( self, network_w=None, network_b=None, white_name='net', black_name='net', random_switch=True, recoard_game=True, recoard_dir='data/distributed/', play_times=np.inf, distributed_dir='data/prepare_weight', **xargs ): self.network_w = network_w self.network_b = network_b self.white_name = white_name self.black_name = black_name self.random_switch = random_switch self.play_times = play_times self.recoard_game = recoard_game self.recoard_dir = recoard_dir self.xargs = xargs # self.distributed_server = distributed_server self.distributed_dir = distributed_dir def begin_of_game(self): pass def end_of_game(self, cbf_name, moves, cbfile, training_dt, epoch): pass def play(self, data_url=None, epoch=0): num = 0 while num < self.play_times: time_one_game_start = time.time() num += 1 self.begin_of_game(epoch) if self.random_switch and random.random() < 0.5: self.network_w, self.network_b = self.network_b, self.network_w self.white_name, self.black_name = self.black_name, self.white_name network_play_game = NetworkPlayGame(self.network_w, self.network_b, **self.xargs) winner, moves = network_play_game.play_till_end() stamp = time.strftime('%Y-%m-%d_%H-%M-%S', time.localtime(time.time())) date = time.strftime('%Y-%m-%d', time.localtime(time.time())) cbfile = cbf.CBF( black=self.black_name, red=self.white_name, date=date, site='北京', name='noname', datemodify=date, redteam=self.white_name, blackteam=self.black_name, round='第一轮' ) cbfile.receive_moves(moves) randstamp = random.randint(0, 1000) cbffilename = '{}_{}_mcts-mcts_{}-{}_{}.cbf'.format( stamp, randstamp, self.white_name, self.black_name, winner) if not os.path.exists(self.recoard_dir): os.makedirs(self.recoard_dir) cbf_name = os.path.join(self.recoard_dir, cbffilename) cbfile.dump(cbf_name) training_dt = time.time() - time_one_game_start self.end_of_game(cbffilename, moves, cbfile, training_dt, epoch)class DistributedSelfPlayGames(ContinousNetworkPlayGames): def __init__(self, gpu_num=0, auto_update=True, mode='train', **kwargs): self.gpu_num = gpu_num self.auto_update = auto_update self.model_name_in_use = None # for tracking latest weight self.mode = mode super(DistributedSelfPlayGames, self).__init__(**kwargs) def begin_of_game(self, epoch): """ when self playing, init network player using the latest weights """ if not self.auto_update: return latest_model_name = get_latest_weight_path() logging.info('------------------ restoring model {}'.format(latest_model_name)) model_path = os.path.join(self.distributed_dir, latest_model_name) if self.network_w is None or self.network_b is None: network = get_model_resnet( model_path, create_uci_labels(), gpu_core=[self.gpu_num], filters=conf.network_filters, num_res_layers=conf.network_layers ) self.network_w = network self.network_b = network self.model_name_in_use = model_path else: if model_path != self.model_name_in_use: (sess, graph), ((X, training), (net_softmax, value_head)) = self.network_w with graph.as_default(): saver = tf.train.Saver(var_list=tf.global_variables()) saver.restore(sess, model_path) self.model_name_in_use = model_path def end_of_game(self, cbf_name, moves, cbfile, training_dt, epoch): played_games = len(os.listdir(conf.distributed_datadir)) if self.mode == 'train': logging.info('------------------ epoch {}: trained {} games, this game used {}s'.format( epoch, played_games, round(training_dt, 6), )) else: logging.info('------------------ infer {} games, this game used {}s'.format( played_games, round(training_dt, 6), )) def self_play_gpu(gpu_num=0, play_times=np.inf, mode='train', n_playout=50, save_dir=conf.distributed_datadir): logging.info('------------------ self play start') cn = DistributedSelfPlayGames( gpu_num=gpu_num, n_playout=n_playout, recoard_dir=save_dir, c_puct=conf.c_puct, distributed_dir=conf.distributed_server_weight_dir, dnoise=True, is_selfplay=True, play_times=play_times, mode=mode, ) cn.play(epoch=0) logging.info('------------------ self play done') 5.进行训练参数配置配置一次训练过程中自博弈次数、训练结束后采用训练出的模型进行推理局数、训练batch_size。为简化训练过程参数均较小。config = { "n_playout": 100, # MCTS搜索次数,推荐(10-1600),数字越小程序运行越快,数字越大算法搜索准确度越高 "self_play_games": 2, # 自博弈对局数, 推荐(5-10000),注意数字较大时可能会超过资源免费体验时长 "infer_games": 1, # 推理对局数 "gpu_num": 0, # 使用的GPU卡号}6.开始自博弈训练,结束后更新模型运行过程中会打印出双方下棋动作self_play_gpu(config['gpu_num'], config['self_play_games'], mode='train', n_playout=config['n_playout'])# model updatemodel_update(gpu_num=config['gpu_num'])7.可视化对局(等待第六步运行结束后再运行此步)在此将加载模型进行博弈一次,可视化对局过程,最后显示对弈结束时的棋面self_play_gpu(config['gpu_num'], config['infer_games'], mode='infer', n_playout=config['n_playout'], save_dir='./infer_res')加载对局文件显示双方所有动作,动作为棋盘上起点坐标至终点坐标,具体坐标定义见后面的棋盘。%reload_ext autoreload%autoreload 2%matplotlib inlinefrom matplotlib import pyplot as pltfrom cchess_training.cchess_zero.gameboard import *from PIL import Imageimport imageiogameplay_path = './infer_res'while not os.path.exists(gameplay_path) or len(os.listdir(gameplay_path)) == 0: time.sleep(5) logging.info('第6步未运行结束,建议停止运行,重新逐步运行')gameplays = os.listdir(gameplay_path)fullpath = '{}/{}'.format(gameplay_path, random.choice(gameplays))moves = cbf.cbf2move(fullpath)fname = fullpath.split('/')[-1]print(moves)['b2e2', 'h7h5', 'b0c2', 'b7e7', 'h0g2', 'h5i5', 'h2i2', 'a9a7', 'i0i1', 'i5g5', 'c2e1', 'h9i7', 'c3c4', 'e6e5', 'a0b0', 'e7g7', 'g3g4', 'c6c5', 'i1h1', 'i7g8', 'e2e5', 'i9i8', 'g2f4', 'b9c7', 'g4g5', 'c7b9', 'f4e6', 'a7e7', 'e6g7', 'e7e5', 'b0b9', 'c5c4', 'i2e2', 'c4d4', 'e2e5', 'i8i7', 'b9c9', 'i7g7', 'h1h8', 'i6i5', 'a3a4', 'g7a7', 'i3i4', 'i5i4', 'g5g6', 'i4i3', 'h8g8', 'i3h3', 'g8f8', 'a7a8', 'f8f9', 'e9f9', 'g6f6', 'a6a5', 'a4a5', 'd4e4', 'e3e4', 'g9e7', 'g0e2', 'a8g8', 'c9d9', 'g8g1']可视化对弈过程import cv2from IPython.display import clear_output, Image, displaystate = gameplay.GameState()statestr = 'RNBAKABNR/9/1C5C1/P1P1P1P1P/9/9/p1p1p1p1p/1c5c1/9/rnbakabnr'for move in moves: clear_output(wait=True) statestr = GameBoard.sim_do_action(move, statestr) img = board_visualizer.get_board_img(statestr) img_show = cv2.cvtColor(img, cv2.COLOR_RGBA2BGR) display(Image(data=cv2.imencode('.jpg', img_show)[1])) time.sleep(0.5)显示终局棋面plt.figure(figsize=(8,8))plt.imshow(board_visualizer.get_board_img(statestr))至此,本案例结束,如果想要完整地训练一个中国象棋AlphaZero AI,可在AI Gallery中订阅《CChess中国象棋》算法,并在ModelArts中进行训练。8. 作业请你调整步骤5中的训练参数,重新训练一个模型,使它在游戏中获得更好的表现
  • [技术干货] 使用强化学习AlphaZero算法训练五子棋AI
    案例目标通过本案例的学习和课后作业的练习:了解强化学习AlphaZero算法利用AlphaZero算法进行一次五子棋AI训练你也可以将本案例相关的 ipynb 学习笔记分享到 AI Gallery Notebook 版块获得成长值,分享方法请查看此文档。案例内容介绍AlphaZero是一种强化学习算法,近期利用AlphaZero训练出的AI以绝对的优势战胜了多名围棋以及国际象棋冠军。AlphaZero创新点在于,它能够在不依赖于外部先验知识即专家知识、仅仅了解游戏规则的情况下,在棋盘类游戏中获得超越人类的表现。本次案例将详细的介绍AlphaZero算法核心原理,包括神经网络构建、MCTS搜索、自博弈训练,以代码的形式加深算法理解,算法详情亦可见论文《Mastering the game of Go without human knowledge》。同时本案例提供五子棋强化学习环境,利用AlphaZero进行一次五子棋训练。最后可视化五子棋AI自博弈对局。由于在标准棋盘下训练一个强力的五子棋AI需要大量的训练时间和资源,本案例将棋盘缩小到了6x6x4,且在运行过程中简化了训练过程,减少了自博弈次数和搜索次数。如果想要完整地训练一个五子棋AlphaZero AI,可在AI Gallery中订阅《Gomoku-训练五子棋小游戏》算法并在ModelArts中进行训练。源码参考GitHub开源项目AlphaZero_Gomoku注意事项本案例运行环境为 Pytorch-1.0.0,且需使用 GPU 运行,请查看《ModelAtrs JupyterLab 硬件规格使用指南》了解切换硬件规格的方法;如果您是第一次使用 JupyterLab,请查看《ModelAtrs JupyterLab使用指导》了解使用方法;如果您在使用 JupyterLab 过程中碰到报错,请参考《ModelAtrs JupyterLab常见问题解决办法》尝试解决问题。建议逐步运行实验目录程序初始化进行训练参数配置构建环境构建神经网络实现MCTS实现自博弈过程训练主函数开始自博弈训练,并保存模型AI对战1.程序初始化第1步:安装基础依赖要确保所有依赖都安装成功后,再执行之后的代码。如果某些模块因为网络原因导致安装失败,直接重试一次即可。!pip install gym第2.进行训练参数配置为简化训练过程,涉及到影响训练时长的参数都设置的较小,且棋盘大小也减小为6x6,棋子连线降低为4。步:导入相关的库import osimport copyimport randomimport timefrom operator import itemgetterfrom collections import defaultdict, dequeimport numpy as npimport torchimport torch.nn as nnimport torch.nn.functional as Ffrom torch.autograd import Variableimport torch.optim as optimimport gymfrom gym.spaces import Box, Discreteimport matplotlib.pyplot as pltfrom IPython import display2.进行训练参数配置为简化训练过程,涉及到影响训练时长的参数都设置的较小,且棋盘大小也减小为6x6,棋子连线降低为4。board_width = 6 # 棋盘宽board_height = 6 # 棋盘高n_in_row = 4 # 胜利需要连成线棋子c_puct = 5 # 决定探索程度n_playout = 100 # 每步模拟次数learn_rate = 0.002 # 学习率lr_multiplier = 1.0 # 基于KL的自适应学习率调整temperature = 1.0 # 温度参数noise_eps = 0.75 # 噪声参数dirichlet_alpha = 0.3 # dirichlet系数buffer_size = 5000 # buffer大小train_batch_size = 128 # batchsize大小update_epochs = 5 # 多少个epoch更新一次kl_coeff = 0.02 # kl系数checkpoint_freq = 20 # 模型保存频率mcts_infer = 200 # 纯mcts推理时间restore_model = None # 是否加载预训练模型game_batch_num=40 # 训练步数model_path="." # 模型保存路径3.构建环境五子棋的环境是按照标准gym环境构建的,棋盘宽x高,先在横线、直线或斜对角线上形成n子连线的玩家获胜。状态空间为[4,棋盘宽,棋盘高],四个维度分别为当前视角下的位置,对手位置,上次位置以及轮次。class GomokuEnv(gym.Env): def __init__(self, start_player=0): self.start_player = start_player self.action_space = Discrete((board_width * board_height)) self.observation_space = Box(0, 1, shape=(4, board_width, board_height)) self.reward = 0 self.info = {} self.players = [1, 2] # player1 and player2 def step(self, action): self.states[action] = self.current_player if action in self.availables: self.availables.remove(action) self.last_move = action done, winner = self.game_end() reward = 0 if done: if winner == self.current_player: reward = 1 else: reward = -1 self.current_player = ( self.players[0] if self.current_player == self.players[1] else self.players[1] ) # update state obs = self.current_state() return obs, reward, done, self.info def reset(self): if board_width < n_in_row or board_height < n_in_row: raise Exception('board width and height can not be ' 'less than {}'.format(n_in_row)) self.current_player = self.players[self.start_player] # start player # keep available moves in a list self.availables = list(range(board_width * board_height)) self.states = {} self.last_move = -1 return self.current_state() def render(self, mode='human', start_player=0): width = board_width height = board_height p1, p2 = self.players print() for x in range(width): print("{0:8}".format(x), end='') print('\r\n') for i in range(height - 1, -1, -1): print("{0:4d}".format(i), end='') for j in range(width): loc = i * width + j p = self.states.get(loc, -1) if p == p1: print('B'.center(8), end='') elif p == p2: print('W'.center(8), end='') else: print('_'.center(8), end='') print('\r\n\r\n') def has_a_winner(self): states = self.states moved = list(set(range(board_width * board_height)) - set(self.availables)) if len(moved) < n_in_row * 2 - 1: return False, -1 for m in moved: h = m // board_width w = m % board_width player = states[m] if (w in range(board_width - n_in_row + 1) and len(set(states.get(i, -1) for i in range(m, m + n_in_row))) == 1): return True, player if (h in range(board_height - n_in_row + 1) and len(set(states.get(i, -1) for i in range(m, m + n_in_row * board_width, board_width))) == 1): return True, player if (w in range(board_width - n_in_row + 1) and h in range(board_height - n_in_row + 1) and len(set( states.get(i, -1) for i in range(m, m + n_in_row * (board_width + 1), board_width + 1))) == 1): return True, player if (w in range(n_in_row - 1, board_width) and h in range(board_height - n_in_row + 1) and len(set( states.get(i, -1) for i in range(m, m + n_in_row * (board_width - 1), board_width - 1))) == 1): return True, player return False, -1 def game_end(self): """Check whether the game is ended or not""" win, winner = self.has_a_winner() if win: # print("winner is player{}".format(winner)) return True, winner elif not len(self.availables): return True, -1 return False, -1 def current_state(self): """return the board state from the perspective of the current player. state shape: 4*width*height """ square_state = np.zeros((4, board_width, board_height)) if self.states: moves, players = np.array(list(zip(*self.states.items()))) move_curr = moves[players == self.current_player] move_oppo = moves[players != self.current_player] square_state[0][move_curr // board_width, move_curr % board_height] = 1.0 square_state[1][move_oppo // board_width, move_oppo % board_height] = 1.0 # indicate the last move location square_state[2][self.last_move // board_width, self.last_move % board_height] = 1.0 if len(self.states) % 2 == 0: square_state[3][:, :] = 1.0 # indicate the colour to play return square_state[:, ::-1, :] def start_play(self, player1, player2, start_player=0): """start a game between two players""" if start_player not in (0, 1): raise Exception('start_player should be either 0 (player1 first) ' 'or 1 (player2 first)') self.reset() p1, p2 = self.players player1.set_player_ind(p1) player2.set_player_ind(p2) players = {p1: player1, p2: player2} while True: player_in_turn = players[self.current_player] move = player_in_turn.get_action(self) self.step(move) end, winner = self.game_end() if end: return winner def start_self_play(self, player): """ start a self-play game using a MCTS player, reuse the search tree, and store the self-play data: (state, mcts_probs, z) for training """ self.reset() states, mcts_probs, current_players = [], [], [] while True: move, move_probs = player.get_action(self, return_prob=1) # store the data states.append(self.current_state()) mcts_probs.append(move_probs) current_players.append(self.current_player) # perform a move self.step(move) end, winner = self.game_end() if end: # winner from the perspective of the current player of each state winners_z = np.zeros(len(current_players)) if winner != -1: winners_z[np.array(current_players) == winner] = 1.0 winners_z[np.array(current_players) != winner] = -1.0 # reset MCTS root node player.reset_player() return winner, zip(states, mcts_probs, winners_z) def location_to_move(self, location): if (len(location) != 2): return -1 h = location[0] w = location[1] move = h * board_width + w if (move not in range(board_width * board_width)): return -1 return move def move_to_location(self, move): """ 3*3 board's moves like: 6 7 8 3 4 5 0 1 2 and move 5's location is (1,2) """ h = move // board_width w = move % board_width return [h, w]4.构建神经网络网络结构较为简单,backbone部分是三层卷积神经网络,提取特征后分为两个分支。一个是价值分支,输出当前棋面价值。另一个是决策分支,输出神经网络计算得到的动作对应概率。class Net(nn.Module): """policy-value network module""" def __init__(self): super(Net, self).__init__() # common layers self.conv1 = nn.Conv2d(4, 32, kernel_size=3, padding=1) self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1) self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1) # action policy layers self.act_conv1 = nn.Conv2d(128, 4, kernel_size=1) self.act_fc1 = nn.Linear(4 * board_width * board_height, board_width * board_height) # state value layers self.val_conv1 = nn.Conv2d(128, 2, kernel_size=1) self.val_fc1 = nn.Linear(2 * board_width * board_height, 64) self.val_fc2 = nn.Linear(64, 1) def forward(self, state_input): # common layers x = F.relu(self.conv1(state_input)) x = F.relu(self.conv2(x)) x = F.relu(self.conv3(x)) # action policy layers x_act = F.relu(self.act_conv1(x)) x_act = x_act.view(-1, 4 * board_width * board_height) x_act = F.log_softmax(self.act_fc1(x_act)) # state value layers x_val = F.relu(self.val_conv1(x)) x_val = x_val.view(-1, 2 * board_width * board_height) x_val = F.relu(self.val_fc1(x_val)) x_val = F.tanh(self.val_fc2(x_val)) return x_act, x_valclass PolicyValueNet: """policy-value network """ def __init__(self, model_file=None): if torch.cuda.is_available(): self.device = torch.device("cuda") else: self.device = torch.device("cpu") self.l2_const = 1e-4 # coef of l2 penalty # the policy value net module self.policy_value_net = Net().to(self.device) self.optimizer = optim.Adam(self.policy_value_net.parameters(), weight_decay=self.l2_const) if model_file: net_params = torch.load(model_file) self.policy_value_net.load_state_dict(net_params) def policy_value(self, state_batch): """ input: a batch of states output: a batch of action probabilities and state values """ state_batch = Variable(torch.FloatTensor(state_batch).to(self.device)) log_act_probs, value = self.policy_value_net(state_batch) act_probs = np.exp(log_act_probs.data.cpu().numpy()) return act_probs, value.data.cpu().numpy() def policy_value_fn(self, board): """ input: board output: a list of (action, probability) tuples for each available action and the score of the board state """ legal_positions = board.availables current_state = np.ascontiguousarray(board.current_state().reshape( -1, 4, board_width, board_height)) log_act_probs, value = self.policy_value_net( Variable(torch.from_numpy(current_state)).to(self.device).float()) act_probs = np.exp(log_act_probs.data.cpu().numpy().flatten()) act_probs = zip(legal_positions, act_probs[legal_positions]) value = value.data[0][0] return act_probs, value def train_step(self, state_batch, mcts_probs, winner_batch, lr): """perform a training step""" # wrap in Variable state_batch = Variable(torch.FloatTensor(state_batch).to(self.device)) mcts_probs = Variable(torch.FloatTensor(mcts_probs).to(self.device)) winner_batch = Variable(torch.FloatTensor(winner_batch).to(self.device)) # zero the parameter gradients self.optimizer.zero_grad() # set learning rate for param_group in self.optimizer.param_groups: param_group['lr'] = lr # forward log_act_probs, value = self.policy_value_net(state_batch) # define the loss = (z - v)^2 - pi^T * log(p) + c||theta||^2 # Note: the L2 penalty is incorporated in optimizer value_loss = F.mse_loss(value.view(-1), winner_batch) policy_loss = -torch.mean(torch.sum(mcts_probs * log_act_probs, 1)) loss = value_loss + policy_loss # backward and optimize loss.backward() self.optimizer.step() # calc policy entropy, for monitoring only entropy = -torch.mean( torch.sum(torch.exp(log_act_probs) * log_act_probs, 1) ) # return loss.data, entropy.data # for pytorch version >= 0.5 please use the following line instead. return loss.item(), entropy.item() def get_policy_param(self): net_params = self.policy_value_net.state_dict() return net_params def save_model(self, model_file): """ save model params to file """ net_params = self.get_policy_param() # get model params torch.save(net_params, model_file)5.实现MCTS¶AlphaZero利用MCTS来自博弈生成棋局,MCTS搜索原理简述如下:每次模拟通过选择具有最大行动价值Q的边加上取决于所存储的先验概率P和该边的访问计数N(每次访问都被增加一次)的上限置信区间U来遍历树。展开叶子节点,通过神经网络来评估局面s;向量P的值存储在叶子结点扩展的边上。更新行动价值Q等于在该行动下的子树中的所有评估值V的均值。一旦MCTS搜索完成,返回局面s下的落子概率π。def softmax(x): probs = np.exp(x - np.max(x)) probs /= np.sum(probs) return probsdef rollout_policy_fn(board): """a coarse, fast version of policy_fn used in the rollout phase.""" # rollout randomly action_probs = np.random.rand(len(board.availables)) return zip(board.availables, action_probs)def policy_value_fn(board): """a function that takes in a state and outputs a list of (action, probability) tuples and a score for the state""" # return uniform probabilities and 0 score for pure MCTS action_probs = np.ones(len(board.availables)) / len(board.availables) return zip(board.availables, action_probs), 0class TreeNode: """A node in the MCTS tree. Each node keeps track of its own value Q, prior probability P, and its visit-count-adjusted prior score u. """ def __init__(self, parent, prior_p): self._parent = parent self._children = {} # a map from action to TreeNode self._n_visits = 0 self._Q = 0 self._u = 0 self._P = prior_p def expand(self, action_priors): """Expand tree by creating new children. action_priors: a list of tuples of actions and their prior probability according to the policy function. """ for action, prob in action_priors: if action not in self._children: self._children[action] = TreeNode(self, prob) def select(self, c_puct): """Select action among children that gives maximum action value Q plus bonus u(P). Return: A tuple of (action, next_node) """ return max(self._children.items(), key=lambda act_node: act_node[1].get_value(c_puct)) def update(self, leaf_value): """Update node values from leaf evaluation. leaf_value: the value of subtree evaluation from the current player's perspective. """ # Count visit. self._n_visits += 1 # Update Q, a running average of values for all visits. self._Q += 1.0 * (leaf_value - self._Q) / self._n_visits def update_recursive(self, leaf_value): """Like a call to update(), but applied recursively for all ancestors. """ # If it is not root, this node's parent should be updated first. if self._parent: self._parent.update_recursive(-leaf_value) self.update(leaf_value) def get_value(self, c_puct): """Calculate and return the value for this node. It is a combination of leaf evaluations Q, and this node's prior adjusted for its visit count, u. c_puct: a number in (0, inf) controlling the relative impact of value Q, and prior probability P, on this node's score. """ self._u = (c_puct * self._P * np.sqrt(self._parent._n_visits) / (1 + self._n_visits)) return self._Q + self._u def is_leaf(self): """Check if leaf node (i.e. no nodes below this have been expanded).""" return self._children == {} def is_root(self): return self._parent is Noneclass MCTS: """An implementation of Monte Carlo Tree Search.""" def __init__(self, policy_value_fn, c_puct=5): """ policy_value_fn: a function that takes in a board state and outputs a list of (action, probability) tuples and also a score in [-1, 1] (i.e. the expected value of the end game score from the current player's perspective) for the current player. c_puct: a number in (0, inf) that controls how quickly exploration converges to the maximum-value policy. A higher value means relying on the prior more. """ self._root = TreeNode(None, 1.0) self._policy = policy_value_fn self._c_puct = c_puct def _playout(self, state): """Run a single playout from the root to the leaf, getting a value at the leaf and propagating it back through its parents. State is modified in-place, so a copy must be provided. """ node = self._root while (1): if node.is_leaf(): break # Greedily select next move. action, node = node.select(self._c_puct) state.step(action) # Evaluate the leaf using a network which outputs a list of # (action, probability) tuples p and also a score v in [-1, 1] # for the current player. action_probs, leaf_value = self._policy(state) # Check for end of game. end, winner = state.game_end() if not end: node.expand(action_probs) else: # for end state, return the true leaf_value if winner == -1: # tie leaf_value = 0.0 else: leaf_value = ( 1.0 if winner == state.current_player else -1.0 ) # Update value and visit count of nodes in this traversal. node.update_recursive(-leaf_value) def _playout_p(self, state): """Run a single playout from the root to the leaf, getting a value at the leaf and propagating it back through its parents. State is modified in-place, so a copy must be provided. """ node = self._root while (1): if node.is_leaf(): break # Greedily select next move. action, node = node.select(self._c_puct) state.step(action) action_probs, _ = self._policy(state) # Check for end of game end, winner = state.game_end() if not end: node.expand(action_probs) # Evaluate the leaf node by random rollout leaf_value = self._evaluate_rollout(state) # Update value and visit count of nodes in this traversal. node.update_recursive(-leaf_value) def _evaluate_rollout(self, env, limit=1000): """Use the rollout policy to play until the end of the game, returning +1 if the current player wins, -1 if the opponent wins, and 0 if it is a tie. """ player = env.current_player for i in range(limit): end, winner = env.game_end() if end: break action_probs = rollout_policy_fn(env) max_action = max(action_probs, key=itemgetter(1))[0] env.step(max_action) else: # If no break from the loop, issue a warning. print("WARNING: rollout reached move limit") if winner == -1: # tie return 0 else: return 1 if winner == player else -1 def get_move_probs(self, state, temp=1e-3): """Run all playouts sequentially and return the available actions and their corresponding probabilities. state: the current game state temp: temperature parameter in (0, 1] controls the level of exploration """ for n in range(n_playout): state_copy = copy.deepcopy(state) self._playout(state_copy) # calc the move probabilities based on visit counts at the root node act_visits = [(act, node._n_visits) for act, node in self._root._children.items()] acts, visits = zip(*act_visits) act_probs = softmax(1.0 / temp * np.log(np.array(visits) + 1e-10)) return acts, act_probs def get_move(self, state): """Runs all playouts sequentially and returns the most visited action. state: the current game state Return: the selected action """ for n in range(n_playout): state_copy = copy.deepcopy(state) self._playout_p(state_copy) return max(self._root._children.items(), key=lambda act_node: act_node[1]._n_visits)[0] def update_with_move(self, last_move): """Step forward in the tree, keeping everything we already know about the subtree. """ if last_move in self._root._children: self._root = self._root._children[last_move] self._root._parent = None else: self._root = TreeNode(None, 1.0) def __str__(self): return "MCTS"6.实现自博弈过程实现自博弈训练,此处博弈双方分别为基于MCTS的神经网络和纯MCTS,对弈过程中,前者基于神经网络和MCTS获取最优下子策略,而后者则仅根据MCTS搜索下子策略。保存对局数据class MCTS_Pure: """AI player based on MCTS""" def __init__(self): self.mcts = MCTS(policy_value_fn, c_puct) def set_player_ind(self, p): self.player = p def reset_player(self): self.mcts.update_with_move(-1) def get_action(self, board): sensible_moves = board.availables if len(sensible_moves) > 0: move = self.mcts.get_move(board) self.mcts.update_with_move(-1) return move else: print("WARNING: the board is full") def __str__(self): return "MCTS {}".format(self.player)class MCTSPlayer(MCTS_Pure): """AI player based on MCTS""" def __init__(self, policy_value_function, is_selfplay=0): super(MCTS_Pure, self).__init__() self.mcts = MCTS(policy_value_function, c_puct) self._is_selfplay = is_selfplay def get_action(self, env, return_prob=0): sensible_moves = env.availables # the pi vector returned by MCTS as in the alphaGo Zero paper move_probs = np.zeros(board_width * board_width) if len(sensible_moves) > 0: acts, probs = self.mcts.get_move_probs(env, temperature) move_probs[list(acts)] = probs if self._is_selfplay: # add Dirichlet Noise for exploration (needed for # self-play training) move = np.random.choice( acts, p=noise_eps * probs + (1 - noise_eps) * np.random.dirichlet( dirichlet_alpha * np.ones(len(probs)))) # update the root node and reuse the search tree self.mcts.update_with_move(move) else: # with the default temp=1e-3, it is almost equivalent # to choosing the move with the highest prob move = np.random.choice(acts, p=probs) # reset the root node self.mcts.update_with_move(-1) if return_prob: return move, move_probs else: return move else: print("WARNING: the board is full")7.训练主函数训练过程包括自我对局,数据生成,模型更新和保存class TrainPipeline: def __init__(self): # params of the board and the game self.env = GomokuEnv() # training params self.data_buffer = deque(maxlen=buffer_size) self.play_batch_size = 1 self.best_win_ratio = 0.0 # start training from an initial policy-value net self.policy_value_net = PolicyValueNet(model_file=restore_model) self.mcts_player = MCTSPlayer(self.policy_value_net.policy_value_fn, is_selfplay=1) self.mcts_infer = mcts_infer self.lr_multiplier = lr_multiplier def get_equi_data(self, play_data): """augment the data set by rotation and flipping play_data: [(state, mcts_prob, winner_z), ..., ...] """ extend_data = [] for state, mcts_porb, winner in play_data: for i in [1, 2, 3, 4]: # rotate counterclockwise equi_state = np.array([np.rot90(s, i) for s in state]) equi_mcts_prob = np.rot90(np.flipud( mcts_porb.reshape(board_height, board_width)), i) extend_data.append((equi_state, np.flipud(equi_mcts_prob).flatten(), winner)) # flip horizontally equi_state = np.array([np.fliplr(s) for s in equi_state]) equi_mcts_prob = np.fliplr(equi_mcts_prob) extend_data.append((equi_state, np.flipud(equi_mcts_prob).flatten(), winner)) return extend_data def collect_selfplay_data(self, n_games=1): """collect self-play data for training""" for i in range(n_games): winner, play_data = self.env.start_self_play(self.mcts_player) play_data = list(play_data)[:] self.episode_len = len(play_data) # augment the data play_data = self.get_equi_data(play_data) self.data_buffer.extend(play_data) def policy_update(self): """update the policy-value net""" mini_batch = random.sample(self.data_buffer, train_batch_size) state_batch = [data[0] for data in mini_batch] mcts_probs_batch = [data[1] for data in mini_batch] winner_batch = [data[2] for data in mini_batch] old_probs, old_v = self.policy_value_net.policy_value(state_batch) for i in range(update_epochs): loss, entropy = self.policy_value_net.train_step( state_batch, mcts_probs_batch, winner_batch, learn_rate * self.lr_multiplier) new_probs, new_v = self.policy_value_net.policy_value(state_batch) kl = np.mean(np.sum(old_probs * ( np.log(old_probs + 1e-10) - np.log(new_probs + 1e-10)), axis=1) ) if kl > kl_coeff * 4: # early stopping if D_KL diverges badly break # adaptively adjust the learning rate if kl > kl_coeff * 2 and self.lr_multiplier > 0.1: self.lr_multiplier /= 1.5 elif kl < kl_coeff / 2 and self.lr_multiplier < 10: self.lr_multiplier *= 1.5 return loss, entropy def policy_evaluate(self, n_games=10): """ Evaluate the trained policy by playing against the pure MCTS player Note: this is only for monitoring the progress of training """ current_mcts_player = MCTSPlayer(self.policy_value_net.policy_value_fn) pure_mcts_player = MCTS_Pure() win_cnt = defaultdict(int) for i in range(n_games): winner = self.env.start_play(current_mcts_player, pure_mcts_player, start_player=i % 2) win_cnt[winner] += 1 win_ratio = 1.0 * (win_cnt[1] + 0.5 * win_cnt[-1]) / n_games print("num_playouts:{}, win: {}, lose: {}, tie:{}".format(self.mcts_infer, win_cnt[1], win_cnt[2], win_cnt[-1])) return win_ratio def run(self): """run the training pipeline""" win_num = 0 try: for i_step in range(game_batch_num): self.collect_selfplay_data(self.play_batch_size) print("batch i:{}, episode_len:{}".format( i_step + 1, self.episode_len)) if len(self.data_buffer) > train_batch_size: loss, entropy = self.policy_update() # check the performance of the current model, # and save the model params if (i_step + 1) % checkpoint_freq == 0: print("current self-play batch: {}".format(i_step + 1)) win_ratio = self.policy_evaluate() self.policy_value_net.save_model(os.path.join(model_path, "newest_model.pt")) if win_ratio > self.best_win_ratio: win_num += 1 # print("New best policy!!!!!!!!") self.best_win_ratio = win_ratio # update the best_policy self.policy_value_net.save_model(os.path.join(model_path, "best_model.pt")) if self.best_win_ratio == 1.0 and self.mcts_infer < 5000: self.mcts_infer += 1000 self.best_win_ratio = 0.0 except KeyboardInterrupt: print('\n\rquit') return win_num8.开始自博弈训练,并保存模型GPU训练耗时约4分钟start_t = time.time()training_pipeline = TrainPipeline()training_pipeline.run()print("time cost is {}".format(time.time()-start_t))batch i:1, episode_len:13batch i:2, episode_len:16batch i:3, episode_len:13batch i:4, episode_len:11batch i:5, episode_len:11batch i:6, episode_len:15batch i:7, episode_len:13batch i:8, episode_len:13batch i:9, episode_len:19batch i:10, episode_len:14batch i:11, episode_len:11batch i:12, episode_len:19batch i:13, episode_len:15batch i:14, episode_len:22batch i:15, episode_len:8batch i:16, episode_len:16batch i:17, episode_len:15batch i:18, episode_len:13batch i:19, episode_len:15batch i:20, episode_len:17current self-play batch: 20num_playouts:200, win: 2, lose: 8, tie:0batch i:21, episode_len:12batch i:22, episode_len:14batch i:23, episode_len:12batch i:24, episode_len:11batch i:25, episode_len:15batch i:26, episode_len:7batch i:27, episode_len:13batch i:28, episode_len:10batch i:29, episode_len:10batch i:30, episode_len:13batch i:31, episode_len:17batch i:32, episode_len:12batch i:33, episode_len:11batch i:34, episode_len:11batch i:35, episode_len:9batch i:36, episode_len:14batch i:37, episode_len:17batch i:38, episode_len:12batch i:39, episode_len:13batch i:40, episode_len:18current self-play batch: 40num_playouts:200, win: 3, lose: 7, tie:0time cost is 250.642775774002089.AI对战(等待第8步运行结束后再运行此步)加载模型,进行人机对战# 定义当前玩家class CurPlayer: player_id = 0# 可视化部分class Game(object): def __init__(self, board): self.board = board self.cell_size = board_width - 1 self.chess_size = 50 * self.cell_size self.whitex = [] self.whitey = [] self.blackx = [] self.blacky = [] # 棋盘背景色 self.color = "#e4ce9f" self.colors = [[self.color] * self.cell_size for _ in range(self.cell_size)] def graphic(self, board, player1, player2): """Draw the board and show game info""" plt_fig, ax = plt.subplots(facecolor=self.color) ax.set_facecolor(self.color) # 制作棋盘 # mytable = ax.table(cellColours=self.colors, loc='center') mytable = plt.table(cellColours=self.colors, colWidths=[1 / board_width] * self.cell_size, loc='center' ) ax.set_aspect('equal') # 网格大小 cell_height = 1 / board_width for pos, cell in mytable.get_celld().items(): cell.set_height(cell_height) mytable.auto_set_font_size(False) mytable.set_fontsize(self.cell_size) ax.set_xlim([1, board_width * 2 + 1]) ax.set_ylim([board_height * 2 + 1, 1]) plt.title("Gomoku") plt.axis('off') cur_player = CurPlayer() while True: # left down of mouse try: if cur_player.player_id == 1: move = player1.get_action(self.board) self.board.step(move) x, y = self.board.move_to_location(move) plt.scatter((y + 1) * 2, (x + 1) * 2, s=self.chess_size, c='white') cur_player.player_id = 0 elif cur_player.player_id == 0: move = player2.get_action(self.board) self.board.step(move) x, y = self.board.move_to_location(move) plt.scatter((y + 1) * 2, (x + 1) * 2, s=self.chess_size, c='black') cur_player.player_id = 1 end, winner = self.board.game_end() if end: if winner != -1: ax.text(x=board_width, y=(board_height + 1) * 2 + 0.1, s="Game end. Winner is player {}".format(cur_player.player_id), fontsize=10, color='red', weight='bold', horizontalalignment='center') else: ax.text(x=board_width, y=(board_height + 1) * 2 + 0.1, s="Game end. Tie Round".format(cur_player.player_id), fontsize=10, color='red', weight='bold', horizontalalignment='center') return winner display.display(plt.gcf()) display.clear_output(wait=True) except: pass def start_play(self, player1, player2, start_player=0): """start a game between two players""" if start_player not in (0, 1): raise Exception('start_player should be either 0 (player1 first) ' 'or 1 (player2 first)') self.board.reset() p1, p2 = self.board.players player1.set_player_ind(p1) player2.set_player_ind(p2) self.graphic(self.board, player1, player2)# 初始化棋盘board = GomokuEnv()game = Game(board)# 加载模型best_policy = PolicyValueNet(model_file="best_model.pt")# 两个AI对打mcts_player = MCTSPlayer(best_policy.policy_value_fn)#开始对打game.start_play(mcts_player, mcts_player, start_player=0至此,本案例结束,如果想要完整地训练一个五子棋AlphaZero AI,可在AI Gallery中订阅《Gomoku-训练五子棋小游戏》算法并在ModelArts中进行训练。10. 作业请你调整步骤2中的训练参数,重新训练一个模型,使它在游戏中获得更好的表现
  • [技术干货] 使用A3C算法玩乒乓球游戏
    使用A3C算法玩乒乓球游戏实验目标通过本案例的学习和课后作业的练习:了解A3C算法的基本概念了解如何基于A3C训练ATARI游戏了解强化学习训练推理游戏的整体流程你也可以将本案例相关的 ipynb 学习笔记分享到 AI Gallery Notebook 版块获得成长值,分享方法请查看此文档。案例内容介绍A3C算法(Asynchronous advantage actor critic)是基于Actor-critic架构提出来的一种并行强化学习算法,解决了单个智能体与环境交互收集速度慢,训练难以收敛的问题。本案例基于单机多进程的方法实现了对ATARI游戏PONG智能体的训练。在16个并行环境下,本案例中的智能体能在14-24min训练后解决ATARI_PONG游戏。整体流程:创建ATARI环境->构建A3C算法->训练->推理->可视化效果参考材料A3C论文代码实现A3C算法基本原理A3C算法基于并行计算的思想,每个进程分别与环境进行交互学习,最后把训练结果汇总在共享内存中。每个进程定期从共享内存中获取最新的训练结果以指导接下来和环境交互的过程。相比Actor-critic架构,A3C改进为异步并行训练框架优点如下:本案例的基于lockfree的并行梯度下降算法(Hogwild!),避免了分布式系统的通信开销,实现了高效的梯度数据同步。多个独立的并行环境有助于降低数据之间的相关性,同时增加智能体的exploration。本案例中Actor和Critic共享网络结构,损失函数包括三个部分:Policy Loss,Value Loss,Regularizaiton with Policy Entropy。Atari Pong环境介绍Pong是起源于1972年美国的一款模拟两个人打乒乓球的游戏,近几年常用于测试强化学习算法的性能。游戏规则:智能体玩一边的球拍(AI控制另一个),将球打到另一边。对手失球得1分,智能体失球对手得1分,先达到21分的获得游戏胜利。游戏环境输出的标准observation为(210,160, 3)的RGB图像。PONG环境介绍注意事项本案例运行环境为 Pytorch-1.0.0,且需使用 GPU 运行,请查看《ModelAtrs JupyterLab 硬件规格使用指南》了解切换硬件规格的方法;如果您是第一次使用 JupyterLab,请查看《ModelAtrs JupyterLab使用指导》了解使用方法;如果您在使用 JupyterLab 过程中碰到报错,请参考《ModelAtrs JupyterLab常见问题解决办法》尝试解决问题。实验步骤1. 程序初始化第1步:安装基础依赖!pip install --upgrade pipCollecting pip Downloading http://repo.myhuaweicloud.com/repository/pypi/packages/47/ca/f0d790b6e18b3a6f3bd5e80c2ee4edbb5807286c21cdd0862ca933f751dd/pip-21.1.3-py3-none-any.whl (1.5MB)  100% |████████████████████████████████| 1.6MB 111.4MB/s a 0:00:011 [?25hInstalling collected packages: pip Found existing installation: pip 9.0.1 Uninstalling pip-9.0.1: Successfully uninstalled pip-9.0.1 Successfully installed pip-21.1.3 You are using pip version 21.1.3, however version 21.3.1 is available. You should consider upgrading via the 'pip install --upgrade pip' command.!pip install gym !pip install gym[atari]Looking in indexes: http://repo.myhuaweicloud.com/repository/pypi/simple Collecting gym Downloading http://repo.myhuaweicloud.com/repository/pypi/packages/26/f2/e7ee20bf02b2d02263becba1c5ec4203fef7cfbd57759e040e51307173f4/gym-0.18.0.tar.gz (1.6 MB)  |████████████████████████████████| 1.6 MB 20.9 MB/s eta 0:00:01 [?25hRequirement already satisfied: scipy in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from gym) (1.2.2) Requirement already satisfied: numpy>=1.10.4 in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from gym) (1.19.1) Collecting pyglet<=1.5.0,>=1.4.0 Downloading http://repo.myhuaweicloud.com/repository/pypi/packages/70/ca/20aee170afe6011e295e34b27ad7d7ccd795faba581dd3c6f7cec237f561/pyglet-1.5.0-py2.py3-none-any.whl (1.0 MB)  |████████████████████████████████| 1.0 MB 19.4 MB/s eta 0:00:01 [?25hRequirement already satisfied: Pillow<=7.2.0 in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from gym) (5.0.0) Collecting cloudpickle<1.7.0,>=1.2.0 Downloading http://repo.myhuaweicloud.com/repository/pypi/packages/e7/e3/898487e5dbeb612054cf2e0c188463acb358167fef749c53c8bb8918cea1/cloudpickle-1.6.0-py3-none-any.whl (23 kB) Collecting future Downloading http://repo.myhuaweicloud.com/repository/pypi/packages/45/0b/38b06fd9b92dc2b68d58b75f900e97884c45bedd2ff83203d933cf5851c9/future-0.18.2.tar.gz (829 kB)  |████████████████████████████████| 829 kB 16.6 MB/s eta 0:00:01 [?25hBuilding wheels for collected packages: gym, future Building wheel for gym (setup.py) ... [?25ldone [?25h Created wheel for gym: filename=gym-0.18.0-py3-none-any.whl size=1657473 sha256=ad9fea2ef16b962a7cca67bf28d1b31e4376dcfc793635764b2cb5ed8328cfd2 Stored in directory: /home/ma-user/.cache/pip/wheels/f0/cf/9f/4cc09cfe39a1dece7f531822718233c341ad4e2ceeb6a02ee8 Building wheel for future (setup.py) ... [?25ldone [?25h Created wheel for future: filename=future-0.18.2-py3-none-any.whl size=493275 sha256=80a51f5dbe968b579c86f93dd251e48f62987881752e103359fe2c8a29e57290 Stored in directory: /home/ma-user/.cache/pip/wheels/24/84/e0/d09e00b3298cbbd47cc2a6ac93d04e5f54ab67009e2633206a Successfully built gym future Installing collected packages: future, pyglet, cloudpickle, gym Attempting uninstall: cloudpickle Found existing installation: cloudpickle 0.5.2 Uninstalling cloudpickle-0.5.2: Successfully uninstalled cloudpickle-0.5.2 Successfully installed cloudpickle-1.6.0 future-0.18.2 gym-0.18.0 pyglet-1.5.0 Looking in indexes: http://repo.myhuaweicloud.com/repository/pypi/simple Requirement already satisfied: gym[atari] in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (0.18.0) Requirement already satisfied: Pillow<=7.2.0 in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from gym[atari]) (5.0.0) Requirement already satisfied: cloudpickle<1.7.0,>=1.2.0 in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from gym[atari]) (1.6.0) Requirement already satisfied: pyglet<=1.5.0,>=1.4.0 in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from gym[atari]) (1.5.0) Requirement already satisfied: scipy in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from gym[atari]) (1.2.2) Requirement already satisfied: numpy>=1.10.4 in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from gym[atari]) (1.19.1) Requirement already satisfied: opencv-python>=3. in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from gym[atari]) (3.4.1.15) Collecting atari-py~=0.2.0 Downloading http://repo.myhuaweicloud.com/repository/pypi/packages/58/45/c2f6523aed89db6672b241fa1aafcfa54126c564be769c1360d298f03852/atari_py-0.2.6-cp36-cp36m-manylinux1_x86_64.whl (2.8 MB)  |████████████████████████████████| 2.8 MB 93.7 MB/s eta 0:00:01 | 501 kB 93.7 MB/s eta 0:00:01 [?25hRequirement already satisfied: six in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from atari-py~=0.2.0->gym[atari]) (1.11.0) Requirement already satisfied: future in /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages (from pyglet<=1.5.0,>=1.4.0->gym[atari]) (0.18.2) Installing collected packages: atari-py Successfully installed atari-py-0.2.6第2步:导入相关的库import os import math import time import argparse import collections from collections import deque import cv2 import numpy as np import gym from gym.spaces.box import Box import torch import torch.nn as nn import torch.optim as optim import torch.multiprocessing as mp import torch.nn.functional as F import matplotlib import matplotlib.pyplot as plt from IPython import display import moxing as mox %matplotlib inlineINFO:root:Using MoXing-v1.17.3- INFO:root:Using OBS-Python-SDK-3.20.7第3步:导入预训练模型mox.file.copy_parallel("obs://modelarts-labs-bj4-v2/course/modelarts/reinforcement_learning/pong_A3C/model/Pong_A3C_pretrained.pth", "model/Pong_A3C_pretrained.pth")2. 定义基于共享内存的Adam优化器class SharedAdam(optim.Adam): """基于共享内存的Adam算法实现,基于torch.optim.adam源码""" def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0): super(SharedAdam, self).__init__(params, lr, betas, eps, weight_decay) for group in self.param_groups: for p in group['params']: state = self.state[p] state['step'] = torch.zeros(1) state['exp_avg'] = p.data.new().resize_as_(p.data).zero_() state['exp_avg_sq'] = p.data.new().resize_as_(p.data).zero_() def share_memory(self): for group in self.param_groups: for p in group['params']: state = self.state[p] state['step'].share_memory_() state['exp_avg'].share_memory_() state['exp_avg_sq'].share_memory_() def step(self, closure=None): """执行单一的优化步骤 参数: closure (callable, optional): 用于重新评估模型并返回损失函数的闭包 """ loss = None if closure is not None: loss = closure() for group in self.param_groups: for p in group['params']: if p.grad is None: continue grad = p.grad.data state = self.state[p] exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq'] beta1, beta2 = group['betas'] state['step'] += 1 if group['weight_decay'] != 0: grad = grad.add(group['weight_decay'], p.data) # 一二阶moment系数的指数衰减 exp_avg.mul_(beta1).add_(1 - beta1, grad) exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad) denom = exp_avg_sq.sqrt().add_(group['eps']) bias_correction1 = 1 - beta1 ** state['step'].item() bias_correction2 = 1 - beta2 ** state['step'].item() step_size = group['lr'] * math.sqrt( bias_correction2) / bias_correction1 p.data.addcdiv_(-step_size, exp_avg, denom) return loss3. 预处理ATARI环境输入# 预处理方式参考 https://github.com/openai/universe-starter-agent def create_atari_env(env_id): env = gym.make(env_id) env = AtariRescale42x42(env) env = NormalizedEnv(env) return env def _process_frame42(frame): frame = frame[34:34 + 160, :160] # 将图像输入裁剪为尺寸[42, 42] frame = cv2.resize(frame, (80, 80)) frame = cv2.resize(frame, (42, 42)) frame = frame.mean(2, keepdims=True) frame = frame.astype(np.float32) frame *= (1.0 / 255.0) frame = np.moveaxis(frame, -1, 0) return frame class AtariRescale42x42(gym.ObservationWrapper): def __init__(self, env=None): gym.ObservationWrapper.__init__(self, env) self.observation_space = Box(0.0, 1.0, [1, 42, 42]) def observation(self, observation): return _process_frame42(observation) class NormalizedEnv(gym.ObservationWrapper): def __init__(self, env=None): gym.ObservationWrapper.__init__(self, env) self.state_mean = 0 self.state_std = 0 self.alpha = 0.9999 self.num_steps = 0 def observation(self, observation): self.num_steps += 1 self.state_mean = self.state_mean * self.alpha + \ observation.mean() * (1 - self.alpha) self.state_std = self.state_std * self.alpha + \ observation.std() * (1 - self.alpha) unbiased_mean = self.state_mean / (1 - pow(self.alpha, self.num_steps)) unbiased_std = self.state_std / (1 - pow(self.alpha, self.num_steps)) return (observation - unbiased_mean) / (unbiased_std + 1e-8)4. 定义神经网络# 初始化权重张量的方差 def normalized_columns_initializer(weights, std=1.0): out = torch.randn(weights.size()) out *= std / torch.sqrt(out.pow(2).sum(1, keepdim=True)) return out # 初始化神经网络参数 def weights_init(m): classname = m.__class__.__name__ if classname.find('Conv') != -1: weight_shape = list(m.weight.data.size()) fan_in = np.prod(weight_shape[1:4]) fan_out = np.prod(weight_shape[2:4]) * weight_shape[0] w_bound = np.sqrt(6. / (fan_in + fan_out)) m.weight.data.uniform_(-w_bound, w_bound) m.bias.data.fill_(0) elif classname.find('Linear') != -1: weight_shape = list(m.weight.data.size()) fan_in = weight_shape[1] fan_out = weight_shape[0] w_bound = np.sqrt(6. / (fan_in + fan_out)) m.weight.data.uniform_(-w_bound, w_bound) m.bias.data.fill_(0) class ActorCritic(torch.nn.Module): def __init__(self, num_inputs, action_space): super(ActorCritic, self).__init__() self.conv1 = nn.Conv2d(num_inputs, 32, 3, stride=2, padding=1) self.conv2 = nn.Conv2d(32, 32, 3, stride=2, padding=1) self.conv3 = nn.Conv2d(32, 32, 3, stride=2, padding=1) self.conv4 = nn.Conv2d(32, 32, 3, stride=2, padding=1) # 使用LSTM来获取图像输入的时序信息 self.lstm = nn.LSTMCell(32 * 3 * 3, 256) num_outputs = action_space.n self.critic_linear = nn.Linear(256, 1) self.actor_linear = nn.Linear(256, num_outputs) self.apply(weights_init) self.actor_linear.weight.data = normalized_columns_initializer( self.actor_linear.weight.data, 0.01) self.actor_linear.bias.data.fill_(0) self.critic_linear.weight.data = normalized_columns_initializer( self.critic_linear.weight.data, 1.0) self.critic_linear.bias.data.fill_(0) self.lstm.bias_ih.data.fill_(0) self.lstm.bias_hh.data.fill_(0) self.train() def forward(self, inputs): inputs, (hx, cx) = inputs x = F.elu(self.conv1(inputs)) x = F.elu(self.conv2(x)) x = F.elu(self.conv3(x)) x = F.elu(self.conv4(x)) # 将最后一层卷积网络的输出展开成一维向量 x = x.view(-1, 32 * 3 * 3) # LSTM层输入x, 上一时刻的hidden state和cell state, 输出新的hiden state和cell state hx, cx = self.lstm(x, (hx, cx)) x = hx return self.critic_linear(x), self.actor_linear(x), (hx, cx)5. 策略评估函数def test(rank, args, shared_model, counter): torch.manual_seed(args.seed + rank) env = create_atari_env(args.env_name) env.seed(args.seed + rank) model = ActorCritic(env.observation_space.shape[0], env.action_space) model.eval() state = env.reset() state = torch.from_numpy(state) reward_sum = 0 done = True start_time = time.time() # 防止agent陷入局部最优 actions = deque(maxlen=100) episode_length = 0 while True: episode_length += 1 # 和共享模型进行参数同步 if done: model.load_state_dict(shared_model.state_dict()) cx = torch.zeros(1, 256) hx = torch.zeros(1, 256) else: cx = cx.detach() hx = hx.detach() with torch.no_grad(): value, logit, (hx, cx) = model((state.unsqueeze(0), (hx, cx))) prob = F.softmax(logit, dim=-1) # 推理时,动作选择不采用探索 action = prob.max(1, keepdim=True)[1].numpy() state, reward, done, _ = env.step(action[0, 0]) done = done or episode_length >= args.max_episode_length reward_sum += reward # 防止agent陷入局部最优 actions.append(action[0, 0]) if actions.count(actions[0]) == actions.maxlen: done = True if done: print("Time {}, num steps {}, FPS {:.0f}, episode reward {}, episode length {}".format( time.strftime("%Hh %Mm %Ss", time.gmtime(time.time() - start_time)), counter.value, counter.value / (time.time() - start_time), reward_sum, episode_length)) reward_sum = 0 episode_length = 0 actions.clear() state = env.reset() time.sleep(1) state = torch.from_numpy(state) if counter.value > args.max_steps: break print('testing ends')6. 模型训练函数# 将共享Adam的优化器指向train进程内的优化器实例 def ensure_shared_grads(model, shared_model): for param, shared_param in zip(model.parameters(), shared_model.parameters()): if shared_param.grad is not None: return shared_param._grad = param.grad def train(rank, args, shared_model, counter, lock, optimizer=None): torch.manual_seed(args.seed + rank) env = create_atari_env(args.env_name) env.seed(args.seed + rank) model = ActorCritic(env.observation_space.shape[0], env.action_space) if optimizer is None: optimizer = optim.Adam(shared_model.parameters(), lr=args.lr) model.train() state = env.reset() state = torch.from_numpy(state) done = True max_steps = 10000 episode_length = 0 while True: # 同步模型参数 model.load_state_dict(shared_model.state_dict()) # 回合开始前初始化lstm的状态 if done: cx = torch.zeros(1, 256) hx = torch.zeros(1, 256) else: cx = cx.detach() hx = hx.detach() values = [] log_probs = [] rewards = [] entropies = [] for step in range(args.num_steps): episode_length += 1 value, logit, (hx, cx) = model((state.unsqueeze(0), (hx, cx))) prob = F.softmax(logit, dim=-1) log_prob = F.log_softmax(logit, dim=-1) # H(p) = - sum_x p(x).log(p(x)) entropy = -(log_prob * prob).sum(1, keepdim=True) entropies.append(entropy) # 根据动作概率密度函数进行采样 action = prob.multinomial(num_samples=1).detach() log_prob = log_prob.gather(1, action) state, reward, done, _ = env.step(action.numpy()) done = done or episode_length >= args.max_episode_length # reward裁剪 reward = max(min(reward, 1), -1) with lock: counter.value += 1 if done: episode_length = 0 state = env.reset() state = torch.from_numpy(state) values.append(value) log_probs.append(log_prob) rewards.append(reward) if done: break R = torch.zeros(1, 1) if not done: value, _, _ = model((state.unsqueeze(0), (hx, cx))) R = value.detach() values.append(R) policy_loss = 0 value_loss = 0 gae = torch.zeros(1, 1) for i in reversed(range(len(rewards))): R = args.gamma * R + rewards[i] advantage = R - values[i] value_loss = value_loss + 0.5 * advantage.pow(2) # GAE实现 delta_t = rewards[i] + args.gamma * \ values[i + 1] - values[i] gae = gae * args.gamma * args.gae_lambda + delta_t policy_loss = policy_loss - \ log_probs[i] * gae.detach() - args.entropy_coef * entropies[i] optimizer.zero_grad() (policy_loss + args.value_loss_coef * value_loss).backward() # 梯度裁剪,0~40 torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm) ensure_shared_grads(model, shared_model) optimizer.step() if counter.value > args.max_steps: break print('training ends')7. 训练超参数设置为快速验证训练代码,默认进程数 num-processes 设置为4,最大训练步数max-steps设置为20000。若需要获得较好训练效果,num-processes设置为16,max-steps设置为2000000,其他参数保持默认。# 超参数设置参考 https://github.com/pytorch/examples/tree/master/mnist_hogwild parser = argparse.ArgumentParser(description='A3C') parser.add_argument('--lr', type=float, default=0.0001, help='learning rate (default: 0.0001)') # 学习率 parser.add_argument('--gamma', type=float, default=0.99, help='discount factor for rewards (default: 0.99)') # 奖励折扣系数 parser.add_argument('--gae-lambda', type=float, default=1.00, help='lambda parameter for GAE (default: 1.00)') # GAE系数 parser.add_argument('--entropy-coef', type=float, default=0.01, help='entropy term coefficient (default: 0.01)') # 熵系数 parser.add_argument('--value-loss-coef', type=float, default=0.5, help='value loss coefficient (default: 0.5)') # 价值函数折扣系数 parser.add_argument('--max-grad-norm', type=float, default=50, help='clip the global norm of gradients (default: 50)') # 梯度裁剪 parser.add_argument('--seed', type=int, default=1, help='random seed (default: 1)') # 随机种子 parser.add_argument('--num-processes', type=int, default=4, help='how many training processes to use (default: 4)') # 并行进程数 parser.add_argument('--num-steps', type=int, default=20, help='number of forward steps in A3C (default: 20)') # 每次采样步数 parser.add_argument('--max-episode-length', type=int, default=1000000, help='maximum length of an episode (default: 1000000)') # 最大回合步数 parser.add_argument('--env-name', default='PongDeterministic-v4', help='environment to train on (default: PongDeterministic-v4)') # 环境名称 parser.add_argument('--no-shared', default=False, help='use an optimizer without shared momentum.') # 共享参数的优化器 parser.add_argument('--max-steps', type=int, default=20000, help='number of max steps for training (default: 20000)') # 最大训练步数 # 保证每个核心运行单个进程 os.environ['OMP_NUM_THREADS'] = '1' # 本案例多进程运行在CPU上 os.environ['CUDA_VISIBLE_DEVICES'] = "" args = parser.parse_args(args=[])8. 启动多进程训练torch.manual_seed(args.seed) # 创建预处理后的环境 env = create_atari_env(args.env_name) # 创建共享参数的模型 shared_model = ActorCritic(env.observation_space.shape[0], env.action_space) # 将共享模型存储在共享内存上 shared_model.share_memory() # 创建共享参数的ADAM优化器 if args.no_shared: optimizer = None else: optimizer = SharedAdam(shared_model.parameters(), lr=args.lr) optimizer.share_memory() processes = [] counter = mp.Value('i', 0) lock = mp.Lock() p = mp.Process(target=test, args=(args.num_processes, args, shared_model, counter)) p.start() processes.append(p) for rank in range(0, args.num_processes): p = mp.Process(target=train, args=(rank, args, shared_model, counter, lock, optimizer)) p.start() processes.append(p) for p in processes: p.join()Time 00h 01m 33s, num steps 3977, FPS 43, episode reward -21.0, episode length 812 Time 00h 01m 45s, num steps 4584, FPS 43, episode reward -2.0, episode length 102 Time 00h 01m 57s, num steps 5185, FPS 44, episode reward -2.0, episode length 102 Time 00h 02m 09s, num steps 5750, FPS 44, episode reward -2.0, episode length 102 Time 00h 02m 22s, num steps 6343, FPS 45, episode reward -2.0, episode length 102 Time 00h 02m 34s, num steps 6904, FPS 45, episode reward -2.0, episode length 102 Time 00h 02m 46s, num steps 7519, FPS 45, episode reward -2.0, episode length 102 Time 00h 02m 59s, num steps 8099, FPS 45, episode reward -2.0, episode length 100 Time 00h 03m 11s, num steps 8714, FPS 45, episode reward -2.0, episode length 100 Time 00h 03m 24s, num steps 9306, FPS 46, episode reward -2.0, episode length 100 Time 00h 03m 36s, num steps 9913, FPS 46, episode reward -2.0, episode length 100 Time 00h 03m 48s, num steps 10477, FPS 46, episode reward -2.0, episode length 100 Time 00h 04m 00s, num steps 11065, FPS 46, episode reward -2.0, episode length 100 Time 00h 04m 13s, num steps 11644, FPS 46, episode reward -2.0, episode length 100 Time 00h 04m 25s, num steps 12237, FPS 46, episode reward -2.0, episode length 100 Time 00h 04m 37s, num steps 12834, FPS 46, episode reward -2.0, episode length 100 Time 00h 04m 49s, num steps 13429, FPS 46, episode reward -2.0, episode length 100 Time 00h 05m 02s, num steps 14033, FPS 46, episode reward -2.0, episode length 100 Time 00h 05m 14s, num steps 14636, FPS 46, episode reward -2.0, episode length 100 Time 00h 05m 27s, num steps 15248, FPS 47, episode reward -2.0, episode length 100 Time 00h 05m 39s, num steps 15828, FPS 47, episode reward -2.0, episode length 100 Time 00h 05m 52s, num steps 16448, FPS 47, episode reward -2.0, episode length 100 Time 00h 06m 04s, num steps 17049, FPS 47, episode reward -2.0, episode length 100 Time 00h 06m 16s, num steps 17624, FPS 47, episode reward -2.0, episode length 100 Time 00h 06m 28s, num steps 18183, FPS 47, episode reward -2.0, episode length 100 Time 00h 06m 40s, num steps 18796, FPS 47, episode reward -2.0, episode length 100 Time 00h 06m 52s, num steps 19390, FPS 47, episode reward -2.0, episode length 100 Time 00h 07m 04s, num steps 19966, FPS 47, episode reward -2.0, episode length 100 training ends testing ends training ends training ends training ends9. 加载模型进行可视化游戏推理展现训练效果,加载在num-processes=16, max_steps=2000000参数下设置的预训练模型,14min内可以快速解决ATARI Pong。def infer(args): torch.manual_seed(args.seed) env = create_atari_env(args.env_name) env.seed(args.seed) model = ActorCritic(env.observation_space.shape[0], env.action_space) model.load_state_dict(torch.load("model/Pong_A3C_pretrained.pth")) model.eval() state = env.reset() img = plt.imshow(env.render(mode='rgb_array')) state = torch.from_numpy(state) reward_sum = 0 done = True start_time = time.time() # 防止agent陷入局部最优 actions = deque(maxlen=100) episode_length = 0 iters = 0 while iters < 1 : episode_length += 1 if done: cx = torch.zeros(1, 256) hx = torch.zeros(1, 256) else: cx = cx.detach() hx = hx.detach() with torch.no_grad(): value, logit, (hx, cx) = model((state.unsqueeze(0), (hx, cx))) prob = F.softmax(logit, dim=-1) action = prob.max(1, keepdim=True)[1].numpy() state, reward, done, _ = env.step(action[0, 0]) img.set_data(env.render(mode='rgb_array')) display.display(plt.gcf()) display.clear_output(wait=True) done = done or episode_length >= args.max_episode_length reward_sum += reward # 防止agent陷入局部最优 actions.append(action[0, 0]) if actions.count(actions[0]) == actions.maxlen: done = True if done: reward_sum = 0 episode_length = 0 actions.clear() state = env.reset() iters += 1 state = torch.from_numpy(state)完成21局对战后结束infer(args)10. 作业请你调整步骤7中的训练参数,重新训练一个模型,使它在游戏中获得更好的表现。
  • [技术干货] 使用DPPO算法控制“倒立摆”
    使用DPPO算法控制“倒立摆”实验目标通过本案例的学习和课后作业的练习:了解DPPO基本概念了解如何基于DPPO训练一个控制类问题了解强化学习训练推理控制类问题的整体流程你也可以将本案例相关的 ipynb 学习笔记分享到 AI Gallery Notebook 版块获得成长值,分享方法请查看此文档。案例内容介绍倒立摆(Pendulum)摆动问题是控制文献中的经典问题。在我们本节用DPPO解决的Pendulum-v0问题中,钟摆从一个随机位置开始围绕一个端点摆动,目标是把钟摆向上摆动,并且是钟摆保持直立。一个随机动作的倒立摆demo如下所示:整体流程:安装基础依赖->创建倒立摆环境->构建DPPO算法->训练->推理->可视化效果Distributed Proximal Policy Optimization (DPPO) 算法的基本结构DPPO算法是在Proximal Policy Optimization(PPO)算法基础上发展而来,相关PPO算法请看 使用PPO算法玩“超级马里奥兄弟”,我们在这一教程中有详细的介绍。DPPO借鉴A3C的并行方法,使用多个workers并行地在不同的环境中收集数据,并根据采集的数据计算梯度,将梯度发送给一个全局chief,全局chief在拿到一定数量的梯度数据之后进行网络更新,更新时workers停止采集等待等下完毕,更新完毕之后workers重新使用最新的网络采集数据。下面我们使用论文中的伪代码介绍DPPO的具体算法细节。上述算法所示为全局PPO的伪代码,其中 W 是workers的数目,D 是用于更新参数的works数量阈值,M,B是给定一批数据点的具有policy网络和critic网络更新的子迭代数,θ, Φ为policy网络,critic网络的参数。上述算法所示为workers的伪代码,其中T是在计算参数更新之前收集的每个工作节点的数据点数,K是计算K步返回和通过时间截断的反向道具的时间步数(对于RNN)。 该部分算法基于PPO,首先采集数据,根据PPO算法计算梯度并将梯度发送给全局chief,等待全局chief更新完毕参数再进行数据的采集。DPPO论文代码部分参考GitHub开源项目注意事项本案例运行环境为 TensorFlow-2.0.0,且需使用 GPU 运行,请查看《ModelAtrs JupyterLab 硬件规格使用指南》了解切换硬件规格的方法;如果您是第一次使用 JupyterLab,请查看《ModelAtrs JupyterLab使用指导》了解使用方法;如果您在使用 JupyterLab 过程中碰到报错,请参考《ModelAtrs JupyterLab常见问题解决办法》尝试解决问题。实验步骤1. 程序初始化第1步:安装基础依赖!pip install tensorflow==2.0.0 !pip install tensorflow-probability==0.7.0 !pip install tensorlayer==2.1.0 --ignore-installed !pip install h5py==2.10.0 !pip install gym第2步:导入相关的库import os import time import queue import threading import gym import matplotlib.pyplot as plt import numpy as np import tensorflow as tf import tensorflow_probability as tfp import tensorlayer as tl2. 训练参数初始化本案例设置的 训练最大局数 EP_MAX = 1000,可以达到较好的训练效果,训练耗时约20分钟。你也可以调小 EP_MAX 的值,以便快速跑通代码。RANDOMSEED = 1 # 随机数种子 EP_MAX = 100 # 训练总局数 EP_LEN = 200 # 一局最长长度 GAMMA = 0.9 # 折扣率 A_LR = 0.0001 # actor学习率 C_LR = 0.0002 # critic学习率 BATCH = 32 # batchsize大小 A_UPDATE_STEPS = 10 # actor更新步数 C_UPDATE_STEPS = 10 # critic更新步数 S_DIM, A_DIM = 3, 1 # state维度, action维度 EPS = 1e-8 # epsilon值 # PPO1 和PPO2 的参数,你可以选择用PPO1 (METHOD[0]),还是PPO2 (METHOD[1]) METHOD = [ dict(name='kl_pen', kl_target=0.01, lam=0.5), # KL penalty dict(name='clip', epsilon=0.2), # Clipped surrogate objective, find this is better ][1] # choose the method for optimization N_WORKER = 4 # 并行workers数目 MIN_BATCH_SIZE = 64 # 更新PPO的minibatch大小 UPDATE_STEP = 10 # 每隔10steps更新一次3. 创建环境本环境为gym内置的Pendulum,倒立摆倒下即失败。env_name = 'Pendulum-v0' # environment name4. 定义DPPO算法DPPO算法-PPO算法class PPO(object): ''' PPO class ''' def __init__(self): # 创建critic tfs = tl.layers.Input([None, S_DIM], tf.float32, 'state') l1 = tl.layers.Dense(100, tf.nn.relu)(tfs) v = tl.layers.Dense(1)(l1) self.critic = tl.models.Model(tfs, v) self.critic.train() # 创建actor self.actor = self._build_anet('pi', trainable=True) self.actor_old = self._build_anet('oldpi', trainable=False) self.actor_opt = tf.optimizers.Adam(A_LR) self.critic_opt = tf.optimizers.Adam(C_LR) # 更新actor def a_train(self, tfs, tfa, tfadv): ''' Update policy network :param tfs: state :param tfa: act :param tfadv: advantage :return: ''' tfs = np.array(tfs, np.float32) tfa = np.array(tfa, np.float32) tfadv = np.array(tfadv, np.float32) # td-error with tf.GradientTape() as tape: mu, sigma = self.actor(tfs) pi = tfp.distributions.Normal(mu, sigma) mu_old, sigma_old = self.actor_old(tfs) oldpi = tfp.distributions.Normal(mu_old, sigma_old) ratio = pi.prob(tfa) / (oldpi.prob(tfa) + EPS) surr = ratio * tfadv ## PPO1 if METHOD['name'] == 'kl_pen': tflam = METHOD['lam'] kl = tfp.distributions.kl_divergence(oldpi, pi) kl_mean = tf.reduce_mean(kl) aloss = -(tf.reduce_mean(surr - tflam * kl)) ## PPO2 else: aloss = -tf.reduce_mean( tf.minimum(surr, tf.clip_by_value(ratio, 1. - METHOD['epsilon'], 1. + METHOD['epsilon']) * tfadv) ) a_gard = tape.gradient(aloss, self.actor.trainable_weights) self.actor_opt.apply_gradients(zip(a_gard, self.actor.trainable_weights)) if METHOD['name'] == 'kl_pen': return kl_mean # 更新old_pi def update_old_pi(self): ''' Update old policy parameter :return: None ''' for p, oldp in zip(self.actor.trainable_weights, self.actor_old.trainable_weights): oldp.assign(p) # 更新critic def c_train(self, tfdc_r, s): ''' Update actor network :param tfdc_r: cumulative reward :param s: state :return: None ''' tfdc_r = np.array(tfdc_r, dtype=np.float32) with tf.GradientTape() as tape: advantage = tfdc_r - self.critic(s) # 计算advantage:V(s') * gamma + r - V(s) closs = tf.reduce_mean(tf.square(advantage)) grad = tape.gradient(closs, self.critic.trainable_weights) self.critic_opt.apply_gradients(zip(grad, self.critic.trainable_weights)) # 计算advantage:V(s') * gamma + r - V(s) def cal_adv(self, tfs, tfdc_r): ''' Calculate advantage :param tfs: state :param tfdc_r: cumulative reward :return: advantage ''' tfdc_r = np.array(tfdc_r, dtype=np.float32) advantage = tfdc_r - self.critic(tfs) return advantage.numpy() def update(self): ''' Update parameter with the constraint of KL divergent :return: None ''' global GLOBAL_UPDATE_COUNTER while not COORD.should_stop(): # 如果协调器没有停止 if GLOBAL_EP < EP_MAX: # EP_MAX是最大更新次数 UPDATE_EVENT.wait() # PPO进程的等待位置 self.update_old_pi() # copy pi to old pi data = [QUEUE.get() for _ in range(QUEUE.qsize())] # collect data from all workers data = np.vstack(data) s, a, r = data[:, :S_DIM].astype(np.float32), \ data[:, S_DIM: S_DIM + A_DIM].astype(np.float32), \ data[:, -1:].astype(np.float32) adv = self.cal_adv(s, r) # update actor ## PPO1 if METHOD['name'] == 'kl_pen': for _ in range(A_UPDATE_STEPS): kl = self.a_train(s, a, adv) if kl > 4 * METHOD['kl_target']: # this in in google's paper break if kl < METHOD['kl_target'] / 1.5: # adaptive lambda, this is in OpenAI's paper METHOD['lam'] /= 2 elif kl > METHOD['kl_target'] * 1.5: METHOD['lam'] *= 2 # sometimes explode, this clipping is MorvanZhou's solution METHOD['lam'] = np.clip(METHOD['lam'], 1e-4, 10) ## PPO2 else: # clipping method, find this is better (OpenAI's paper) for _ in range(A_UPDATE_STEPS): self.a_train(s, a, adv) # 更新critic for _ in range(C_UPDATE_STEPS): self.c_train(r, s) UPDATE_EVENT.clear() # updating finished GLOBAL_UPDATE_COUNTER = 0 # reset counter ROLLING_EVENT.set() # set roll-out available # 构建actor网络 def _build_anet(self, name, trainable): ''' Build policy network :param name: name :param trainable: trainable flag :return: policy network ''' tfs = tl.layers.Input([None, S_DIM], tf.float32, name + '_state') l1 = tl.layers.Dense(100, tf.nn.relu, name=name + '_l1')(tfs) a = tl.layers.Dense(A_DIM, tf.nn.tanh, name=name + '_a')(l1) mu = tl.layers.Lambda(lambda x: x * 2, name=name + '_lambda')(a) sigma = tl.layers.Dense(A_DIM, tf.nn.softplus, name=name + '_sigma')(l1) model = tl.models.Model(tfs, [mu, sigma], name) if trainable: model.train() else: model.eval() return model # 选择动作 def choose_action(self, s): ''' Choose action :param s: state :return: clipped act ''' s = s[np.newaxis, :].astype(np.float32) mu, sigma = self.actor(s) pi = tfp.distributions.Normal(mu, sigma) a = tf.squeeze(pi.sample(1), axis=0)[0] # choosing action return np.clip(a, -2, 2) # 计算V() def get_v(self, s): ''' Compute value :param s: state :return: value ''' s = s.astype(np.float32) if s.ndim < 2: s = s[np.newaxis, :] return self.critic(s)[0, 0] def save_ckpt(self): """ save trained weights :return: None """ if not os.path.exists('model_Pendulum'): os.makedirs('model_Pendulum') tl.files.save_weights_to_hdf5('model_Pendulum/dppo_actor.hdf5', self.actor) tl.files.save_weights_to_hdf5('model_Pendulum/dppo_actor_old.hdf5', self.actor_old) tl.files.save_weights_to_hdf5('model_Pendulum/dppo_critic.hdf5', self.critic) def load_ckpt(self): """ load trained weights :return: None """ tl.files.load_hdf5_to_weights_in_order('model_Pendulum/dppo_actor.hdf5', self.actor) tl.files.load_hdf5_to_weights_in_order('model_Pendulum/dppo_actor_old.hdf5', self.actor_old) tl.files.load_hdf5_to_weights_in_order('model_Pendulum/dppo_critic.hdf5', self.critic)workers构建class Worker(object): ''' Worker class for distributional running ''' def __init__(self, wid): self.wid = wid # 工号 self.env = gym.make(env_name).unwrapped # 创建环境 self.env.seed(wid * 100 + RANDOMSEED) # 设置不同的随机种子,因为不希望每个worker的都一致 self.ppo = GLOBAL_PPO # 算法 def work(self): ''' Define a worker :return: None ''' global GLOBAL_EP, GLOBAL_RUNNING_R, GLOBAL_UPDATE_COUNTER while not COORD.should_stop(): # 从COORD接受消息,看看是否应该should_stop s = self.env.reset() ep_r = 0 buffer_s, buffer_a, buffer_r = [], [], [] # 记录data t0 = time.time() for t in range(EP_LEN): # 看是否正在被更新。PPO进程正在工作,那么就在这里等待 if not ROLLING_EVENT.is_set(): # 查询进程是否被阻塞,如果在阻塞状态,就证明如果global PPO正在更新。否则就可以继续。 ROLLING_EVENT.wait() # worker进程的等待位置。wait until PPO is updated buffer_s, buffer_a, buffer_r = [], [], [] # clear history buffer, use new policy to collect data # 正常跑游戏,并搜集数据 a = self.ppo.choose_action(s) s_, r, done, _ = self.env.step(a) buffer_s.append(s) buffer_a.append(a) buffer_r.append((r + 8) / 8) # normalize reward, find to be useful s = s_ ep_r += r # GLOBAL_UPDATE_COUNTER是每个work的在游戏中进行一步,也就是产生一条数据就会+1. # 当GLOBAL_UPDATE_COUNTER大于batch(64)的时候,就可以进行更新。 GLOBAL_UPDATE_COUNTER += 1 # count to minimum batch size, no need to wait other workers if t == EP_LEN - 1 or GLOBAL_UPDATE_COUNTER >= MIN_BATCH_SIZE: # t == EP_LEN - 1 是最后一步 ## 计算每个状态对应的V(s') ## 要注意,这里的len(buffer) < GLOBAL_UPDATE_COUNTER。所以数据是每个worker各自计算的。 v_s_ = self.ppo.get_v(s_) discounted_r = [] # compute discounted reward for r in buffer_r[::-1]: v_s_ = r + GAMMA * v_s_ discounted_r.append(v_s_) discounted_r.reverse() ## 堆叠成数据,并保存到公共队列中。 bs, ba, br = np.vstack(buffer_s), np.vstack(buffer_a), np.array(discounted_r)[:, np.newaxis] buffer_s, buffer_a, buffer_r = [], [], [] QUEUE.put(np.hstack((bs, ba, br))) # put data in the queue # 如果数据足够,就开始更新 if GLOBAL_UPDATE_COUNTER >= MIN_BATCH_SIZE: ROLLING_EVENT.clear() # stop collecting data UPDATE_EVENT.set() # global PPO update if GLOBAL_EP >= EP_MAX: # stop training COORD.request_stop() # 停止更新 break # record reward changes, plot later if len(GLOBAL_RUNNING_R) == 0: GLOBAL_RUNNING_R.append(ep_r) else: GLOBAL_RUNNING_R.append(GLOBAL_RUNNING_R[-1] * 0.9 + ep_r * 0.1) GLOBAL_EP += 1 print( 'Episode: {}/{} | Worker: {} | Episode Reward: {:.4f} | Running Time: {:.4f}'.format( GLOBAL_EP, EP_MAX, self.wid, ep_r, time.time() - t0 ) )5. 模型训练np.random.seed(RANDOMSEED) tf.random.set_seed(RANDOMSEED) GLOBAL_PPO = PPO()[TL] Input state: [None, 3] [TL] Dense dense_1: 100 relu [TL] Dense dense_2: 1 No Activation [TL] Input pi_state: [None, 3] [TL] Dense pi_l1: 100 relu [TL] Dense pi_a: 1 tanh [TL] Lambda pi_lambda: func: <function PPO._build_anet.<locals>.<lambda> at 0x7fb65d633950>, len_weights: 0 [TL] Dense pi_sigma: 1 softplus [TL] Input oldpi_state: [None, 3] [TL] Dense oldpi_l1: 100 relu [TL] Dense oldpi_a: 1 tanh [TL] Lambda oldpi_lambda: func: <function PPO._build_anet.<locals>.<lambda> at 0x7fb65d633a70>, len_weights: 0 [TL] Dense oldpi_sigma: 1 softplus# 定义两组不同的事件,update 和 rolling UPDATE_EVENT, ROLLING_EVENT = threading.Event(), threading.Event() UPDATE_EVENT.clear() # not update now,相当于把标志位设置为False ROLLING_EVENT.set() # start to roll out,相当于把标志位设置为True,并通知所有处于等待阻塞状态的线程恢复运行状态。 # 创建workers workers = [Worker(wid=i) for i in range(N_WORKER)] GLOBAL_UPDATE_COUNTER, GLOBAL_EP = 0, 0 # 全局更新次数计数器,全局EP计数器 GLOBAL_RUNNING_R = [] # 记录动态的reward,看成绩 COORD = tf.train.Coordinator() # 创建tensorflow的协调器 QUEUE = queue.Queue() # workers putting data in this queue threads = [] # 为每个worker创建进程 for worker in workers: # worker threads t = threading.Thread(target=worker.work, args=()) # 创建进程 t.start() # 开始进程 threads.append(t) # 把进程放到进程列表中,方便管理 # add a PPO updating thread # 把一个全局的PPO更新放到进程列表最后。 threads.append(threading.Thread(target=GLOBAL_PPO.update, )) threads[-1].start() COORD.join(threads) # 把进程列表交给协调器管理 GLOBAL_PPO.save_ckpt() # 保存全局参数 # plot reward change and test plt.title('DPPO') plt.plot(np.arange(len(GLOBAL_RUNNING_R)), GLOBAL_RUNNING_R) plt.xlabel('Episode') plt.ylabel('Moving reward') plt.ylim(-2000, 0) plt.show()Episode: 1/100 | Worker: 1 | Episode Reward: -965.6343 | Running Time: 3.1675 Episode: 2/100 | Worker: 2 | Episode Reward: -1443.1138 | Running Time: 3.1689 Episode: 3/100 | Worker: 3 | Episode Reward: -1313.6248 | Running Time: 3.1734 Episode: 4/100 | Worker: 0 | Episode Reward: -1403.1952 | Running Time: 3.1819 Episode: 5/100 | Worker: 1 | Episode Reward: -1399.3963 | Running Time: 3.2429 Episode: 6/100 | Worker: 2 | Episode Reward: -1480.8439 | Running Time: 3.2453 Episode: 7/100 | Worker: 0 | Episode Reward: -1489.4195 | Running Time: 3.2373 Episode: 8/100 | Worker: 3 | Episode Reward: -1339.0517 | Running Time: 3.2583 Episode: 9/100 | Worker: 1 | Episode Reward: -1600.1292 | Running Time: 3.2478 Episode: 10/100 | Worker: 0 | Episode Reward: -1513.2170 | Running Time: 3.2584 Episode: 11/100 | Worker: 2 | Episode Reward: -1461.7279 | Running Time: 3.2697 Episode: 12/100 | Worker: 3 | Episode Reward: -1480.2685 | Running Time: 3.2598 Episode: 13/100 | Worker: 0 | Episode Reward: -1831.5374 | Running Time: 3.2423Episode: 14/100 | Worker: 1 | Episode Reward: -1524.8253 | Running Time: 3.2635 Episode: 15/100 | Worker: 2 | Episode Reward: -1383.4878 | Running Time: 3.2556 Episode: 16/100 | Worker: 3 | Episode Reward: -1288.9392 | Running Time: 3.2588 Episode: 17/100 | Worker: 1 | Episode Reward: -1657.2223 | Running Time: 3.2377 Episode: 18/100 | Worker: 0 | Episode Reward: -1472.2335 | Running Time: 3.2678 Episode: 19/100 | Worker: 2 | Episode Reward: -1475.5421 | Running Time: 3.2667 Episode: 20/100 | Worker: 3 | Episode Reward: -1532.7678 | Running Time: 3.2739 Episode: 21/100 | Worker: 1 | Episode Reward: -1575.5706 | Running Time: 3.2688 Episode: 22/100 | Worker: 2 | Episode Reward: -1238.4006 | Running Time: 3.2303 Episode: 23/100 | Worker: 0 | Episode Reward: -1630.9554 | Running Time: 3.2584 Episode: 24/100 | Worker: 3 | Episode Reward: -1610.7237 | Running Time: 3.2601 Episode: 25/100 | Worker: 1 | Episode Reward: -1516.5440 | Running Time: 3.2683 Episode: 26/100 | Worker: 0 | Episode Reward: -1547.6209 | Running Time: 3.2589 Episode: 27/100 | Worker: 2 | Episode Reward: -1328.2584 | Running Time: 3.2762 Episode: 28/100 | Worker: 3 | Episode Reward: -1191.0914 | Running Time: 3.2552 Episode: 29/100 | Worker: 1 | Episode Reward: -1415.3608 | Running Time: 3.2804 Episode: 30/100 | Worker: 0 | Episode Reward: -1765.8007 | Running Time: 3.2767 Episode: 31/100 | Worker: 2 | Episode Reward: -1756.5872 | Running Time: 3.3078 Episode: 32/100 | Worker: 3 | Episode Reward: -1428.0094 | Running Time: 3.2815 Episode: 33/100 | Worker: 1 | Episode Reward: -1605.7720 | Running Time: 3.3010 Episode: 34/100 | Worker: 0 | Episode Reward: -1247.7492 | Running Time: 3.3115Episode: 35/100 | Worker: 2 | Episode Reward: -1333.9553 | Running Time: 3.2759 Episode: 36/100 | Worker: 3 | Episode Reward: -1485.7453 | Running Time: 3.2749 Episode: 37/100 | Worker: 3 | Episode Reward: -1341.3090 | Running Time: 3.2323 Episode: 38/100 | Worker: 2 | Episode Reward: -1472.5245 | Running Time: 3.2595 Episode: 39/100 | Worker: 0 | Episode Reward: -1583.6614 | Running Time: 3.2721 Episode: 40/100 | Worker: 1 | Episode Reward: -1358.4421 | Running Time: 3.2925 Episode: 41/100 | Worker: 3 | Episode Reward: -1744.7500 | Running Time: 3.2391 Episode: 42/100 | Worker: 2 | Episode Reward: -1684.8821 | Running Time: 3.2527 Episode: 43/100 | Worker: 1 | Episode Reward: -1412.0231 | Running Time: 3.2400 Episode: 44/100 | Worker: 0 | Episode Reward: -1437.6130 | Running Time: 3.2458 Episode: 45/100 | Worker: 3 | Episode Reward: -1461.7901 | Running Time: 3.2872 Episode: 46/100 | Worker: 2 | Episode Reward: -1572.6255 | Running Time: 3.2710 Episode: 47/100 | Worker: 0 | Episode Reward: -1704.6351 | Running Time: 3.2762 Episode: 48/100 | Worker: 1 | Episode Reward: -1538.4030 | Running Time: 3.3117 Episode: 49/100 | Worker: 3 | Episode Reward: -1554.7941 | Running Time: 3.2881 Episode: 50/100 | Worker: 2 | Episode Reward: -1796.0786 | Running Time: 3.2718 Episode: 51/100 | Worker: 0 | Episode Reward: -1877.3152 | Running Time: 3.2804 Episode: 52/100 | Worker: 1 | Episode Reward: -1749.8780 | Running Time: 3.2779 Episode: 53/100 | Worker: 3 | Episode Reward: -1486.8338 | Running Time: 3.1559 Episode: 54/100 | Worker: 2 | Episode Reward: -1540.8134 | Running Time: 3.2903 Episode: 55/100 | Worker: 0 | Episode Reward: -1596.7365 | Running Time: 3.3156 Episode: 56/100 | Worker: 1 | Episode Reward: -1644.7888 | Running Time: 3.3065 Episode: 57/100 | Worker: 3 | Episode Reward: -1514.0685 | Running Time: 3.2920 Episode: 58/100 | Worker: 2 | Episode Reward: -1411.2714 | Running Time: 3.1554 Episode: 59/100 | Worker: 0 | Episode Reward: -1602.3725 | Running Time: 3.2737 Episode: 60/100 | Worker: 1 | Episode Reward: -1579.8769 | Running Time: 3.3140 Episode: 61/100 | Worker: 3 | Episode Reward: -1360.7916 | Running Time: 3.2856 Episode: 62/100 | Worker: 2 | Episode Reward: -1490.7107 | Running Time: 3.2861 Episode: 63/100 | Worker: 0 | Episode Reward: -1775.7557 | Running Time: 3.2644 Episode: 64/100 | Worker: 1 | Episode Reward: -1491.0894 | Running Time: 3.2828 Episode: 65/100 | Worker: 0 | Episode Reward: -1428.8124 | Running Time: 3.1239 Episode: 66/100 | Worker: 2 | Episode Reward: -1493.7703 | Running Time: 3.2680 Episode: 67/100 | Worker: 3 | Episode Reward: -1658.3558 | Running Time: 3.2853 Episode: 68/100 | Worker: 1 | Episode Reward: -1605.9077 | Running Time: 3.2911 Episode: 69/100 | Worker: 2 | Episode Reward: -1374.3309 | Running Time: 3.3644 Episode: 70/100 | Worker: 0 | Episode Reward: -1283.5023 | Running Time: 3.3819 Episode: 71/100 | Worker: 3 | Episode Reward: -1346.1850 | Running Time: 3.3860 Episode: 72/100 | Worker: 1 | Episode Reward: -1222.1988 | Running Time: 3.3724 Episode: 73/100 | Worker: 2 | Episode Reward: -1199.1266 | Running Time: 3.2739 Episode: 74/100 | Worker: 0 | Episode Reward: -1207.3161 | Running Time: 3.2670 Episode: 75/100 | Worker: 3 | Episode Reward: -1302.0207 | Running Time: 3.2562 Episode: 76/100 | Worker: 1 | Episode Reward: -1233.3584 | Running Time: 3.2892 Episode: 77/100 | Worker: 2 | Episode Reward: -964.8099 | Running Time: 3.2339 Episode: 78/100 | Worker: 0 | Episode Reward: -1208.2836 | Running Time: 3.2602 Episode: 79/100 | Worker: 3 | Episode Reward: -1149.2154 | Running Time: 3.2579 Episode: 80/100 | Worker: 1 | Episode Reward: -1219.3229 | Running Time: 3.2321 Episode: 81/100 | Worker: 2 | Episode Reward: -1097.7572 | Running Time: 3.2995 Episode: 82/100 | Worker: 3 | Episode Reward: -940.7949 | Running Time: 3.2981 Episode: 83/100 | Worker: 0 | Episode Reward: -1395.6272 | Running Time: 3.3076 Episode: 84/100 | Worker: 1 | Episode Reward: -1092.5180 | Running Time: 3.2936 Episode: 85/100 | Worker: 2 | Episode Reward: -1369.8868 | Running Time: 3.2517 Episode: 86/100 | Worker: 0 | Episode Reward: -1380.5247 | Running Time: 3.2390 Episode: 87/100 | Worker: 3 | Episode Reward: -1413.2114 | Running Time: 3.2740 Episode: 88/100 | Worker: 1 | Episode Reward: -1403.9904 | Running Time: 3.2643 Episode: 89/100 | Worker: 2 | Episode Reward: -1098.8470 | Running Time: 3.3078 Episode: 90/100 | Worker: 0 | Episode Reward: -983.4387 | Running Time: 3.3224 Episode: 91/100 | Worker: 3 | Episode Reward: -1056.6701 | Running Time: 3.3059 Episode: 92/100 | Worker: 1 | Episode Reward: -1357.6828 | Running Time: 3.2980 Episode: 93/100 | Worker: 2 | Episode Reward: -1082.3377 | Running Time: 3.3248 Episode: 94/100 | Worker: 3 | Episode Reward: -1052.0146 | Running Time: 3.3291 Episode: 95/100 | Worker: 0 | Episode Reward: -1373.0590 | Running Time: 3.3660 Episode: 96/100 | Worker: 1 | Episode Reward: -1044.4578 | Running Time: 3.3311 Episode: 97/100 | Worker: 2 | Episode Reward: -1179.2926 | Running Time: 3.3593 Episode: 98/100 | Worker: 3 | Episode Reward: -1039.1825 | Running Time: 3.3540 Episode: 99/100 | Worker: 0 | Episode Reward: -1193.3356 | Running Time: 3.3599 Episode: 100/100 | Worker: 1 | Episode Reward: -1378.5094 | Running Time: 3.2025 Episode: 101/100 | Worker: 2 | Episode Reward: -30.6317 | Running Time: 0.1128 Episode: 102/100 | Worker: 0 | Episode Reward: -141.0568 | Running Time: 0.2976 Episode: 103/100 | Worker: 3 | Episode Reward: -166.4818 | Running Time: 0.3256 Episode: 104/100 | Worker: 1 | Episode Reward: -123.2953 | Running Time: 0.2683 [TL] [*] Saving TL weights into model_Pendulum/dppo_actor.hdf5 [TL] [*] Saved [TL] [*] Saving TL weights into model_Pendulum/dppo_actor_old.hdf5 [TL] [*] Saved [TL] [*] Saving TL weights into model_Pendulum/dppo_critic.hdf5 [TL] [*] Saved6. 模型推理Notebook暂时不支持Pendulum可视化,请将下面代码下载到本地,可查看可视化效果。from matplotlib import animation GLOBAL_PPO.load_ckpt() env = gym.make(env_name) s = env.reset() def display_frames_as_gif(frames): patch = plt.imshow(frames[0]) plt.axis('off') def animate(i): patch.set_data(frames[i]) anim = animation.FuncAnimation(plt.gcf(), animate, frames=len(frames), interval=5) anim.save('./DPPO_Pendulum.gif', writer='imagemagick', fps=30) total_reward = 0 frames = [] while True: env.render() frames.append(env.render(mode='rgb_array')) s, r, done, info = env.step(GLOBAL_PPO.choose_action(s)) if done: print('It is over, the window will be closed after 1 seconds.') time.sleep(1) break env.close() print('Total Reward : %.2f' % total_reward) display_frames_as_gif(frames)7. 模型推理效果如下视频是训练1000 Episode模型的推理效果8. 作业请你调整步骤2中的训练参数,重新训练一个模型,使它在游戏中获得更好的表现。
  • [技术干货] 使用DPPO算法控制“倒立摆”
    使用DPPO算法控制“倒立摆”实验目标通过本案例的学习和课后作业的练习:了解DPPO基本概念了解如何基于DPPO训练一个控制类问题了解强化学习训练推理控制类问题的整体流程你也可以将本案例相关的 ipynb 学习笔记分享到 AI Gallery Notebook 版块获得成长值,分享方法请查看此文档。案例内容介绍倒立摆(Pendulum)摆动问题是控制文献中的经典问题。在我们本节用DPPO解决的Pendulum-v0问题中,钟摆从一个随机位置开始围绕一个端点摆动,目标是把钟摆向上摆动,并且是钟摆保持直立。一个随机动作的倒立摆demo如下所示:整体流程:安装基础依赖->创建倒立摆环境->构建DPPO算法->训练->推理->可视化效果Distributed Proximal Policy Optimization (DPPO) 算法的基本结构DPPO算法是在Proximal Policy Optimization(PPO)算法基础上发展而来,相关PPO算法请看 使用PPO算法玩“超级马里奥兄弟”,我们在这一教程中有详细的介绍。DPPO借鉴A3C的并行方法,使用多个workers并行地在不同的环境中收集数据,并根据采集的数据计算梯度,将梯度发送给一个全局chief,全局chief在拿到一定数量的梯度数据之后进行网络更新,更新时workers停止采集等待等下完毕,更新完毕之后workers重新使用最新的网络采集数据。下面我们使用论文中的伪代码介绍DPPO的具体算法细节。上述算法所示为全局PPO的伪代码,其中 W 是workers的数目,D 是用于更新参数的works数量阈值,M,B是给定一批数据点的具有policy网络和critic网络更新的子迭代数,θ, Φ为policy网络,critic网络的参数。上述算法所示为workers的伪代码,其中T是在计算参数更新之前收集的每个工作节点的数据点数,K是计算K步返回和通过时间截断的反向道具的时间步数(对于RNN)。 该部分算法基于PPO,首先采集数据,根据PPO算法计算梯度并将梯度发送给全局chief,等待全局chief更新完毕参数再进行数据的采集。DPPO论文代码部分参考GitHub开源项目注意事项本案例运行环境为 TensorFlow-2.0.0,且需使用 GPU 运行,请查看《ModelAtrs JupyterLab 硬件规格使用指南》了解切换硬件规格的方法;如果您是第一次使用 JupyterLab,请查看《ModelAtrs JupyterLab使用指导》了解使用方法;如果您在使用 JupyterLab 过程中碰到报错,请参考《ModelAtrs JupyterLab常见问题解决办法》尝试解决问题。实验步骤1. 程序初始化第1步:安装基础依赖!pip install tensorflow==2.0.0 !pip install tensorflow-probability==0.7.0 !pip install tensorlayer==2.1.0 --ignore-installed !pip install h5py==2.10.0 !pip install gym第2步:导入相关的库import os import time import queue import threading import gym import matplotlib.pyplot as plt import numpy as np import tensorflow as tf import tensorflow_probability as tfp import tensorlayer as tl2. 训练参数初始化本案例设置的 训练最大局数 EP_MAX = 1000,可以达到较好的训练效果,训练耗时约20分钟。你也可以调小 EP_MAX 的值,以便快速跑通代码。RANDOMSEED = 1 # 随机数种子 EP_MAX = 100 # 训练总局数 EP_LEN = 200 # 一局最长长度 GAMMA = 0.9 # 折扣率 A_LR = 0.0001 # actor学习率 C_LR = 0.0002 # critic学习率 BATCH = 32 # batchsize大小 A_UPDATE_STEPS = 10 # actor更新步数 C_UPDATE_STEPS = 10 # critic更新步数 S_DIM, A_DIM = 3, 1 # state维度, action维度 EPS = 1e-8 # epsilon值 # PPO1 和PPO2 的参数,你可以选择用PPO1 (METHOD[0]),还是PPO2 (METHOD[1]) METHOD = [ dict(name='kl_pen', kl_target=0.01, lam=0.5), # KL penalty dict(name='clip', epsilon=0.2), # Clipped surrogate objective, find this is better ][1] # choose the method for optimization N_WORKER = 4 # 并行workers数目 MIN_BATCH_SIZE = 64 # 更新PPO的minibatch大小 UPDATE_STEP = 10 # 每隔10steps更新一次3. 创建环境本环境为gym内置的Pendulum,倒立摆倒下即失败。env_name = 'Pendulum-v0' # environment name4. 定义DPPO算法DPPO算法-PPO算法class PPO(object): ''' PPO class ''' def __init__(self): # 创建critic tfs = tl.layers.Input([None, S_DIM], tf.float32, 'state') l1 = tl.layers.Dense(100, tf.nn.relu)(tfs) v = tl.layers.Dense(1)(l1) self.critic = tl.models.Model(tfs, v) self.critic.train() # 创建actor self.actor = self._build_anet('pi', trainable=True) self.actor_old = self._build_anet('oldpi', trainable=False) self.actor_opt = tf.optimizers.Adam(A_LR) self.critic_opt = tf.optimizers.Adam(C_LR) # 更新actor def a_train(self, tfs, tfa, tfadv): ''' Update policy network :param tfs: state :param tfa: act :param tfadv: advantage :return: ''' tfs = np.array(tfs, np.float32) tfa = np.array(tfa, np.float32) tfadv = np.array(tfadv, np.float32) # td-error with tf.GradientTape() as tape: mu, sigma = self.actor(tfs) pi = tfp.distributions.Normal(mu, sigma) mu_old, sigma_old = self.actor_old(tfs) oldpi = tfp.distributions.Normal(mu_old, sigma_old) ratio = pi.prob(tfa) / (oldpi.prob(tfa) + EPS) surr = ratio * tfadv ## PPO1 if METHOD['name'] == 'kl_pen': tflam = METHOD['lam'] kl = tfp.distributions.kl_divergence(oldpi, pi) kl_mean = tf.reduce_mean(kl) aloss = -(tf.reduce_mean(surr - tflam * kl)) ## PPO2 else: aloss = -tf.reduce_mean( tf.minimum(surr, tf.clip_by_value(ratio, 1. - METHOD['epsilon'], 1. + METHOD['epsilon']) * tfadv) ) a_gard = tape.gradient(aloss, self.actor.trainable_weights) self.actor_opt.apply_gradients(zip(a_gard, self.actor.trainable_weights)) if METHOD['name'] == 'kl_pen': return kl_mean # 更新old_pi def update_old_pi(self): ''' Update old policy parameter :return: None ''' for p, oldp in zip(self.actor.trainable_weights, self.actor_old.trainable_weights): oldp.assign(p) # 更新critic def c_train(self, tfdc_r, s): ''' Update actor network :param tfdc_r: cumulative reward :param s: state :return: None ''' tfdc_r = np.array(tfdc_r, dtype=np.float32) with tf.GradientTape() as tape: advantage = tfdc_r - self.critic(s) # 计算advantage:V(s') * gamma + r - V(s) closs = tf.reduce_mean(tf.square(advantage)) grad = tape.gradient(closs, self.critic.trainable_weights) self.critic_opt.apply_gradients(zip(grad, self.critic.trainable_weights)) # 计算advantage:V(s') * gamma + r - V(s) def cal_adv(self, tfs, tfdc_r): ''' Calculate advantage :param tfs: state :param tfdc_r: cumulative reward :return: advantage ''' tfdc_r = np.array(tfdc_r, dtype=np.float32) advantage = tfdc_r - self.critic(tfs) return advantage.numpy() def update(self): ''' Update parameter with the constraint of KL divergent :return: None ''' global GLOBAL_UPDATE_COUNTER while not COORD.should_stop(): # 如果协调器没有停止 if GLOBAL_EP < EP_MAX: # EP_MAX是最大更新次数 UPDATE_EVENT.wait() # PPO进程的等待位置 self.update_old_pi() # copy pi to old pi data = [QUEUE.get() for _ in range(QUEUE.qsize())] # collect data from all workers data = np.vstack(data) s, a, r = data[:, :S_DIM].astype(np.float32), \ data[:, S_DIM: S_DIM + A_DIM].astype(np.float32), \ data[:, -1:].astype(np.float32) adv = self.cal_adv(s, r) # update actor ## PPO1 if METHOD['name'] == 'kl_pen': for _ in range(A_UPDATE_STEPS): kl = self.a_train(s, a, adv) if kl > 4 * METHOD['kl_target']: # this in in google's paper break if kl < METHOD['kl_target'] / 1.5: # adaptive lambda, this is in OpenAI's paper METHOD['lam'] /= 2 elif kl > METHOD['kl_target'] * 1.5: METHOD['lam'] *= 2 # sometimes explode, this clipping is MorvanZhou's solution METHOD['lam'] = np.clip(METHOD['lam'], 1e-4, 10) ## PPO2 else: # clipping method, find this is better (OpenAI's paper) for _ in range(A_UPDATE_STEPS): self.a_train(s, a, adv) # 更新critic for _ in range(C_UPDATE_STEPS): self.c_train(r, s) UPDATE_EVENT.clear() # updating finished GLOBAL_UPDATE_COUNTER = 0 # reset counter ROLLING_EVENT.set() # set roll-out available # 构建actor网络 def _build_anet(self, name, trainable): ''' Build policy network :param name: name :param trainable: trainable flag :return: policy network ''' tfs = tl.layers.Input([None, S_DIM], tf.float32, name + '_state') l1 = tl.layers.Dense(100, tf.nn.relu, name=name + '_l1')(tfs) a = tl.layers.Dense(A_DIM, tf.nn.tanh, name=name + '_a')(l1) mu = tl.layers.Lambda(lambda x: x * 2, name=name + '_lambda')(a) sigma = tl.layers.Dense(A_DIM, tf.nn.softplus, name=name + '_sigma')(l1) model = tl.models.Model(tfs, [mu, sigma], name) if trainable: model.train() else: model.eval() return model # 选择动作 def choose_action(self, s): ''' Choose action :param s: state :return: clipped act ''' s = s[np.newaxis, :].astype(np.float32) mu, sigma = self.actor(s) pi = tfp.distributions.Normal(mu, sigma) a = tf.squeeze(pi.sample(1), axis=0)[0] # choosing action return np.clip(a, -2, 2) # 计算V() def get_v(self, s): ''' Compute value :param s: state :return: value ''' s = s.astype(np.float32) if s.ndim < 2: s = s[np.newaxis, :] return self.critic(s)[0, 0] def save_ckpt(self): """ save trained weights :return: None """ if not os.path.exists('model_Pendulum'): os.makedirs('model_Pendulum') tl.files.save_weights_to_hdf5('model_Pendulum/dppo_actor.hdf5', self.actor) tl.files.save_weights_to_hdf5('model_Pendulum/dppo_actor_old.hdf5', self.actor_old) tl.files.save_weights_to_hdf5('model_Pendulum/dppo_critic.hdf5', self.critic) def load_ckpt(self): """ load trained weights :return: None """ tl.files.load_hdf5_to_weights_in_order('model_Pendulum/dppo_actor.hdf5', self.actor) tl.files.load_hdf5_to_weights_in_order('model_Pendulum/dppo_actor_old.hdf5', self.actor_old) tl.files.load_hdf5_to_weights_in_order('model_Pendulum/dppo_critic.hdf5', self.critic)workers构建class Worker(object): ''' Worker class for distributional running ''' def __init__(self, wid): self.wid = wid # 工号 self.env = gym.make(env_name).unwrapped # 创建环境 self.env.seed(wid * 100 + RANDOMSEED) # 设置不同的随机种子,因为不希望每个worker的都一致 self.ppo = GLOBAL_PPO # 算法 def work(self): ''' Define a worker :return: None ''' global GLOBAL_EP, GLOBAL_RUNNING_R, GLOBAL_UPDATE_COUNTER while not COORD.should_stop(): # 从COORD接受消息,看看是否应该should_stop s = self.env.reset() ep_r = 0 buffer_s, buffer_a, buffer_r = [], [], [] # 记录data t0 = time.time() for t in range(EP_LEN): # 看是否正在被更新。PPO进程正在工作,那么就在这里等待 if not ROLLING_EVENT.is_set(): # 查询进程是否被阻塞,如果在阻塞状态,就证明如果global PPO正在更新。否则就可以继续。 ROLLING_EVENT.wait() # worker进程的等待位置。wait until PPO is updated buffer_s, buffer_a, buffer_r = [], [], [] # clear history buffer, use new policy to collect data # 正常跑游戏,并搜集数据 a = self.ppo.choose_action(s) s_, r, done, _ = self.env.step(a) buffer_s.append(s) buffer_a.append(a) buffer_r.append((r + 8) / 8) # normalize reward, find to be useful s = s_ ep_r += r # GLOBAL_UPDATE_COUNTER是每个work的在游戏中进行一步,也就是产生一条数据就会+1. # 当GLOBAL_UPDATE_COUNTER大于batch(64)的时候,就可以进行更新。 GLOBAL_UPDATE_COUNTER += 1 # count to minimum batch size, no need to wait other workers if t == EP_LEN - 1 or GLOBAL_UPDATE_COUNTER >= MIN_BATCH_SIZE: # t == EP_LEN - 1 是最后一步 ## 计算每个状态对应的V(s') ## 要注意,这里的len(buffer) < GLOBAL_UPDATE_COUNTER。所以数据是每个worker各自计算的。 v_s_ = self.ppo.get_v(s_) discounted_r = [] # compute discounted reward for r in buffer_r[::-1]: v_s_ = r + GAMMA * v_s_ discounted_r.append(v_s_) discounted_r.reverse() ## 堆叠成数据,并保存到公共队列中。 bs, ba, br = np.vstack(buffer_s), np.vstack(buffer_a), np.array(discounted_r)[:, np.newaxis] buffer_s, buffer_a, buffer_r = [], [], [] QUEUE.put(np.hstack((bs, ba, br))) # put data in the queue # 如果数据足够,就开始更新 if GLOBAL_UPDATE_COUNTER >= MIN_BATCH_SIZE: ROLLING_EVENT.clear() # stop collecting data UPDATE_EVENT.set() # global PPO update if GLOBAL_EP >= EP_MAX: # stop training COORD.request_stop() # 停止更新 break # record reward changes, plot later if len(GLOBAL_RUNNING_R) == 0: GLOBAL_RUNNING_R.append(ep_r) else: GLOBAL_RUNNING_R.append(GLOBAL_RUNNING_R[-1] * 0.9 + ep_r * 0.1) GLOBAL_EP += 1 print( 'Episode: {}/{} | Worker: {} | Episode Reward: {:.4f} | Running Time: {:.4f}'.format( GLOBAL_EP, EP_MAX, self.wid, ep_r, time.time() - t0 ) )5. 模型训练np.random.seed(RANDOMSEED) tf.random.set_seed(RANDOMSEED) GLOBAL_PPO = PPO()[TL] Input state: [None, 3] [TL] Dense dense_1: 100 relu [TL] Dense dense_2: 1 No Activation [TL] Input pi_state: [None, 3] [TL] Dense pi_l1: 100 relu [TL] Dense pi_a: 1 tanh [TL] Lambda pi_lambda: func: <function PPO._build_anet.<locals>.<lambda> at 0x7fb65d633950>, len_weights: 0 [TL] Dense pi_sigma: 1 softplus [TL] Input oldpi_state: [None, 3] [TL] Dense oldpi_l1: 100 relu [TL] Dense oldpi_a: 1 tanh [TL] Lambda oldpi_lambda: func: <function PPO._build_anet.<locals>.<lambda> at 0x7fb65d633a70>, len_weights: 0 [TL] Dense oldpi_sigma: 1 softplus# 定义两组不同的事件,update 和 rolling UPDATE_EVENT, ROLLING_EVENT = threading.Event(), threading.Event() UPDATE_EVENT.clear() # not update now,相当于把标志位设置为False ROLLING_EVENT.set() # start to roll out,相当于把标志位设置为True,并通知所有处于等待阻塞状态的线程恢复运行状态。 # 创建workers workers = [Worker(wid=i) for i in range(N_WORKER)] GLOBAL_UPDATE_COUNTER, GLOBAL_EP = 0, 0 # 全局更新次数计数器,全局EP计数器 GLOBAL_RUNNING_R = [] # 记录动态的reward,看成绩 COORD = tf.train.Coordinator() # 创建tensorflow的协调器 QUEUE = queue.Queue() # workers putting data in this queue threads = [] # 为每个worker创建进程 for worker in workers: # worker threads t = threading.Thread(target=worker.work, args=()) # 创建进程 t.start() # 开始进程 threads.append(t) # 把进程放到进程列表中,方便管理 # add a PPO updating thread # 把一个全局的PPO更新放到进程列表最后。 threads.append(threading.Thread(target=GLOBAL_PPO.update, )) threads[-1].start() COORD.join(threads) # 把进程列表交给协调器管理 GLOBAL_PPO.save_ckpt() # 保存全局参数 # plot reward change and test plt.title('DPPO') plt.plot(np.arange(len(GLOBAL_RUNNING_R)), GLOBAL_RUNNING_R) plt.xlabel('Episode') plt.ylabel('Moving reward') plt.ylim(-2000, 0) plt.show()Episode: 1/100 | Worker: 1 | Episode Reward: -965.6343 | Running Time: 3.1675 Episode: 2/100 | Worker: 2 | Episode Reward: -1443.1138 | Running Time: 3.1689 Episode: 3/100 | Worker: 3 | Episode Reward: -1313.6248 | Running Time: 3.1734 Episode: 4/100 | Worker: 0 | Episode Reward: -1403.1952 | Running Time: 3.1819 Episode: 5/100 | Worker: 1 | Episode Reward: -1399.3963 | Running Time: 3.2429 Episode: 6/100 | Worker: 2 | Episode Reward: -1480.8439 | Running Time: 3.2453 Episode: 7/100 | Worker: 0 | Episode Reward: -1489.4195 | Running Time: 3.2373 Episode: 8/100 | Worker: 3 | Episode Reward: -1339.0517 | Running Time: 3.2583 Episode: 9/100 | Worker: 1 | Episode Reward: -1600.1292 | Running Time: 3.2478 Episode: 10/100 | Worker: 0 | Episode Reward: -1513.2170 | Running Time: 3.2584 Episode: 11/100 | Worker: 2 | Episode Reward: -1461.7279 | Running Time: 3.2697 Episode: 12/100 | Worker: 3 | Episode Reward: -1480.2685 | Running Time: 3.2598 Episode: 13/100 | Worker: 0 | Episode Reward: -1831.5374 | Running Time: 3.2423Episode: 14/100 | Worker: 1 | Episode Reward: -1524.8253 | Running Time: 3.2635 Episode: 15/100 | Worker: 2 | Episode Reward: -1383.4878 | Running Time: 3.2556 Episode: 16/100 | Worker: 3 | Episode Reward: -1288.9392 | Running Time: 3.2588 Episode: 17/100 | Worker: 1 | Episode Reward: -1657.2223 | Running Time: 3.2377 Episode: 18/100 | Worker: 0 | Episode Reward: -1472.2335 | Running Time: 3.2678 Episode: 19/100 | Worker: 2 | Episode Reward: -1475.5421 | Running Time: 3.2667 Episode: 20/100 | Worker: 3 | Episode Reward: -1532.7678 | Running Time: 3.2739 Episode: 21/100 | Worker: 1 | Episode Reward: -1575.5706 | Running Time: 3.2688 Episode: 22/100 | Worker: 2 | Episode Reward: -1238.4006 | Running Time: 3.2303 Episode: 23/100 | Worker: 0 | Episode Reward: -1630.9554 | Running Time: 3.2584 Episode: 24/100 | Worker: 3 | Episode Reward: -1610.7237 | Running Time: 3.2601 Episode: 25/100 | Worker: 1 | Episode Reward: -1516.5440 | Running Time: 3.2683 Episode: 26/100 | Worker: 0 | Episode Reward: -1547.6209 | Running Time: 3.2589 Episode: 27/100 | Worker: 2 | Episode Reward: -1328.2584 | Running Time: 3.2762 Episode: 28/100 | Worker: 3 | Episode Reward: -1191.0914 | Running Time: 3.2552 Episode: 29/100 | Worker: 1 | Episode Reward: -1415.3608 | Running Time: 3.2804 Episode: 30/100 | Worker: 0 | Episode Reward: -1765.8007 | Running Time: 3.2767 Episode: 31/100 | Worker: 2 | Episode Reward: -1756.5872 | Running Time: 3.3078 Episode: 32/100 | Worker: 3 | Episode Reward: -1428.0094 | Running Time: 3.2815 Episode: 33/100 | Worker: 1 | Episode Reward: -1605.7720 | Running Time: 3.3010 Episode: 34/100 | Worker: 0 | Episode Reward: -1247.7492 | Running Time: 3.3115Episode: 35/100 | Worker: 2 | Episode Reward: -1333.9553 | Running Time: 3.2759 Episode: 36/100 | Worker: 3 | Episode Reward: -1485.7453 | Running Time: 3.2749 Episode: 37/100 | Worker: 3 | Episode Reward: -1341.3090 | Running Time: 3.2323 Episode: 38/100 | Worker: 2 | Episode Reward: -1472.5245 | Running Time: 3.2595 Episode: 39/100 | Worker: 0 | Episode Reward: -1583.6614 | Running Time: 3.2721 Episode: 40/100 | Worker: 1 | Episode Reward: -1358.4421 | Running Time: 3.2925 Episode: 41/100 | Worker: 3 | Episode Reward: -1744.7500 | Running Time: 3.2391 Episode: 42/100 | Worker: 2 | Episode Reward: -1684.8821 | Running Time: 3.2527 Episode: 43/100 | Worker: 1 | Episode Reward: -1412.0231 | Running Time: 3.2400 Episode: 44/100 | Worker: 0 | Episode Reward: -1437.6130 | Running Time: 3.2458 Episode: 45/100 | Worker: 3 | Episode Reward: -1461.7901 | Running Time: 3.2872 Episode: 46/100 | Worker: 2 | Episode Reward: -1572.6255 | Running Time: 3.2710 Episode: 47/100 | Worker: 0 | Episode Reward: -1704.6351 | Running Time: 3.2762 Episode: 48/100 | Worker: 1 | Episode Reward: -1538.4030 | Running Time: 3.3117 Episode: 49/100 | Worker: 3 | Episode Reward: -1554.7941 | Running Time: 3.2881 Episode: 50/100 | Worker: 2 | Episode Reward: -1796.0786 | Running Time: 3.2718 Episode: 51/100 | Worker: 0 | Episode Reward: -1877.3152 | Running Time: 3.2804 Episode: 52/100 | Worker: 1 | Episode Reward: -1749.8780 | Running Time: 3.2779 Episode: 53/100 | Worker: 3 | Episode Reward: -1486.8338 | Running Time: 3.1559 Episode: 54/100 | Worker: 2 | Episode Reward: -1540.8134 | Running Time: 3.2903 Episode: 55/100 | Worker: 0 | Episode Reward: -1596.7365 | Running Time: 3.3156 Episode: 56/100 | Worker: 1 | Episode Reward: -1644.7888 | Running Time: 3.3065 Episode: 57/100 | Worker: 3 | Episode Reward: -1514.0685 | Running Time: 3.2920 Episode: 58/100 | Worker: 2 | Episode Reward: -1411.2714 | Running Time: 3.1554 Episode: 59/100 | Worker: 0 | Episode Reward: -1602.3725 | Running Time: 3.2737 Episode: 60/100 | Worker: 1 | Episode Reward: -1579.8769 | Running Time: 3.3140 Episode: 61/100 | Worker: 3 | Episode Reward: -1360.7916 | Running Time: 3.2856 Episode: 62/100 | Worker: 2 | Episode Reward: -1490.7107 | Running Time: 3.2861 Episode: 63/100 | Worker: 0 | Episode Reward: -1775.7557 | Running Time: 3.2644 Episode: 64/100 | Worker: 1 | Episode Reward: -1491.0894 | Running Time: 3.2828 Episode: 65/100 | Worker: 0 | Episode Reward: -1428.8124 | Running Time: 3.1239 Episode: 66/100 | Worker: 2 | Episode Reward: -1493.7703 | Running Time: 3.2680 Episode: 67/100 | Worker: 3 | Episode Reward: -1658.3558 | Running Time: 3.2853 Episode: 68/100 | Worker: 1 | Episode Reward: -1605.9077 | Running Time: 3.2911 Episode: 69/100 | Worker: 2 | Episode Reward: -1374.3309 | Running Time: 3.3644 Episode: 70/100 | Worker: 0 | Episode Reward: -1283.5023 | Running Time: 3.3819 Episode: 71/100 | Worker: 3 | Episode Reward: -1346.1850 | Running Time: 3.3860 Episode: 72/100 | Worker: 1 | Episode Reward: -1222.1988 | Running Time: 3.3724 Episode: 73/100 | Worker: 2 | Episode Reward: -1199.1266 | Running Time: 3.2739 Episode: 74/100 | Worker: 0 | Episode Reward: -1207.3161 | Running Time: 3.2670 Episode: 75/100 | Worker: 3 | Episode Reward: -1302.0207 | Running Time: 3.2562 Episode: 76/100 | Worker: 1 | Episode Reward: -1233.3584 | Running Time: 3.2892 Episode: 77/100 | Worker: 2 | Episode Reward: -964.8099 | Running Time: 3.2339 Episode: 78/100 | Worker: 0 | Episode Reward: -1208.2836 | Running Time: 3.2602 Episode: 79/100 | Worker: 3 | Episode Reward: -1149.2154 | Running Time: 3.2579 Episode: 80/100 | Worker: 1 | Episode Reward: -1219.3229 | Running Time: 3.2321 Episode: 81/100 | Worker: 2 | Episode Reward: -1097.7572 | Running Time: 3.2995 Episode: 82/100 | Worker: 3 | Episode Reward: -940.7949 | Running Time: 3.2981 Episode: 83/100 | Worker: 0 | Episode Reward: -1395.6272 | Running Time: 3.3076 Episode: 84/100 | Worker: 1 | Episode Reward: -1092.5180 | Running Time: 3.2936 Episode: 85/100 | Worker: 2 | Episode Reward: -1369.8868 | Running Time: 3.2517 Episode: 86/100 | Worker: 0 | Episode Reward: -1380.5247 | Running Time: 3.2390 Episode: 87/100 | Worker: 3 | Episode Reward: -1413.2114 | Running Time: 3.2740 Episode: 88/100 | Worker: 1 | Episode Reward: -1403.9904 | Running Time: 3.2643 Episode: 89/100 | Worker: 2 | Episode Reward: -1098.8470 | Running Time: 3.3078 Episode: 90/100 | Worker: 0 | Episode Reward: -983.4387 | Running Time: 3.3224 Episode: 91/100 | Worker: 3 | Episode Reward: -1056.6701 | Running Time: 3.3059 Episode: 92/100 | Worker: 1 | Episode Reward: -1357.6828 | Running Time: 3.2980 Episode: 93/100 | Worker: 2 | Episode Reward: -1082.3377 | Running Time: 3.3248 Episode: 94/100 | Worker: 3 | Episode Reward: -1052.0146 | Running Time: 3.3291 Episode: 95/100 | Worker: 0 | Episode Reward: -1373.0590 | Running Time: 3.3660 Episode: 96/100 | Worker: 1 | Episode Reward: -1044.4578 | Running Time: 3.3311 Episode: 97/100 | Worker: 2 | Episode Reward: -1179.2926 | Running Time: 3.3593 Episode: 98/100 | Worker: 3 | Episode Reward: -1039.1825 | Running Time: 3.3540 Episode: 99/100 | Worker: 0 | Episode Reward: -1193.3356 | Running Time: 3.3599 Episode: 100/100 | Worker: 1 | Episode Reward: -1378.5094 | Running Time: 3.2025 Episode: 101/100 | Worker: 2 | Episode Reward: -30.6317 | Running Time: 0.1128 Episode: 102/100 | Worker: 0 | Episode Reward: -141.0568 | Running Time: 0.2976 Episode: 103/100 | Worker: 3 | Episode Reward: -166.4818 | Running Time: 0.3256 Episode: 104/100 | Worker: 1 | Episode Reward: -123.2953 | Running Time: 0.2683 [TL] [*] Saving TL weights into model_Pendulum/dppo_actor.hdf5 [TL] [*] Saved [TL] [*] Saving TL weights into model_Pendulum/dppo_actor_old.hdf5 [TL] [*] Saved [TL] [*] Saving TL weights into model_Pendulum/dppo_critic.hdf5 [TL] [*] Saved ![](https://bbs-img.huaweicloud.com/blogs/img/20221205/1670209748845266678.png)6. 模型推理Notebook暂时不支持Pendulum可视化,请将下面代码下载到本地,可查看可视化效果。from matplotlib import animation GLOBAL_PPO.load_ckpt() env = gym.make(env_name) s = env.reset() def display_frames_as_gif(frames): patch = plt.imshow(frames[0]) plt.axis('off') def animate(i): patch.set_data(frames[i]) anim = animation.FuncAnimation(plt.gcf(), animate, frames=len(frames), interval=5) anim.save('./DPPO_Pendulum.gif', writer='imagemagick', fps=30) total_reward = 0 frames = [] while True: env.render() frames.append(env.render(mode='rgb_array')) s, r, done, info = env.step(GLOBAL_PPO.choose_action(s)) if done: print('It is over, the window will be closed after 1 seconds.') time.sleep(1) break env.close() print('Total Reward : %.2f' % total_reward) display_frames_as_gif(frames)7. 模型推理效果如下视频是训练1000 Episode模型的推理效果8. 作业请你调整步骤2中的训练参数,重新训练一个模型,使它在游戏中获得更好的表现。
  • [技术干货] 使用DDPG算法控制小车上山
    使用DDPG算法控制小车上山实验目标通过本案例的学习和课后作业的练习:了解DDPG算法的基本概念了解如何基于DDPG训练一个控制类小游戏了解强化学习训练推理游戏的整体流程你也可以将本案例相关的 ipynb 学习笔记分享到 AI Gallery Notebook 版块获得成长值,分享方法请查看此文档。案例内容介绍MountainCarContinuous是连续动作空间的控制游戏,也是OpenAi gym的经典问题。在此游戏中,我们可以向左/向右推动小车,小车若到达山顶,则游戏胜利,越早到达山顶,则分数越高;若999回合后,没有到达山顶,则游戏失败。DDPG全称为Deep Deterministic Policy Gradient,由Google Deepmind发表于ICLR 2016,主要用于解决连续动作空间的问题。 在本案例中,我们将展示如何基于DDPG算法,训练连续动作空间的小车上山问题。整体流程:安装基础依赖->创建mountain_car环境->构建DDPG算法->训练->推理->可视化效果DDPG算法介绍DDPG将深度学习神经网络融合进DPG。相对于DPG的核心改进是:采用卷积神经网络作为策略函数μ和Q函数的模拟,即策略网络和Q网络;然后使用深度学习的方法来训练上述神经网络。DDPG的算法伪代码如下图所示:MountainCarContinuous-v0环境简介MountainCarContinuous-v0,是基于gym的经典控制环境。游戏任务为玩家通过向左/向右推动小车,使之最终到达右边山坡的旗帜处。游戏机制小车初始处于两个山坡中间,而小车的引擎无法提供足够的动力使小车向右直接到达山顶旗帜处。因此,需要玩家在山谷中前后推动小车来积蓄动能,以使小车冲上目标点。小车到达目标点所用动能越少,奖励越高。注意事项本案例运行环境为 Pytorch-1.0.0,且需使用 GPU 运行,请查看《ModelAtrs JupyterLab 硬件规格使用指南》了解切换硬件规格的方法;如果您是第一次使用 JupyterLab,请查看《ModelAtrs JupyterLab使用指导》了解使用方法;如果您在使用 JupyterLab 过程中碰到报错,请参考《ModelAtrs JupyterLab常见问题解决办法》尝试解决问题。实验步骤1.程序初始化第1步:安装基础依赖要确保所有依赖都安装成功后,再执行之后的代码。如果某些模块因为网络原因导致安装失败,直接重试一次即可。!pip install gym !pip install pandas第2步:导入相关的库%matplotlib inline import sys import logging import imp import itertools import copy import moxing as mox import numpy as np np.random.seed(0) import pandas as pd import gym import matplotlib.pyplot as plt import torch torch.manual_seed(0) import torch.nn as nn import torch.optim as optim imp.reload(logging) logging.basicConfig(level=logging.DEBUG, format='%(asctime)s [%(levelname)s] %(message)s', stream=sys.stdout, datefmt='%H:%M:%S')2. 参数设置本案例通过设置 目标奖励值 来判断训练是否结束,"target_reward"为 90 时,模型已达到较好的水平。训练耗时约为10分钟。opt={ "replayer" : 100000, # 经验池大小 "gama" : 0.99, # Q值估计的折扣率 "learn_start" : 5000, # 经验数达到{}时开始学习 "a_lr" : 0.0001, # actor的学习率 "c_lr" : 0.001, # critic的学习率 "soft_lr" : 0.005, # 软更新的学习率 "target_reward" : 90, # 平均每局目标奖励 "batch_size" : 512, # 经验回放的batch_size "mu" : 0., # OU的均值 "sigma" : 0.5, # OU的波动率 "theta" : .15 # OU均值回归的速率 } 3. 游戏环境创建env = gym.make('MountainCarContinuous-v0') env.seed(0) for key in vars(env): logging.info('%s: %s', key, vars(env)[key]) 11:28:18 [INFO] env: <Continuous_MountainCarEnv<MountainCarContinuous-v0>> 11:28:18 [INFO] action_space: Box(-1.0, 1.0, (1,), float32) 11:28:18 [INFO] observation_space: Box(-1.2000000476837158, 0.6000000238418579, (2,), float32) 11:28:18 [INFO] reward_range: (-inf, inf) 11:28:18 [INFO] metadata: {'render.modes': ['human', 'rgb_array'], 'video.frames_per_second': 30} 11:28:18 [INFO] _max_episode_steps: 999 11:28:18 [INFO] _elapsed_steps: None4. DDPG算法构建Actor-Critic网络构建class Actor(nn.Module): def __init__(self, nb_states, nb_actions, hidden1=400, hidden2=300): super(Actor, self).__init__() self.fc1 = nn.Linear(nb_states, hidden1) self.fc2 = nn.Linear(hidden1, hidden2) self.fc3 = nn.Linear(hidden2, nb_actions) self.relu = nn.ReLU() self.tanh = nn.Tanh() def forward(self, x): out = self.fc1(x) out = self.relu(out) out = self.fc2(out) out = self.relu(out) out = self.fc3(out) out = self.tanh(out) return out class Critic(nn.Module): def __init__(self, nb_states, nb_actions, hidden1=400, hidden2=300): super(Critic, self).__init__() self.fc1 = nn.Linear(nb_states, hidden1) self.fc2 = nn.Linear(hidden1+nb_actions, hidden2) self.fc3 = nn.Linear(hidden2, 1) self.relu = nn.ReLU() def forward(self, xs): x, a = xs out = self.fc1(x) out = self.relu(out) # debug() out = self.fc2(torch.cat([out,a],1)) out = self.relu(out) out = self.fc3(out) return outOU噪声构建class OrnsteinUhlenbeckProcess: def __init__(self, x0): self.x = x0 def __call__(self, mu=opt["mu"], sigma=opt["sigma"], theta=opt["theta"], dt=0.01): n = np.random.normal(size=self.x.shape) self.x += (theta * (mu - self.x) * dt + sigma * np.sqrt(dt) * n) return self.xreplay buffer构建class Replayer: def __init__(self, capacity): self.memory = pd.DataFrame(index=range(capacity), columns=['observation', 'action', 'reward', 'next_observation', 'done']) self.i = 0 self.count = 0 self.capacity = capacity def store(self, *args): self.memory.loc[self.i] = args self.i = (self.i + 1) % self.capacity self.count = min(self.count + 1, self.capacity) def sample(self, size): indices = np.random.choice(self.count, size=size) return (np.stack(self.memory.loc[indices, field]) for field in self.memory.columns) 创建DDPG核心训练部分class DDPGAgent: def __init__(self, env): state_dim = env.observation_space.shape[0] self.action_dim = env.action_space.shape[0] self.action_low = env.action_space.low[0] self.action_high = env.action_space.high[0] self.gamma = 0.99 mox.file.copy_parallel("obs://modelarts-labs-bj4-v2/course/modelarts/reinforcement_learning/ddpg_mountaincar/model/", "model/") self.replayer = Replayer(opt["replayer"]) self.actor_evaluate_net = Actor(state_dim,self.action_dim) self.actor_evaluate_net.load_state_dict(torch.load('model/actor.pkl')) self.actor_optimizer = optim.Adam(self.actor_evaluate_net.parameters(), lr=opt["a_lr"]) self.actor_target_net = copy.deepcopy(self.actor_evaluate_net) self.critic_evaluate_net = Critic(state_dim,self.action_dim) self.critic_evaluate_net.load_state_dict(torch.load('model/critic.pkl')) self.critic_optimizer = optim.Adam(self.critic_evaluate_net.parameters(), lr=opt["c_lr"]) self.critic_loss = nn.MSELoss() self.critic_target_net = copy.deepcopy(self.critic_evaluate_net) def reset(self, mode=None): self.mode = mode if self.mode == 'train': self.trajectory = [] self.noise = OrnsteinUhlenbeckProcess(np.zeros((self.action_dim,))) def step(self, observation, reward, done): if self.mode == 'train' and self.replayer.count < opt["learn_start"]: action = np.random.randint(self.action_low, self.action_high) else: state_tensor = torch.as_tensor(observation, dtype=torch.float).reshape(1, -1) action_tensor = self.actor_evaluate_net(state_tensor) action = action_tensor.detach().numpy()[0] if self.mode == 'train': noise = self.noise(sigma=0.1) action = (action + noise).clip(self.action_low, self.action_high) self.trajectory += [observation, reward, done, action] if len(self.trajectory) >= 8: state, _, _, act, next_state, reward, done, _ = self.trajectory[-8:] self.replayer.store(state, act, reward, next_state, done) if self.replayer.count >= opt["learn_start"]: self.learn() return action def close(self): pass def update_net(self, target_net, evaluate_net, learning_rate=opt["soft_lr"]): for target_param, evaluate_param in zip( target_net.parameters(), evaluate_net.parameters()): target_param.data.copy_(learning_rate * evaluate_param.data + (1 - learning_rate) * target_param.data) def learn(self): # replay states, actions, rewards, next_states, dones = self.replayer.sample(opt["batch_size"]) state_tensor = torch.as_tensor(states, dtype=torch.float) action_tensor = torch.as_tensor(actions, dtype=torch.long) reward_tensor = torch.as_tensor(rewards, dtype=torch.float) dones = dones.astype(int) done_tensor = torch.as_tensor(dones, dtype=torch.float) next_state_tensor = torch.as_tensor(next_states, dtype=torch.float) # learn critic next_action_tensor = self.actor_target_net(next_state_tensor) noise_tensor = (0.2 * torch.randn_like(action_tensor, dtype=torch.float)) noisy_next_action_tensor = (next_action_tensor + noise_tensor).clamp( self.action_low, self.action_high) next_state_action_tensor = [next_state_tensor, noisy_next_action_tensor,] #print(next_state_action_tensor) next_q_tensor = self.critic_target_net(next_state_action_tensor).squeeze(1) #print(next_q_tensor) critic_target_tensor = reward_tensor + (1. - done_tensor) * self.gamma * next_q_tensor critic_target_tensor = critic_target_tensor.detach() state_action_tensor = [state_tensor.float(), action_tensor.float(),] critic_pred_tensor = self.critic_evaluate_net(state_action_tensor).squeeze(1) critic_loss_tensor = self.critic_loss(critic_pred_tensor, critic_target_tensor) self.critic_optimizer.zero_grad() critic_loss_tensor.backward() self.critic_optimizer.step() # learn actor pred_action_tensor = self.actor_evaluate_net(state_tensor) pred_action_tensor = pred_action_tensor.clamp(self.action_low, self.action_high) pred_state_action_tensor = [state_tensor, pred_action_tensor,] critic_pred_tensor = self.critic_evaluate_net(pred_state_action_tensor) actor_loss_tensor = -critic_pred_tensor.mean() self.actor_optimizer.zero_grad() actor_loss_tensor.backward() self.actor_optimizer.step() self.update_net(self.critic_target_net, self.critic_evaluate_net) self.update_net(self.actor_target_net, self.actor_evaluate_net) agent = DDPGAgent(env) 11:28:18 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=10850] Ready to call (timestamp=1635305298.8828228): obsClient.getObjectMetadata 11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=10850] Finish calling (timestamp=1635305299.1487627) 11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=10850] Ready to call (timestamp=1635305299.1495306): obsClient.listObjects 11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=10850] Finish calling (timestamp=1635305299.1665623) 11:28:19 [DEBUG] Start to copy 2 files from obs://modelarts-labs-bj4-v2/course/modelarts/reinforcement_learning/ddpg_mountaincar/model to model. 11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11259] Ready to call (timestamp=1635305299.187472): obsClient.getObjectMetadata 11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11260] Ready to call (timestamp=1635305299.192197): obsClient.getObjectMetadata 11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11260] Finish calling (timestamp=1635305299.2651675) 11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11260] Ready to call (timestamp=1635305299.2662349): obsClient.getObject 11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11260] Finish calling (timestamp=1635305299.2906594) 11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11259] Finish calling (timestamp=1635305299.4189594) 11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11259] Ready to call (timestamp=1635305299.4199762): obsClient.getObject 11:28:19 [DEBUG] [hostname=kg-c4f2050c-bd01-44e3-a1b7-35296d3756a4][cid=11259] Finish calling (timestamp=1635305299.430522) 11:28:19 [DEBUG] Copy Successfully. ### 5. 开始训练 训练至奖励值达到90以上大约需要10分钟。 ```python def play_episode(env, agent, max_episode_steps=None, mode=None): observation, reward, done = env.reset(), 0., False agent.reset(mode=mode) episode_reward, elapsed_steps = 0., 0 while True: action = agent.step(observation, reward, done) # 可视化 # env.render() if done: break observation, reward, done, _ = env.step(action) episode_reward += reward elapsed_steps += 1 if max_episode_steps and elapsed_steps >= max_episode_steps: break agent.close() return episode_reward, elapsed_steps logging.info('==== train ====') episode_rewards = [] for episode in itertools.count(): episode_reward, elapsed_steps = play_episode(env.unwrapped, agent, max_episode_steps=env._max_episode_steps, mode='train') episode_rewards.append(episode_reward) logging.debug('train episode %d: reward = %.2f, steps = %d', episode, episode_reward, elapsed_steps) if episode>10 and np.mean(episode_rewards[-10:]) > opt["target_reward"]: #最近的10个reward>90 break plt.plot(episode_rewards)11:28:19 [INFO] ==== train ==== /home/ma-user/anaconda3/envs/Pytorch-1.0.0/lib/python3.6/site-packages/pandas/core/internals.py:826: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray arr_value = np.array(value) 11:28:25 [DEBUG] train episode 0: reward = -43.16, steps = 999 11:28:32 [DEBUG] train episode 1: reward = -38.43, steps = 999 11:28:38 [DEBUG] train episode 2: reward = -51.41, steps = 999 11:28:44 [DEBUG] train episode 3: reward = -48.34, steps = 999 11:28:51 [DEBUG] train episode 4: reward = -50.17, steps = 999 11:28:56 [DEBUG] train episode 5: reward = 93.68, steps = 69 11:29:02 [DEBUG] train episode 6: reward = 93.60, steps = 67 11:29:08 [DEBUG] train episode 7: reward = 93.85, steps = 66 11:29:16 [DEBUG] train episode 8: reward = 92.45, steps = 85 11:29:22 [DEBUG] train episode 9: reward = 93.77, steps = 67 11:29:32 [DEBUG] train episode 10: reward = 91.62, steps = 103 11:29:38 [DEBUG] train episode 11: reward = 94.28, steps = 67 11:29:44 [DEBUG] train episode 12: reward = 93.65, steps = 68 11:29:50 [DEBUG] train episode 13: reward = 93.66, steps = 66 11:29:57 [DEBUG] train episode 14: reward = 93.73, steps = 65 [<matplotlib.lines.Line2D at 0x7fd0f48ef550>]6. 使用模型推理由于本内核可视化依赖于OpenGL,需要窗口显示,但当前环境暂不支持弹窗,因此无法可视化,请将代码下载到本地,取消 env.render() 这行代码的注释,可查看可视化效果。logging.info('==== test ====') episode_rewards = [] for episode in range(100): episode_reward, elapsed_steps = play_episode(env, agent) episode_rewards.append(episode_reward) logging.debug('test episode %d: reward = %.2f, steps = %d', episode, episode_reward, elapsed_steps) logging.info('average episode reward = %.2f ± %.2f', np.mean(episode_rewards), np.std(episode_rewards)) env.close()11:29:57 [INFO] ==== test ==== 11:29:57 [DEBUG] test episode 0: reward = 93.60, steps = 66 11:29:57 [DEBUG] test episode 1: reward = 93.45, steps = 68 11:29:57 [DEBUG] test episode 2: reward = 93.67, steps = 65 11:29:57 [DEBUG] test episode 3: reward = 92.87, steps = 87 11:29:57 [DEBUG] test episode 4: reward = 93.67, steps = 65 11:29:57 [DEBUG] test episode 5: reward = 93.58, steps = 66 11:29:57 [DEBUG] test episode 6: reward = 92.58, steps = 86 11:29:57 [DEBUG] test episode 7: reward = 93.46, steps = 68 11:29:57 [DEBUG] test episode 8: reward = 93.60, steps = 66 11:29:57 [DEBUG] test episode 9: reward = 93.67, steps = 65 11:29:57 [DEBUG] test episode 10: reward = 93.58, steps = 66 11:29:57 [DEBUG] test episode 11: reward = 93.59, steps = 66 11:29:57 [DEBUG] test episode 12: reward = 88.82, steps = 126 11:29:57 [DEBUG] test episode 13: reward = 93.53, steps = 67 11:29:57 [DEBUG] test episode 14: reward = 93.17, steps = 72 11:29:57 [DEBUG] test episode 15: reward = 93.32, steps = 70 11:29:57 [DEBUG] test episode 16: reward = 93.40, steps = 69 11:29:57 [DEBUG] test episode 17: reward = 93.53, steps = 67 11:29:57 [DEBUG] test episode 18: reward = 93.67, steps = 65 11:29:57 [DEBUG] test episode 19: reward = 93.46, steps = 68 11:29:57 [DEBUG] test episode 20: reward = 93.47, steps = 68 11:29:57 [DEBUG] test episode 21: reward = 93.67, steps = 65 11:29:57 [DEBUG] test episode 22: reward = 93.39, steps = 69 11:29:57 [DEBUG] test episode 23: reward = 93.54, steps = 67 11:29:57 [DEBUG] test episode 24: reward = 93.18, steps = 72 11:29:57 [DEBUG] test episode 25: reward = 93.39, steps = 69 11:29:57 [DEBUG] test episode 26: reward = 93.66, steps = 65 11:29:57 [DEBUG] test episode 27: reward = 92.42, steps = 86 11:29:57 [DEBUG] test episode 28: reward = 93.25, steps = 71 11:29:57 [DEBUG] test episode 29: reward = 93.66, steps = 65 11:29:57 [DEBUG] test episode 30: reward = 93.17, steps = 72 11:29:57 [DEBUG] test episode 31: reward = 93.67, steps = 65 11:29:57 [DEBUG] test episode 32: reward = 93.67, steps = 65 11:29:57 [DEBUG] test episode 33: reward = 93.52, steps = 67 11:29:57 [DEBUG] test episode 34: reward = 93.67, steps = 65 11:29:57 [DEBUG] test episode 35: reward = 91.86, steps = 120 11:29:57 [DEBUG] test episode 36: reward = 88.23, steps = 131 11:29:57 [DEBUG] test episode 37: reward = 93.67, steps = 65 11:29:57 [DEBUG] test episode 38: reward = 93.31, steps = 70 11:29:57 [DEBUG] test episode 39: reward = 93.31, steps = 70 11:29:57 [DEBUG] test episode 40: reward = 93.31, steps = 70 11:29:57 [DEBUG] test episode 41: reward = 93.60, steps = 66 11:29:57 [DEBUG] test episode 42: reward = 93.67, steps = 65 11:29:57 [DEBUG] test episode 43: reward = 93.66, steps = 65 11:29:57 [DEBUG] test episode 44: reward = 93.20, steps = 72 11:29:57 [DEBUG] test episode 45: reward = 93.67, steps = 65 11:29:57 [DEBUG] test episode 46: reward = 88.05, steps = 133 11:29:57 [DEBUG] test episode 47: reward = 88.83, steps = 126 11:29:58 [DEBUG] test episode 48: reward = 93.61, steps = 66 11:29:58 [DEBUG] test episode 49: reward = 93.52, steps = 67 11:29:58 [DEBUG] test episode 50: reward = 91.99, steps = 103 11:29:58 [DEBUG] test episode 51: reward = 93.58, steps = 66 11:29:58 [DEBUG] test episode 52: reward = 93.39, steps = 69 11:29:58 [DEBUG] test episode 53: reward = 93.67, steps = 65 11:29:58 [DEBUG] test episode 54: reward = 93.59, steps = 66 11:29:58 [DEBUG] test episode 55: reward = 91.40, steps = 105 11:29:58 [DEBUG] test episode 56: reward = 93.59, steps = 66 11:29:58 [DEBUG] test episode 57: reward = 92.72, steps = 86 11:29:58 [DEBUG] test episode 58: reward = 93.38, steps = 69 11:29:58 [DEBUG] test episode 59: reward = 93.45, steps = 68 11:29:58 [DEBUG] test episode 60: reward = 93.54, steps = 67 11:29:58 [DEBUG] test episode 61: reward = 93.32, steps = 70 11:29:58 [DEBUG] test episode 62: reward = 93.66, steps = 65 11:29:58 [DEBUG] test episode 63: reward = 93.58, steps = 66 11:29:58 [DEBUG] test episode 64: reward = 93.60, steps = 66 11:29:58 [DEBUG] test episode 65: reward = 93.03, steps = 87 11:29:58 [DEBUG] test episode 66: reward = 93.58, steps = 66 11:29:58 [DEBUG] test episode 67: reward = 92.55, steps = 86 11:29:58 [DEBUG] test episode 68: reward = 93.37, steps = 69 11:29:58 [DEBUG] test episode 69: reward = 93.61, steps = 66 11:29:58 [DEBUG] test episode 70: reward = 93.61, steps = 66 11:29:58 [DEBUG] test episode 71: reward = 93.44, steps = 68 11:29:58 [DEBUG] test episode 72: reward = 93.59, steps = 66 11:29:58 [DEBUG] test episode 73: reward = 93.46, steps = 68 11:29:58 [DEBUG] test episode 74: reward = 93.54, steps = 67 11:29:58 [DEBUG] test episode 75: reward = 93.31, steps = 70 11:29:58 [DEBUG] test episode 76: reward = 89.16, steps = 124 11:29:58 [DEBUG] test episode 77: reward = 92.82, steps = 77 11:29:58 [DEBUG] test episode 78: reward = 93.37, steps = 69 11:29:58 [DEBUG] test episode 79: reward = 93.60, steps = 66 11:29:58 [DEBUG] test episode 80: reward = 93.67, steps = 65 11:29:58 [DEBUG] test episode 81: reward = 93.46, steps = 68 11:29:58 [DEBUG] test episode 82: reward = 93.68, steps = 65 11:29:58 [DEBUG] test episode 83: reward = 93.54, steps = 67 11:29:58 [DEBUG] test episode 84: reward = 93.19, steps = 72 11:29:58 [DEBUG] test episode 85: reward = 92.95, steps = 87 11:29:58 [DEBUG] test episode 86: reward = 93.33, steps = 70 11:29:58 [DEBUG] test episode 87: reward = 93.60, steps = 66 11:29:58 [DEBUG] test episode 88: reward = 93.68, steps = 65 11:29:58 [DEBUG] test episode 89: reward = 92.96, steps = 75 11:29:58 [DEBUG] test episode 90: reward = 93.38, steps = 69 11:29:58 [DEBUG] test episode 91: reward = 93.59, steps = 66 11:29:58 [DEBUG] test episode 92: reward = 93.10, steps = 73 11:29:58 [DEBUG] test episode 93: reward = 93.19, steps = 72 11:29:58 [DEBUG] test episode 94: reward = 93.66, steps = 65 11:29:58 [DEBUG] test episode 95: reward = 91.91, steps = 103 11:29:58 [DEBUG] test episode 96: reward = 93.53, steps = 67 11:29:58 [DEBUG] test episode 97: reward = 93.52, steps = 67 11:29:58 [DEBUG] test episode 98: reward = 93.60, steps = 66 11:29:58 [DEBUG] test episode 99: reward = 92.68, steps = 79 11:29:58 [INFO] average episode reward = 93.12 ± 1.127.可视化效果下面的视频为target_reward设置为90时,模型的推理效果,该动图演示了小车在能量消耗极小的情况下到达目标点。8. 作业请你调整步骤2中的训练参数,重新训练一个模型,使它在游戏中获得更好的表现。
  • [技术干货] 使用SAC算法控制倒立摆
    使用SAC算法控制倒立摆-作业欢迎你将完成的作业分享到 AI Gallery Notebook 版块获得成长值,分享方法请查看此文档。题目描述请你调整步骤2中的训练参数,重新训练一个模型,使它在游戏中获得更好的表现提示:请在下文中搜索“# 请在此处实现代码”,注释所在之处就是你需要修改代码的地方;修改好代码之后,跑通整个案例代码,即可完成作业,请将完成的作业分享到AI Gallery,标题以“2021实战营”为开头命名;代码实现1. 程序初始化第1步:安装基础依赖!pip install gym pybullet第2步:导入相关的库import time import random import itertools import gym import numpy as np import torch import torch.nn as nn import torch.nn.functional as F from torch.optim import Adam from torch.distributions import Normal import pybullet_envs2. 训练参数初始化本案例设置的 num_steps = 30000,可以达到200分,训练耗时约5分钟。# 请在此处实现代码3. 定义SAC算法第1步:定义Q网络,Q1和Q2,结构相同,为[256,256,256]的全连接层# Initialize Policy weights def weights_init_(m): if isinstance(m, nn.Linear): torch.nn.init.xavier_uniform_(m.weight, gain=1) torch.nn.init.constant_(m.bias, 0) class QNetwork(nn.Module): def __init__(self, num_inputs, num_actions): super(QNetwork, self).__init__() # Q1 architecture self.linear1 = nn.Linear(num_inputs + num_actions, 256) self.linear2 = nn.Linear(256, 256) self.linear3 = nn.Linear(256, 1) # Q2 architecture self.linear4 = nn.Linear(num_inputs + num_actions, 256) self.linear5 = nn.Linear(256, 256) self.linear6 = nn.Linear(256, 1) self.apply(weights_init_) def forward(self, state, action): xu = torch.cat([state, action], 1) x1 = F.relu(self.linear1(xu)) x1 = F.relu(self.linear2(x1)) x1 = self.linear3(x1) x2 = F.relu(self.linear4(xu)) x2 = F.relu(self.linear5(x2)) x2 = self.linear6(x2) return x1, x2第2步:Policy网络,采用高斯分布,两层[256,256]全连接+均值+标准差class GaussianPolicy(nn.Module): def __init__(self, num_inputs, num_actions, action_space=None): super(GaussianPolicy, self).__init__() self.linear1 = nn.Linear(num_inputs, 256) self.linear2 = nn.Linear(256, 256) self.mean_linear = nn.Linear(256, num_actions) self.log_std_linear = nn.Linear(256, num_actions) self.apply(weights_init_) # action rescaling if action_space is None: self.action_scale = torch.tensor(1.) self.action_bias = torch.tensor(0.) else: self.action_scale = torch.FloatTensor( (action_space.high - action_space.low) / 2.) self.action_bias = torch.FloatTensor( (action_space.high + action_space.low) / 2.) def forward(self, state): x = F.relu(self.linear1(state)) x = F.relu(self.linear2(x)) mean = self.mean_linear(x) log_std = self.log_std_linear(x) log_std = torch.clamp(log_std, min=LOG_SIG_MIN, max=LOG_SIG_MAX) return mean, log_std def sample(self, state): mean, log_std = self.forward(state) std = log_std.exp() normal = Normal(mean, std) # 重参数化技巧 (mean + std * N(0,1)) x_t = normal.rsample() y_t = torch.tanh(x_t) action = y_t * self.action_scale + self.action_bias log_prob = normal.log_prob(x_t) log_prob -= torch.log(self.action_scale * (1 - y_t.pow(2)) + epsilon) log_prob = log_prob.sum(1, keepdim=True) mean = torch.tanh(mean) * self.action_scale + self.action_bias return action, log_prob, mean def to(self, device): self.action_scale = self.action_scale.to(device) self.action_bias = self.action_bias.to(device) return super(GaussianPolicy, self).to(device)第3步: 定义sac训练部分class SAC(object): def __init__(self, num_inputs, action_space): self.alpha = alpha self.auto_entropy = auto_entropy self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # critic网络 self.critic = QNetwork(num_inputs, action_space.shape[0]).to(device=self.device) self.critic_optim = Adam(self.critic.parameters(), lr=lr) # critic_target网络 self.critic_target = QNetwork(num_inputs, action_space.shape[0]).to(self.device) hard_update(self.critic_target, self.critic) # Target Entropy = −dim(A) if auto_entropy is True: self.target_entropy = -torch.prod(torch.Tensor(action_space.shape).to(self.device)).item() self.log_alpha = torch.zeros(1, requires_grad=True, device=self.device) self.alpha_optim = Adam([self.log_alpha], lr=lr) self.policy = GaussianPolicy(num_inputs, action_space.shape[0], action_space).to(self.device) self.policy_optim = Adam(self.policy.parameters(), lr=lr) def select_action(self, state): state = torch.FloatTensor(state).to(self.device).unsqueeze(0) action, _, _ = self.policy.sample(state) return action.detach().cpu().numpy()[0] def update_parameters(self, memory, batch_size, updates): # Sample a batch from memory state_batch, action_batch, reward_batch, next_state_batch, mask_batch = memory.sample(batch_size=batch_size) state_batch = torch.FloatTensor(state_batch).to(self.device) next_state_batch = torch.FloatTensor(next_state_batch).to(self.device) action_batch = torch.FloatTensor(action_batch).to(self.device) reward_batch = torch.FloatTensor(reward_batch).to(self.device).unsqueeze(1) mask_batch = torch.FloatTensor(mask_batch).to(self.device).unsqueeze(1) with torch.no_grad(): # 经过policy_network得到action next_state_action, next_state_log_pi, _ = self.policy.sample(next_state_batch) # 输入next_state,和next_action,经过target_critic_network得到Q值 qf1_next_target, qf2_next_target = self.critic_target(next_state_batch, next_state_action) min_qf_next_target = torch.min(qf1_next_target, qf2_next_target) - self.alpha * next_state_log_pi next_q_value = reward_batch + mask_batch * gamma * (min_qf_next_target) # 将当前state,action输入critic_network得到Q值 qf1, qf2 = self.critic(state_batch, action_batch) # JQ = 𝔼(st,at)~D[0.5(Q1(st,at) - r(st,at) - γ(𝔼st+1~p[V(st+1)]))^2] qf1_loss = F.mse_loss(qf1, next_q_value) # JQ = 𝔼(st,at)~D[0.5(Q1(st,at) - r(st,at) - γ(𝔼st+1~p[V(st+1)]))^2] qf2_loss = F.mse_loss(qf2, next_q_value) qf_loss = qf1_loss + qf2_loss self.critic_optim.zero_grad() qf_loss.backward() self.critic_optim.step() pi, log_pi, _ = self.policy.sample(state_batch) qf1_pi, qf2_pi = self.critic(state_batch, pi) min_qf_pi = torch.min(qf1_pi, qf2_pi) # Jπ = 𝔼st∼D,εt∼N[α * logπ(f(εt;st)|st) − Q(st,f(εt;st))] policy_loss = ((self.alpha * log_pi) - min_qf_pi).mean() self.policy_optim.zero_grad() policy_loss.backward() self.policy_optim.step() if self.auto_entropy: alpha_loss = -(self.log_alpha * (log_pi + self.target_entropy).detach()).mean() self.alpha_optim.zero_grad() alpha_loss.backward() self.alpha_optim.step() self.alpha = self.log_alpha.exp() else: alpha_loss = torch.tensor(0.).to(self.device) if updates % target_update_interval == 0: soft_update(self.critic_target, self.critic, tau) def soft_update(target, source, tau): for target_param, param in zip(target.parameters(), source.parameters()): target_param.data.copy_(target_param.data * (1.0 - tau) + param.data * tau) def hard_update(target, source): for target_param, param in zip(target.parameters(), source.parameters()): target_param.data.copy_(param.data)第4步:定义replay buffer,存储[s,a,r,s_,done]class ReplayMemory: def __init__(self, capacity): random.seed(seed) self.capacity = capacity self.buffer = [] self.position = 0 def push(self, state, action, reward, next_state, done): if len(self.buffer) < self.capacity: self.buffer.append(None) self.buffer[self.position] = (state, action, reward, next_state, done) self.position = (self.position + 1) % self.capacity def sample(self, batch_size): batch = random.sample(self.buffer, batch_size) state, action, reward, next_state, done = map(np.stack, zip(*batch)) return state, action, reward, next_state, done def __len__(self): return len(self.buffer)4. 训练模型初始化环境和算法# 创建环境 env = gym.make(env_name) # 设置随机数 env.seed(seed) env.action_space.seed(seed) torch.manual_seed(seed) np.random.seed(seed) # 创建agent agent = SAC(env.observation_space.shape[0], env.action_space) # replay buffer memory = ReplayMemory(replay_size) # 训练步数记录 total_numsteps = 0 updates = 0 max_reward = 0开始训练print('\ntraining...') begin_t = time.time() for i_episode in itertools.count(1): episode_reward = 0 episode_steps = 0 done = False state = env.reset() while not done: if start_steps > total_numsteps: # 随机采样过程 action = env.action_space.sample() else: # 根据策略采样 action = agent.select_action(state) if len(memory) > batch_size: # 每个step更新次数 for i in range(updates_per_step): agent.update_parameters(memory, batch_size, updates) updates += 1 # 执行该步 next_state, reward, done, _ = env.step(action) # 更新记录参数 episode_steps += 1 total_numsteps += 1 episode_reward += reward # -done mask = 1 if episode_steps == env._max_episode_steps else float(not done) # 存入buffer memory.push(state, action, reward, next_state, mask) # 更新state state = next_state # 达到终止条件后,停止 if total_numsteps > num_steps: break if episode_reward >= max_reward: max_reward = episode_reward print("current_max_reward {}".format(max_reward)) # 保存模型 torch.save(agent.policy, "model.pt") print("Episode: {}, total numsteps: {}, reward: {}".format(i_episode, total_numsteps,round(episode_reward, 2))) env.close() print("finish! time cost is {}s".format(time.time() - begin_t))5. 使用模型推理游戏由于本内核可视化依赖于OpenGL,需要窗口显示,但当前环境暂不支持,因此无法可视化,请将代码下载到本地,取消 env.render() 这行代码的注释,可查看可视化效果。# 可视化部分 model = torch.load("model.pt") model.eval() device = torch.device("cuda" if torch.cuda.is_available() else "cpu") state = env.reset() # env.render() done = False episode_reward = 0 while not done: _, _, action = model.sample(torch.FloatTensor(state).to(device).unsqueeze(0)) action = action.detach().cpu().numpy()[0] next_state, reward, done, _ = env.step(action) episode_reward += reward # env.render() state = next_state print(episode_reward)可视化效果如下:
  • [技术干货] 使用PPO算法玩“超级马里奥兄弟”
    实验目标通过本案例的学习和课后作业的练习:了解PPO算法的基本概念了解如何基于PPO训练一个小游戏了解强化学习训练推理游戏的整体流程你也可以将本案例相关的 ipynb 学习笔记分享到 AI Gallery Notebook 版块获得成长值,分享方法请查看此文档。案例内容介绍在此教程中,我们利用PPO算法来玩“Super Mario Bros”(超级马里奥兄弟)。目前来看,对于绝大部分关卡,智能体都可以在1500个episode内学会过关,您可以在超参数栏输入您想要的游戏关卡和训练算法超参数。整体流程:创建马里奥环境->构建PPO算法->训练->推理->可视化效果PPO算法的基本结构PPO算法有两种主要形式:PPO-Penalty和PPO-Clip(PPO2)。在这里,我们讨论PPO-Clip(OpenAI使用的主要形式)。 PPO的主要特点如下:PPO属于on-policy算法PPO同时适用于离散和连续的动作空间损失函数 PPO-Clip算法最精髓的地方就是加入了一项比例用以描绘新老策略的差异,通过超参数ϵ限制策略的更新步长:更新策略:探索策略 PPO采用随机探索策略。优势函数 表示在状态s下采取动作a,相较于其他动作有多少优势,如果>0,则当前动作比平均动作好,反之,则差PPO论文超级马里奥兄弟游戏环境简介《超级马里奥兄弟》,是任天堂公司开发并于1985年出品的著名横版过关游戏,游戏的目标在于游历蘑菇王国,并从大反派酷霸王的魔掌里救回桃花公主。马力奥可以在游戏世界收集散落各处的金币,或者敲击特殊的砖块,获得其中的金币或特殊道具。这里一共有8大关(world),每大关有4小关(stage)。注意事项本案例运行环境为 Pytorch-1.0.0,且需使用 GPU 运行,请查看《ModelAtrs JupyterLab 硬件规格使用指南》了解切换硬件规格的方法;如果您是第一次使用 JupyterLab,请查看《ModelAtrs JupyterLab使用指导》了解使用方法;如果您在使用 JupyterLab 过程中碰到报错,请参考《ModelAtrs JupyterLab常见问题解决办法》尝试解决问题。实验步骤1. 程序初始化第1步:安装基础依赖!pip install -U pip!pip install gym==0.19.0!pip install tqdm==4.48.0!pip install nes-py==8.1.0!pip install gym-super-mario-bros==7.3.2import osimport shutilimport subprocess as spfrom collections import dequeimport numpy as npimport torchimport torch.nn as nnimport torch.nn.functional as Fimport torch.multiprocessing as _mpfrom torch.distributions import Categoricalimport torch.multiprocessing as mpfrom nes_py.wrappers import JoypadSpaceimport gym_super_mario_brosfrom gym.spaces import Boxfrom gym import Wrapperfrom gym_super_mario_bros.actions import SIMPLE_MOVEMENT, COMPLEX_MOVEMENT, RIGHT_ONLYimport cv2import matplotlib.pyplot as pltfrom IPython import displayimport moxing as mox2. 训练参数初始化该部分参数可以自己调整,以训练出更好的效果opt={ "world": 1, # 可选大关:1,2,3,4,5,6,7,8 "stage": 1, # 可选小关:1,2,3,4 "action_type": "simple", # 动作类别:"simple","right_only", "complex" 'lr': 1e-4, # 建议学习率:1e-3,1e-4, 1e-5,7e-5 'gamma': 0.9, # 奖励折扣 'tau': 1.0, # GAE参数 'beta': 0.01, # 熵系数 'epsilon': 0.2, # PPO的Clip系数 'batch_size': 16, # 经验回放的batch_size 'max_episode':10, # 最大训练局数 'num_epochs': 10, # 每条经验回放次数 "num_local_steps": 512, # 每局的最大步数 "num_processes": 8, # 训练进程数,一般等于训练机核心数 "save_interval": 5, # 每{}局保存一次模型 "log_path": "./log", # 日志保存路径 "saved_path": "./model", # 训练模型保存路径 "pretrain_model": True, # 是否加载预训练模型,目前只提供1-1关卡的预训练模型,其他需要从零开始训练 "episode":5}3. 创建环境结束标志:胜利:mario到达本关终点失败:mario受到敌人的伤害、坠入悬崖或者时间用完奖励函数:得分:收集金币、踩扁敌人、结束时夺旗扣分:受到敌人伤害、掉落悬崖、结束时未夺旗# 创建环境def create_train_env(world, stage, actions, output_path=None): # 创建基础环境 env = gym_super_mario_bros.make("SuperMarioBros-{}-{}-v0".format(world, stage)) env = JoypadSpace(env, actions) # 对环境自定义 env = CustomReward(env, world, stage, monitor=None) env = CustomSkipFrame(env) return env# 对原始环境进行修改,以获得更好的训练效果class CustomReward(Wrapper): def __init__(self, env=None, world=None, stage=None, monitor=None): super(CustomReward, self).__init__(env) self.observation_space = Box(low=0, high=255, shape=(1, 84, 84)) self.curr_score = 0 self.current_x = 40 self.world = world self.stage = stage if monitor: self.monitor = monitor else: self.monitor = None def step(self, action): state, reward, done, info = self.env.step(action) if self.monitor: self.monitor.record(state) state = process_frame(state) reward += (info["score"] - self.curr_score) / 40. self.curr_score = info["score"] if done: if info["flag_get"]: reward += 50 else: reward -= 50 if self.world == 7 and self.stage == 4: if (506 <= info["x_pos"] <= 832 and info["y_pos"] > 127) or ( 832 < info["x_pos"] <= 1064 and info["y_pos"] < 80) or ( 1113 < info["x_pos"] <= 1464 and info["y_pos"] < 191) or ( 1579 < info["x_pos"] <= 1943 and info["y_pos"] < 191) or ( 1946 < info["x_pos"] <= 1964 and info["y_pos"] >= 191) or ( 1984 < info["x_pos"] <= 2060 and (info["y_pos"] >= 191 or info["y_pos"] < 127)) or ( 2114 < info["x_pos"] < 2440 and info["y_pos"] < 191) or info["x_pos"] < self.current_x - 500: reward -= 50 done = True if self.world == 4 and self.stage == 4: if (info["x_pos"] <= 1500 and info["y_pos"] < 127) or ( 1588 <= info["x_pos"] < 2380 and info["y_pos"] >= 127): reward = -50 done = True self.current_x = info["x_pos"] return state, reward / 10., done, info def reset(self): self.curr_score = 0 self.current_x = 40 return process_frame(self.env.reset())class MultipleEnvironments: def __init__(self, world, stage, action_type, num_envs, output_path=None): self.agent_conns, self.env_conns = zip(*[mp.Pipe() for _ in range(num_envs)]) if action_type == "right_only": actions = RIGHT_ONLY elif action_type == "simple": actions = SIMPLE_MOVEMENT else: actions = COMPLEX_MOVEMENT self.envs = [create_train_env(world, stage, actions, output_path=output_path) for _ in range(num_envs)] self.num_states = self.envs[0].observation_space.shape[0] self.num_actions = len(actions) for index in range(num_envs): process = mp.Process(target=self.run, args=(index,)) process.start() self.env_conns[index].close() def run(self, index): self.agent_conns[index].close() while True: request, action = self.env_conns[index].recv() if request == "step": self.env_conns[index].send(self.envs[index].step(action.item())) elif request == "reset": self.env_conns[index].send(self.envs[index].reset()) else: raise NotImplementedErrordef process_frame(frame): if frame is not None: frame = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY) frame = cv2.resize(frame, (84, 84))[None, :, :] / 255. return frame else: return np.zeros((1, 84, 84)) class CustomSkipFrame(Wrapper): def __init__(self, env, skip=4): super(CustomSkipFrame, self).__init__(env) self.observation_space = Box(low=0, high=255, shape=(skip, 84, 84)) self.skip = skip self.states = np.zeros((skip, 84, 84), dtype=np.float32) def step(self, action): total_reward = 0 last_states = [] for i in range(self.skip): state, reward, done, info = self.env.step(action) total_reward += reward if i >= self.skip / 2: last_states.append(state) if done: self.reset() return self.states[None, :, :, :].astype(np.float32), total_reward, done, info max_state = np.max(np.concatenate(last_states, 0), 0) self.states[:-1] = self.states[1:] self.states[-1] = max_state return self.states[None, :, :, :].astype(np.float32), total_reward, done, info def reset(self): state = self.env.reset() self.states = np.concatenate([state for _ in range(self.skip)], 0) return self.states[None, :, :, :].astype(np.float32)4. 定义神经网络神经网络结构包含四层卷积网络和一层全连接网络,提取的特征输入critic层和actor层,分别输出value值和动作概率分布。class Net(nn.Module): def __init__(self, num_inputs, num_actions): super(Net, self).__init__() self.conv1 = nn.Conv2d(num_inputs, 32, 3, stride=2, padding=1) self.conv2 = nn.Conv2d(32, 32, 3, stride=2, padding=1) self.conv3 = nn.Conv2d(32, 32, 3, stride=2, padding=1) self.conv4 = nn.Conv2d(32, 32, 3, stride=2, padding=1) self.linear = nn.Linear(32 * 6 * 6, 512) self.critic_linear = nn.Linear(512, 1) self.actor_linear = nn.Linear(512, num_actions) self._initialize_weights() def _initialize_weights(self): for module in self.modules(): if isinstance(module, nn.Conv2d) or isinstance(module, nn.Linear): nn.init.orthogonal_(module.weight, nn.init.calculate_gain('relu')) nn.init.constant_(module.bias, 0) def forward(self, x): x = F.relu(self.conv1(x)) x = F.relu(self.conv2(x)) x = F.relu(self.conv3(x)) x = F.relu(self.conv4(x)) x = self.linear(x.view(x.size(0), -1)) return self.actor_linear(x), self.critic_linear(x)6. 训练模型训练10 Episode,耗时约5分钟train(opt)加载预训练模型Episode: 1. Total loss: 1.1230244636535645Episode: 2. Total loss: 2.553663730621338Episode: 3. Total loss: 1.768389344215393Episode: 4. Total loss: 1.6962862014770508Episode: 5. Total loss: 1.0912611484527588Episode: 6. Total loss: 1.6626232862472534Episode: 7. Total loss: 1.9952025413513184Episode: 8. Total loss: 1.2410558462142944Episode: 9. Total loss: 1.3711413145065308Episode: 10. Total loss: 1.21552050113677987. 使用模型推理游戏定义推理函数def infer(opt): if torch.cuda.is_available(): torch.cuda.manual_seed(123) else: torch.manual_seed(123) if opt['action_type'] == "right": actions = RIGHT_ONLY elif opt['action_type'] == "simple": actions = SIMPLE_MOVEMENT else: actions = COMPLEX_MOVEMENT env = create_train_env(opt['world'], opt['stage'], actions) model = Net(env.observation_space.shape[0], len(actions)) if torch.cuda.is_available(): model.load_state_dict(torch.load("{}/ppo_super_mario_bros_{}_{}_{}".format(opt['saved_path'],opt['world'], opt['stage'],opt['episode']))) model.cuda() else: model.load_state_dict(torch.load("{}/ppo_super_mario_bros_{}_{}_{}".format(opt['saved_path'], opt['world'], opt['stage'],opt['episode']), map_location=torch.device('cpu'))) model.eval() state = torch.from_numpy(env.reset()) plt.figure(figsize=(10,10)) img = plt.imshow(env.render(mode='rgb_array')) while True: if torch.cuda.is_available(): state = state.cuda() logits, value = model(state) policy = F.softmax(logits, dim=1) action = torch.argmax(policy).item() state, reward, done, info = env.step(action) state = torch.from_numpy(state) img.set_data(env.render(mode='rgb_array')) # just update the data display.display(plt.gcf()) display.clear_output(wait=True) if info["flag_get"]: print("World {} stage {} completed".format(opt['world'], opt['stage'])) break if done and info["flag_get"] is False: print('Game Failed') breakinfer(opt)8. 作业¶请你调整步骤2中的训练参数,重新训练一个模型,使它在游戏中获得更好的表现。
  • [技术干货] 使用DQN算法玩2048游戏
    实验目标通过本案例的学习和课后作业的练习:了解DQN算法的基本概念了解如何基于DQN训练一个小游戏了解强化学习训练推理游戏的整体流程你也可以将本案例相关的 ipynb 学习笔记分享到 AI Gallery Notebook 版块获得成长值,分享方法请查看此文档。案例内容介绍《2048》是一款单人在线和移动端游戏,由19岁的意大利人Gabriele Cirulli于2014年3月开发。游戏任务是在一个网格上滑动小方块来进行组合,直到形成一个带有有数字2048的方块。该游戏可以上下左右移动方块。如果两个带有相同数字的方块在移动中碰撞,则它们会合并为一个方块,且所带数字变为两者之和。每次移动时,会有一个值为2或者4的新方块出现,所出现的数字都是2的幂。当值为2048的方块出现时,游戏即胜利,该游戏因此得名。(源自维基百科) DQN是强化学习的经典算法之一,最早由DeepMind于2013年发表的论文“Playing Atari with Deep Reinforcement Learning”中提出,属于value based的model free方法,在多种游戏环境中表现稳定且良好。 在本案例中,我们将展示如何基于simple dqn算法,训练一个2048的小游戏。整体流程:安装基础依赖->创建2048环境->构建DQN算法->训练->推理->可视化效果DQN算法的基本结构Deep Q-learning(DQN)是Q-learing和神经网络的结合,利用经验回放来进行强化学习的训练,结构如下:神经网络部分神经网络用来逼近值函数,一般采用全连接层表达特征输入,采用卷积层表达图像输入。其损失函数表达为γ经验折扣率,γ取0,表示只考虑当下,γ取1,表示只考虑未来。经验回放经验回放是指:模型与环境交互得到的(s,a,r,s')会存入一个replay buffer,然后每次从中随机采样出一批样本进行学习。采用该策略的优点如下:减少样本之间的相关性,以近似符合独立同分布的假设;同时增大样本的利用率。探索策略一般采用贪婪探索策略,即agent以ε的概率进行随机探索,其他时间则采取模型计算得到的动作。DQN的整体结构可以简单的表示为:DQN论文Nature DQN论文Nature DQN在DQN的基础上,采用两个结构一样的网络,一个当前Q网络用来选择动作,更新模型参数,另一个目标Q网络用于计算目标Q值。这样可以减少目标Q值和当前的Q值相关性。2048游戏环境简介2048环境来源于GitHub开源项目,继承于gym基本环境类。玩家可以上下左右移动方块,如果方块数字相同,则合并且所带数字变成两者之和。当值为2048的方块出现时,则获得胜利。结束标志只要出现非法移动,即移动的方向的数字无法合并,则该局结束。相比较于传统的可试验多次更加严格,难度提高。奖励函数奖励值为当前方块和累加最大合成数字最大能合成4096注意事项本案例运行环境为 Pytorch-1.0.0,支持 GPU和CPU 运行,请查看《ModelAtrs JupyterLab 硬件规格使用指南》了解切换硬件规格的方法;如果您是第一次使用 JupyterLab,请查看《ModelAtrs JupyterLab使用指导》了解使用方法;如果您在使用 JupyterLab 过程中碰到报错,请参考《ModelAtrs JupyterLab常见问题解决办法》尝试解决问题。实验步骤1. 程序初始化第1步:安装基础依赖!pip install gym第2步:导入相关的库import sysimport timeimport loggingimport argparseimport itertoolsfrom six import StringIOfrom random import sample, randintimport gymfrom gym import spacesfrom gym.utils import seedingimport numpy as npimport torchimport torch.nn as nnfrom PIL import Image, ImageDraw, ImageFontfrom IPython import displayimport matplotlibimport matplotlib.pyplot as plt2. 训练参数初始化¶本案例设置的 epochs = 3000,可以达到较好的训练效果,GPU下训练耗时约10分钟。CPU下训练较慢,建议调小 epochs 的值,如50,以便快速跑通代码。parser = argparse.ArgumentParser()parser.add_argument("--learning_rate", type=float, default=0.001) # 学习率parser.add_argument("--gamma", type=float, default=0.99) # 经验折扣率parser.add_argument("--epochs", type=int, default=50) # 迭代多少局数parser.add_argument("--buffer_size", type=int, default=10000) # replaybuffer大小parser.add_argument("--batch_size", type=int, default=128) # batchsize大小parser.add_argument("--pre_train_model", type=str, default=None) # 是否加载预训练模型parser.add_argument("--use_nature_dqn", type=bool, default=True) # 是否采用nature dqnparser.add_argument("--target_update_freq", type=int, default=250) # 如果采用nature dqn,target模型更新频率parser.add_argument("--epsilon", type=float, default=0.9) # 探索epsilon取值args, _ = parser.parse_known_args()3. 创建环境2048游戏环境继承于gym.Env,主要几个部分:init函数 定义动作空间、状态空间和游戏基本设置step函数 与环境交互,获取动作并执行,返回状态、奖励、是否结束和补充信息reset函数 一局结束后,重置环境render函数 绘图,可视化环境def pairwise(iterable): "s -> (s0,s1), (s1,s2), (s2, s3), ..." a, b = itertools.tee(iterable) next(b, None) return zip(a, b)class IllegalMove(Exception): passdef stack(flat, layers=16): """Convert an [4, 4] representation into [4, 4, layers] with one layers for each value.""" # representation is what each layer represents representation = 2 ** (np.arange(layers, dtype=int) + 1) # layered is the flat board repeated layers times layered = np.repeat(flat[:, :, np.newaxis], layers, axis=-1) # Now set the values in the board to 1 or zero depending whether they match representation. # Representation is broadcast across a number of axes layered = np.where(layered == representation, 1, 0) return layeredclass Game2048Env(gym.Env): metadata = {'render.modes': ['ansi', 'human', 'rgb_array']} def __init__(self): # Definitions for game. Board must be square. self.size = 4 self.w = self.size self.h = self.size self.squares = self.size * self.size # Maintain own idea of game score, separate from rewards self.score = 0 # Members for gym implementation self.action_space = spaces.Discrete(4) # Suppose that the maximum tile is as if you have powers of 2 across the board. layers = self.squares self.observation_space = spaces.Box(0, 1, (self.w, self.h, layers), dtype=np.int) self.set_illegal_move_reward(-100) self.set_max_tile(None) # Size of square for rendering self.grid_size = 70 # Initialise seed self.seed() # Reset ready for a game self.reset() def seed(self, seed=None): self.np_random, seed = seeding.np_random(seed) return [seed] def set_illegal_move_reward(self, reward): """Define the reward/penalty for performing an illegal move. Also need to update the reward range for this.""" # Guess that the maximum reward is also 2**squares though you'll probably never get that. # (assume that illegal move reward is the lowest value that can be returned self.illegal_move_reward = reward self.reward_range = (self.illegal_move_reward, float(2 ** self.squares)) def set_max_tile(self, max_tile): """Define the maximum tile that will end the game (e.g. 2048). None means no limit. This does not affect the state returned.""" assert max_tile is None or isinstance(max_tile, int) self.max_tile = max_tile # Implement gym interface def step(self, action): """Perform one step of the game. This involves moving and adding a new tile.""" logging.debug("Action {}".format(action)) score = 0 done = None info = { 'illegal_move': False, } try: score = float(self.move(action)) self.score += score assert score <= 2 ** (self.w * self.h) self.add_tile() done = self.isend() reward = float(score) except IllegalMove: logging.debug("Illegal move") info['illegal_move'] = True done = True reward = self.illegal_move_reward # print("Am I done? {}".format(done)) info['highest'] = self.highest() # Return observation (board state), reward, done and info dict return stack(self.Matrix), reward, done, info def reset(self): self.Matrix = np.zeros((self.h, self.w), np.int) self.score = 0 logging.debug("Adding tiles") self.add_tile() self.add_tile() return stack(self.Matrix) def render(self, mode='human'): if mode == 'rgb_array': black = (0, 0, 0) grey = (200, 200, 200) white = (255, 255, 255) tile_colour_map = { 2: (255, 255, 255), 4: (255, 248, 220), 8: (255, 222, 173), 16: (244, 164, 96), 32: (205, 92, 92), 64: (240, 255, 255), 128: (240, 255, 240), 256: (193, 255, 193), 512: (154, 255, 154), 1024: (84, 139, 84), 2048: (139, 69, 19), 4096: (178, 34, 34), } grid_size = self.grid_size # Render with Pillow pil_board = Image.new("RGB", (grid_size * 4, grid_size * 4)) draw = ImageDraw.Draw(pil_board) draw.rectangle([0, 0, 4 * grid_size, 4 * grid_size], grey) fnt = ImageFont.truetype('/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf', 30) for y in range(4): for x in range(4): o = self.get(y, x) if o: draw.rectangle([x * grid_size, y * grid_size, (x + 1) * grid_size, (y + 1) * grid_size], tile_colour_map[o]) (text_x_size, text_y_size) = draw.textsize(str(o), font=fnt) draw.text((x * grid_size + (grid_size - text_x_size) // 2, y * grid_size + (grid_size - text_y_size) // 2), str(o), font=fnt, fill=black) assert text_x_size < grid_size assert text_y_size < grid_size return np.asarray(pil_board) outfile = StringIO() if mode == 'ansi' else sys.stdout s = 'Score: {}\n'.format(self.score) s += 'Highest: {}\n'.format(self.highest()) npa = np.array(self.Matrix) grid = npa.reshape((self.size, self.size)) s += "{}\n".format(grid) outfile.write(s) return outfile # Implement 2048 game def add_tile(self): """Add a tile, probably a 2 but maybe a 4""" possible_tiles = np.array([2, 4]) tile_probabilities = np.array([0.9, 0.1]) val = self.np_random.choice(possible_tiles, 1, p=tile_probabilities)[0] empties = self.empties() assert empties.shape[0] empty_idx = self.np_random.choice(empties.shape[0]) empty = empties[empty_idx] logging.debug("Adding %s at %s", val, (empty[0], empty[1])) self.set(empty[0], empty[1], val) def get(self, x, y): """Return the value of one square.""" return self.Matrix[x, y] def set(self, x, y, val): """Set the value of one square.""" self.Matrix[x, y] = val def empties(self): """Return a 2d numpy array with the location of empty squares.""" return np.argwhere(self.Matrix == 0) def highest(self): """Report the highest tile on the board.""" return np.max(self.Matrix) def move(self, direction, trial=False): """Perform one move of the game. Shift things to one side then, combine. directions 0, 1, 2, 3 are up, right, down, left. Returns the score that [would have] got.""" if not trial: if direction == 0: logging.debug("Up") elif direction == 1: logging.debug("Right") elif direction == 2: logging.debug("Down") elif direction == 3: logging.debug("Left") changed = False move_score = 0 dir_div_two = int(direction / 2) dir_mod_two = int(direction % 2) shift_direction = dir_mod_two ^ dir_div_two # 0 for towards up left, 1 for towards bottom right # Construct a range for extracting row/column into a list rx = list(range(self.w)) ry = list(range(self.h)) if dir_mod_two == 0: # Up or down, split into columns for y in range(self.h): old = [self.get(x, y) for x in rx] (new, ms) = self.shift(old, shift_direction) move_score += ms if old != new: changed = True if not trial: for x in rx: self.set(x, y, new[x]) else: # Left or right, split into rows for x in range(self.w): old = [self.get(x, y) for y in ry] (new, ms) = self.shift(old, shift_direction) move_score += ms if old != new: changed = True if not trial: for y in ry: self.set(x, y, new[y]) if changed != True: raise IllegalMove return move_score def combine(self, shifted_row): """Combine same tiles when moving to one side. This function always shifts towards the left. Also count the score of combined tiles.""" move_score = 0 combined_row = [0] * self.size skip = False output_index = 0 for p in pairwise(shifted_row): if skip: skip = False continue combined_row[output_index] = p[0] if p[0] == p[1]: combined_row[output_index] += p[1] move_score += p[0] + p[1] # Skip the next thing in the list. skip = True output_index += 1 if shifted_row and not skip: combined_row[output_index] = shifted_row[-1] return (combined_row, move_score) def shift(self, row, direction): """Shift one row left (direction == 0) or right (direction == 1), combining if required.""" length = len(row) assert length == self.size assert direction == 0 or direction == 1 # Shift all non-zero digits up shifted_row = [i for i in row if i != 0] # Reverse list to handle shifting to the right if direction: shifted_row.reverse() (combined_row, move_score) = self.combine(shifted_row) # Reverse list to handle shifting to the right if direction: combined_row.reverse() assert len(combined_row) == self.size return (combined_row, move_score) def isend(self): """Has the game ended. Game ends if there is a tile equal to the limit or there are no legal moves. If there are empty spaces then there must be legal moves.""" if self.max_tile is not None and self.highest() == self.max_tile: return True for direction in range(4): try: self.move(direction, trial=True) # Not the end if we can do any move return False except IllegalMove: pass return True def get_board(self): """Retrieve the whole board, useful for testing.""" return self.Matrix def set_board(self, new_board): """Retrieve the whole board, useful for testing.""" self.Matrix = new_board4. 定义DQN算法DQN算法分了两部分构造,拟合函数部分-神经网络算法逻辑本身Replay Buffer部分神经网络结构包含三层卷积网络和一层全连接网络,输出维度为动作空间维度。 神经网络部分可自由设计,以训练出更好的效果。class Net(nn.Module): #obs是状态空间输入,available_actions_count为动作输出维度 def __init__(self, obs, available_actions_count): super(Net, self).__init__() self.conv1 = nn.Conv2d(obs, 128, kernel_size=2, stride=1) self.conv2 = nn.Conv2d(128, 64, kernel_size=2, stride=1) self.conv3 = nn.Conv2d(64, 16, kernel_size=2, stride=1) self.fc1 = nn.Linear(16, available_actions_count) self.relu = nn.ReLU(inplace=True) def forward(self, x): x = x.permute(0, 3, 1, 2) x = self.relu(self.conv1(x)) x = self.relu(self.conv2(x)) x = self.relu(self.conv3(x)) x = self.fc1(x.view(x.shape[0], -1)) return xDQN核心逻辑部分class DQN: def __init__(self, args, obs_dim, action_dim): # 是否加载预训练模型 if args.pre_train_model: print("Loading model from: ", args.pre_train_model) self.behaviour_model = torch.load(args.pre_train_model).to(device) # 如果采用Nature DQN,则需要额外定义target_network self.target_model = torch.load(args.pre_train_model).to(device) else: self.behaviour_model = Net(obs_dim, action_dim).to(device) self.target_model = Net(obs_dim, action_dim).to(device) self.optimizer = torch.optim.Adam(self.behaviour_model.parameters(), args.learning_rate) self.criterion = nn.MSELoss() # 动作维度 self.action_dim = action_dim # 统计学习步数 self.learn_step_counter = 0 self.args = args def learn(self, buffer): # 当replaybuffer中存储的数据大于batchsize时,从中随机采样一个batch的数据学习 if buffer.size >= self.args.batch_size: # 更新target_model的参数 if self.learn_step_counter % args.target_update_freq == 0: self.target_model.load_state_dict(self.behaviour_model.state_dict()) self.learn_step_counter += 1 # 从replaybuffer中随机采样一个五元组(当前观测值,动作,下一个观测值,是否一局结束,奖励值) s1, a, s2, done, r = buffer.get_sample(self.args.batch_size) s1 = torch.FloatTensor(s1).to(device) s2 = torch.FloatTensor(s2).to(device) r = torch.FloatTensor(r).to(device) a = torch.LongTensor(a).to(device) if args.use_nature_dqn: q = self.target_model(s2).detach() else: q = self.behaviour_model(s2) # 每个动作的q值=r+gamma*(1-0或1)*q_max target_q = r + torch.FloatTensor(args.gamma * (1 - done)).to(device) * q.max(1)[0] target_q = target_q.view(args.batch_size, 1) eval_q = self.behaviour_model(s1).gather(1, torch.reshape(a, shape=(a.size()[0], -1))) # 计算损失函数 loss = self.criterion(eval_q, target_q) self.optimizer.zero_grad() loss.backward() self.optimizer.step() def get_action(self, state, explore=True): # 判断是否探索,如果探索,则采用贪婪探索策略决定行为 if explore: if np.random.uniform() >= args.epsilon: action = randint(0, self.action_dim - 1) else: # Choose the best action according to the network. q = self.behaviour_model(torch.FloatTensor(state).to(device)) m, index = torch.max(q, 1) action = index.data.cpu().numpy()[0] else: q = self.behaviour_model(torch.FloatTensor(state).to(device)) m, index = torch.max(q, 1) action = index.data.cpu().numpy()[0] return actionreplay buffer数据存储部分class ReplayBuffer: def __init__(self, buffer_size, obs_space): self.s1 = np.zeros(obs_space, dtype=np.float32) self.s2 = np.zeros(obs_space, dtype=np.float32) self.a = np.zeros(buffer_size, dtype=np.int32) self.r = np.zeros(buffer_size, dtype=np.float32) self.done = np.zeros(buffer_size, dtype=np.float32) # replaybuffer大小 self.buffer_size = buffer_size self.size = 0 self.pos = 0 # 不断将数据存储入buffer def add_transition(self, s1, action, s2, done, reward): self.s1[self.pos] = s1 self.a[self.pos] = action if not done: self.s2[self.pos] = s2 self.done[self.pos] = done self.r[self.pos] = reward self.pos = (self.pos + 1) % self.buffer_size self.size = min(self.size + 1, self.buffer_size) # 随机采样一个batchsize def get_sample(self, sample_size): i = sample(range(0, self.size), sample_size) return self.s1[i], self.a[i], self.s2[i], self.done[i], self.r[i]5. 训练模型初始化环境和算法# 初始化环境env = Game2048Env()device = torch.device("cuda" if torch.cuda.is_available() else "cpu")# 初始化dqndqn = DQN(args, obs_dim=env.observation_space.shape[2], action_dim=env.action_space.n)# 初始化replay buffermemory = ReplayBuffer(buffer_size=args.buffer_size, obs_space=(args.buffer_size, env.observation_space.shape[0], env.observation_space.shape[1], env.observation_space.shape[2]))开始训练print('\ntraining...')begin_t = time.time()max_reward = 0for i_episode in range(args.epochs): # 每局开始,重置环境 s = env.reset() # 累计奖励值 ep_r = 0 while True: # 计算动作 a = dqn.get_action(np.expand_dims(s, axis=0)) # 执行动作 s_, r, done, info = env.step(a) # 存储信息 memory.add_transition(s, a, s_, done, r) ep_r += r # 学习优化过程 dqn.learn(memory) if done: print('Ep: ', i_episode, '| Ep_r: ', round(ep_r, 2)) if ep_r > max_reward: max_reward = ep_r print("current_max_reward {}".format(max_reward)) # 保存模型 torch.save(dqn.behaviour_model, "2048.pt") break s = s_print("finish! time cost is {}s".format(time.time() - begin_t))6. 使用模型推理游戏¶#加载模型model = torch.load("2048.pt").to(device)model.eval()s = env.reset()img = plt.imshow(env.render(mode='rgb_array'))while True: plt.axis("off") img.set_data(env.render(mode='rgb_array')) display.display(plt.gcf()) display.clear_output(wait=True) s = torch.FloatTensor(np.expand_dims(s, axis=0)).to(device) a = torch.argmax(model(s), dim=1).cpu().numpy()[0] # take action s_, r, done, info = env.step(a) time.sleep(0.1) if done: break s = s_env.close()plt.close()7. 作业请你调整步骤2中的训练参数,重新训练一个模型,使它在游戏中获得更好的表现。
  • [ModelArts昇...] 使用A2C算法控制登月器着陆
    实验目标通过本案例的学习和课后作业的练习:了解A2C算法的基本概念了解如何基于A2C训练一个控制类小游戏了解强化学习训练推理游戏的整体流程你也可以将本案例相关的 ipynb 学习笔记分享到 AI Gallery Notebook 版块获得成长值,分享方法请查看此文档。案例内容介绍LunarLander是一款控制类的小游戏,也是强化学习中常用的例子。游戏任务为控制登月器着陆,玩家通过操作登月器的主引擎和副引擎,控制登月器降落。登月器平稳着陆会得到相应的奖励积分,如果精准降落在着陆平台上会有额外的奖励积分;相反地如果登月器坠毁会扣除积分。A2C全称为Advantage Actor-Critic,在本案例中,我们将展示如何基于A2C算法,训练一个LunarLander小游戏。整体流程:基于gym创建LunarLander环境->构建A2C算法->训练->推理->可视化效果A2C算法的基本结构A2C是openAI在实现baseline过程中提出的,是一种结合了Value-based (比如 Q learning) 和 Policy-based (比如 Policy Gradients) 的强化学习算法。Actor目的是学习策略函数π(θ)以得到尽量高的回报。 Critic目的是对当前策略的值函数进行估计,来评价。Policy GradientsPolicy Gradient算法的整个过程可以看作先通过策略π(θ)让agent与环境进行互动,计算每一步所能得到的奖励,并以此得到一局游戏的奖励作为累积奖励G,然后通过调整策略π,使得G最大化。所以使用了梯度提升的方法来更新网络参数θ,利用更新后的策略再采集数据,再更新,如此循环,达到优化策略的目的。Actor Criticagent在于环境互动过程中产生的G值本身是一个随机变量,可以通过Q函数去估计G的期望值,来增加稳定性。即Actor-Critic算法在PG策略的更新过程中使用Q函数来代替了G,同时构建了Critic网络来计算Q函数,此时Actor相关参数的梯度为:而Critic的损失函数使用Q估计和Q实际值差的平方损失来表示:​A2C算法A2C在AC算法的基础上使用状态价值函数给Q值增加了基线V,使反馈可以为正或者为负,因此Actor的策略梯变为:同时Critic网络的损失函数使用实际状态价值和估计状态价值的平方损失来表示:LunarLander-v2游戏环境简介LunarLander-v2,是基于gym和box2d提供的游戏环境。游戏任务为玩家通过操作登月器的喷气主引擎和副引擎来控制登月器降落。gym:开源强化学习python库,提供了算法和环境交互的标准API,以及符合该API的标准环境集。box2d:gym提供的一种环境集合注意事项本案例运行环境为 TensorFlow-1.13.1,且需使用 GPU 运行,请查看《ModelAtrs JupyterLab 硬件规格使用指南》了解切换硬件规格的方法;如果您是第一次使用 JupyterLab,请查看《ModelAtrs JupyterLab使用指导》了解使用方法;如果您在使用 JupyterLab 过程中碰到报错,请参考《ModelAtrs JupyterLab常见问题解决办法》尝试解决问题。实验步骤1. 程序初始化第1步:安装基础依赖要确保所有依赖都安装成功后,再执行之后的代码。如果某些模块因为网络原因导致安装失败,直接重试一次即可。!pip install gym!conda install swig -y!pip install box2d-py!pip install gym[box2d]第2步:导入相关的库import osimport gymimport numpy as npimport tensorflow as tfimport pandas as pd2. 参数设置¶本案例设置的 游戏最大局数 MAX_EPISODE = 100,保存模型的局数 SAVE_EPISODES = 20,以便快速跑通代码。你也可以调大 MAX_EPISODE 和 SAVE_EPISODES 的值,如1000和100,可以达到较好的训练效果,训练耗时约20分钟。MAX_EPISODE = 100 # 游戏最大局数DISPLAY_REWARD_THRESHOLD = 100 # 开启可视化的reward阈值SAVE_REWARD_THRESHOLD = 100 # 保存模型的reward阈值MAX_EP_STEPS = 2000 # 每局最大步长TEST_EPISODE = 10 # 测试局RENDER = False # 是否启用可视化(耗时)GAMMA = 0.9 # TD error中reward衰减系数RUNNING_REWARD_DECAY=0.95 # running reward 衰减系数LR_A = 0.001 # Actor网络的学习率LR_C = 0.01 # Critic网络学习率NUM_UNITS = 20 # FC层神经元个数SEED = 1 # 种子数,减小随机性SAVE_EPISODES = 20 # 保存模型的局数model_dir = './models' # 模型保存路径3. 游戏环境创建def create_env(): env = gym.make('LunarLander-v2') # 减少随机性 env.seed(SEED) env = env.unwrapped num_features = env.observation_space.shape[0] num_actions = env.action_space.n return env, num_features, num_actions4. Actor-Critic网络构建¶class Actor: """ Actor网络 Parameters ---------- sess : tensorflow.Session() n_features : int 特征维度 n_actions : int 动作空间大小 lr : float 学习率大小 """ def __init__(self, sess, n_features, n_actions, lr=0.001): self.sess = sess # 状态空间 self.s = tf.placeholder(tf.float32, [1, n_features], "state") # 动作空间 self.a = tf.placeholder(tf.int32, None, "action") # TD_error self.td_error = tf.placeholder(tf.float32, None, "td_error") # actor网络为两层全连接层,输出为动作概率 with tf.variable_scope('Actor'): l1 = tf.layers.dense( inputs=self.s, units=NUM_UNITS, activation=tf.nn.relu, kernel_initializer=tf.random_normal_initializer(0., .1), bias_initializer=tf.constant_initializer(0.1), name='l1' ) self.acts_prob = tf.layers.dense( inputs=l1, units=n_actions, activation=tf.nn.softmax, kernel_initializer=tf.random_normal_initializer(0., .1), bias_initializer=tf.constant_initializer(0.1), name='acts_prob' ) with tf.variable_scope('exp_v'): log_prob = tf.log(self.acts_prob[0, self.a]) # 损失函数 self.exp_v = tf.reduce_mean(log_prob * self.td_error) with tf.variable_scope('train'): # minimize(-exp_v) = maximize(exp_v) self.train_op = tf.train.AdamOptimizer(lr).minimize(-self.exp_v) def learn(self, s, a, td): s = s[np.newaxis, :] feed_dict = {self.s: s, self.a: a, self.td_error: td} _, exp_v = self.sess.run([self.train_op, self.exp_v], feed_dict) return exp_v # 生成动作 def choose_action(self, s): s = s[np.newaxis, :] probs = self.sess.run(self.acts_prob, {self.s: s}) return np.random.choice(np.arange(probs.shape[1]), p=probs.ravel())class Critic: """ Critic网络 Parameters ---------- sess : tensorflow.Session() n_features : int 特征维度 lr : float 学习率大小 """ def __init__(self, sess, n_features, lr=0.01): self.sess = sess # 状态空间 self.s = tf.placeholder(tf.float32, [1, n_features], "state") # value值 self.v_ = tf.placeholder(tf.float32, [1, 1], "v_next") # 奖励 self.r = tf.placeholder(tf.float32, None, 'r') # critic网络为两层全连接层,输出为value值 with tf.variable_scope('Critic'): l1 = tf.layers.dense( inputs=self.s, # number of hidden units units=NUM_UNITS, activation=tf.nn.relu, kernel_initializer=tf.random_normal_initializer(0., .1), bias_initializer=tf.constant_initializer(0.1), name='l1' ) self.v = tf.layers.dense( inputs=l1, # output units units=1, activation=None, kernel_initializer=tf.random_normal_initializer(0., .1), bias_initializer=tf.constant_initializer(0.1), name='V' ) with tf.variable_scope('squared_TD_error'): self.td_error = self.r + GAMMA * self.v_ - self.v # TD_error = (r+gamma*V_next) - V_eval self.loss = tf.square(self.td_error) with tf.variable_scope('train'): self.train_op = tf.train.AdamOptimizer(lr).minimize(self.loss) def learn(self, s, r, s_): s, s_ = s[np.newaxis, :], s_[np.newaxis, :] v_ = self.sess.run(self.v, {self.s: s_}) td_error, _ = self.sess.run([self.td_error, self.train_op], {self.s: s, self.v_: v_, self.r: r}) return td_error5. 创建训练函数def model_train(): env, num_features, num_actions = create_env() render = RENDER sess = tf.Session() actor = Actor(sess, n_features=num_features, n_actions=num_actions, lr=LR_A) critic = Critic(sess, n_features=num_features, lr=LR_C) sess.run(tf.global_variables_initializer()) saver = tf.train.Saver() for i_episode in range(MAX_EPISODE+1): cur_state = env.reset() cur_step = 0 track_r = [] while True: # notebook暂不支持该游戏的可视化 # if RENDER: # env.render() action = actor.choose_action(cur_state) next_state, reward, done, info = env.step(action) track_r.append(reward) # gradient = grad[reward + gamma * V(next_state) - V(cur_state)] td_error = critic.learn(cur_state, reward, next_state) # true_gradient = grad[logPi(cur_state,action) * td_error] actor.learn(cur_state, action, td_error) cur_state = next_state cur_step += 1 if done or cur_step >= MAX_EP_STEPS: ep_rs_sum = sum(track_r) if 'running_reward' not in locals(): running_reward = ep_rs_sum else: running_reward = running_reward * RUNNING_REWARD_DECAY + ep_rs_sum * (1-RUNNING_REWARD_DECAY) # 判断是否达到可视化阈值 # if running_reward > DISPLAY_REWARD_THRESHOLD: # render = True print("episode:", i_episode, " reward:", int(running_reward), " steps:", cur_step) break if i_episode > 0 and i_episode % SAVE_EPISODES == 0: if not os.path.exists(model_dir): os.mkdir(model_dir) ckpt_path = os.path.join(model_dir, '{}_model.ckpt'.format(i_episode)) saver.save(sess, ckpt_path)6. 开始训练训练一个episode大约需1.2秒print('MAX_EPISODE:', MAX_EPISODE)model_train()# reset graphtf.reset_default_graph()7.使用模型推理由于本游戏内核可视化依赖于OpenGL,需要桌面化操作系统的窗口显示,但当前环境暂不支持弹窗,因此无法可视化,您可将代码下载到本地,取消 env.render() 这行代码的注释,查看可视化效果。def model_test(): env, num_features, num_actions = create_env() sess = tf.Session() actor = Actor(sess, n_features=num_features, n_actions=num_actions, lr=LR_A) sess.run(tf.global_variables_initializer()) saver = tf.train.Saver() saver.restore(sess, tf.train.latest_checkpoint(model_dir)) for i_episode in range(TEST_EPISODE): cur_state = env.reset() cur_step = 0 track_r = [] while True: # 可视化 # env.render() action = actor.choose_action(cur_state) next_state, reward, done, info = env.step(action) track_r.append(reward) cur_state = next_state cur_step += 1 if done or cur_step >= MAX_EP_STEPS: ep_rs_sum = sum(track_r) print("episode:", i_episode, " reward:", int(ep_rs_sum), " steps:", cur_step) breakmodel_test()episode: 0 reward: -31 steps: 196episode: 1 reward: -99 steps: 308episode: 2 reward: -273 steps: 533episode: 3 reward: -5 steps: 232episode: 4 reward: -178 steps: 353episode: 5 reward: -174 steps: 222episode: 6 reward: -309 steps: 377episode: 7 reward: 24 steps: 293episode: 8 reward: -121 steps: 423episode: 9 reward: -194 steps: 2868.可视化效果下面的视频为训练1000 episode模型的推理效果,该视频演示了在三个不同的地形情况下,登月器都可以安全着陆cid:link_09. 作业请你调整步骤2中的训练参数,重新训练一个模型,使它在游戏中获得更好的表现。
  • [问题求助] 在第三方调用上传图片成功,然后回显查看报错未授权
    在第三方调用上传图片成功,然后回显查看报错未授权
  • [高校开发者专区] 【HCSD-DevCloud训练营学习笔记】--飞机大战游戏项目部署学习总结
    1.项目介绍本项目是通过DevCloud对飞机大战游戏进行开发,并且部署到鲲鹏云服务器上。2.实验资源2.1首先需要在华为云官网创建‘华北-北京四’区域的云服务资源;2.2然后在控制台中创建虚拟私有云VPC;2.3在访问控制中创建安全组并且添加规则;3.云端环境配置此处需要在服务列表中购买弹性云服务器ECS4 创建项目在https://devcloud.cn-north-4.huaweicloud.com/home中创建项目5.上传代码配置好GIT与SSH密钥,将本地飞机大战代码推送到云端6.项目构建在云端将项目执行与发布7.部署项目在项目的构建与发布选项下,创建主机组,连通主机后对项目进行部署最后验证项目是否部署成功
  • [伙伴解决方案] 初创工作室如何找到合适的技术合伙人,游戏类主程
    介绍:理工背景,手上有完善的设计案,游戏类,想寻找合作伙伴主程成立工作室个人负责:整体方案(包括不限于逻辑,玩法,公式,数值,剧情等),完整发展规划,一步一步来需求:能良性沟通(设计可能不符合程序语言,都可以按程序逻辑进行修改),上进心(喜好钻研,共通积累经验),有空闲时间(可以兼职初创,出成绩再考虑下一步如何进行,能合理分配时间)
  • [高校开发者专区] 【HCSD-DevCloud训练营学习笔记】飞机大战项目开发部署-学习实验打卡
    前言:飞机大战是一个基于Cocos2d服务,使用Cocos Creator游戏引擎开发;可在电脑和移动端使用运行;此次我们将使用华为云的DevCloud进行飞机大战的开发到部署DevCloud:DevCloud是基于华为研发云的成功实践经验,通过云服务的方式提供一站式云端DevOps平台。开发团队基于云服务的模式按需使用,在云端进行项目管理、配置管理、代码检查、编译、构建、测试、部署、发布等。学到什么:学习使用华为云DevCloud进行云上开发部署认识华为云系列部分产品的使用(弹性云服务器、虚拟私有云等)简单的认识下Cocos2d的开发使用知识笔记:虚拟私有云:云上构建一个虚拟网络环境,跟我们的平常网络配置一样都可以申请带宽、创建子网、安全组等等;安全组:访问云上资源的一种策略;我们想要访问云上的目标资源时,必须达到安全组的规则才能进行访问;弹性云服务器:按需计费,随时随地就能帮我们搭建好一款线上的云服务器,方便我们学习使用线上的操作环境安装部署等等,一般都是按时计费,大大降低了我们的学习成本的同时还能减轻我们的经济,避免我们的云上资源浪费的情况;代码推送:将本地代码推送到远程仓库,远程仓库需配置我们的本地ssh的密钥或者https密码,这样才能完成身份配对,进行本地-远程的操作编译构建:将目标文件和必要文档制作成软件包,软件包进行发布部署应用:将软件包部署到云主机上Git:代码仓库管理工具,将我们的代码上传到DevCloud进行全站式管理Cocos Creator:轻量、易用的跨平台2D、3D游戏创作工具开发流程:代码开发云服务器配置代码推送到DevCloudDevCloud编译构建,打包软件包云服务器部署应用实验流程:将开发好的本地代码上传到远程DevCloud的代码管理仓库通过编译打成zip压缩包发布zip包到指定路径鲲鹏云服务器ECS进行部署、主机配置以下是我个人此次实验的部分截图,具体详情的实验操作流程可观看实验操作手册