• [用户故事] Wordle:为逗女友而爆火的小游戏
    Wordle 的传奇故事说起 Wordle,这绝对是近几年最火的小游戏之一。2021年,一个叫 Josh Wardle 的程序员为了逗女朋友开心,花了几个晚上做了这个简单的猜词游戏。没想到女朋友玩得很开心,就分享给了朋友,然后朋友又分享给朋友... 结果呢?短短几个月,全世界都在玩 Wordle。Twitter 上到处都是那种绿黄灰的小方块截图,连我妈都问我那些彩色格子是什么意思。 最疯狂的是,Josh 本来只是想做个小游戏玩玩,结果《纽约时报》花了七位数把它买下来。一个周末项目变成了百万美元的生意,这大概是每个程序员的梦想吧。   Wordle 为什么这么火?我觉得主要是几个原因:简单易懂:规则五分钟就能学会每天一题:不会让人沉迷,但又让人期待明天的挑战社交属性:那个分享截图的功能太聪明了,不剧透但又能炫耀成绩免费纯净:没有广告,没有内购,就是纯粹的游戏乐趣 Wordle这种游戏的玩法精髓 Wordle 的规则很简单:6次机会猜出5个字母的英文单词。每次猜完会给你颜色提示:绿色:字母对了,位置也对黄色:字母在单词里,但位置不对灰色:这个字母不在单词里 听起来简单,但要玩好还是有技巧的。老玩家都有自己的套路: 开局策略:大部分人第一个词都会选元音字母多的,比如 "ADIEU"、"AUDIO"、"AROSE"。我个人喜欢用 "STARE",因为 S、T、R 这些字母出现频率很高。 进阶技巧:不要浪费已经确定是灰色的字母如果黄色字母很多,先确定位置再考虑其他字母有时候故意猜一个不可能的词来排除更多字母 心理战术:Wordle 的单词选择其实是有规律的,不会选太生僻的词,也不会选复数形式。了解这个规律能帮你少走弯路。 Wordless 的独特之处 做 Wordless 的时候,我就在想:Wordle 虽然好玩,但为什么只能是5个字母?为什么一天只能玩一次? 所以 Wordless 就有了这些特色: 可变长度:从3个字母到8个字母都可以选。3个字母的超简单,适合练手;8个字母的能把人逼疯,适合虐自己。我经常3个字母玩几局找找信心,然后挑战8个字母被打击一下。 无限游戏:想玩多久玩多久,不用等到明天。有时候猜对了一个难词,兴奋得想继续玩,Wordless 就能满足这种需求。 智能单词库:不会连续出现相同的单词,每次都是新鲜的挑战。而且按长度分类,保证每个难度级别都有足够的词汇。 策略变化:不同长度的单词需要不同的策略。3个字母可能就是纯猜测,但8个字母就需要更系统的方法了。 玩 Wordless 的时候,我发现自己的策略会根据单词长度调整:3-4字母:直接猜常见词,比如 "THE"、"AND"5-6字母:用经典的 Wordle 策略7-8字母:先确定元音位置,再慢慢填辅音 其他有趣的变种游戏 Wordle 火了之后,各种变种游戏如雨后春笋般出现。有些真的很有创意: Absurdle:这个游戏会故意跟你作对,每次都选择让你最难猜中的单词。有种跟 AI 斗智斗勇的感觉。 Worldle:猜国家形状,地理爱好者的天堂。我经常被一些小岛国难住。 Heardle:猜歌名,听前奏猜歌。音乐版的 Wordle,但我这种五音不全的人基本靠蒙。 Nerdle:数学版 Wordle,猜数学等式。适合数学好的人,我一般看一眼就放弃了。 这些变种游戏都证明了 Wordle 这个核心玩法有多么强大,几乎可以套用到任何领域。 玩法心得分享 玩了这么久的词汇游戏,我总结了几个心得: 不要太执着于完美开局:很多人纠结第一个词选什么,其实差别没那么大。重要的是根据反馈调整策略。 学会利用排除法:有时候猜一个明知道不对的词,就是为了排除更多字母,这是高级玩法。 保持词汇积累:经常玩这类游戏确实能学到新单词,我的英语词汇量就是这么慢慢提升的。 享受过程:不要太在意成绩,重要的是享受那种一步步接近答案的乐趣。 最后说一句,无论是 Wordle 还是 Wordless,最重要的是玩得开心。毕竟游戏的初衷就是娱乐,不是考试。  
  • 移动携手华为发布移动云信息技术融合应用创新产业生态联合体
    4月25日,由中国移动通信集团主办的“云擎未来 智信天下”2023移动云大会在苏州金鸡湖国际会议中心顺利召开。华为技术有限公司高级副总裁曹既斌受邀出席此次大会,并参加了移动云信息技术融合应用创新产业生态联合体发布仪式。移动云信息技术融合应用创新产业生态联合体的发布,标志着中国移动将凝聚合力共同构建开放包容、合作共赢的产业生态。当前,中国移动与华为正迈入新的合作周期,华为也将秉持"一切皆服务"的战略,与移动云强强联合,持续做强技术底座,为产业创造更多价值,为建设数字中国注入澎湃动能。        在本届移动云大会上,华为云计算技术有限公司副总裁黄瑾做了主题为“一切皆服务,共促云产业,深耕数智化”的演讲,阐述了华为云对于产业的思考,业务战略,以及与移动云合作的成果及未来展望。“在数字转型的关键时刻,华为云通过一切皆服务,共创行业新价值,助力企业数智化转型,共建繁荣生态共赢。中国移动和华为在云计算领域的战略合作,取得较好进展,践行“一切皆服务”,把研发投入所产生的技术和产品、数字化转型经验,伙伴的能力,都开放在云上作为云服务,提升竞争力,同时更好的共同服务客户。”
  • [热门活动] 足球五大联赛美加墨世界杯预判分析
    2026年美加墨世界杯,足球滚大小技巧《₩₩₩典𝔧𝚜㊇㊇㍚ℂ𝒞》亚足联共有8.5个参赛席位。世预赛亚洲区从第一阶段开始,共22队参赛,晋级的11支球队与亚足联旗下FIFA排名前25的25支球队,参加第二阶段的36强赛。36强赛一共分为9个小组,每组4支球队,采取主客场双循环赛制,总计踢6轮比赛,最终只有小组前两名才能出线,晋级第三阶段18强赛。小组第三和第四名则直接出局,提前宣告无缘2026年世界杯决赛圈。中国男足已经进入了36强赛,与韩国、泰国、新加坡分在同一个小组,目标当然就是拿到小组前两名,顺利进入18强赛。首轮,11月16日晚20点30分,国足做客挑战泰国,可以说“首战即决战”,至少要带走1分。第二轮,11月21日晚20点,国足主场迎战韩国。韩国拥有孙兴慜、金玟哉、李刚仁等欧洲豪门球星,不出意外的话将获小组第一。所以国足若能在深圳拿到1分,就已完成目标;若能爆冷赢球,那更是巨大惊喜,出线不成问题。第三轮和第四轮,国足将在明年3月21日和24日连战新加坡,而且是先客后主。新加坡是本组最弱对手,所以国足的目标只有一个,就是全取6分,少拿2分都是巨大损失。如果前4轮战罢,国足能先赢泰国、再平韩国、双杀新加坡,3胜1平积10分,那么很有可能提前两轮出线!因为泰国要两战韩国,基本上会全败,4轮1胜3负只有3分——当然,这是最理想的情况。第五轮,国足将在6月6日主场迎战泰国。如果不胜,很有可能无法出线。因为第六轮,国足将在6月11日做客挑战韩国,对手恐已提前晋级,但大概率不会放水。
  • [技术干货] 在华为开发者空间,基于鲲鹏服务器的打砖块小游戏部署
    “【摘要】 鲲鹏服务器是基于鲲鹏处理器的新一代数据中心服务器,适用于大数据、分布式存储、高性能计算和数据库等应用。鲲鹏服务器具有高性能、低功耗、灵活的扩展能力,适合大数据分析、软件定义存储、Web等应用场景。”一、 案例介绍鲲鹏服务器是基于鲲鹏处理器的新一代数据中心服务器,适用于大数据、分布式存储、高性能计算和数据库等应用。鲲鹏服务器具有高性能、低功耗、灵活的扩展能力,适合大数据分析、软件定义存储、Web等应用场景。本案例将指导开发者如何在鲲鹏服务器部署并运行web小游戏。二、免费领取云主机如您还没有云主机,可点击链接 ,领取专属云主机后进行操作。如您已领取云主机,可直接开始实验。三、实验流程说明:1、自动部署鲲鹏服务器;2、使用终端连接鲲鹏服务器;3、创建html文件;4、启动Web服务器;5、体验游戏。四、实验资源本次实验预计花费总计0元。资源名称规格单价(元)时长(h)云主机2vCPUs | 4GB RAM免费1五、自动部署鲲鹏服务器1、在下载的更新包目录下点击鼠标右键选择“Open Terminal Here”,打开命令终端窗口。执行自动部署命令如下:hcd deploy --password abcd1234! --time 1800命令的参数说明:password:password关键字后设置的是鲲鹏服务器的root用户密码,命令中给出的默认为abcd1234!,开发者可以替换成自定义密码(至少8个字符)。time:time关键字后面设置的为鲲鹏服务器的可用时间,单位为秒,至少600秒。当前实验预估需要20分钟,为了保证时间充足,在命令中申请的时间为30分钟,即1800秒。该命令会自动部署鲲鹏服务器。首次部署会直接执行,旧资源未到期时重复部署,会提示是否删除前面创建的资源,可以删除旧资源再次部署。记录部署远端服务器公网IP,如截图中对应的就是:113.44.86.210。六、拷贝代码新打开一个命令窗口,在命令窗口中输入命令登录远端服务器,命令如下:ssh root@远端服务器公网IP输入密码,密码为步骤五中自动部署命令行中“--password”后面的参数,命令中给出的默认为abcd1234!,如果没有修改,就使用abcd1234!进行登录,如果设置了自定义密码,直接输入自定义的密码(注意:输入过程中密码不会显示,密码输入完成按回车键结束)。输入密码也可以借助剪切板进行复制粘贴,避免直接在命令窗口输入看不到输入内容而密码错误:登录成功后创建文件夹用于存放html文件,命令如下:mkdir gamecd gamevi game.html进入到Vim编辑器,按下键盘的“i”键进入到插入模式下,复制下列代码粘贴到编辑器中。(复制文档中代码时,如果包含页眉,请删除页眉部分!也可以通过链接下载代码)<!DOCTYPE html><html lang="en"><head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>Arkanoid game</title> <style> body { display: flex; justify-content: center; align-items: center; height: 100vh; margin: 0; background-color: #f0f0f0; } canvas { border: 5px solid #3498db; border-radius: 10px; } /* game start cues */ .game-start-text { position: absolute; top: 50%; left: 50%; transform: translate(-50%, -50%); font-size: 24px; color: green; background-color: rgba(255, 255, 255, 0.8); padding: 10px; border-radius: 5px; } /* score display style */ .score-display { position: absolute; top: 20px; left: 50%; transform: translateX(-50%); font-size: 18px; color: #333; font-weight: bold; background-color: rgba(255, 255, 255, 0.7); padding: 5px 10px; border-radius: 5px; } </style></head><body><canvas id="gameCanvas" width="800" height="600"></canvas><div class="score-display" id="scoreDisplay">score:0</div><div id="gameStartText" class="game-start-text">start</div><audio id="hitBrickSound" preload="auto"> <source src="hitBrick.mp3" type="audio/mpeg"> Your browser does not support the audio element.</audio><audio id="hitPaddleSound" preload="auto"> <source src="hitPaddle.mp3" type="audio/mpeg"> Your browser does not support the audio element.</audio><audio id="gameOverSound" preload="auto"> <source src="gameOver.mp3" type="audio/mpeg"> Your browser does not support the audio element.</audio><script> const canvas = document.getElementById('gameCanvas'); const ctx = canvas.getContext('2d'); const ballRadius = 10; let x = canvas.width / 2; let y = canvas.height - 30; let dx = 3; let dy = -3; const paddleHeight = 10; const paddleWidth = 100; let paddleX = (canvas.width - paddleWidth) / 2; const brickRowCount = 10; const brickColumnCount = 15; const brickWidth = 48; const brickHeight = 20; const bricks = []; let score = 0; let gameStarted = false; for (let c = 0; c < brickColumnCount; c++) { bricks[c] = []; for (let r = 0; r < brickRowCount; r++) { bricks[c][r] = { x: 0, y: 0, status: 1 }; } } document.addEventListener('mousemove', mouseMoveHandler, false); document.addEventListener('click', startGame, false); function mouseMoveHandler(e) { if (gameStarted) { const relativeX = e.clientX - canvas.offsetLeft; if (relativeX > 0 && relativeX < canvas.width) { paddleX = relativeX - paddleWidth / 2; } } } function startGame() { if (!gameStarted) { gameStarted = true; document.getElementById('gameStartText').style.display = 'none'; draw(); } } function drawBall() { ctx.beginPath(); ctx.arc(x, y, ballRadius, 0, Math.PI * 2); ctx.fillStyle = '#0095DD'; ctx.fill(); ctx.closePath(); } function drawPaddle() { ctx.beginPath(); ctx.rect(paddleX, canvas.height - paddleHeight, paddleWidth, paddleHeight); ctx.fillStyle = '#0095DD'; ctx.fill(); ctx.closePath(); } function drawBricks() { for (let c = 0; c < brickColumnCount; c++) { for (let r = 0; r < brickRowCount; r++) { if (bricks[c][r].status === 1) { const brickX = c * (brickWidth + 2) + 20; const brickY = r * (brickHeight + 2) + 20; bricks[c][r].x = brickX; bricks[c][r].y = brickY; ctx.beginPath(); ctx.rect(brickX, brickY, brickWidth, brickHeight); ctx.fillStyle = '#0095DD'; ctx.fill(); ctx.closePath(); } } } } function collisionDetection() { for (let c = 0; c < brickColumnCount; c++) { for (let r = 0; r < brickRowCount; r++) { const b = bricks[c][r]; if (b.status === 1) { if (x > b.x && x < b.x + brickWidth && y > b.y && y < b.y + brickHeight) { dy = -dy; b.status = 0; score++; document.getElementById('scoreDisplay').textContent = 'score:' + score; const hitBrickSound = document.getElementById('hitBrickSound'); hitBrickSound.play(); } } } } } function draw() { ctx.clearRect(0, 0, canvas.width, canvas.height); if (gameStarted) { drawBricks(); drawBall(); drawPaddle(); collisionDetection(); x += dx; y += dy; if (x + dx > canvas.width - ballRadius || x + dx < ballRadius) { dx = -dx; } if (y + dy < ballRadius) { dy = -dy; } else if (y + dy > canvas.height - ballRadius) { if (x > paddleX && x < paddleX + paddleWidth) { dy = -dy; const hitPaddleSound = document.getElementById('hitPaddleSound'); hitPaddleSound.play(); } else { const gameOverSound = document.getElementById('gameOverSound'); gameOverSound.play(); document.location.reload(); } } } requestAnimationFrame(draw); } draw();</script></body></html>按下ESC按钮退出编辑模式,输入“:wq”,退出并保存game.html文件。七、安装软件包安装Python3,命令如下:sudo yum install -y python3 安装成功后检查Python3版本确认是否安装成功。python3 --version八、浏览器访问在当前存放代码的路径下,使用Python3启动一个简单的Web服务器,命令如下:python3 -m http.server如下图所示,代表当前Web服务器已经启动。打开火狐浏览器,在地址栏输入“http://弹性云服务器IP:8000/game.html”即可体验游戏。至此实验全部完成。
  • [区域初赛赛题问题] 正式赛啥时候更新排行榜?
    正式赛啥时候更新排行榜?是否一直刷新到比赛结束?刷新频率和训练赛一样吗
  • [案例共创] 小案例完成
    一开始还以为很难,没想到感觉挺简单的
  • [区域初赛赛题问题] compile_error
    为什么直接把demos里的cpp压成zip交上去都会显示compile_error
  • 风行的作品
  • 使用强化学习AlphaZero算法训练中国象棋AI
    使用强化学习AlphaZero算法训练中国象棋AI案例目标通过本案例的学习和课后作业的练习:了解强化学习AlphaZero算法;利用AlphaZero算法进行一次中国象棋AI训练;你也可以将本案例相关的 ipynb 学习笔记分享到 AI Gallery Notebook 版块获得成长值,分享方法请查看此文档。案例内容介绍AlphaZero是一种强化学习算法,近期利用AlphaZero训练出的AI以绝对的优势战胜了多名围棋以及国际象棋冠军。AlphaZero创新点在于,它能够在不依赖于外部先验知识(也称专家知识),仅仅了解游戏规则的情况下,在棋盘类游戏中获得超越人类的表现。本次案例将详细的介绍AlphaZero算法核心原理,包括神经网络构建、MCTS搜索、自博弈训练,以代码的形式加深对算法的理解,算法详情亦可见论文《Mastering the game of Go without human knowledge》。同时本案例提供中国象棋强化学习环境,利用AlphaZero进行一次中国象棋训练,最后可视化象棋AI自博弈对局。由于训练一个强力的中国象棋AI需要大量的训练时间和资源,本案例偏重于算法理解,在运行过程中简化了训练过程,减少了自博弈次数和搜索次数。如果想要完整地训练一个中国象棋AlphaZero AI,可在AI Gallery中订阅《CChess中国象棋》算法,并在ModelArts中进行训练。注意事项本案例运行环境为 TensorFlow-1.13.1,且建议使用 GPU 运行,请查看《ModelAtrs JupyterLab 硬件规格使用指南》了解切换硬件规格的方法;如果您是第一次使用 JupyterLab,请查看《ModelAtrs JupyterLab使用指导》了解使用方法;如果您在使用 JupyterLab 过程中碰到报错,请参考《ModelAtrs JupyterLab常见问题解决办法》尝试解决问题;请逐步运行下面的每一个代码块;实验步骤程序初始化构建神经网络实现MCTS实现自博弈过程进行训练参数配置开始自博弈训练模型更新可视化对局1. 程序初始化第1步:安装基础依赖要确保所有依赖都安装成功后,再执行之后的代码。如果某些模块因为网络原因导致安装失败,直接重试一次即可。!pip install tornado==6.1.0!pip install tflearn==0.3.2!pip install tqdm!pip install urllib3==1.22!pip install threadpool==1.3.2!pip install xmltodict==0.12.0!pip install requests!pip install pandas==0.19.2!pip install numpy==1.14.5!pip install scipy==1.1.0!pip install matplotlib==2.0.0!pip install nest_asyncio!pip install gast==0.2.2第2步: 下载依赖包import osimport moxing as moxif not os.path.exists('cchess_training'): mox.file.copy("obs://modelarts-labs-bj4/course/modelarts/reinforcement_learning/cchess_gameplay/cchess_training/cchess_training.zip", "cchess_training.zip") os.system('unzip cchess_training.zip')第3步:导入相关的库%matplotlib notebook%matplotlib autoimport osimport sysimport loggingimport subprocessimport copyimport randomimport jsonimport asyncioimport timeimport numpy as npimport tensorflow as tffrom multiprocessing import Processfrom cchess_training.cchess_zero import board_visualizerfrom cchess_training.gameplays import players, gameplayfrom cchess_training.config import conffrom cchess_training.common.board import create_uci_labelsfrom cchess_training.cchess_training_model_update import model_updatefrom cchess_training.cchess_zero.gameboard import GameBoardfrom cchess_training.cchess_zero import cbffrom cchess_training.utils import get_latest_weight_pathimport nest_asyncionest_asyncio.apply()os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)logging.basicConfig(level=logging.INFO, format="[%(asctime)s] [%(levelname)s] [%(message)s]", datefmt='%Y-%m-%d %H:%M:%S' )2.构建神经网络这里基于Resnet实现了AlphaZero中的神经网络,神经网络输入为当前象棋棋面转化得到的0-1图,大小为[10, 9, 14],[10, 9]表示象棋棋盘大小,[14]每一个plane对应一类棋子,我方7类(兵、炮、车、马、相、仕、将),敌方7类,共14个plane。经过Resnet提取特征后分为两个分支,一个是价值分支,输出当前棋面价值,另一个是策略头,输出神经网络计算得到的动作对应概率。# resnetdef res_block(inputx, name, training, block_num=2, filters=256, kernel_size=(3, 3)): net = inputx for i in range(block_num): net = tf.layers.conv2d( net, filters=filters, kernel_size=kernel_size, activation=None, name="{}_res_conv{}".format(name, i), padding='same' ) net = tf.layers.batch_normalization(net, training=training, name="{}_res_bn{}".format(name, i)) if i == block_num - 1: net = net + inputx net = tf.nn.elu(net, name="{}_res_elu{}".format(name, i)) return netdef conv_block(inputx, name, training, block_num=1, filters=2, kernel_size=(1, 1)): net = inputx for i in range(block_num): net = tf.layers.conv2d( net, filters=filters, kernel_size=kernel_size, activation=None, name="{}_convblock_conv{}".format(name, i), padding='same' ) net = tf.layers.batch_normalization(net, training=training, name="{}_convblock_bn{}".format(name, i)) net = tf.nn.elu(net, name="{}_convblock_elu{}".format(name, i)) # net shape [None,10,9,2] netshape = net.get_shape().as_list() net = tf.reshape(net, shape=(-1, netshape[1] * netshape[2] * netshape[3])) net = tf.layers.dense(net, 10 * 9, name="{}_dense".format(name)) net = tf.nn.elu(net, name="{}_elu".format(name)) return netdef res_net_board(inputx, name, training, filters=256, num_res_layers=4): net = inputx net = tf.layers.conv2d( net, filters=filters, kernel_size=(3, 3), activation=None, name="{}_res_convb".format(name), padding='same' ) net = tf.layers.batch_normalization(net, training=training, name="{}_res_bnb".format(name)) net = tf.nn.elu(net, name="{}_res_elub".format(name)) for i in range(num_res_layers): net = res_block(net, name="{}_layer_{}".format(name, i + 1), training=training, filters=filters) return netdef get_scatter(name): with tf.variable_scope("Test"): ph = tf.placeholder(tf.float32, name=name) op = tf.summary.scalar(name, ph) return ph, opdef average_gradients(tower_grads): """Calculate the average gradient for each shared variable across all towers. Note that this function provides a synchronization point across all towers. Args: tower_grads: List of lists of (gradient, variable) tuples. The outer list is over individual gradients. The inner list is over the gradient calculation for each tower. Returns: List of pairs of (gradient, variable) where the gradient has been averaged across all towers. """ average_grads = [] for grad_and_vars in zip(*tower_grads): # Note that each grad_and_vars looks like the following: # ((grad0_gpu0, var0_gpu0), ... , (grad0_gpuN, var0_gpuN)) grads = [] for g, _ in grad_and_vars: # Add 0 dimension to the gradients to represent the tower. expanded_g = tf.expand_dims(g, 0) # Append on a 'tower' dimension which we will average over below. grads.append(expanded_g) # Average over the 'tower' dimension. grad = tf.concat(grads, 0) grad = tf.reduce_mean(grad, 0) # Keep in mind that the Variables are redundant because they are shared # across towers. So .. we will just return the first tower's pointer to # the Variable. v = grad_and_vars[0][1] grad_and_var = (grad, v) average_grads.append(grad_and_var) return average_gradsdef add_grad_to_list(opt, train_param, loss, tower_grad): grads = opt.compute_gradients(loss, var_list=train_param) grads = [i[0] for i in grads] tower_grad.append(zip(grads, train_param))def get_op_mul(tower_gradients, optimizer, gs): grads = average_gradients(tower_gradients) train_op = optimizer.apply_gradients(grads, gs) return train_opdef reduce_mean(x): return tf.reduce_mean(x)def merge(x): return tf.concat(x, axis=0)def get_model_resnet( model_name, labels, gpu_core=[0], batch_size=512, num_res_layers=4, filters=256, extra=False, extrav2=False): tf.reset_default_graph() graph = tf.Graph() with graph.as_default(): x_input = tf.placeholder(tf.float32, [None, 10, 9, 14]) nextmove = tf.placeholder(tf.float32, [None, len(labels)]) score = tf.placeholder(tf.float32, [None, 1]) training = tf.placeholder(tf.bool, name='training_mode') learning_rate = tf.placeholder(tf.float32) global_step = tf.train.get_or_create_global_step() optimizer_policy = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=0.9) optimizer_value = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=0.9) optimizer_multitarg = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=0.9) tower_gradients_policy, tower_gradients_value, tower_gradients_multitarg = [], [], [] net_softmax_collection = [] value_head_collection = [] multitarget_loss_collection = [] value_loss_collection = [] policy_loss_collection = [] accuracy_select_collection = [] with tf.variable_scope(tf.get_variable_scope()) as vscope: for ind, one_core in enumerate(gpu_core): if one_core is not None: devicestr = "/gpu:{}".format(one_core) if one_core is not None else "" else: devicestr = '/cpu:0' with tf.device(devicestr): body = res_net_board( x_input[ind * (batch_size // len(gpu_core)):(ind + 1) * (batch_size // len(gpu_core))], "selectnet", training=training, filters=filters, num_res_layers=num_res_layers ) with tf.variable_scope("policy_head"): policy_head = tf.layers.conv2d(body, 2, 1, padding='SAME') policy_head = tf.contrib.layers.batch_norm( policy_head, center=False, epsilon=1e-5, fused=True, is_training=training, activation_fn=tf.nn.relu ) policy_head = tf.reshape(policy_head, [-1, 9 * 10 * 2]) policy_head = tf.contrib.layers.fully_connected(policy_head, len(labels), activation_fn=None) # 价值头 with tf.variable_scope("value_head"): value_head = tf.layers.conv2d(body, 1, 1, padding='SAME') value_head = tf.contrib.layers.batch_norm( value_head, center=False, epsilon=1e-5, fused=True, is_training=training, activation_fn=tf.nn.relu ) value_head = tf.reshape(value_head, [-1, 9 * 10 * 1]) value_head = tf.contrib.layers.fully_connected(value_head, 256, activation_fn=tf.nn.relu) value_head = tf.contrib.layers.fully_connected(value_head, 1, activation_fn=tf.nn.tanh) value_head_collection.append(value_head) net_unsoftmax = policy_head with tf.variable_scope("Loss"): policy_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits( labels=nextmove[ind * (batch_size // len(gpu_core)): (ind + 1) * (batch_size // len(gpu_core))], logits=net_unsoftmax)) value_loss = tf.losses.mean_squared_error( labels=score[ind * (batch_size // len(gpu_core)):(ind + 1) * (batch_size // len(gpu_core))], predictions=value_head) value_loss = tf.reduce_mean(value_loss) regularizer = tf.contrib.layers.l2_regularizer(scale=1e-5) regular_variables = tf.trainable_variables() l2_loss = tf.contrib.layers.apply_regularization(regularizer, regular_variables) multitarget_loss = value_loss + policy_loss + l2_loss multitarget_loss_collection.append(multitarget_loss) value_loss_collection.append(value_loss) policy_loss_collection.append(policy_loss) net_softmax = tf.nn.softmax(net_unsoftmax) net_softmax_collection.append(net_softmax) correct_prediction = tf.equal(tf.argmax(nextmove, 1), tf.argmax(net_softmax, 1)) with tf.variable_scope("Accuracy"): accuracy_select = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) accuracy_select_collection.append(accuracy_select) tf.get_variable_scope().reuse_variables() trainable_params = tf.trainable_variables() tp_policy = [i for i in trainable_params if ('value_head' not in i.name)] tp_value = [i for i in trainable_params if ('policy_head' not in i.name)] add_grad_to_list(optimizer_policy, tp_policy, policy_loss, tower_gradients_policy) add_grad_to_list(optimizer_value, tp_value, value_loss, tower_gradients_value) add_grad_to_list(optimizer_multitarg, trainable_params, multitarget_loss, tower_gradients_multitarg) update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) with tf.control_dependencies(update_ops): train_op_policy = get_op_mul(tower_gradients_policy, optimizer_policy, global_step) train_op_value = get_op_mul(tower_gradients_value, optimizer_value, global_step) train_op_multitarg = get_op_mul(tower_gradients_multitarg, optimizer_multitarg, global_step) net_softmax = merge(net_softmax_collection) value_head = merge(value_head_collection) multitarget_loss = reduce_mean(multitarget_loss_collection) value_loss = reduce_mean(value_loss_collection) policy_loss = reduce_mean(policy_loss_collection) accuracy_select = reduce_mean(accuracy_select_collection) with graph.as_default(): config = tf.ConfigProto() config.gpu_options.allow_growth = True config.allow_soft_placement = True sess = tf.Session(config=config) if model_name is not None: with graph.as_default(): saver = tf.train.Saver(var_list=tf.global_variables()) saver.restore(sess, model_name) else: with graph.as_default(): sess.run(tf.global_variables_initializer()) return (sess, graph), ((x_input, training), (net_softmax, value_head))3.实现MCTSAlphaZero利用MCTS来自博弈生成棋局,MCTS搜索原理简述如下:每次模拟通过选择具有最大行动价值Q的边加上取决于所存储的先验概率P和该边的访问计数N(每次访问都被增加一次)的上限置信区间U来遍历树;展开叶子节点,通过神经网络来评估局面s,向量P的值存储在叶子结点扩展的边上;更新行动价值Q等于在该行动下的子树中的所有评估值V的均值;一旦MCTS搜索完成,返回局面s下的落子概率π。def softmax(x): probs = np.exp(x - np.max(x)) probs /= np.sum(probs) return probsclass TreeNode(object): """A node in the MCTS tree. Each node keeps track of its own value Q, prior probability P, and its visit-count-adjusted prior score u. """ def __init__(self, parent, prior_p, noise=False): self._parent = parent self._children = {} # a map from action to TreeNode self._n_visits = 0 self._Q = 0 self._u = 0 self._P = prior_p self.virtual_loss = 0 self.noise = noise def expand(self, action_priors): """Expand tree by creating new children. action_priors: a list of tuples of actions and their prior probability according to the policy function. """ # dirichlet noise should be applied when every select action if False and self.noise is True and self._parent is None: noise_d = np.random.dirichlet([0.3] * len(action_priors)) for (action, prob), one_noise in zip(action_priors, noise_d): if action not in self._children: prob = (1 - 0.25) * prob + 0.25 * one_noise self._children[action] = TreeNode(self, prob, noise=self.noise) else: for action, prob in action_priors: if action not in self._children: self._children[action] = TreeNode(self, prob) def select(self, c_puct): """Select action among children that gives maximum action value Q plus bonus u(P). Return: A tuple of (action, next_node) """ if self.noise is False: return max(self._children.items(), key=lambda act_node: act_node[1].get_value(c_puct)) elif self.noise is True and self._parent is not None: return max(self._children.items(), key=lambda act_node: act_node[1].get_value(c_puct)) else: noise_d = np.random.dirichlet([0.3] * len(self._children)) return max(list(zip(noise_d, self._children.items())), key=lambda act_node: act_node[1][1].get_value(c_puct, noise_p=act_node[0]))[1] def update(self, leaf_value): """Update node values from leaf evaluation. leaf_value: the value of subtree evaluation from the current player's perspective. """ # Count visit. self._n_visits += 1 # Update Q, a running average of values for all visits. self._Q += 1.0 * (leaf_value - self._Q) / self._n_visits def update_recursive(self, leaf_value): """Like a call to update(), but applied recursively for all ancestors. """ # If it is not root, this node's parent should be updated first. if self._parent: self._parent.update_recursive(-leaf_value) self.update(leaf_value) def get_value(self, c_puct, noise_p=None): """Calculate and return the value for this node. It is a combination of leaf evaluations Q, and this node's prior adjusted for its visit count, u. c_puct: a number in (0, inf) controlling the relative impact of value Q, and prior probability P, on this node's score. """ if noise_p is None: self._u = (c_puct * self._P * np.sqrt(self._parent._n_visits) / (1 + self._n_visits)) return self._Q + self._u + self.virtual_loss else: self._u = (c_puct * (self._P * 0.75 + noise_p * 0.25) * np.sqrt(self._parent._n_visits) / (1 + self._n_visits)) return self._Q + self._u + self.virtual_loss def is_leaf(self): """Check if leaf node (i.e. no nodes below this have been expanded).""" return self._children == {} def is_root(self): return self._parent is Noneclass MCTS(object): """An implementation of Monte Carlo Tree Search.""" def __init__( self, policy_value_fn, c_puct=5, n_playout=10000, search_threads=32, virtual_loss=3, policy_loop_arg=False, dnoise=False, play=False ): """ policy_value_fn: a function that takes in a board state and outputs a list of (action, probability) tuples and also a score in [-1, 1] (i.e. the expected value of the end game score from the current player's perspective) for the current player. c_puct: a number in (0, inf) that controls how quickly exploration converges to the maximum-value policy. A higher value means relying on the prior more. """ self._root = TreeNode(None, 1.0, noise=dnoise) self._policy = policy_value_fn self._c_puct = c_puct self._n_playout = n_playout self.virtual_loss = virtual_loss self.loop = asyncio.get_event_loop() self.policy_loop_arg = policy_loop_arg self.sem = asyncio.Semaphore(search_threads) self.now_expanding = set() self.select_time = 0 self.policy_time = 0 self.update_time = 0 self.num_proceed = 0 self.dnoise = dnoise self.play = play async def _playout(self, state): """Run a single playout from the root to the leaf, getting a value at the leaf and propagating it back through its parents. State is modified in-place, so a copy must be provided. """ with await self.sem: node = self._root road = [] while 1: while node in self.now_expanding: await asyncio.sleep(1e-4) start = time.time() if node.is_leaf(): break # Greedily select next move. action, node = node.select(self._c_puct) road.append(node) node.virtual_loss -= self.virtual_loss state.do_move(action) self.select_time += (time.time() - start) # at leave node if long check or long catch then cut off the node if state.should_cutoff() and not self.play: # cut off node for one_node in road: one_node.virtual_loss += self.virtual_loss # now at this time, we do not update the entire tree branch, the accuracy loss is supposed to be small # set virtual loss to -inf so that other threads would not # visit the same node again(so the node is cut off) node.virtual_loss = - np.inf self.update_time += (time.time() - start) # however the proceed number still goes up 1 self.num_proceed += 1 return start = time.time() self.now_expanding.add(node) # Evaluate the leaf using a network which outputs a list of # (action, probability) tuples p and also a score v in [-1, 1] # for the current player if self.policy_loop_arg is False: action_probs, leaf_value = await self._policy(state) else: action_probs, leaf_value = await self._policy(state, self.loop) self.policy_time += (time.time() - start) start = time.time() # Check for end of game. end, winner = state.game_end() if not end: node.expand(action_probs) else: # for end state,return the "true" leaf_value if winner == -1: # tie leaf_value = 0.0 else: leaf_value = ( 1.0 if winner == state.get_current_player() else -1.0 ) # Update value and visit count of nodes in this traversal. for one_node in road: one_node.virtual_loss += self.virtual_loss node.update_recursive(-leaf_value) self.now_expanding.remove(node) # node.update_recursive(leaf_value) self.update_time += (time.time() - start) self.num_proceed += 1 def get_move_probs(self, state, temp=1e-3, predict_workers=[], can_apply_dnoise=False, verbose=False, infer_mode=False): """Run all playouts sequentially and return the available actions and their corresponding probabilities. state: the current game state temp: temperature parameter in (0, 1] controls the level of exploration """ if can_apply_dnoise is False: self._root.noise = False coroutine_list = [] for n in range(self._n_playout): state_copy = copy.deepcopy(state) coroutine_list.append(self._playout(state_copy)) coroutine_list += predict_workers self.loop.run_until_complete(asyncio.gather(*coroutine_list)) # calc the move probabilities based on visit counts at the root node act_visits = [(act, node._n_visits) for act, node in self._root._children.items()] acts, visits = zip(*act_visits) act_probs = softmax(1.0 / temp * np.log(np.array(visits) + 1e-10)) if infer_mode: info = [(act, node._n_visits, node._Q, node._P) for act, node in self._root._children.items()] if infer_mode: return acts, act_probs, info else: return acts, act_probs def update_with_move(self, last_move, allow_legacy=True): """Step forward in the tree, keeping everything we already know about the subtree. """ self.num_proceed = 0 if last_move in self._root._children and allow_legacy: self._root = self._root._children[last_move] self._root._parent = None else: self._root = TreeNode(None, 1.0, noise=self.dnoise) def __str__(self): return "MCTS"4.实现自博弈过程实现自博弈训练,基于同一个神经网络初始化对弈双方棋手,对弈过程中双方棋手每下一步前均采用MCTS搜索最优下子策略,每次自博弈一局结束后保存棋局。# Self-playclass Game(object): def __init__(self, white, black, verbose=True): self.white = white self.black = black self.verbose = verbose self.gamestate = gameplay.GameState() def play_till_end(self): winner = 'peace' moves = [] peace_round = 0 remain_piece = gameplay.countpiece(self.gamestate.statestr) while True: start_time = time.time() if self.gamestate.move_number % 2 == 0: player_name = 'w' player = self.white else: player_name = 'b' player = self.black move, score = player.make_move(self.gamestate) if move is None: winner = 'b' if player_name == 'w' else 'w' break moves.append(move) total_time = time.time() - start_time logging.info('move {} {} play {} use {:.2f}s'.format( self.gamestate.move_number, player_name, move, total_time,)) game_end, winner_p = self.gamestate.game_end() if game_end: winner = winner_p break remain_piece_round = gameplay.countpiece(self.gamestate.statestr) if remain_piece_round < remain_piece: remain_piece = remain_piece_round peace_round = 0 else: peace_round += 1 if peace_round > conf.non_cap_draw_round: winner = 'peace' break return winner, movesclass NetworkPlayGame(Game): def __init__(self, network_w, network_b, **xargs): whiteplayer = players.NetworkPlayer('w', network_w, **xargs) blackplayer = players.NetworkPlayer('b', network_b, **xargs) super(NetworkPlayGame, self).__init__(whiteplayer, blackplayer)class ContinousNetworkPlayGames(object): def __init__( self, network_w=None, network_b=None, white_name='net', black_name='net', random_switch=True, recoard_game=True, recoard_dir='data/distributed/', play_times=np.inf, distributed_dir='data/prepare_weight', **xargs ): self.network_w = network_w self.network_b = network_b self.white_name = white_name self.black_name = black_name self.random_switch = random_switch self.play_times = play_times self.recoard_game = recoard_game self.recoard_dir = recoard_dir self.xargs = xargs # self.distributed_server = distributed_server self.distributed_dir = distributed_dir def begin_of_game(self): pass def end_of_game(self, cbf_name, moves, cbfile, training_dt, epoch): pass def play(self, data_url=None, epoch=0): num = 0 while num < self.play_times: time_one_game_start = time.time() num += 1 self.begin_of_game(epoch) if self.random_switch and random.random() < 0.5: self.network_w, self.network_b = self.network_b, self.network_w self.white_name, self.black_name = self.black_name, self.white_name network_play_game = NetworkPlayGame(self.network_w, self.network_b, **self.xargs) winner, moves = network_play_game.play_till_end() stamp = time.strftime('%Y-%m-%d_%H-%M-%S', time.localtime(time.time())) date = time.strftime('%Y-%m-%d', time.localtime(time.time())) cbfile = cbf.CBF( black=self.black_name, red=self.white_name, date=date, site='北京', name='noname', datemodify=date, redteam=self.white_name, blackteam=self.black_name, round='第一轮' ) cbfile.receive_moves(moves) randstamp = random.randint(0, 1000) cbffilename = '{}_{}_mcts-mcts_{}-{}_{}.cbf'.format( stamp, randstamp, self.white_name, self.black_name, winner) if not os.path.exists(self.recoard_dir): os.makedirs(self.recoard_dir) cbf_name = os.path.join(self.recoard_dir, cbffilename) cbfile.dump(cbf_name) training_dt = time.time() - time_one_game_start self.end_of_game(cbffilename, moves, cbfile, training_dt, epoch)class DistributedSelfPlayGames(ContinousNetworkPlayGames): def __init__(self, gpu_num=0, auto_update=True, mode='train', **kwargs): self.gpu_num = gpu_num self.auto_update = auto_update self.model_name_in_use = None # for tracking latest weight self.mode = mode super(DistributedSelfPlayGames, self).__init__(**kwargs) def begin_of_game(self, epoch): """ when self playing, init network player using the latest weights """ if not self.auto_update: return latest_model_name = get_latest_weight_path() logging.info('------------------ restoring model {}'.format(latest_model_name)) model_path = os.path.join(self.distributed_dir, latest_model_name) if self.network_w is None or self.network_b is None: network = get_model_resnet( model_path, create_uci_labels(), gpu_core=[self.gpu_num], filters=conf.network_filters, num_res_layers=conf.network_layers ) self.network_w = network self.network_b = network self.model_name_in_use = model_path else: if model_path != self.model_name_in_use: (sess, graph), ((X, training), (net_softmax, value_head)) = self.network_w with graph.as_default(): saver = tf.train.Saver(var_list=tf.global_variables()) saver.restore(sess, model_path) self.model_name_in_use = model_path def end_of_game(self, cbf_name, moves, cbfile, training_dt, epoch): played_games = len(os.listdir(conf.distributed_datadir)) if self.mode == 'train': logging.info('------------------ epoch {}: trained {} games, this game used {}s'.format( epoch, played_games, round(training_dt, 6), )) else: logging.info('------------------ infer {} games, this game used {}s'.format( played_games, round(training_dt, 6), )) def self_play_gpu(gpu_num=0, play_times=np.inf, mode='train', n_playout=50, save_dir=conf.distributed_datadir): logging.info('------------------ self play start') cn = DistributedSelfPlayGames( gpu_num=gpu_num, n_playout=n_playout, recoard_dir=save_dir, c_puct=conf.c_puct, distributed_dir=conf.distributed_server_weight_dir, dnoise=True, is_selfplay=True, play_times=play_times, mode=mode, ) cn.play(epoch=0) logging.info('------------------ self play done') 5.进行训练参数配置配置一次训练过程中自博弈次数、训练结束后采用训练出的模型进行推理局数、训练batch_size。为简化训练过程参数均较小。config = { "n_playout": 100, # MCTS搜索次数,推荐(10-1600),数字越小程序运行越快,数字越大算法搜索准确度越高 "self_play_games": 2, # 自博弈对局数, 推荐(5-10000),注意数字较大时可能会超过资源免费体验时长 "infer_games": 1, # 推理对局数 "gpu_num": 0, # 使用的GPU卡号}6.开始自博弈训练,结束后更新模型运行过程中会打印出双方下棋动作self_play_gpu(config['gpu_num'], config['self_play_games'], mode='train', n_playout=config['n_playout'])# model updatemodel_update(gpu_num=config['gpu_num'])7.可视化对局(等待第六步运行结束后再运行此步)在此将加载模型进行博弈一次,可视化对局过程,最后显示对弈结束时的棋面self_play_gpu(config['gpu_num'], config['infer_games'], mode='infer', n_playout=config['n_playout'], save_dir='./infer_res')加载对局文件显示双方所有动作,动作为棋盘上起点坐标至终点坐标,具体坐标定义见后面的棋盘。%reload_ext autoreload%autoreload 2%matplotlib inlinefrom matplotlib import pyplot as pltfrom cchess_training.cchess_zero.gameboard import *from PIL import Imageimport imageiogameplay_path = './infer_res'while not os.path.exists(gameplay_path) or len(os.listdir(gameplay_path)) == 0: time.sleep(5) logging.info('第6步未运行结束,建议停止运行,重新逐步运行')gameplays = os.listdir(gameplay_path)fullpath = '{}/{}'.format(gameplay_path, random.choice(gameplays))moves = cbf.cbf2move(fullpath)fname = fullpath.split('/')[-1]print(moves)['b2e2', 'h7h5', 'b0c2', 'b7e7', 'h0g2', 'h5i5', 'h2i2', 'a9a7', 'i0i1', 'i5g5', 'c2e1', 'h9i7', 'c3c4', 'e6e5', 'a0b0', 'e7g7', 'g3g4', 'c6c5', 'i1h1', 'i7g8', 'e2e5', 'i9i8', 'g2f4', 'b9c7', 'g4g5', 'c7b9', 'f4e6', 'a7e7', 'e6g7', 'e7e5', 'b0b9', 'c5c4', 'i2e2', 'c4d4', 'e2e5', 'i8i7', 'b9c9', 'i7g7', 'h1h8', 'i6i5', 'a3a4', 'g7a7', 'i3i4', 'i5i4', 'g5g6', 'i4i3', 'h8g8', 'i3h3', 'g8f8', 'a7a8', 'f8f9', 'e9f9', 'g6f6', 'a6a5', 'a4a5', 'd4e4', 'e3e4', 'g9e7', 'g0e2', 'a8g8', 'c9d9', 'g8g1']可视化对弈过程import cv2from IPython.display import clear_output, Image, displaystate = gameplay.GameState()statestr = 'RNBAKABNR/9/1C5C1/P1P1P1P1P/9/9/p1p1p1p1p/1c5c1/9/rnbakabnr'for move in moves: clear_output(wait=True) statestr = GameBoard.sim_do_action(move, statestr) img = board_visualizer.get_board_img(statestr) img_show = cv2.cvtColor(img, cv2.COLOR_RGBA2BGR) display(Image(data=cv2.imencode('.jpg', img_show)[1])) time.sleep(0.5)显示终局棋面plt.figure(figsize=(8,8))plt.imshow(board_visualizer.get_board_img(statestr))至此,本案例结束,如果想要完整地训练一个中国象棋AlphaZero AI,可在AI Gallery中订阅《CChess中国象棋》算法,并在ModelArts中进行训练。8. 作业请你调整步骤5中的训练参数,重新训练一个模型,使它在游戏中获得更好的表现
  • [技术干货] 使用DPPO算法控制“倒立摆”
    使用DPPO算法控制“倒立摆”实验目标通过本案例的学习和课后作业的练习:了解DPPO基本概念了解如何基于DPPO训练一个控制类问题了解强化学习训练推理控制类问题的整体流程你也可以将本案例相关的 ipynb 学习笔记分享到 AI Gallery Notebook 版块获得成长值,分享方法请查看此文档。案例内容介绍倒立摆(Pendulum)摆动问题是控制文献中的经典问题。在我们本节用DPPO解决的Pendulum-v0问题中,钟摆从一个随机位置开始围绕一个端点摆动,目标是把钟摆向上摆动,并且是钟摆保持直立。一个随机动作的倒立摆demo如下所示:整体流程:安装基础依赖->创建倒立摆环境->构建DPPO算法->训练->推理->可视化效果Distributed Proximal Policy Optimization (DPPO) 算法的基本结构DPPO算法是在Proximal Policy Optimization(PPO)算法基础上发展而来,相关PPO算法请看 使用PPO算法玩“超级马里奥兄弟”,我们在这一教程中有详细的介绍。DPPO借鉴A3C的并行方法,使用多个workers并行地在不同的环境中收集数据,并根据采集的数据计算梯度,将梯度发送给一个全局chief,全局chief在拿到一定数量的梯度数据之后进行网络更新,更新时workers停止采集等待等下完毕,更新完毕之后workers重新使用最新的网络采集数据。下面我们使用论文中的伪代码介绍DPPO的具体算法细节。上述算法所示为全局PPO的伪代码,其中 W 是workers的数目,D 是用于更新参数的works数量阈值,M,B是给定一批数据点的具有policy网络和critic网络更新的子迭代数,θ, Φ为policy网络,critic网络的参数。上述算法所示为workers的伪代码,其中T是在计算参数更新之前收集的每个工作节点的数据点数,K是计算K步返回和通过时间截断的反向道具的时间步数(对于RNN)。 该部分算法基于PPO,首先采集数据,根据PPO算法计算梯度并将梯度发送给全局chief,等待全局chief更新完毕参数再进行数据的采集。DPPO论文代码部分参考GitHub开源项目注意事项本案例运行环境为 TensorFlow-2.0.0,且需使用 GPU 运行,请查看《ModelAtrs JupyterLab 硬件规格使用指南》了解切换硬件规格的方法;如果您是第一次使用 JupyterLab,请查看《ModelAtrs JupyterLab使用指导》了解使用方法;如果您在使用 JupyterLab 过程中碰到报错,请参考《ModelAtrs JupyterLab常见问题解决办法》尝试解决问题。实验步骤1. 程序初始化第1步:安装基础依赖!pip install tensorflow==2.0.0 !pip install tensorflow-probability==0.7.0 !pip install tensorlayer==2.1.0 --ignore-installed !pip install h5py==2.10.0 !pip install gym第2步:导入相关的库import os import time import queue import threading import gym import matplotlib.pyplot as plt import numpy as np import tensorflow as tf import tensorflow_probability as tfp import tensorlayer as tl2. 训练参数初始化本案例设置的 训练最大局数 EP_MAX = 1000,可以达到较好的训练效果,训练耗时约20分钟。你也可以调小 EP_MAX 的值,以便快速跑通代码。RANDOMSEED = 1 # 随机数种子 EP_MAX = 100 # 训练总局数 EP_LEN = 200 # 一局最长长度 GAMMA = 0.9 # 折扣率 A_LR = 0.0001 # actor学习率 C_LR = 0.0002 # critic学习率 BATCH = 32 # batchsize大小 A_UPDATE_STEPS = 10 # actor更新步数 C_UPDATE_STEPS = 10 # critic更新步数 S_DIM, A_DIM = 3, 1 # state维度, action维度 EPS = 1e-8 # epsilon值 # PPO1 和PPO2 的参数,你可以选择用PPO1 (METHOD[0]),还是PPO2 (METHOD[1]) METHOD = [ dict(name='kl_pen', kl_target=0.01, lam=0.5), # KL penalty dict(name='clip', epsilon=0.2), # Clipped surrogate objective, find this is better ][1] # choose the method for optimization N_WORKER = 4 # 并行workers数目 MIN_BATCH_SIZE = 64 # 更新PPO的minibatch大小 UPDATE_STEP = 10 # 每隔10steps更新一次3. 创建环境本环境为gym内置的Pendulum,倒立摆倒下即失败。env_name = 'Pendulum-v0' # environment name4. 定义DPPO算法DPPO算法-PPO算法class PPO(object): ''' PPO class ''' def __init__(self): # 创建critic tfs = tl.layers.Input([None, S_DIM], tf.float32, 'state') l1 = tl.layers.Dense(100, tf.nn.relu)(tfs) v = tl.layers.Dense(1)(l1) self.critic = tl.models.Model(tfs, v) self.critic.train() # 创建actor self.actor = self._build_anet('pi', trainable=True) self.actor_old = self._build_anet('oldpi', trainable=False) self.actor_opt = tf.optimizers.Adam(A_LR) self.critic_opt = tf.optimizers.Adam(C_LR) # 更新actor def a_train(self, tfs, tfa, tfadv): ''' Update policy network :param tfs: state :param tfa: act :param tfadv: advantage :return: ''' tfs = np.array(tfs, np.float32) tfa = np.array(tfa, np.float32) tfadv = np.array(tfadv, np.float32) # td-error with tf.GradientTape() as tape: mu, sigma = self.actor(tfs) pi = tfp.distributions.Normal(mu, sigma) mu_old, sigma_old = self.actor_old(tfs) oldpi = tfp.distributions.Normal(mu_old, sigma_old) ratio = pi.prob(tfa) / (oldpi.prob(tfa) + EPS) surr = ratio * tfadv ## PPO1 if METHOD['name'] == 'kl_pen': tflam = METHOD['lam'] kl = tfp.distributions.kl_divergence(oldpi, pi) kl_mean = tf.reduce_mean(kl) aloss = -(tf.reduce_mean(surr - tflam * kl)) ## PPO2 else: aloss = -tf.reduce_mean( tf.minimum(surr, tf.clip_by_value(ratio, 1. - METHOD['epsilon'], 1. + METHOD['epsilon']) * tfadv) ) a_gard = tape.gradient(aloss, self.actor.trainable_weights) self.actor_opt.apply_gradients(zip(a_gard, self.actor.trainable_weights)) if METHOD['name'] == 'kl_pen': return kl_mean # 更新old_pi def update_old_pi(self): ''' Update old policy parameter :return: None ''' for p, oldp in zip(self.actor.trainable_weights, self.actor_old.trainable_weights): oldp.assign(p) # 更新critic def c_train(self, tfdc_r, s): ''' Update actor network :param tfdc_r: cumulative reward :param s: state :return: None ''' tfdc_r = np.array(tfdc_r, dtype=np.float32) with tf.GradientTape() as tape: advantage = tfdc_r - self.critic(s) # 计算advantage:V(s') * gamma + r - V(s) closs = tf.reduce_mean(tf.square(advantage)) grad = tape.gradient(closs, self.critic.trainable_weights) self.critic_opt.apply_gradients(zip(grad, self.critic.trainable_weights)) # 计算advantage:V(s') * gamma + r - V(s) def cal_adv(self, tfs, tfdc_r): ''' Calculate advantage :param tfs: state :param tfdc_r: cumulative reward :return: advantage ''' tfdc_r = np.array(tfdc_r, dtype=np.float32) advantage = tfdc_r - self.critic(tfs) return advantage.numpy() def update(self): ''' Update parameter with the constraint of KL divergent :return: None ''' global GLOBAL_UPDATE_COUNTER while not COORD.should_stop(): # 如果协调器没有停止 if GLOBAL_EP < EP_MAX: # EP_MAX是最大更新次数 UPDATE_EVENT.wait() # PPO进程的等待位置 self.update_old_pi() # copy pi to old pi data = [QUEUE.get() for _ in range(QUEUE.qsize())] # collect data from all workers data = np.vstack(data) s, a, r = data[:, :S_DIM].astype(np.float32), \ data[:, S_DIM: S_DIM + A_DIM].astype(np.float32), \ data[:, -1:].astype(np.float32) adv = self.cal_adv(s, r) # update actor ## PPO1 if METHOD['name'] == 'kl_pen': for _ in range(A_UPDATE_STEPS): kl = self.a_train(s, a, adv) if kl > 4 * METHOD['kl_target']: # this in in google's paper break if kl < METHOD['kl_target'] / 1.5: # adaptive lambda, this is in OpenAI's paper METHOD['lam'] /= 2 elif kl > METHOD['kl_target'] * 1.5: METHOD['lam'] *= 2 # sometimes explode, this clipping is MorvanZhou's solution METHOD['lam'] = np.clip(METHOD['lam'], 1e-4, 10) ## PPO2 else: # clipping method, find this is better (OpenAI's paper) for _ in range(A_UPDATE_STEPS): self.a_train(s, a, adv) # 更新critic for _ in range(C_UPDATE_STEPS): self.c_train(r, s) UPDATE_EVENT.clear() # updating finished GLOBAL_UPDATE_COUNTER = 0 # reset counter ROLLING_EVENT.set() # set roll-out available # 构建actor网络 def _build_anet(self, name, trainable): ''' Build policy network :param name: name :param trainable: trainable flag :return: policy network ''' tfs = tl.layers.Input([None, S_DIM], tf.float32, name + '_state') l1 = tl.layers.Dense(100, tf.nn.relu, name=name + '_l1')(tfs) a = tl.layers.Dense(A_DIM, tf.nn.tanh, name=name + '_a')(l1) mu = tl.layers.Lambda(lambda x: x * 2, name=name + '_lambda')(a) sigma = tl.layers.Dense(A_DIM, tf.nn.softplus, name=name + '_sigma')(l1) model = tl.models.Model(tfs, [mu, sigma], name) if trainable: model.train() else: model.eval() return model # 选择动作 def choose_action(self, s): ''' Choose action :param s: state :return: clipped act ''' s = s[np.newaxis, :].astype(np.float32) mu, sigma = self.actor(s) pi = tfp.distributions.Normal(mu, sigma) a = tf.squeeze(pi.sample(1), axis=0)[0] # choosing action return np.clip(a, -2, 2) # 计算V() def get_v(self, s): ''' Compute value :param s: state :return: value ''' s = s.astype(np.float32) if s.ndim < 2: s = s[np.newaxis, :] return self.critic(s)[0, 0] def save_ckpt(self): """ save trained weights :return: None """ if not os.path.exists('model_Pendulum'): os.makedirs('model_Pendulum') tl.files.save_weights_to_hdf5('model_Pendulum/dppo_actor.hdf5', self.actor) tl.files.save_weights_to_hdf5('model_Pendulum/dppo_actor_old.hdf5', self.actor_old) tl.files.save_weights_to_hdf5('model_Pendulum/dppo_critic.hdf5', self.critic) def load_ckpt(self): """ load trained weights :return: None """ tl.files.load_hdf5_to_weights_in_order('model_Pendulum/dppo_actor.hdf5', self.actor) tl.files.load_hdf5_to_weights_in_order('model_Pendulum/dppo_actor_old.hdf5', self.actor_old) tl.files.load_hdf5_to_weights_in_order('model_Pendulum/dppo_critic.hdf5', self.critic)workers构建class Worker(object): ''' Worker class for distributional running ''' def __init__(self, wid): self.wid = wid # 工号 self.env = gym.make(env_name).unwrapped # 创建环境 self.env.seed(wid * 100 + RANDOMSEED) # 设置不同的随机种子,因为不希望每个worker的都一致 self.ppo = GLOBAL_PPO # 算法 def work(self): ''' Define a worker :return: None ''' global GLOBAL_EP, GLOBAL_RUNNING_R, GLOBAL_UPDATE_COUNTER while not COORD.should_stop(): # 从COORD接受消息,看看是否应该should_stop s = self.env.reset() ep_r = 0 buffer_s, buffer_a, buffer_r = [], [], [] # 记录data t0 = time.time() for t in range(EP_LEN): # 看是否正在被更新。PPO进程正在工作,那么就在这里等待 if not ROLLING_EVENT.is_set(): # 查询进程是否被阻塞,如果在阻塞状态,就证明如果global PPO正在更新。否则就可以继续。 ROLLING_EVENT.wait() # worker进程的等待位置。wait until PPO is updated buffer_s, buffer_a, buffer_r = [], [], [] # clear history buffer, use new policy to collect data # 正常跑游戏,并搜集数据 a = self.ppo.choose_action(s) s_, r, done, _ = self.env.step(a) buffer_s.append(s) buffer_a.append(a) buffer_r.append((r + 8) / 8) # normalize reward, find to be useful s = s_ ep_r += r # GLOBAL_UPDATE_COUNTER是每个work的在游戏中进行一步,也就是产生一条数据就会+1. # 当GLOBAL_UPDATE_COUNTER大于batch(64)的时候,就可以进行更新。 GLOBAL_UPDATE_COUNTER += 1 # count to minimum batch size, no need to wait other workers if t == EP_LEN - 1 or GLOBAL_UPDATE_COUNTER >= MIN_BATCH_SIZE: # t == EP_LEN - 1 是最后一步 ## 计算每个状态对应的V(s') ## 要注意,这里的len(buffer) < GLOBAL_UPDATE_COUNTER。所以数据是每个worker各自计算的。 v_s_ = self.ppo.get_v(s_) discounted_r = [] # compute discounted reward for r in buffer_r[::-1]: v_s_ = r + GAMMA * v_s_ discounted_r.append(v_s_) discounted_r.reverse() ## 堆叠成数据,并保存到公共队列中。 bs, ba, br = np.vstack(buffer_s), np.vstack(buffer_a), np.array(discounted_r)[:, np.newaxis] buffer_s, buffer_a, buffer_r = [], [], [] QUEUE.put(np.hstack((bs, ba, br))) # put data in the queue # 如果数据足够,就开始更新 if GLOBAL_UPDATE_COUNTER >= MIN_BATCH_SIZE: ROLLING_EVENT.clear() # stop collecting data UPDATE_EVENT.set() # global PPO update if GLOBAL_EP >= EP_MAX: # stop training COORD.request_stop() # 停止更新 break # record reward changes, plot later if len(GLOBAL_RUNNING_R) == 0: GLOBAL_RUNNING_R.append(ep_r) else: GLOBAL_RUNNING_R.append(GLOBAL_RUNNING_R[-1] * 0.9 + ep_r * 0.1) GLOBAL_EP += 1 print( 'Episode: {}/{} | Worker: {} | Episode Reward: {:.4f} | Running Time: {:.4f}'.format( GLOBAL_EP, EP_MAX, self.wid, ep_r, time.time() - t0 ) )5. 模型训练np.random.seed(RANDOMSEED) tf.random.set_seed(RANDOMSEED) GLOBAL_PPO = PPO()[TL] Input state: [None, 3] [TL] Dense dense_1: 100 relu [TL] Dense dense_2: 1 No Activation [TL] Input pi_state: [None, 3] [TL] Dense pi_l1: 100 relu [TL] Dense pi_a: 1 tanh [TL] Lambda pi_lambda: func: <function PPO._build_anet.<locals>.<lambda> at 0x7fb65d633950>, len_weights: 0 [TL] Dense pi_sigma: 1 softplus [TL] Input oldpi_state: [None, 3] [TL] Dense oldpi_l1: 100 relu [TL] Dense oldpi_a: 1 tanh [TL] Lambda oldpi_lambda: func: <function PPO._build_anet.<locals>.<lambda> at 0x7fb65d633a70>, len_weights: 0 [TL] Dense oldpi_sigma: 1 softplus# 定义两组不同的事件,update 和 rolling UPDATE_EVENT, ROLLING_EVENT = threading.Event(), threading.Event() UPDATE_EVENT.clear() # not update now,相当于把标志位设置为False ROLLING_EVENT.set() # start to roll out,相当于把标志位设置为True,并通知所有处于等待阻塞状态的线程恢复运行状态。 # 创建workers workers = [Worker(wid=i) for i in range(N_WORKER)] GLOBAL_UPDATE_COUNTER, GLOBAL_EP = 0, 0 # 全局更新次数计数器,全局EP计数器 GLOBAL_RUNNING_R = [] # 记录动态的reward,看成绩 COORD = tf.train.Coordinator() # 创建tensorflow的协调器 QUEUE = queue.Queue() # workers putting data in this queue threads = [] # 为每个worker创建进程 for worker in workers: # worker threads t = threading.Thread(target=worker.work, args=()) # 创建进程 t.start() # 开始进程 threads.append(t) # 把进程放到进程列表中,方便管理 # add a PPO updating thread # 把一个全局的PPO更新放到进程列表最后。 threads.append(threading.Thread(target=GLOBAL_PPO.update, )) threads[-1].start() COORD.join(threads) # 把进程列表交给协调器管理 GLOBAL_PPO.save_ckpt() # 保存全局参数 # plot reward change and test plt.title('DPPO') plt.plot(np.arange(len(GLOBAL_RUNNING_R)), GLOBAL_RUNNING_R) plt.xlabel('Episode') plt.ylabel('Moving reward') plt.ylim(-2000, 0) plt.show()Episode: 1/100 | Worker: 1 | Episode Reward: -965.6343 | Running Time: 3.1675 Episode: 2/100 | Worker: 2 | Episode Reward: -1443.1138 | Running Time: 3.1689 Episode: 3/100 | Worker: 3 | Episode Reward: -1313.6248 | Running Time: 3.1734 Episode: 4/100 | Worker: 0 | Episode Reward: -1403.1952 | Running Time: 3.1819 Episode: 5/100 | Worker: 1 | Episode Reward: -1399.3963 | Running Time: 3.2429 Episode: 6/100 | Worker: 2 | Episode Reward: -1480.8439 | Running Time: 3.2453 Episode: 7/100 | Worker: 0 | Episode Reward: -1489.4195 | Running Time: 3.2373 Episode: 8/100 | Worker: 3 | Episode Reward: -1339.0517 | Running Time: 3.2583 Episode: 9/100 | Worker: 1 | Episode Reward: -1600.1292 | Running Time: 3.2478 Episode: 10/100 | Worker: 0 | Episode Reward: -1513.2170 | Running Time: 3.2584 Episode: 11/100 | Worker: 2 | Episode Reward: -1461.7279 | Running Time: 3.2697 Episode: 12/100 | Worker: 3 | Episode Reward: -1480.2685 | Running Time: 3.2598 Episode: 13/100 | Worker: 0 | Episode Reward: -1831.5374 | Running Time: 3.2423Episode: 14/100 | Worker: 1 | Episode Reward: -1524.8253 | Running Time: 3.2635 Episode: 15/100 | Worker: 2 | Episode Reward: -1383.4878 | Running Time: 3.2556 Episode: 16/100 | Worker: 3 | Episode Reward: -1288.9392 | Running Time: 3.2588 Episode: 17/100 | Worker: 1 | Episode Reward: -1657.2223 | Running Time: 3.2377 Episode: 18/100 | Worker: 0 | Episode Reward: -1472.2335 | Running Time: 3.2678 Episode: 19/100 | Worker: 2 | Episode Reward: -1475.5421 | Running Time: 3.2667 Episode: 20/100 | Worker: 3 | Episode Reward: -1532.7678 | Running Time: 3.2739 Episode: 21/100 | Worker: 1 | Episode Reward: -1575.5706 | Running Time: 3.2688 Episode: 22/100 | Worker: 2 | Episode Reward: -1238.4006 | Running Time: 3.2303 Episode: 23/100 | Worker: 0 | Episode Reward: -1630.9554 | Running Time: 3.2584 Episode: 24/100 | Worker: 3 | Episode Reward: -1610.7237 | Running Time: 3.2601 Episode: 25/100 | Worker: 1 | Episode Reward: -1516.5440 | Running Time: 3.2683 Episode: 26/100 | Worker: 0 | Episode Reward: -1547.6209 | Running Time: 3.2589 Episode: 27/100 | Worker: 2 | Episode Reward: -1328.2584 | Running Time: 3.2762 Episode: 28/100 | Worker: 3 | Episode Reward: -1191.0914 | Running Time: 3.2552 Episode: 29/100 | Worker: 1 | Episode Reward: -1415.3608 | Running Time: 3.2804 Episode: 30/100 | Worker: 0 | Episode Reward: -1765.8007 | Running Time: 3.2767 Episode: 31/100 | Worker: 2 | Episode Reward: -1756.5872 | Running Time: 3.3078 Episode: 32/100 | Worker: 3 | Episode Reward: -1428.0094 | Running Time: 3.2815 Episode: 33/100 | Worker: 1 | Episode Reward: -1605.7720 | Running Time: 3.3010 Episode: 34/100 | Worker: 0 | Episode Reward: -1247.7492 | Running Time: 3.3115Episode: 35/100 | Worker: 2 | Episode Reward: -1333.9553 | Running Time: 3.2759 Episode: 36/100 | Worker: 3 | Episode Reward: -1485.7453 | Running Time: 3.2749 Episode: 37/100 | Worker: 3 | Episode Reward: -1341.3090 | Running Time: 3.2323 Episode: 38/100 | Worker: 2 | Episode Reward: -1472.5245 | Running Time: 3.2595 Episode: 39/100 | Worker: 0 | Episode Reward: -1583.6614 | Running Time: 3.2721 Episode: 40/100 | Worker: 1 | Episode Reward: -1358.4421 | Running Time: 3.2925 Episode: 41/100 | Worker: 3 | Episode Reward: -1744.7500 | Running Time: 3.2391 Episode: 42/100 | Worker: 2 | Episode Reward: -1684.8821 | Running Time: 3.2527 Episode: 43/100 | Worker: 1 | Episode Reward: -1412.0231 | Running Time: 3.2400 Episode: 44/100 | Worker: 0 | Episode Reward: -1437.6130 | Running Time: 3.2458 Episode: 45/100 | Worker: 3 | Episode Reward: -1461.7901 | Running Time: 3.2872 Episode: 46/100 | Worker: 2 | Episode Reward: -1572.6255 | Running Time: 3.2710 Episode: 47/100 | Worker: 0 | Episode Reward: -1704.6351 | Running Time: 3.2762 Episode: 48/100 | Worker: 1 | Episode Reward: -1538.4030 | Running Time: 3.3117 Episode: 49/100 | Worker: 3 | Episode Reward: -1554.7941 | Running Time: 3.2881 Episode: 50/100 | Worker: 2 | Episode Reward: -1796.0786 | Running Time: 3.2718 Episode: 51/100 | Worker: 0 | Episode Reward: -1877.3152 | Running Time: 3.2804 Episode: 52/100 | Worker: 1 | Episode Reward: -1749.8780 | Running Time: 3.2779 Episode: 53/100 | Worker: 3 | Episode Reward: -1486.8338 | Running Time: 3.1559 Episode: 54/100 | Worker: 2 | Episode Reward: -1540.8134 | Running Time: 3.2903 Episode: 55/100 | Worker: 0 | Episode Reward: -1596.7365 | Running Time: 3.3156 Episode: 56/100 | Worker: 1 | Episode Reward: -1644.7888 | Running Time: 3.3065 Episode: 57/100 | Worker: 3 | Episode Reward: -1514.0685 | Running Time: 3.2920 Episode: 58/100 | Worker: 2 | Episode Reward: -1411.2714 | Running Time: 3.1554 Episode: 59/100 | Worker: 0 | Episode Reward: -1602.3725 | Running Time: 3.2737 Episode: 60/100 | Worker: 1 | Episode Reward: -1579.8769 | Running Time: 3.3140 Episode: 61/100 | Worker: 3 | Episode Reward: -1360.7916 | Running Time: 3.2856 Episode: 62/100 | Worker: 2 | Episode Reward: -1490.7107 | Running Time: 3.2861 Episode: 63/100 | Worker: 0 | Episode Reward: -1775.7557 | Running Time: 3.2644 Episode: 64/100 | Worker: 1 | Episode Reward: -1491.0894 | Running Time: 3.2828 Episode: 65/100 | Worker: 0 | Episode Reward: -1428.8124 | Running Time: 3.1239 Episode: 66/100 | Worker: 2 | Episode Reward: -1493.7703 | Running Time: 3.2680 Episode: 67/100 | Worker: 3 | Episode Reward: -1658.3558 | Running Time: 3.2853 Episode: 68/100 | Worker: 1 | Episode Reward: -1605.9077 | Running Time: 3.2911 Episode: 69/100 | Worker: 2 | Episode Reward: -1374.3309 | Running Time: 3.3644 Episode: 70/100 | Worker: 0 | Episode Reward: -1283.5023 | Running Time: 3.3819 Episode: 71/100 | Worker: 3 | Episode Reward: -1346.1850 | Running Time: 3.3860 Episode: 72/100 | Worker: 1 | Episode Reward: -1222.1988 | Running Time: 3.3724 Episode: 73/100 | Worker: 2 | Episode Reward: -1199.1266 | Running Time: 3.2739 Episode: 74/100 | Worker: 0 | Episode Reward: -1207.3161 | Running Time: 3.2670 Episode: 75/100 | Worker: 3 | Episode Reward: -1302.0207 | Running Time: 3.2562 Episode: 76/100 | Worker: 1 | Episode Reward: -1233.3584 | Running Time: 3.2892 Episode: 77/100 | Worker: 2 | Episode Reward: -964.8099 | Running Time: 3.2339 Episode: 78/100 | Worker: 0 | Episode Reward: -1208.2836 | Running Time: 3.2602 Episode: 79/100 | Worker: 3 | Episode Reward: -1149.2154 | Running Time: 3.2579 Episode: 80/100 | Worker: 1 | Episode Reward: -1219.3229 | Running Time: 3.2321 Episode: 81/100 | Worker: 2 | Episode Reward: -1097.7572 | Running Time: 3.2995 Episode: 82/100 | Worker: 3 | Episode Reward: -940.7949 | Running Time: 3.2981 Episode: 83/100 | Worker: 0 | Episode Reward: -1395.6272 | Running Time: 3.3076 Episode: 84/100 | Worker: 1 | Episode Reward: -1092.5180 | Running Time: 3.2936 Episode: 85/100 | Worker: 2 | Episode Reward: -1369.8868 | Running Time: 3.2517 Episode: 86/100 | Worker: 0 | Episode Reward: -1380.5247 | Running Time: 3.2390 Episode: 87/100 | Worker: 3 | Episode Reward: -1413.2114 | Running Time: 3.2740 Episode: 88/100 | Worker: 1 | Episode Reward: -1403.9904 | Running Time: 3.2643 Episode: 89/100 | Worker: 2 | Episode Reward: -1098.8470 | Running Time: 3.3078 Episode: 90/100 | Worker: 0 | Episode Reward: -983.4387 | Running Time: 3.3224 Episode: 91/100 | Worker: 3 | Episode Reward: -1056.6701 | Running Time: 3.3059 Episode: 92/100 | Worker: 1 | Episode Reward: -1357.6828 | Running Time: 3.2980 Episode: 93/100 | Worker: 2 | Episode Reward: -1082.3377 | Running Time: 3.3248 Episode: 94/100 | Worker: 3 | Episode Reward: -1052.0146 | Running Time: 3.3291 Episode: 95/100 | Worker: 0 | Episode Reward: -1373.0590 | Running Time: 3.3660 Episode: 96/100 | Worker: 1 | Episode Reward: -1044.4578 | Running Time: 3.3311 Episode: 97/100 | Worker: 2 | Episode Reward: -1179.2926 | Running Time: 3.3593 Episode: 98/100 | Worker: 3 | Episode Reward: -1039.1825 | Running Time: 3.3540 Episode: 99/100 | Worker: 0 | Episode Reward: -1193.3356 | Running Time: 3.3599 Episode: 100/100 | Worker: 1 | Episode Reward: -1378.5094 | Running Time: 3.2025 Episode: 101/100 | Worker: 2 | Episode Reward: -30.6317 | Running Time: 0.1128 Episode: 102/100 | Worker: 0 | Episode Reward: -141.0568 | Running Time: 0.2976 Episode: 103/100 | Worker: 3 | Episode Reward: -166.4818 | Running Time: 0.3256 Episode: 104/100 | Worker: 1 | Episode Reward: -123.2953 | Running Time: 0.2683 [TL] [*] Saving TL weights into model_Pendulum/dppo_actor.hdf5 [TL] [*] Saved [TL] [*] Saving TL weights into model_Pendulum/dppo_actor_old.hdf5 [TL] [*] Saved [TL] [*] Saving TL weights into model_Pendulum/dppo_critic.hdf5 [TL] [*] Saved6. 模型推理Notebook暂时不支持Pendulum可视化,请将下面代码下载到本地,可查看可视化效果。from matplotlib import animation GLOBAL_PPO.load_ckpt() env = gym.make(env_name) s = env.reset() def display_frames_as_gif(frames): patch = plt.imshow(frames[0]) plt.axis('off') def animate(i): patch.set_data(frames[i]) anim = animation.FuncAnimation(plt.gcf(), animate, frames=len(frames), interval=5) anim.save('./DPPO_Pendulum.gif', writer='imagemagick', fps=30) total_reward = 0 frames = [] while True: env.render() frames.append(env.render(mode='rgb_array')) s, r, done, info = env.step(GLOBAL_PPO.choose_action(s)) if done: print('It is over, the window will be closed after 1 seconds.') time.sleep(1) break env.close() print('Total Reward : %.2f' % total_reward) display_frames_as_gif(frames)7. 模型推理效果如下视频是训练1000 Episode模型的推理效果8. 作业请你调整步骤2中的训练参数,重新训练一个模型,使它在游戏中获得更好的表现。
  • [技术干货] 使用DPPO算法控制“倒立摆”
    使用DPPO算法控制“倒立摆”实验目标通过本案例的学习和课后作业的练习:了解DPPO基本概念了解如何基于DPPO训练一个控制类问题了解强化学习训练推理控制类问题的整体流程你也可以将本案例相关的 ipynb 学习笔记分享到 AI Gallery Notebook 版块获得成长值,分享方法请查看此文档。案例内容介绍倒立摆(Pendulum)摆动问题是控制文献中的经典问题。在我们本节用DPPO解决的Pendulum-v0问题中,钟摆从一个随机位置开始围绕一个端点摆动,目标是把钟摆向上摆动,并且是钟摆保持直立。一个随机动作的倒立摆demo如下所示:整体流程:安装基础依赖->创建倒立摆环境->构建DPPO算法->训练->推理->可视化效果Distributed Proximal Policy Optimization (DPPO) 算法的基本结构DPPO算法是在Proximal Policy Optimization(PPO)算法基础上发展而来,相关PPO算法请看 使用PPO算法玩“超级马里奥兄弟”,我们在这一教程中有详细的介绍。DPPO借鉴A3C的并行方法,使用多个workers并行地在不同的环境中收集数据,并根据采集的数据计算梯度,将梯度发送给一个全局chief,全局chief在拿到一定数量的梯度数据之后进行网络更新,更新时workers停止采集等待等下完毕,更新完毕之后workers重新使用最新的网络采集数据。下面我们使用论文中的伪代码介绍DPPO的具体算法细节。上述算法所示为全局PPO的伪代码,其中 W 是workers的数目,D 是用于更新参数的works数量阈值,M,B是给定一批数据点的具有policy网络和critic网络更新的子迭代数,θ, Φ为policy网络,critic网络的参数。上述算法所示为workers的伪代码,其中T是在计算参数更新之前收集的每个工作节点的数据点数,K是计算K步返回和通过时间截断的反向道具的时间步数(对于RNN)。 该部分算法基于PPO,首先采集数据,根据PPO算法计算梯度并将梯度发送给全局chief,等待全局chief更新完毕参数再进行数据的采集。DPPO论文代码部分参考GitHub开源项目注意事项本案例运行环境为 TensorFlow-2.0.0,且需使用 GPU 运行,请查看《ModelAtrs JupyterLab 硬件规格使用指南》了解切换硬件规格的方法;如果您是第一次使用 JupyterLab,请查看《ModelAtrs JupyterLab使用指导》了解使用方法;如果您在使用 JupyterLab 过程中碰到报错,请参考《ModelAtrs JupyterLab常见问题解决办法》尝试解决问题。实验步骤1. 程序初始化第1步:安装基础依赖!pip install tensorflow==2.0.0 !pip install tensorflow-probability==0.7.0 !pip install tensorlayer==2.1.0 --ignore-installed !pip install h5py==2.10.0 !pip install gym第2步:导入相关的库import os import time import queue import threading import gym import matplotlib.pyplot as plt import numpy as np import tensorflow as tf import tensorflow_probability as tfp import tensorlayer as tl2. 训练参数初始化本案例设置的 训练最大局数 EP_MAX = 1000,可以达到较好的训练效果,训练耗时约20分钟。你也可以调小 EP_MAX 的值,以便快速跑通代码。RANDOMSEED = 1 # 随机数种子 EP_MAX = 100 # 训练总局数 EP_LEN = 200 # 一局最长长度 GAMMA = 0.9 # 折扣率 A_LR = 0.0001 # actor学习率 C_LR = 0.0002 # critic学习率 BATCH = 32 # batchsize大小 A_UPDATE_STEPS = 10 # actor更新步数 C_UPDATE_STEPS = 10 # critic更新步数 S_DIM, A_DIM = 3, 1 # state维度, action维度 EPS = 1e-8 # epsilon值 # PPO1 和PPO2 的参数,你可以选择用PPO1 (METHOD[0]),还是PPO2 (METHOD[1]) METHOD = [ dict(name='kl_pen', kl_target=0.01, lam=0.5), # KL penalty dict(name='clip', epsilon=0.2), # Clipped surrogate objective, find this is better ][1] # choose the method for optimization N_WORKER = 4 # 并行workers数目 MIN_BATCH_SIZE = 64 # 更新PPO的minibatch大小 UPDATE_STEP = 10 # 每隔10steps更新一次 3. 创建环境本环境为gym内置的Pendulum,倒立摆倒下即失败。env_name = 'Pendulum-v0' # environment name 4. 定义DPPO算法DPPO算法-PPO算法class PPO(object): ''' PPO class ''' def __init__(self): # 创建critic tfs = tl.layers.Input([None, S_DIM], tf.float32, 'state') l1 = tl.layers.Dense(100, tf.nn.relu)(tfs) v = tl.layers.Dense(1)(l1) self.critic = tl.models.Model(tfs, v) self.critic.train() # 创建actor self.actor = self._build_anet('pi', trainable=True) self.actor_old = self._build_anet('oldpi', trainable=False) self.actor_opt = tf.optimizers.Adam(A_LR) self.critic_opt = tf.optimizers.Adam(C_LR) # 更新actor def a_train(self, tfs, tfa, tfadv): ''' Update policy network :param tfs: state :param tfa: act :param tfadv: advantage :return: ''' tfs = np.array(tfs, np.float32) tfa = np.array(tfa, np.float32) tfadv = np.array(tfadv, np.float32) # td-error with tf.GradientTape() as tape: mu, sigma = self.actor(tfs) pi = tfp.distributions.Normal(mu, sigma) mu_old, sigma_old = self.actor_old(tfs) oldpi = tfp.distributions.Normal(mu_old, sigma_old) ratio = pi.prob(tfa) / (oldpi.prob(tfa) + EPS) surr = ratio * tfadv ## PPO1 if METHOD['name'] == 'kl_pen': tflam = METHOD['lam'] kl = tfp.distributions.kl_divergence(oldpi, pi) kl_mean = tf.reduce_mean(kl) aloss = -(tf.reduce_mean(surr - tflam * kl)) ## PPO2 else: aloss = -tf.reduce_mean( tf.minimum(surr, tf.clip_by_value(ratio, 1. - METHOD['epsilon'], 1. + METHOD['epsilon']) * tfadv) ) a_gard = tape.gradient(aloss, self.actor.trainable_weights) self.actor_opt.apply_gradients(zip(a_gard, self.actor.trainable_weights)) if METHOD['name'] == 'kl_pen': return kl_mean # 更新old_pi def update_old_pi(self): ''' Update old policy parameter :return: None ''' for p, oldp in zip(self.actor.trainable_weights, self.actor_old.trainable_weights): oldp.assign(p) # 更新critic def c_train(self, tfdc_r, s): ''' Update actor network :param tfdc_r: cumulative reward :param s: state :return: None ''' tfdc_r = np.array(tfdc_r, dtype=np.float32) with tf.GradientTape() as tape: advantage = tfdc_r - self.critic(s) # 计算advantage:V(s') * gamma + r - V(s) closs = tf.reduce_mean(tf.square(advantage)) grad = tape.gradient(closs, self.critic.trainable_weights) self.critic_opt.apply_gradients(zip(grad, self.critic.trainable_weights)) # 计算advantage:V(s') * gamma + r - V(s) def cal_adv(self, tfs, tfdc_r): ''' Calculate advantage :param tfs: state :param tfdc_r: cumulative reward :return: advantage ''' tfdc_r = np.array(tfdc_r, dtype=np.float32) advantage = tfdc_r - self.critic(tfs) return advantage.numpy() def update(self): ''' Update parameter with the constraint of KL divergent :return: None ''' global GLOBAL_UPDATE_COUNTER while not COORD.should_stop(): # 如果协调器没有停止 if GLOBAL_EP &lt; EP_MAX: # EP_MAX是最大更新次数 UPDATE_EVENT.wait() # PPO进程的等待位置 self.update_old_pi() # copy pi to old pi data = [QUEUE.get() for _ in range(QUEUE.qsize())] # collect data from all workers data = np.vstack(data) s, a, r = data[:, :S_DIM].astype(np.float32), \ data[:, S_DIM: S_DIM + A_DIM].astype(np.float32), \ data[:, -1:].astype(np.float32) adv = self.cal_adv(s, r) # update actor ## PPO1 if METHOD['name'] == 'kl_pen': for _ in range(A_UPDATE_STEPS): kl = self.a_train(s, a, adv) if kl &gt; 4 * METHOD['kl_target']: # this in in google's paper break if kl &lt; METHOD['kl_target'] / 1.5: # adaptive lambda, this is in OpenAI's paper METHOD['lam'] /= 2 elif kl &gt; METHOD['kl_target'] * 1.5: METHOD['lam'] *= 2 # sometimes explode, this clipping is MorvanZhou's solution METHOD['lam'] = np.clip(METHOD['lam'], 1e-4, 10) ## PPO2 else: # clipping method, find this is better (OpenAI's paper) for _ in range(A_UPDATE_STEPS): self.a_train(s, a, adv) # 更新critic for _ in range(C_UPDATE_STEPS): self.c_train(r, s) UPDATE_EVENT.clear() # updating finished GLOBAL_UPDATE_COUNTER = 0 # reset counter ROLLING_EVENT.set() # set roll-out available # 构建actor网络 def _build_anet(self, name, trainable): ''' Build policy network :param name: name :param trainable: trainable flag :return: policy network ''' tfs = tl.layers.Input([None, S_DIM], tf.float32, name + '_state') l1 = tl.layers.Dense(100, tf.nn.relu, name=name + '_l1')(tfs) a = tl.layers.Dense(A_DIM, tf.nn.tanh, name=name + '_a')(l1) mu = tl.layers.Lambda(lambda x: x * 2, name=name + '_lambda')(a) sigma = tl.layers.Dense(A_DIM, tf.nn.softplus, name=name + '_sigma')(l1) model = tl.models.Model(tfs, [mu, sigma], name) if trainable: model.train() else: model.eval() return model # 选择动作 def choose_action(self, s): ''' Choose action :param s: state :return: clipped act ''' s = s[np.newaxis, :].astype(np.float32) mu, sigma = self.actor(s) pi = tfp.distributions.Normal(mu, sigma) a = tf.squeeze(pi.sample(1), axis=0)[0] # choosing action return np.clip(a, -2, 2) # 计算V() def get_v(self, s): ''' Compute value :param s: state :return: value ''' s = s.astype(np.float32) if s.ndim &lt; 2: s = s[np.newaxis, :] return self.critic(s)[0, 0] def save_ckpt(self): """ save trained weights :return: None """ if not os.path.exists('model_Pendulum'): os.makedirs('model_Pendulum') tl.files.save_weights_to_hdf5('model_Pendulum/dppo_actor.hdf5', self.actor) tl.files.save_weights_to_hdf5('model_Pendulum/dppo_actor_old.hdf5', self.actor_old) tl.files.save_weights_to_hdf5('model_Pendulum/dppo_critic.hdf5', self.critic) def load_ckpt(self): """ load trained weights :return: None """ tl.files.load_hdf5_to_weights_in_order('model_Pendulum/dppo_actor.hdf5', self.actor) tl.files.load_hdf5_to_weights_in_order('model_Pendulum/dppo_actor_old.hdf5', self.actor_old) tl.files.load_hdf5_to_weights_in_order('model_Pendulum/dppo_critic.hdf5', self.critic) workers构建class Worker(object): ''' Worker class for distributional running ''' def __init__(self, wid): self.wid = wid # 工号 self.env = gym.make(env_name).unwrapped # 创建环境 self.env.seed(wid * 100 + RANDOMSEED) # 设置不同的随机种子,因为不希望每个worker的都一致 self.ppo = GLOBAL_PPO # 算法 def work(self): ''' Define a worker :return: None ''' global GLOBAL_EP, GLOBAL_RUNNING_R, GLOBAL_UPDATE_COUNTER while not COORD.should_stop(): # 从COORD接受消息,看看是否应该should_stop s = self.env.reset() ep_r = 0 buffer_s, buffer_a, buffer_r = [], [], [] # 记录data t0 = time.time() for t in range(EP_LEN): # 看是否正在被更新。PPO进程正在工作,那么就在这里等待 if not ROLLING_EVENT.is_set(): # 查询进程是否被阻塞,如果在阻塞状态,就证明如果global PPO正在更新。否则就可以继续。 ROLLING_EVENT.wait() # worker进程的等待位置。wait until PPO is updated buffer_s, buffer_a, buffer_r = [], [], [] # clear history buffer, use new policy to collect data # 正常跑游戏,并搜集数据 a = self.ppo.choose_action(s) s_, r, done, _ = self.env.step(a) buffer_s.append(s) buffer_a.append(a) buffer_r.append((r + 8) / 8) # normalize reward, find to be useful s = s_ ep_r += r # GLOBAL_UPDATE_COUNTER是每个work的在游戏中进行一步,也就是产生一条数据就会+1. # 当GLOBAL_UPDATE_COUNTER大于batch(64)的时候,就可以进行更新。 GLOBAL_UPDATE_COUNTER += 1 # count to minimum batch size, no need to wait other workers if t == EP_LEN - 1 or GLOBAL_UPDATE_COUNTER &gt;= MIN_BATCH_SIZE: # t == EP_LEN - 1 是最后一步 ## 计算每个状态对应的V(s') ## 要注意,这里的len(buffer) &lt; GLOBAL_UPDATE_COUNTER。所以数据是每个worker各自计算的。 v_s_ = self.ppo.get_v(s_) discounted_r = [] # compute discounted reward for r in buffer_r[::-1]: v_s_ = r + GAMMA * v_s_ discounted_r.append(v_s_) discounted_r.reverse() ## 堆叠成数据,并保存到公共队列中。 bs, ba, br = np.vstack(buffer_s), np.vstack(buffer_a), np.array(discounted_r)[:, np.newaxis] buffer_s, buffer_a, buffer_r = [], [], [] QUEUE.put(np.hstack((bs, ba, br))) # put data in the queue # 如果数据足够,就开始更新 if GLOBAL_UPDATE_COUNTER &gt;= MIN_BATCH_SIZE: ROLLING_EVENT.clear() # stop collecting data UPDATE_EVENT.set() # global PPO update if GLOBAL_EP &gt;= EP_MAX: # stop training COORD.request_stop() # 停止更新 break # record reward changes, plot later if len(GLOBAL_RUNNING_R) == 0: GLOBAL_RUNNING_R.append(ep_r) else: GLOBAL_RUNNING_R.append(GLOBAL_RUNNING_R[-1] * 0.9 + ep_r * 0.1) GLOBAL_EP += 1 print( 'Episode: {}/{} | Worker: {} | Episode Reward: {:.4f} | Running Time: {:.4f}'.format( GLOBAL_EP, EP_MAX, self.wid, ep_r, time.time() - t0 ) ) 5. 模型训练np.random.seed(RANDOMSEED) tf.random.set_seed(RANDOMSEED) GLOBAL_PPO = PPO() [TL] Input state: [None, 3] [TL] Dense dense_1: 100 relu [TL] Dense dense_2: 1 No Activation [TL] Input pi_state: [None, 3] [TL] Dense pi_l1: 100 relu [TL] Dense pi_a: 1 tanh [TL] Lambda pi_lambda: func: <function ppo._build_anet.<locals>.<lambda> at 0x7fb65d633950&gt;, len_weights: 0 [TL] Dense pi_sigma: 1 softplus [TL] Input oldpi_state: [None, 3] [TL] Dense oldpi_l1: 100 relu [TL] Dense oldpi_a: 1 tanh [TL] Lambda oldpi_lambda: func: <function ppo._build_anet.<locals>.<lambda> at 0x7fb65d633a70&gt;, len_weights: 0 [TL] Dense oldpi_sigma: 1 softplus# 定义两组不同的事件,update 和 rolling UPDATE_EVENT, ROLLING_EVENT = threading.Event(), threading.Event() UPDATE_EVENT.clear() # not update now,相当于把标志位设置为False ROLLING_EVENT.set() # start to roll out,相当于把标志位设置为True,并通知所有处于等待阻塞状态的线程恢复运行状态。 # 创建workers workers = [Worker(wid=i) for i in range(N_WORKER)] GLOBAL_UPDATE_COUNTER, GLOBAL_EP = 0, 0 # 全局更新次数计数器,全局EP计数器 GLOBAL_RUNNING_R = [] # 记录动态的reward,看成绩 COORD = tf.train.Coordinator() # 创建tensorflow的协调器 QUEUE = queue.Queue() # workers putting data in this queue threads = [] # 为每个worker创建进程 for worker in workers: # worker threads t = threading.Thread(target=worker.work, args=()) # 创建进程 t.start() # 开始进程 threads.append(t) # 把进程放到进程列表中,方便管理 # add a PPO updating thread # 把一个全局的PPO更新放到进程列表最后。 threads.append(threading.Thread(target=GLOBAL_PPO.update, )) threads[-1].start() COORD.join(threads) # 把进程列表交给协调器管理 GLOBAL_PPO.save_ckpt() # 保存全局参数 # plot reward change and test plt.title('DPPO') plt.plot(np.arange(len(GLOBAL_RUNNING_R)), GLOBAL_RUNNING_R) plt.xlabel('Episode') plt.ylabel('Moving reward') plt.ylim(-2000, 0) plt.show() Episode: 1/100 | Worker: 1 | Episode Reward: -965.6343 | Running Time: 3.1675 Episode: 2/100 | Worker: 2 | Episode Reward: -1443.1138 | Running Time: 3.1689 Episode: 3/100 | Worker: 3 | Episode Reward: -1313.6248 | Running Time: 3.1734 Episode: 4/100 | Worker: 0 | Episode Reward: -1403.1952 | Running Time: 3.1819 Episode: 5/100 | Worker: 1 | Episode Reward: -1399.3963 | Running Time: 3.2429 Episode: 6/100 | Worker: 2 | Episode Reward: -1480.8439 | Running Time: 3.2453 Episode: 7/100 | Worker: 0 | Episode Reward: -1489.4195 | Running Time: 3.2373 Episode: 8/100 | Worker: 3 | Episode Reward: -1339.0517 | Running Time: 3.2583 Episode: 9/100 | Worker: 1 | Episode Reward: -1600.1292 | Running Time: 3.2478 Episode: 10/100 | Worker: 0 | Episode Reward: -1513.2170 | Running Time: 3.2584 Episode: 11/100 | Worker: 2 | Episode Reward: -1461.7279 | Running Time: 3.2697 Episode: 12/100 | Worker: 3 | Episode Reward: -1480.2685 | Running Time: 3.2598 Episode: 13/100 | Worker: 0 | Episode Reward: -1831.5374 | Running Time: 3.2423Episode: 14/100 | Worker: 1 | Episode Reward: -1524.8253 | Running Time: 3.2635 Episode: 15/100 | Worker: 2 | Episode Reward: -1383.4878 | Running Time: 3.2556 Episode: 16/100 | Worker: 3 | Episode Reward: -1288.9392 | Running Time: 3.2588 Episode: 17/100 | Worker: 1 | Episode Reward: -1657.2223 | Running Time: 3.2377 Episode: 18/100 | Worker: 0 | Episode Reward: -1472.2335 | Running Time: 3.2678 Episode: 19/100 | Worker: 2 | Episode Reward: -1475.5421 | Running Time: 3.2667 Episode: 20/100 | Worker: 3 | Episode Reward: -1532.7678 | Running Time: 3.2739 Episode: 21/100 | Worker: 1 | Episode Reward: -1575.5706 | Running Time: 3.2688 Episode: 22/100 | Worker: 2 | Episode Reward: -1238.4006 | Running Time: 3.2303 Episode: 23/100 | Worker: 0 | Episode Reward: -1630.9554 | Running Time: 3.2584 Episode: 24/100 | Worker: 3 | Episode Reward: -1610.7237 | Running Time: 3.2601 Episode: 25/100 | Worker: 1 | Episode Reward: -1516.5440 | Running Time: 3.2683 Episode: 26/100 | Worker: 0 | Episode Reward: -1547.6209 | Running Time: 3.2589 Episode: 27/100 | Worker: 2 | Episode Reward: -1328.2584 | Running Time: 3.2762 Episode: 28/100 | Worker: 3 | Episode Reward: -1191.0914 | Running Time: 3.2552 Episode: 29/100 | Worker: 1 | Episode Reward: -1415.3608 | Running Time: 3.2804 Episode: 30/100 | Worker: 0 | Episode Reward: -1765.8007 | Running Time: 3.2767 Episode: 31/100 | Worker: 2 | Episode Reward: -1756.5872 | Running Time: 3.3078 Episode: 32/100 | Worker: 3 | Episode Reward: -1428.0094 | Running Time: 3.2815 Episode: 33/100 | Worker: 1 | Episode Reward: -1605.7720 | Running Time: 3.3010 Episode: 34/100 | Worker: 0 | Episode Reward: -1247.7492 | Running Time: 3.3115Episode: 35/100 | Worker: 2 | Episode Reward: -1333.9553 | Running Time: 3.2759 Episode: 36/100 | Worker: 3 | Episode Reward: -1485.7453 | Running Time: 3.2749 Episode: 37/100 | Worker: 3 | Episode Reward: -1341.3090 | Running Time: 3.2323 Episode: 38/100 | Worker: 2 | Episode Reward: -1472.5245 | Running Time: 3.2595 Episode: 39/100 | Worker: 0 | Episode Reward: -1583.6614 | Running Time: 3.2721 Episode: 40/100 | Worker: 1 | Episode Reward: -1358.4421 | Running Time: 3.2925 Episode: 41/100 | Worker: 3 | Episode Reward: -1744.7500 | Running Time: 3.2391 Episode: 42/100 | Worker: 2 | Episode Reward: -1684.8821 | Running Time: 3.2527 Episode: 43/100 | Worker: 1 | Episode Reward: -1412.0231 | Running Time: 3.2400 Episode: 44/100 | Worker: 0 | Episode Reward: -1437.6130 | Running Time: 3.2458 Episode: 45/100 | Worker: 3 | Episode Reward: -1461.7901 | Running Time: 3.2872 Episode: 46/100 | Worker: 2 | Episode Reward: -1572.6255 | Running Time: 3.2710 Episode: 47/100 | Worker: 0 | Episode Reward: -1704.6351 | Running Time: 3.2762 Episode: 48/100 | Worker: 1 | Episode Reward: -1538.4030 | Running Time: 3.3117 Episode: 49/100 | Worker: 3 | Episode Reward: -1554.7941 | Running Time: 3.2881 Episode: 50/100 | Worker: 2 | Episode Reward: -1796.0786 | Running Time: 3.2718 Episode: 51/100 | Worker: 0 | Episode Reward: -1877.3152 | Running Time: 3.2804 Episode: 52/100 | Worker: 1 | Episode Reward: -1749.8780 | Running Time: 3.2779 Episode: 53/100 | Worker: 3 | Episode Reward: -1486.8338 | Running Time: 3.1559 Episode: 54/100 | Worker: 2 | Episode Reward: -1540.8134 | Running Time: 3.2903 Episode: 55/100 | Worker: 0 | Episode Reward: -1596.7365 | Running Time: 3.3156 Episode: 56/100 | Worker: 1 | Episode Reward: -1644.7888 | Running Time: 3.3065 Episode: 57/100 | Worker: 3 | Episode Reward: -1514.0685 | Running Time: 3.2920 Episode: 58/100 | Worker: 2 | Episode Reward: -1411.2714 | Running Time: 3.1554 Episode: 59/100 | Worker: 0 | Episode Reward: -1602.3725 | Running Time: 3.2737 Episode: 60/100 | Worker: 1 | Episode Reward: -1579.8769 | Running Time: 3.3140 Episode: 61/100 | Worker: 3 | Episode Reward: -1360.7916 | Running Time: 3.2856 Episode: 62/100 | Worker: 2 | Episode Reward: -1490.7107 | Running Time: 3.2861 Episode: 63/100 | Worker: 0 | Episode Reward: -1775.7557 | Running Time: 3.2644 Episode: 64/100 | Worker: 1 | Episode Reward: -1491.0894 | Running Time: 3.2828 Episode: 65/100 | Worker: 0 | Episode Reward: -1428.8124 | Running Time: 3.1239 Episode: 66/100 | Worker: 2 | Episode Reward: -1493.7703 | Running Time: 3.2680 Episode: 67/100 | Worker: 3 | Episode Reward: -1658.3558 | Running Time: 3.2853 Episode: 68/100 | Worker: 1 | Episode Reward: -1605.9077 | Running Time: 3.2911 Episode: 69/100 | Worker: 2 | Episode Reward: -1374.3309 | Running Time: 3.3644 Episode: 70/100 | Worker: 0 | Episode Reward: -1283.5023 | Running Time: 3.3819 Episode: 71/100 | Worker: 3 | Episode Reward: -1346.1850 | Running Time: 3.3860 Episode: 72/100 | Worker: 1 | Episode Reward: -1222.1988 | Running Time: 3.3724 Episode: 73/100 | Worker: 2 | Episode Reward: -1199.1266 | Running Time: 3.2739 Episode: 74/100 | Worker: 0 | Episode Reward: -1207.3161 | Running Time: 3.2670 Episode: 75/100 | Worker: 3 | Episode Reward: -1302.0207 | Running Time: 3.2562 Episode: 76/100 | Worker: 1 | Episode Reward: -1233.3584 | Running Time: 3.2892 Episode: 77/100 | Worker: 2 | Episode Reward: -964.8099 | Running Time: 3.2339 Episode: 78/100 | Worker: 0 | Episode Reward: -1208.2836 | Running Time: 3.2602 Episode: 79/100 | Worker: 3 | Episode Reward: -1149.2154 | Running Time: 3.2579 Episode: 80/100 | Worker: 1 | Episode Reward: -1219.3229 | Running Time: 3.2321 Episode: 81/100 | Worker: 2 | Episode Reward: -1097.7572 | Running Time: 3.2995 Episode: 82/100 | Worker: 3 | Episode Reward: -940.7949 | Running Time: 3.2981 Episode: 83/100 | Worker: 0 | Episode Reward: -1395.6272 | Running Time: 3.3076 Episode: 84/100 | Worker: 1 | Episode Reward: -1092.5180 | Running Time: 3.2936 Episode: 85/100 | Worker: 2 | Episode Reward: -1369.8868 | Running Time: 3.2517 Episode: 86/100 | Worker: 0 | Episode Reward: -1380.5247 | Running Time: 3.2390 Episode: 87/100 | Worker: 3 | Episode Reward: -1413.2114 | Running Time: 3.2740 Episode: 88/100 | Worker: 1 | Episode Reward: -1403.9904 | Running Time: 3.2643 Episode: 89/100 | Worker: 2 | Episode Reward: -1098.8470 | Running Time: 3.3078 Episode: 90/100 | Worker: 0 | Episode Reward: -983.4387 | Running Time: 3.3224 Episode: 91/100 | Worker: 3 | Episode Reward: -1056.6701 | Running Time: 3.3059 Episode: 92/100 | Worker: 1 | Episode Reward: -1357.6828 | Running Time: 3.2980 Episode: 93/100 | Worker: 2 | Episode Reward: -1082.3377 | Running Time: 3.3248 Episode: 94/100 | Worker: 3 | Episode Reward: -1052.0146 | Running Time: 3.3291 Episode: 95/100 | Worker: 0 | Episode Reward: -1373.0590 | Running Time: 3.3660 Episode: 96/100 | Worker: 1 | Episode Reward: -1044.4578 | Running Time: 3.3311 Episode: 97/100 | Worker: 2 | Episode Reward: -1179.2926 | Running Time: 3.3593 Episode: 98/100 | Worker: 3 | Episode Reward: -1039.1825 | Running Time: 3.3540 Episode: 99/100 | Worker: 0 | Episode Reward: -1193.3356 | Running Time: 3.3599 Episode: 100/100 | Worker: 1 | Episode Reward: -1378.5094 | Running Time: 3.2025 Episode: 101/100 | Worker: 2 | Episode Reward: -30.6317 | Running Time: 0.1128 Episode: 102/100 | Worker: 0 | Episode Reward: -141.0568 | Running Time: 0.2976 Episode: 103/100 | Worker: 3 | Episode Reward: -166.4818 | Running Time: 0.3256 Episode: 104/100 | Worker: 1 | Episode Reward: -123.2953 | Running Time: 0.2683 [TL] [*] Saving TL weights into model_Pendulum/dppo_actor.hdf5 [TL] [*] Saved [TL] [*] Saving TL weights into model_Pendulum/dppo_actor_old.hdf5 [TL] [*] Saved [TL] [*] Saving TL weights into model_Pendulum/dppo_critic.hdf5 [TL] [*] Saved ![](https://fileserver.developer.huaweicloud.com/FileServer/getFile/cmtybbs/694/403/82b/d6219e782969440382b6e08950db3e8d.20251010101155.81803368360051049380546059544335:50561216115931:2400:2D51DB97DA597D972F71668BC34B13416B7CC19843125CD2E2EF495AE834C72C.png)6. 模型推理Notebook暂时不支持Pendulum可视化,请将下面代码下载到本地,可查看可视化效果。from matplotlib import animation GLOBAL_PPO.load_ckpt() env = gym.make(env_name) s = env.reset() def display_frames_as_gif(frames): patch = plt.imshow(frames[0]) plt.axis('off') def animate(i): patch.set_data(frames[i]) anim = animation.FuncAnimation(plt.gcf(), animate, frames=len(frames), interval=5) anim.save('./DPPO_Pendulum.gif', writer='imagemagick', fps=30) total_reward = 0 frames = [] while True: env.render() frames.append(env.render(mode='rgb_array')) s, r, done, info = env.step(GLOBAL_PPO.choose_action(s)) if done: print('It is over, the window will be closed after 1 seconds.') time.sleep(1) break env.close() print('Total Reward : %.2f' % total_reward) display_frames_as_gif(frames) 7. 模型推理效果如下视频是训练1000 Episode模型的推理效果8. 作业请你调整步骤2中的训练参数,重新训练一个模型,使它在游戏中获得更好的表现。</lambda></function></lambda></function>
  • [技术干货] 使用SAC算法控制倒立摆
    使用SAC算法控制倒立摆-作业欢迎你将完成的作业分享到 AI Gallery Notebook 版块获得成长值,分享方法请查看此文档。题目描述请你调整步骤2中的训练参数,重新训练一个模型,使它在游戏中获得更好的表现提示:请在下文中搜索“# 请在此处实现代码”,注释所在之处就是你需要修改代码的地方;修改好代码之后,跑通整个案例代码,即可完成作业,请将完成的作业分享到AI Gallery,标题以“2021实战营”为开头命名;代码实现1. 程序初始化第1步:安装基础依赖!pip install gym pybullet第2步:导入相关的库import time import random import itertools import gym import numpy as np import torch import torch.nn as nn import torch.nn.functional as F from torch.optim import Adam from torch.distributions import Normal import pybullet_envs2. 训练参数初始化本案例设置的 num_steps = 30000,可以达到200分,训练耗时约5分钟。# 请在此处实现代码3. 定义SAC算法第1步:定义Q网络,Q1和Q2,结构相同,为[256,256,256]的全连接层# Initialize Policy weights def weights_init_(m): if isinstance(m, nn.Linear): torch.nn.init.xavier_uniform_(m.weight, gain=1) torch.nn.init.constant_(m.bias, 0) class QNetwork(nn.Module): def __init__(self, num_inputs, num_actions): super(QNetwork, self).__init__() # Q1 architecture self.linear1 = nn.Linear(num_inputs + num_actions, 256) self.linear2 = nn.Linear(256, 256) self.linear3 = nn.Linear(256, 1) # Q2 architecture self.linear4 = nn.Linear(num_inputs + num_actions, 256) self.linear5 = nn.Linear(256, 256) self.linear6 = nn.Linear(256, 1) self.apply(weights_init_) def forward(self, state, action): xu = torch.cat([state, action], 1) x1 = F.relu(self.linear1(xu)) x1 = F.relu(self.linear2(x1)) x1 = self.linear3(x1) x2 = F.relu(self.linear4(xu)) x2 = F.relu(self.linear5(x2)) x2 = self.linear6(x2) return x1, x2第2步:Policy网络,采用高斯分布,两层[256,256]全连接+均值+标准差class GaussianPolicy(nn.Module): def __init__(self, num_inputs, num_actions, action_space=None): super(GaussianPolicy, self).__init__() self.linear1 = nn.Linear(num_inputs, 256) self.linear2 = nn.Linear(256, 256) self.mean_linear = nn.Linear(256, num_actions) self.log_std_linear = nn.Linear(256, num_actions) self.apply(weights_init_) # action rescaling if action_space is None: self.action_scale = torch.tensor(1.) self.action_bias = torch.tensor(0.) else: self.action_scale = torch.FloatTensor( (action_space.high - action_space.low) / 2.) self.action_bias = torch.FloatTensor( (action_space.high + action_space.low) / 2.) def forward(self, state): x = F.relu(self.linear1(state)) x = F.relu(self.linear2(x)) mean = self.mean_linear(x) log_std = self.log_std_linear(x) log_std = torch.clamp(log_std, min=LOG_SIG_MIN, max=LOG_SIG_MAX) return mean, log_std def sample(self, state): mean, log_std = self.forward(state) std = log_std.exp() normal = Normal(mean, std) # 重参数化技巧 (mean + std * N(0,1)) x_t = normal.rsample() y_t = torch.tanh(x_t) action = y_t * self.action_scale + self.action_bias log_prob = normal.log_prob(x_t) log_prob -= torch.log(self.action_scale * (1 - y_t.pow(2)) + epsilon) log_prob = log_prob.sum(1, keepdim=True) mean = torch.tanh(mean) * self.action_scale + self.action_bias return action, log_prob, mean def to(self, device): self.action_scale = self.action_scale.to(device) self.action_bias = self.action_bias.to(device) return super(GaussianPolicy, self).to(device)第3步: 定义sac训练部分class SAC(object): def __init__(self, num_inputs, action_space): self.alpha = alpha self.auto_entropy = auto_entropy self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # critic网络 self.critic = QNetwork(num_inputs, action_space.shape[0]).to(device=self.device) self.critic_optim = Adam(self.critic.parameters(), lr=lr) # critic_target网络 self.critic_target = QNetwork(num_inputs, action_space.shape[0]).to(self.device) hard_update(self.critic_target, self.critic) # Target Entropy = −dim(A) if auto_entropy is True: self.target_entropy = -torch.prod(torch.Tensor(action_space.shape).to(self.device)).item() self.log_alpha = torch.zeros(1, requires_grad=True, device=self.device) self.alpha_optim = Adam([self.log_alpha], lr=lr) self.policy = GaussianPolicy(num_inputs, action_space.shape[0], action_space).to(self.device) self.policy_optim = Adam(self.policy.parameters(), lr=lr) def select_action(self, state): state = torch.FloatTensor(state).to(self.device).unsqueeze(0) action, _, _ = self.policy.sample(state) return action.detach().cpu().numpy()[0] def update_parameters(self, memory, batch_size, updates): # Sample a batch from memory state_batch, action_batch, reward_batch, next_state_batch, mask_batch = memory.sample(batch_size=batch_size) state_batch = torch.FloatTensor(state_batch).to(self.device) next_state_batch = torch.FloatTensor(next_state_batch).to(self.device) action_batch = torch.FloatTensor(action_batch).to(self.device) reward_batch = torch.FloatTensor(reward_batch).to(self.device).unsqueeze(1) mask_batch = torch.FloatTensor(mask_batch).to(self.device).unsqueeze(1) with torch.no_grad(): # 经过policy_network得到action next_state_action, next_state_log_pi, _ = self.policy.sample(next_state_batch) # 输入next_state,和next_action,经过target_critic_network得到Q值 qf1_next_target, qf2_next_target = self.critic_target(next_state_batch, next_state_action) min_qf_next_target = torch.min(qf1_next_target, qf2_next_target) - self.alpha * next_state_log_pi next_q_value = reward_batch + mask_batch * gamma * (min_qf_next_target) # 将当前state,action输入critic_network得到Q值 qf1, qf2 = self.critic(state_batch, action_batch) # JQ = 𝔼(st,at)~D[0.5(Q1(st,at) - r(st,at) - γ(𝔼st+1~p[V(st+1)]))^2] qf1_loss = F.mse_loss(qf1, next_q_value) # JQ = 𝔼(st,at)~D[0.5(Q1(st,at) - r(st,at) - γ(𝔼st+1~p[V(st+1)]))^2] qf2_loss = F.mse_loss(qf2, next_q_value) qf_loss = qf1_loss + qf2_loss self.critic_optim.zero_grad() qf_loss.backward() self.critic_optim.step() pi, log_pi, _ = self.policy.sample(state_batch) qf1_pi, qf2_pi = self.critic(state_batch, pi) min_qf_pi = torch.min(qf1_pi, qf2_pi) # Jπ = 𝔼st∼D,εt∼N[α * logπ(f(εt;st)|st) − Q(st,f(εt;st))] policy_loss = ((self.alpha * log_pi) - min_qf_pi).mean() self.policy_optim.zero_grad() policy_loss.backward() self.policy_optim.step() if self.auto_entropy: alpha_loss = -(self.log_alpha * (log_pi + self.target_entropy).detach()).mean() self.alpha_optim.zero_grad() alpha_loss.backward() self.alpha_optim.step() self.alpha = self.log_alpha.exp() else: alpha_loss = torch.tensor(0.).to(self.device) if updates % target_update_interval == 0: soft_update(self.critic_target, self.critic, tau) def soft_update(target, source, tau): for target_param, param in zip(target.parameters(), source.parameters()): target_param.data.copy_(target_param.data * (1.0 - tau) + param.data * tau) def hard_update(target, source): for target_param, param in zip(target.parameters(), source.parameters()): target_param.data.copy_(param.data)第4步:定义replay buffer,存储[s,a,r,s_,done]class ReplayMemory: def __init__(self, capacity): random.seed(seed) self.capacity = capacity self.buffer = [] self.position = 0 def push(self, state, action, reward, next_state, done): if len(self.buffer) < self.capacity: self.buffer.append(None) self.buffer[self.position] = (state, action, reward, next_state, done) self.position = (self.position + 1) % self.capacity def sample(self, batch_size): batch = random.sample(self.buffer, batch_size) state, action, reward, next_state, done = map(np.stack, zip(*batch)) return state, action, reward, next_state, done def __len__(self): return len(self.buffer)4. 训练模型初始化环境和算法# 创建环境 env = gym.make(env_name) # 设置随机数 env.seed(seed) env.action_space.seed(seed) torch.manual_seed(seed) np.random.seed(seed) # 创建agent agent = SAC(env.observation_space.shape[0], env.action_space) # replay buffer memory = ReplayMemory(replay_size) # 训练步数记录 total_numsteps = 0 updates = 0 max_reward = 0开始训练print('\ntraining...') begin_t = time.time() for i_episode in itertools.count(1): episode_reward = 0 episode_steps = 0 done = False state = env.reset() while not done: if start_steps > total_numsteps: # 随机采样过程 action = env.action_space.sample() else: # 根据策略采样 action = agent.select_action(state) if len(memory) > batch_size: # 每个step更新次数 for i in range(updates_per_step): agent.update_parameters(memory, batch_size, updates) updates += 1 # 执行该步 next_state, reward, done, _ = env.step(action) # 更新记录参数 episode_steps += 1 total_numsteps += 1 episode_reward += reward # -done mask = 1 if episode_steps == env._max_episode_steps else float(not done) # 存入buffer memory.push(state, action, reward, next_state, mask) # 更新state state = next_state # 达到终止条件后,停止 if total_numsteps > num_steps: break if episode_reward >= max_reward: max_reward = episode_reward print("current_max_reward {}".format(max_reward)) # 保存模型 torch.save(agent.policy, "model.pt") print("Episode: {}, total numsteps: {}, reward: {}".format(i_episode, total_numsteps,round(episode_reward, 2))) env.close() print("finish! time cost is {}s".format(time.time() - begin_t))5. 使用模型推理游戏由于本内核可视化依赖于OpenGL,需要窗口显示,但当前环境暂不支持,因此无法可视化,请将代码下载到本地,取消 env.render() 这行代码的注释,可查看可视化效果。# 可视化部分 model = torch.load("model.pt") model.eval() device = torch.device("cuda" if torch.cuda.is_available() else "cpu") state = env.reset() # env.render() done = False episode_reward = 0 while not done: _, _, action = model.sample(torch.FloatTensor(state).to(device).unsqueeze(0)) action = action.detach().cpu().numpy()[0] next_state, reward, done, _ = env.step(action) episode_reward += reward # env.render() state = next_state print(episode_reward)可视化效果如下:
  • [问题求助] 在第三方调用上传图片成功,然后回显查看报错未授权
    在第三方调用上传图片成功,然后回显查看报错未授权
  • [高校开发者专区] 【HCSD-DevCloud训练营学习笔记】--飞机大战游戏项目部署学习总结
    1.项目介绍本项目是通过DevCloud对飞机大战游戏进行开发,并且部署到鲲鹏云服务器上。2.实验资源2.1首先需要在华为云官网创建‘华北-北京四’区域的云服务资源;2.2然后在控制台中创建虚拟私有云VPC;2.3在访问控制中创建安全组并且添加规则;3.云端环境配置此处需要在服务列表中购买弹性云服务器ECS4 创建项目在https://devcloud.cn-north-4.huaweicloud.com/home中创建项目5.上传代码配置好GIT与SSH密钥,将本地飞机大战代码推送到云端6.项目构建在云端将项目执行与发布7.部署项目在项目的构建与发布选项下,创建主机组,连通主机后对项目进行部署最后验证项目是否部署成功
  • [伙伴解决方案] 初创工作室如何找到合适的技术合伙人,游戏类主程
    介绍:理工背景,手上有完善的设计案,游戏类,想寻找合作伙伴主程成立工作室个人负责:整体方案(包括不限于逻辑,玩法,公式,数值,剧情等),完整发展规划,一步一步来需求:能良性沟通(设计可能不符合程序语言,都可以按程序逻辑进行修改),上进心(喜好钻研,共通积累经验),有空闲时间(可以兼职初创,出成绩再考虑下一步如何进行,能合理分配时间)