-
一、 环境开通1、Server 资源开通流程图2、Server 资源开通流程阶段任务准备工作1、申请开通资源规格2、资源配合提升3、基础权限开通4、配置ModelArts委托授权购买Server5、在ModelArts控制台购买资源池 步骤 1:申请开通资源规格 请联系客户经理/接口人确认Server资源方案、申请要开通资源的规格(如果无客户经理可提 交工单)。步骤 2:资源配额提升 由于Server所需资源可能会超出默认提供的资源(如ECS、EIP、SFS、内存大小、CPU 核数),因此需要提升资源配额。n 1 登录云管理控制台。n 2 在顶部导航栏单击“资源 > 我的配额”,进入服务配额页面。n 3 单击右上角“申请扩大配额”,填写申请材料后提交工单。说明 配额需大于需要开通的资源,且在购买开通前完成提升,否则会导致资源开通失败。 ----结束n 3:基础权限开通(若主账号/已有iam子账号可忽略)基础权限开通需要登录管理员账号,为子用户账号开通Server功能所需的基础权限 (ModelArts FullAccess/BMS FullAccess/ECS FullAccess/VPC FullAccess/VPC Administrator/VPCEndpoint Administrator),即允许子用户账号同时可以使用这些 云服务。步骤1 登录统一身份认证服务管理控制台。步骤2 单击目录左侧“用户组”,然后在页面右上角单击“创建用户组”。步骤3 填写“用户组名称”并单击“确定”。步骤4 在操作列单击“用户组管理”,将需要配置权限的用户加入用户组中。步骤5 单击用户组名称,进入用户组详情页。步骤6 在权限管理页签下,单击“授权”。步骤7 在搜索栏输入“ModelArts FullAccess”,并勾选“ModelArts FullAccess”。图 2-3 ModelArts FullAccess以相同的方式,依次添加:BMS FullAccess、ECS FullAccess、VPC FullAccess、VPC Administrator、VPCEndpoint Administrator。(Server Administrator、DNS Administrator为依赖策略,会自动被勾选)。步骤8 单击“下一步”,授权范围方案选择“所有资源”。步骤9 单击“确认”,完成基础权限开通。----结束步骤 4 在 ModelArts 上创建委托授权 ModelArts Lite Server在任务执行过程中需要访问用户的其他服务,典型的就是容器 使用过程中需要到SWR服务拉取镜像。在这个过程中,就出现了ModelArts“代表” 用户去访问其他云服务的情形。从安全角度出发,ModelArts代表用户访问任何云服务 之前,均需要先获得用户的授权,而这个动作就是一个“委托”的过程。用户授权 ModelArts代表自己访问特定的云服务,以完成其在ModelArts平台上执行的AI计算任 务。● 新建委托第一次使用ModelArts时需要创建委托授权,授权允许ModelArts代表用户去访问 其他云服务。进入到ModelArts控制台的“权限管理”页面,单击“添加授权”, 根据提示进行操作。 ● 更新委托如果之前给ModelArts创过委托授权,此处可以更新授权。a. 进入到ModelArts控制台的“资源管理>AI专属资源池>弹性节点 Server”页 面,查看是否存在授权缺失的提示。 图 2-4 弹性节点 Server 权限缺失提示b. 如果有授权缺失,根据提示,单击“此处”更新委托。根据提示选择“追加 至已有授权”,单击“确定”,系统会提示权限更新成功。 图 2-5 追加授权步骤 5:购买弹性节点 Server 资源 购买弹性节点Server资源的过程即创建资源过程。l 登录ModelArts管理控制台。l 在左侧导航栏中,选择“资源管理 > AI专属资源池 > 弹性节点 Server”,进入“节 点”列表。单击节点列表页右上角的“购买AI专属节点”,进入“购买AI专属节点”页面,在该 页面填写相关参数信息。(根据申请的区域(贵阳 乌兰察布 华东))二、 软件安装1、NPU 服务器上配置 Lite Server 资源软件环境 安装驱动l wget "https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Ascend HDK/Ascend HDK 23.0.3/Ascend-hdk-910b-npu-driver_23.0.3_linux-aarch64.run"l sudo sh Ascend-hdk-910b-npu-driver_23.0.3_linux-aarch64.run --full --install-for-all 安装完毕后需重启服务器(可输入指令 reboot 重启服务器)验证l npu-smi info检查 npu-smi 工具完整输出上图内容则为正常l 安装固件l wget "https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Ascend HDK/Ascend HDK 23.0.3/Ascend-hdk-910b-npu-firmware_7.1.0.5.220.run"l sudo sh Ascend-hdk-910b-npu-firmware_7.1.0.5.220.run --full 查看安装结果,如图l npu-smi info -t board -i 1 | egrep -i "software|firmware" 图 查看固件和驱动版本 l 安装CANN# install CANN Toolkitl wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C17SPC701/Ascend-cann-toolkit_8.0.RC1.alpha001_linux-"$(uname -i)".runl bash Ascend-cann-toolkit_8.0.RC1.alpha001_linux-"$(uname -i)".run –install# install CANN Kernelswget cid:link_1 bash Ascend-cann-kernels-910b_8.0.RC1.alpha001_linux.run –install#注意安装顺序,先安装Toolkit 在安装Kernels#配置昇腾开发工具包运行所需的环境变量l source /usr/local/Ascend/ascend-toolkit/set_env.sh 安装dockerl yum install -y docker-engine.aarch64 docker-engine-selinux.noarch docker-runc.aarch64#配置Ascend-docker-runtime。#下载Ascend-docker-runtimel wget cid:link_0l chmod 700 *.run l ./ Ascend-docker-runtime_5.0.RC2_linux-aarch64.run --install 检查安装Ascend-docker-runtimedocker info |grep Runtime Ascend-docker-runtime 查询配置docker将新挂载的盘设置为docker容器使用路径{ "data-root": "/home/docker", "default-runtime": "ascend", "default-shm-size": "8G", "insecure-registries": [ "ascendhub.huawei.com" ], "registry-mirrors": [ "https://90c2db5e37a147c6afd970e45c68e365.mirror.swr.myhuaweicloud.com" ], "runtimes": { "ascend": { "path": "/usr/local/Ascend/Ascend-Docker-Runtime/ascend-docker-runtime", "runtimeArgs": [] } }} #保存后,执行如下命令重启docker使配置生效。l systemctl daemon-reload && systemctl restart docker 二、 模型部署部署Qwen3-Embedding-8B模型 权重文件下载在huggingface或者魔塔社区modelscope 下载权重文件modelscope 地址:https://modelscope.cn/models/Qwen/Qwen3-Embedding-8B 相关依赖库安装1、 torch-npu和torch相关的安装#安装torchpip install torch==2.5.1#安装torch-npu 依赖pip install pyyamlpip install setuptools#安装torch-npu pip install torch-npu==2.5.12、 安装vllm和vllm-ascend(1)、安装系统依赖yum update -yyum install -y gcc g++ cmake wget git(2)、pip安装vllm和vllm-ascendpip install vllm==0.9.0pip install vllm-ascend==0.9.0rc2 (3)、查看vllm serve 服务是否正常 执行vllm serve --help 显示如下图即为安装完成加载华为昇腾环境变量# 配置CANN环境,默认安装在/usr/local目录下source /usr/local/Ascend/ascend-toolkit/set_env.sh# 配置加速库环境source /usr/local/Ascend/nnal/atb/set_env.sh 通过vLLM部署API接口启动服务nohup python3 -m vllm.entrypoints.openai.api_server \ --model /home/liboran/Qwen3/Qwen3-Embedding-8B \ --served-model-name Qwen3-Embedding-8B \ --host 0.0.0.0 \ --port 8083 \ --dtype bfloat16 \ --tensor-parallel-size 8 \ --max-num-seqs 128 \ --max-model-len 32768 \ --max-num-batched-tokens 32768 \ --block-size 128 \ --gpu-memory-utilization 0.9 \ --enable-prefix-caching \ --trust-remote-code \> /home/Qwen3/logs/embedding_api.log 2>&1 & 执行上述启动命令,查看日志显示如下图,说明运行成功验证服务请求示例:curl http://127.0.0.1:8083/v1/embeddings \-H "Content-Type: application/json" \-d '{"input": "Your text string goes here","model": "Qwen3-Embedding-8B"}' 服务正常运行。接口返回如下: 部署Qwen3-Reranker-8B模型权重文件下载在huggingface或者魔塔社区modelscope 下载权重文件modelscope 地址:https://modelscope.cn/models/Qwen/Qwen3-Reranker-8B 相关依赖库安装1、安装torch-npu和torch#安装torchpip install torch==2.5.1#安装torch-npu 依赖pip install pyyamlpip install setuptools#安装torch-npupip install torch-npu==2.5.1#安装transformerspip install transformers==4.51.02、 安装相关依赖服务pip install fastapipip install pydanticpip install typing 加载华为昇腾环境变量# 配置CANN环境,默认安装在/usr/local目录下source /usr/local/Ascend/ascend-toolkit/set_env.sh# 配置加速库环境source /usr/local/Ascend/nnal/atb/set_env.sh 启动部署模型通过FastAPI部署启动模型,新建启动文件,示例代码如下: 使用uvicorn启动服务器,示例命令如下:nohup uvicorn rerank:app --reload --host 0.0.0.0 --port 8084 > /home/liboran/Qwen3/logs/rerank_server.log 2>&1 & 验证服务请求示例:curl -s -X POST "http://127.0.0.1:8084/v1/rerank" \ -H "Content-Type: application/json" \ -d '{ "model":"Qwen3-Reranker-8B", "query":"中国的首都是什么?", "documents":["中国的首都是北京.","重力是一种使两个物体相互吸引的力","法国首都是巴黎"]}'服务正常运行。接口返回如下:
-
使用mac笔记本连接华为云服务器(huaweicloud-910b)失败 ,下面是报错日志,请问是什么原因?要怎么解决?2025-09-07 20:35:24.435 [error] Error installing server: Failed to connect to the remote SSH host. Please check the logs for more details.2025-09-07 20:35:24.435 [info] Retrying connection in 5 seconds...2025-09-07 20:35:29.440 [info] Deleting local script /var/folders/y_/x2m2sfln3c5gw00qjt64x23r0000gn/T/cursor_remote_install_4c0c5ce5-07db-4e98-b5e2-810e5d8ab100.sh2025-09-07 20:35:29.441 [info] Using askpass script: /Users/sunxiao/.cursor/extensions/anysphere.remote-ssh-1.0.27/dist/scripts/launchSSHAskpass.sh with javascript file /Users/sunxiao/.cursor/extensions/anysphere.remote-ssh-1.0.27/dist/scripts/sshAskClient.js. Askpass handle: /var/folders/y_/x2m2sfln3c5gw00qjt64x23r0000gn/T/cursor-ssh-dt0408/socket.sock2025-09-07 20:35:29.446 [info] Launching SSH server via shell with command: cat "/var/folders/y_/x2m2sfln3c5gw00qjt64x23r0000gn/T/cursor_remote_install_abc21a0b-9caa-422a-b082-0537e7193f1c.sh" | ssh -T -D 54168 huaweicloud-910b bash --login -c bash2025-09-07 20:35:29.446 [info] Establishing SSH connection: cat "/var/folders/y_/x2m2sfln3c5gw00qjt64x23r0000gn/T/cursor_remote_install_abc21a0b-9caa-422a-b082-0537e7193f1c.sh" | ssh -T -D 54168 huaweicloud-910b bash --login -c bash2025-09-07 20:35:29.446 [info] Started installation script. Waiting for it to finish...2025-09-07 20:35:29.446 [info] Waiting for server to install...2025-09-07 20:35:29.555 [info] (ssh_tunnel) stderr: ssh: connect to host dev-modelarts.cn-east-4.huaweicloud.com port 31628: Connection refused
-
使用LLaMA-Factory进行推理的时候出现下面的错误是什么原因?要怎么解决?AssertionError: Torch not compiled with CUDA enabled return func(*args, **kwargs) File "/home/ma-user/anaconda3/envs/LLaMA-Factory-sunxiao/lib/python3.10/site-packages/transformers/modeling_utils.py", line 842, in _load_state_dict_into_meta_model param = param[...] File "/home/ma-user/anaconda3/envs/LLaMA-Factory-sunxiao/lib/python3.10/site-packages/torch/cuda/__init__.py", line 289, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") _error_msgs, disk_offload_index, cpu_offload_index = load_shard_file(args)AssertionError File "/home/ma-user/anaconda3/envs/LLaMA-Factory-sunxiao/lib/python3.10/site-packages/transformers/modeling_utils.py", line 974, in load_shard_file: Torch not compiled with CUDA enabled param = param[...] File "/home/ma-user/anaconda3/envs/LLaMA-Factory-sunxiao/lib/python3.10/site-packages/torch/cuda/__init__.py", line 289, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled")
-
obs数据丢失,想要从别的节点将数据直接传输带MA环境里
-
我希望在ModelArts里使用SFS Turbo的存储,是否必须购买专属资源池才能连通到SFS Turbo?现在没有资源可以购买的话,有其它方式可以使用SFS Turbo吗?
-
NO.2 篇 知识库/RAG能力详解 <ModelArts Versatile-AI原生应用引擎 体验入口>华为开发者空间 -- 开发平台 --Versatile Agent ( 请在PC端打开 ) 什么是AI AgentAI Agent人工智能体,是一种能够自主感知环境、制定目标、规划行动、执行任务并持续学习的智能程序或系统。比如,告诉AI Agent帮忙下单一份外卖,它就可以直接调用 APP选择外卖,再调用支付程序下单支付,无需人类去指定每一步的操作。从本质上来讲,大模型作为大脑发号施令,推理能力决定它的决策能力上限,工具模块决定他行动能力的上限、可完成任务的横向宽度、操作场域。而记忆(Memory)模块作为信息存储库,将多次交互的短期和长期信息保存与共享,决定了它规划和决策的准确度与速度,从而实现更个性化的响应。 · AI Agent之ModelArts VersatileModelArts Versatile-AI原生应用引擎是一站式企业级全生命周期的智能体平台,为企业专属大模型应用开发的工具链,提供灵活的画布式AI Agent开发能力,让Agent能够准确解决复杂的业务场景问题。提供从Agent开发,使用,运营和运维的全生命周期管理能力。 快速了解 AI Agent—知识库/RAG · 知识库——信息的“蓄水池”知识库是组织、存储及管理知识的系统,涵盖文档、图片、视频等信息的分类整理,可以帮助用户高效管理大量的信息。在Agent中添加知识库,使其与用户提供的专业知识库进行交互,可以显著提升Agent的准确度和专业度。知识库的数据来源分为直接接入源数据和选择知识数据集两种。 专业领域的知识库已不再是简单的数据堆砌,需要智能化、高效率的知识管理和利用。知识库具体指的是存储过往交互、知识片段和任务状态(可能利用向量数据库)。知识库本身是静态的存储,要让其中的知识,尤其是海量非结构化知识,能够被大模型或RAG系统高效地理解和利用,就需要更智能的检索技术——这正是向量数据库大显身手之处。 --结构化知识库:通常指关系型数据库(如 MySQL、PostgreSQL),数据以表格形式组织,具有严格的模式和关系定义,擅长存储交易,记录、用户信息等高度规整的数据。--非结构化知识库:存储文档、PDF、PPT、网页、图片、音视频等原始形态的数据。文件系统、文档管理系统、对象存储等都属于此类。它们容量巨大,但信息组织相对松散,直接利用效率较低。 · 知识数据集知识数据集是组成知识库的重要元素,其核心是将零散知识(如文档、图片、视频等)加工成结构化知识单元:通过本地上传或OBS接入数据,经预处理后切片拆分为更小的知识单元,并配置索引以便Agent快速定位所需知识。完成知识数据集创建后,将其关联至知识库,形成可调用的知识体系。 · RAG——为模型注入精准信息RAG(Retrieval-Augmented Generation)即检索增强生成,是一种结合信息检索与文本生成的人工智能技术框架,通过动态引入外部知识库来增强大语言模型的输出质量,旨在解决大语言模型在回答问题时,因知识更新不及时、缺乏特定领域知识或事实性错误等问题,通过在生成回答之前检索相关外部知识源来增强回答的准确性和可靠性。RAG可以在 AI 接收到问题时,动态地从外部知识库中查找相关信息,然后把这些信息加到提示词里一起交给 AI 处理,从而提高回答质量。RAG本质上是通过工程化手段,解决LLM知识更新困难的问题。其核心手段是利用外挂于LLM的知识数据库(通常使用向量数据库)存储未在训练数据集中出现的新数据、领域数据等。通常而言,RAG将知识问答分成三个阶段:索引、知识检索、基于内容的问答。其应用优势在于解决大模型幻觉问题,提升专业领域回答准确性;支持私有化知识库与通用模型的融合应用;相比微调更具成本效益和灵活性。 RAG类型:VectorRAG(向量RAG)、GraphRAG(知识图谱RAG)--VectorRAG(向量检索增强生成)向量RAG,是一种结合了向量化和大语言模型的RAG技术。VectorRAG将非结构化的数据转化为结构化的向量空间,利用向量库实现高效的信息检索。技术原理:通过向量数据库将文本数据转换为高维向量,基于语义相似度检索相关片段核心优势:处理非结构化文本效率高(如文档问答、FAQ系统);实现简单且计算资源需求较低 -- GraphRAG(基于图谱的检索增强生成)知识图谱RAG,是一种结合了知识图谱和大语言模型的RAG技术。GraphRAG能够处理各种类型的文档,从中提取实体(文档中具体的对象或概念)、关系以及文本内容构建知识图谱(一种结构化的知识表示方式),从而增强大语言模型对复杂信息的理解和推理能力。技术原理:构建知识图谱存储实体关系,通过图数据库(如Neo4j)执行多跳推理检索核心优势:擅长处理结构化/半结构化数据(如知识图谱问答);支持复杂逻辑推理和全局语义理解 · RAG(检索增强生成)与知识库的关系——通过技术互补实现协同增效功能定位的差异与互补:知识库作为静态信息存储系统,主要承担结构化数据的精确检索功能(如企业文档查询),而RAG则通过动态检索外部知识库内容,结合大语言模型生成自然语言回答。这种组合既保留了知识库的检索效率,又解决了大模型的幻觉问题。 技术实现的依赖关系:RAG通常以知识库为数据基础,通过知识库文档分块及向量化存储、用户查询时进行语义相似度检索、将检索结果作为上下文输入大模型生成最终答案等流程实现知识增强,这种架构使得知识库成为RAG的“数据基础设施”。 在智能体架构中,知识库是静态的知识载体,而RAG是动态的知识调用机制。两者的结合实现了“存储-检索-生成”的闭环,既扩展了大模型的能力边界,又保障了输出的准确性与可追溯性。未来随着多模态知识库的发展,RAG的应用场景将进一步扩展。 ModelArts Versatile--AI原生应用引擎 知识库/RAG的竞争力优势Versatile Agent沉淀行业Know-How(专长),通过知识工程提升大模型的表现,进一步支撑行业应用创新。行业/企业经验提取,持续积累企业场景经验模板,动态经验知识库迭代。知识库RAG 技术优势01超高效率:企业知识库持续学习,天级迭代;场景学习,周级迭代;领域学习,季度级迭代。 02 RAG: 领域知识库&检索引擎,让大模型输出结果更可靠。RAG检索增强实践成效:准确率从50%提升到83%。输出结果准确率提升明显;行业知识动态持续更新;检索生成结果速度更快。 03 接入多类实时知识库,更实时、更准确的生成式答案ModelArts Versatile-AI原生应用引擎接入多类型知识库,充分获取企业最新知识,增强实时性和准确性;灵活可配置的融合检索策略。 04 行业知识赋能大模型ModelArts Versatile-AI原生应用引擎通过RAG赋能增强大模型行业知识,提升行业理解能力和业务准确性。融入企业业务场景,从企业业务场景经验提取,“经验模板” 迭代优化,形成企业经验模板库(专属知识库)。 相关特性知识数据集管理:接入源数据、数据加工、知识数据集基础操作知识库管理:接入源数据或选择知识数据集、知识库的基础操作、索引字段配置、检索方式、能力开放。三方知识库接入:三方知识库API接入- 适配第三方图数据库。 --选择知识数据集-创建知识库:在“选择知识数据集”面板,勾选目标数据集,并选择数据集版本及索引配置。当知识库RAG类型为“VectorRAG”时,按照混合检索、语义检索、全文检索多模式开展。当知识库RAG类型为“GraphRAG”时,默认为“语义检索”方式。--接入数据源-创建知识库:选择数据源的接入方式,支持以下两种方式,01 本地上传:数据文件在本地,从本地选择文件进行上传;02 OBS接入:数据文件存放在华为云OBS桶,从OBS桶接入数据。 知识飞轮 技术优势超高效率,知识飞轮帮助企业知识库周级创建、天级迭代。--数据积累与回流:采集增量业务数据和用户反馈数据,提炼知识,提供给大模型,支撑大模型能力持续增强。--多轮迭代机制:支持天级持续学习(自动)、周级场景学习、季度级领域学习三轮循环迭代机制,持续提升模型能力、应用效果。 针对反馈数据的提炼和学习,ModelArts Versatile-AI原生应用引擎通过多工具预集成、主流模型集成、业务应用信息抽取插件、智能应用反馈插件、自动化调度等能力构建知识飞轮,将业务人员在日常作业过程中积累的增量数据和用户反馈提炼为企业知识,不断供给大模型开展多轮循环学习实现天级迭代,持续提升智能应用体验与效果。 Versatile--AI原生应用引擎 知识库/RAG 主要解决什么问题(知识中心) 结构化知识存储领域知识库(如医疗术语库、法律条款)提供专业认知基础企业知识库(产品手册、业务流程)支撑特定场景服务用户画像知识库(偏好记录、行为模式)实现个性化交互动态知识演进短期记忆沉淀为长期知识(如对话摘要转为用户特征向量)通过知识图谱自动更新实体关系(如新增"用户-宠物-用品"关联) 功能边界突破突破LLM原生知识限制(通过企业私域知识库回答内部流程问题)支持多模态知识融合(图文知识库+语音交互记忆)持续学习基础为模型微调提供高质量训练数据(清洗后的用户交互记录)构建可解释的知识溯源链条(决策依据可关联到具体知识条目) ModelArts Versatile-AI原生应用引擎知识中心,内置知识库+RAG技术突破大模型的知识局限和幻觉问题。它的核心思想并非让模型死记硬背所有知识,而是赋予它"按需查找"的能力,极大地提升了大模型在特定领域或依赖精确事实场景下的表现力和可靠性,巧妙地将大模型的强大语言生成能力与结构化/非结构化外部知识源的丰富性结合起来,实现了"1+1>2"的效果。同时提升了 AI 的交互能力,还为其在各种复杂场景中的应用提供了可能性,包括上下文连续性、复杂任务处理、长期项目管理、认知能力增强等能力。 结语ModelArts Versatile-AI原生应用引擎,以知识库融合RAG技术,担当智能体大脑的“第二存储”,搭建与业务场景融合的知识库,融会贯通行业特有经验沉淀,实现快速检索访问获取信息,增强Agent自主执行能力与结果可靠性。通过知识工程提升大模型的表现,进一步支撑行业应用创新。同时,Versatile助力广大开发者将Agent开发简化并内化成基础技能,快速搭建各领域智能应用,辐射千行万业,变革智能生产力。 点击可前往>>华为云ModelArts Versatile-AI原生应用 引擎官网 系列文章推荐:AI Agent智能体系列解读 | ModelArts Versatile插件类——MCP/工具能力详解
-
"当前AI技术存在明显的'数字鸿沟':大公司掌握尖端模型,而普通人只能使用有限功能。这是否会加剧社会不平等?我们是否需要像水电一样建立公共AI基础设施?"
-
小样本分类实现度量学习方法1、Prototypical Networks为每个类别计算原型向量(类中心)通过计算样本与原型的距离进行分类import torch import torch.nn as nn class PrototypicalNetwork(nn.Module): def __init__(self, encoder): super().__init__() self.encoder = encoder def forward(self, support_images, support_labels, query_images): # 编码支持集和查询集 support_embeddings = self.encoder(support_images) query_embeddings = self.encoder(query_images) # 计算类原型 prototypes = compute_prototypes(support_embeddings, support_labels) # 计算距离并分类 distances = compute_distances(query_embeddings, prototypes) return -distances2、Siamese Networks学习样本对之间的相似性度量通过比较查询样本与支持样本的相似度进行分类核心思想Siamese Networks采用孪生网络结构,通过比较两个输入样本的特征表示来判断它们的相似性:网络结构:两个完全相同的子网络共享权重输入方式:同时输入一对样本(正样本对或负样本对)学习目标:学习一个距离度量函数最佳实践可以参考AI Gallery中的Noetbook示例:
-
使用 llama_factory 进行多节点训练,同样的启动代码、启动脚本,在 910b1 上可以启动运行,但是在 910b3上发生报错。如果只选一个节点,8 卡 910b3 上倒是可以正常运行的。在 910b1 与 910b3 上的验证测试使用的是不同的镜像,分别在 910b1和 910b3 的 notebook 里面配置,并且分别都验证过能够单节点多卡运行。之所以不用相同镜像,是因为先前试过使用相同镜像(910b1 上配置)更加不行,完全相同的镜像、代码,没有任何改动,在 910b1 上能跑(单节点、多节点都能),到 910b3 上就不行(单节点、多节点都不行)。后来单独配置了 910b3 上的镜像,然后至少910b3 单节点训练可以正常运行起来了(无论在 notebook 里还是训练作业里都可以),但是一旦使用多节点,就又发生报错。错误高度可复现,只要多节点就错,无论2、3、4 节点都会报- 部分镜像、环境信息:```- `llamafactory` version: 0.9.4.dev0- Platform: Linux-4.19.90-vhulk2211.3.0.h1543.eulerosv2r10.aarch64-aarch64-with-glibc2.28- Python version: 3.10.6- PyTorch version: 2.5.1 (NPU)- Transformers version: 4.55.0- Datasets version: 3.6.0- Accelerate version: 1.7.0- PEFT version: 0.15.2- TRL version: 0.9.6- NPU type: Ascend910B2- CANN version: 8.0.RC3- DeepSpeed version: 0.16.9```- 我使用的启动脚本(训练代码使用 LLaMA-Factory)如下,同样的脚本在多节点 910b1 或者单节点 910b3上都验证过能正常运行:``` multi_full_sft.sh#!/usr/bin/env bashset -eux# 跳转到脚本所在目录的上一级目录cd "$(dirname "$0")"/..# --------- 环境变量由 ModelArts 注入 ----------export MASTER_ADDR=$(echo $VC_WORKER_HOSTS | cut -d',' -f1)export MASTER_PORT=29500 # 被占用时可自选export NPROC_PER_NODE=${MA_NUM_GPUS}export NNODES=${MA_NUM_HOSTS}export RANK=${VC_TASK_INDEX}export HCCL_CONNECT_TIMEOUT=7200 # Ascend 建议值export TORCH_DISTRIBUTED_BACKEND=hccl# ---------------------------------------------torchrun \--nproc_per_node $NPROC_PER_NODE \--nnodes $NNODES \--node_rank $RANK \--master_addr $MASTER_ADDR \--master_port $MASTER_PORT \./LLaMA-Factory/src/train.py \./Configs/multi_full_sft.yaml```训练作业里设置的启动命令如下,同样,该命令在多节点 910b1 或者单节点 910b3上都验证过能正常运行:```pwdls -lha ~source ~/.bashrcCACHE_DIR="/cache"MEDIA_FILES_NAME="eureka_dataset"echo "[1/3] 拷贝图片到高速盘..."cp "${MY_MEDIA_DIR}/${MEDIA_FILES_NAME}.tar.zst" "${CACHE_DIR}/" echo "[2/3] 高速盘本地解压..."cd "${CACHE_DIR}"tar -I 'zstd -T0 -q' -xf "${MEDIA_FILES_NAME}.tar.zst" -C ./ --checkpoint=5000 --checkpoint-action=dotecho "[3/3] 训练启动..."conda activate llama_factorycd /home/ma-user/modelarts/user-job-dirbash ./llama_factory-910b3/Scripts/multi_full_sft.sh```报错信息如下,在模型加载完毕后产生的报错:```Loading checkpoint shards: 100%|██████████| 5/5 [00:15<00:00, 2.71s/it]Loading checkpoint shards: 100%|██████████| 5/5 [00:15<00:00, 3.03s/it][INFO|modeling_utils.py:5606] 2025-08-27 20:50:24,783 >> All model checkpoint weights were used when initializing Qwen2_5_VLForConditionalGeneration.[INFO|modeling_utils.py:5614] 2025-08-27 20:50:24,783 >> All the weights of Qwen2_5_VLForConditionalGeneration were initialized from the model checkpoint at /home/ma-user/modelarts/inputs/MY_MODEL_PATH_0.If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2_5_VLForConditionalGeneration for predictions without further training.[INFO|configuration_utils.py:1051] 2025-08-27 20:50:24,795 >> loading configuration file /home/ma-user/modelarts/inputs/MY_MODEL_PATH_0/generation_config.json[INFO|configuration_utils.py:1098] 2025-08-27 20:50:24,796 >> Generate config GenerationConfig { "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 1e-06}[INFO|2025-08-27 20:50:24] llamafactory.model.model_utils.checkpointing:143 >> Gradient checkpointing enabled.[INFO|2025-08-27 20:50:24] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.[INFO|2025-08-27 20:50:24] llamafactory.model.adapter:143 >> Upcasting trainable params to float32.[INFO|2025-08-27 20:50:24] llamafactory.model.adapter:143 >> Fine-tuning method: Full[INFO|2025-08-27 20:50:24] llamafactory.model.model_utils.visual:143 >> Set vision model not trainable: ['visual.patch_embed', 'visual.blocks'].[INFO|2025-08-27 20:50:24] llamafactory.model.model_utils.visual:143 >> Set multi model projector not trainable: visual.merger.[INFO|2025-08-27 20:50:24] llamafactory.model.loader:143 >> trainable params: 7,615,616,512 || all params: 8,292,166,656 || trainable%: 91.8411Detected kernel version 4.19.90, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.[INFO|trainer.py:757] 2025-08-27 20:50:24,959 >> Using auto half precision backend[2025-08-27 20:50:26,041] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed info: version=0.16.9, git-hash=unknown, git-branch=unknown[2025-08-27 20:50:26,042] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 24time="2025-08-27T20:50:31+08:00" level=info msg="event(name: ReceivedTermStateEvent, msg: ) is being handled, len: 0" file="controller.go:100" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="received terminated-state from task worker-1, user process exit code: 1" file="handler.go:246" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="decide to recover via [ascend_log_diag_when_fail] due to [the task worker-1 failed with exit code 1] on task [worker-1] " file="handler.go:460" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="event listener(name: instructTasksToHalt) sync=false started" file="controller.go:143" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="HaltInfo: is_leader_rescheduled: false, is_task_restart: false" file="listener.go:655" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="will halt user process of all other tasks" file="listener.go:709" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="done handling event(name: ReceivedTermStateEvent, msg: ), started listeners: [instructTasksToHalt startTimerForWaiting]" file="controller.go:114" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="event listener(name: startTimerForWaiting) sync=false started" file="controller.go:143" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="timer for waiting recover started, will count for 1800 secs" file="listener.go:513" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="instruct tasks [worker-1] to halt successfully" file="listener.go:713" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="user processes termination grace period set to 0" file="utils.go:967" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="will halt user process in current task" file="listener.go:699" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="get and terminate user processes, current retry: 1." file="process.go:275" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="get all user processes successfully" file="process.go:293" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="signal SIGTERM to user process group (pgid: 413)" file="process.go:386" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="user process in current task halt successfully" file="listener.go:704" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="will halt user process of all other tasks" file="listener.go:709" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="command is exit with -1" file="process.go:196" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="NPU training process exits with exit code -1, and the environment will be retained for 0s." file="process.go:258"time="2025-08-27T20:50:31+08:00" level=info msg="the environment has been retained for 0s." file="process.go:260"time="2025-08-27T20:50:31+08:00" level=info msg="the task will finished after terminating the user processes" file="process.go:396" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="process terminated by instruct halt, will ignore original exit code of it, nor the detectors" file="process.go:204" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="device log export result is false" file="listener.go:321" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="start parse ascend log, current time is :2025-08-27 20:50:31" file="ascend_log_handler.go:220" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="zombie process cleaner is exiting" file="cleaner_unix.go:53" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="ascend log parse config is { /home/ma-user/modelarts/log }" file="executor.go:65" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-ServiceW0827 20:50:31.423690 499 site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workersW0827 20:50:31.425579 499 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 591 closing signal SIGTERMW0827 20:50:31.427622 499 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 592 closing signal SIGTERMW0827 20:50:31.429263 499 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 593 closing signal SIGTERMtime="2025-08-27T20:50:31+08:00" level=info msg="instruct tasks [worker-2] to halt successfully" file="listener.go:713" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:31+08:00" level=info msg="event listener(name: instructTasksToHalt) sync=false exited" file="controller.go:140" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-ServiceW0827 20:50:31.430600 499 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 594 closing signal SIGTERMW0827 20:50:31.433036 499 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 595 closing signal SIGTERMW0827 20:50:31.435105 499 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 596 closing signal SIGTERMW0827 20:50:31.435939 499 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 597 closing signal SIGTERMW0827 20:50:31.437269 499 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 598 closing signal SIGTERMThe parse job starts. Please wait. Job id: [20250827205031706380_1a2db3d6-f056-48ae-adfc-b4b7a5945e2e], run log file is [ascend_faultdiag_5133.log].time="2025-08-27T20:50:32+08:00" level=info msg="event(name: ReceivedTermStateEvent, msg: ) is being handled, len: 0" file="controller.go:100" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:32+08:00" level=info msg="received terminated-state from task worker-2, user process exit code: -2" file="handler.go:246" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:32+08:00" level=info msg="the number of event listener is zero, registration is not required" file="controller.go:127" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:32+08:00" level=info msg="done handling event(name: ReceivedTermStateEvent, msg: ), started listeners: []" file="controller.go:114" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-ServiceThese job ['KNOWLEDGE_GRAPH', 'ROOT_CLUSTER'] succeeded.The parse job is complete.time="2025-08-27T20:50:32+08:00" level=info msg="parse ascend logs by ascend-fd successful, outputPath is /home/ma-user/modelarts/log/modelarts-job-00811b2e-32bc-4067-822e-de216e90441a/ascend/2025-08-27-20-48-06/fault_diag_data/worker-0" file="executor.go:70"time="2025-08-27T20:50:32+08:00" level=info msg="parse ascend log successfully, cost time is :1.573635361s" file="ascend_log_handler.go:231" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:33+08:00" level=info msg="local dir = /home/ma-user/modelarts/log/modelarts-job-00811b2e-32bc-4067-822e-de216e90441a/ascend/2025-08-27-20-48-06/fault_diag_data/" file="upload.go:270" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Service Task=parsedAscendLogUploadtime="2025-08-27T20:50:33+08:00" level=info msg="obs dir = /noteobs-250512/work_logs/work_log_default/modelarts-job-00811b2e-32bc-4067-822e-de216e90441a/ascend/2025-08-27-20-48-06/auto_diagnose" file="upload.go:273" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Service Task=parsedAscendLogUploadtime="2025-08-27T20:50:33+08:00" level=info msg="num of workers = 8" file="upload.go:278" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Service Task=parsedAscendLogUploadtime="2025-08-27T20:50:33+08:00" level=info msg="MA_OUTPUT_PRELOAD_SUFFIX: ." file="preload.go:27" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Service Task=parsedAscendLogUploadtime="2025-08-27T20:50:33+08:00" level=info msg="modelarts output channel preload to memarts switch is: false." file="preload.go:40" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Service Task=parsedAscendLogUploadtime="2025-08-27T20:50:33+08:00" level=info msg="upload parsed ascend log successfully" file="ascend_log_handler.go:237" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:33+08:00" level=info msg="auxiliary routine(name: TaskHangedDetect) exited" file="listener.go:1119" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:33+08:00" level=info msg="terminate all system auxiliary routines" file="common.go:25" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:33+08:00" level=info msg="event listener(name: startAndWait) sync=false exited" file="controller.go:140" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:33+08:00" level=info msg="npu-watch-routine is exiting without startup observed" file="watch_routine.go:49" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:33+08:00" level=info msg="event(name: UserProcessTerminatedEvent, msg: ) is being handled, len: 0" file="controller.go:100" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:33+08:00" level=info msg="received terminated-state from task worker-0, user process exit code: -2" file="handler.go:246" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:33+08:00" level=info msg="start diagnose ascend log" file="ascend_log_handler.go:295" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:33+08:00" level=info msg="obs download task in progress, local dir: /home/ma-user/fault_diag_data/, obs dir: /noteobs-250512/work_logs/work_log_default/modelarts-job-00811b2e-32bc-4067-822e-de216e90441a/ascend/2025-08-27-20-48-06/auto_diagnose, num workers: 8" file="download.go:161" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Service Task=ascend_log_parsing_resulttime="2025-08-27T20:50:33+08:00" level=info msg="auxiliary routine(name: TaskCoreDump) exited" file="listener.go:1119" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:33+08:00" level=info msg="event listener(name: SystemAuxiliaryRoutines) sync=false exited" file="controller.go:140" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-ServiceException in thread Thread-1:Traceback (most recent call last): File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT) File "<string>", line 2, in get File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod kind, result = conn.recv() File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 255, in recv buf = self._recv_bytes() File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 419, in _recv_bytes buf = self._recv(4) File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 388, in _recv raise EOFErrorEOFErrorException in thread Thread-1:Traceback (most recent call last): File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_innerException in thread Thread-1:Traceback (most recent call last): File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT) File "<string>", line 2, in get self.run() File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT) File "<string>", line 2, in get File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod kind, result = conn.recv() File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 255, in recv buf = self._recv_bytes() File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 419, in _recv_bytes kind, result = conn.recv() File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 255, in recv buf = self._recv(4) File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 388, in _recv buf = self._recv_bytes() File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 419, in _recv_bytes raise EOFErrorEOFError buf = self._recv(4) File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 388, in _recv raise EOFErrorEOFErrorException in thread Thread-1:Traceback (most recent call last): File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT) File "<string>", line 2, in getException in thread Thread-1:Traceback (most recent call last): File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod kind, result = conn.recv() File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 255, in recv self.run() File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run buf = self._recv_bytes() File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 419, in _recv_bytes key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT) File "<string>", line 2, in get buf = self._recv(4) File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 388, in _recv raise EOFErrorEOFError File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod kind, result = conn.recv() File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 255, in recv buf = self._recv_bytes() File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 419, in _recv_bytes buf = self._recv(4) File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 388, in _recv raise EOFErrorEOFErrortime="2025-08-27T20:50:33+08:00" level=info msg="obs download complete, successful: 36, failed: 0, skipped: 0, total downloaded bytes: 127713, total elapsed time: 0s" file="download.go:198" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Service Task=ascend_log_parsing_resultException in thread Thread-1:Traceback (most recent call last): File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_innerException in thread self.run() File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in runThread-1:Traceback (most recent call last): File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT) File "<string>", line 2, in get File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod self.run() File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT) kind, result = conn.recv() File "<string>", line 2, in get File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 255, in recv buf = self._recv_bytes() File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 419, in _recv_bytes buf = self._recv(4) File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 388, in _recv File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod raise EOFErrorEOFError kind, result = conn.recv() File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 255, in recv buf = self._recv_bytes() File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 419, in _recv_bytes buf = self._recv(4) File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 388, in _recv raise EOFErrorEOFErrorException in thread Thread-1:Traceback (most recent call last): File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT) File "<string>", line 2, in get File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod kind, result = conn.recv() File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 255, in recv buf = self._recv_bytes() File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 419, in _recv_bytes buf = self._recv(4) File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 388, in _recv raise EOFErrorEOFErrorThe diag job starts. Please wait. Job id: [20250827205033703256_cc8c531c-ad86-41ca-93a5-de5afa691399], run log file is [ascend_faultdiag_5543.log].+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+| Ascend Fault-Diag Report |+--------------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+| 版本信息 | 类型 | 版本 |+--------------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+| | Fault-Diag | 7.1.RC1 |+--------------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+| 根因节点分析 | 类型 | 描述 |+--------------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+| | 说明 | 根因节点分析检测出了多个的疑似故障根因节点,将优先排查这几个节点 |+--------------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+| | 根因节点 | ['worker-2 device-0', 'worker-1 device-0'] || | 现象描述 | 所有节点的Plog都没有记录超时类错误日志。日志中有报错的节点为疑似根因节点,请排查。 || | 首错节点 | worker-2 device-0: 2025-08-27-20:49:34.625503 || | 尾错节点 | worker-1 device-0: 2025-08-27-20:49:34.633599 |+--------------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+| 故障事件分析 | 类型 | 描述 |+--------------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+| 疑似根因故障 | 状态码 | AISW_CANN_HCCL_029 || | 故障分类 | 类别:Software/Hardware 组件:CANN 模块:HCCL || | 故障设备 | ['worker-1 device-0', 'worker-2 device-0'] || | 故障名称 | RDMA通信重传超次 || | 故障描述 | HCCL算子执行失败:在通信算子执行时,会使用RDMA进行跨节点通信。可能有以下原因: || | | a. 对端HCCP进程退出,无法回复响应; || | | b. NPU发送丢包; || | | c. NPU接收丢包; || | | d. NPU接收错包(乱序、FEC错误、截断等); || | | e. NPU网卡配置问题(MTU、PFC、QoS、Headroom、TC buf等配置和交换机不匹配); || | | f. 光模块闪断或断路; || | | g. 光纤闪断或断路。 || | 建议方案 | 1. 检查本端和远端是都出现linkdown 或者闪断,如出现则排查相关光模块和光纤是否故障; || | | 2. 检查本端和远端的NPU网口报文统计,是否有大量的PFC统计,如果存在则需要检查网络是否出现拥塞; || | | 3. 检查NPU侧的MTU、PFC、QoS、Headroom、TC buf等配置和交换机侧是否匹配; || | 关键日志 | [ERROR] HCCL(587,python3.1):2025-08-27-20:49:35.673.591 [heartbeat.cc:855][4083][Heartbeat]cqe error status[21], time:[2025-08-27 20:49:34.633612],localIP[29.65.150.252 || | | ], remoteIP:[29.65.18.59] |+--------------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+The diag job is complete.time="2025-08-27T20:50:33+08:00" level=info msg="diagnose successfully, cost time is: 606.637196ms" file="ascend_log_handler.go:302" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:33+08:00" level=info msg="diag report file /home/ma-user/modelarts/log/modelarts-job-00811b2e-32bc-4067-822e-de216e90441a/ascend/fault_diag_result/diag_report.json not exist" file="handler.go:656"time="2025-08-27T20:50:33+08:00" level=warning msg="no root cause nodes found in ascend diag report" file="handler.go:1268" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:33+08:00" level=info msg="done handling event(name: UserProcessTerminatedEvent, msg: ), started listeners: [instructTasksToExitWithProcessOriginalExitCode]" file="controller.go:114" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:33+08:00" level=info msg="event listener(name: instructTasksToExitWithProcessOriginalExitCode) sync=false started" file="controller.go:143" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:33+08:00" level=info msg="instruct task(worker-1) exit with 1" file="listener.go:1157" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Servicetime="2025-08-27T20:50:33+08:00" level=info msg="event listener(name: instructTasksToExitWithProcessOriginalExitCode) sync=false exited" file="controller.go:140" Command=bootstrap/run Component=ma-training-toolkit Container=modelarts-training Platform=ModelArts-Service/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d 'Traceback (most recent call last): File "/home/ma-user/anaconda3/envs/llama_factory/bin/torchrun", line 8, in <module> sys.exit(main()) File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 855, in _invoke_run time.sleep(monitor_interval) File "/home/ma-user/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)torch.distributed.elastic.multiprocessing.api.SignalException: Process 499 got signal: 15[ERROR] 2025-08-27-20:50:40 (PID:499, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception```
-
RK3588 AI 应用开发 (ResNet50V2-关键点检测)一、模型训练与转换ResNet50V2 是改进版的深度卷积神经网络,基于 ResNet 架构发展而来。它采用前置激活(将 BN 和 ReLU 移至卷积前)与身份映射,优化了信息传播和模型训练性能。作为 50 层深度的网络,ResNet50V2 广泛应用于图像分类、目标检测等任务,支持迁移学习,适合快速适配新数据集,具有良好的泛化能力和较高准确率。模型的训练与转换教程已经开放在AI Gallery中,其中包含训练数据、训练代码、模型转换脚本。在ModelArts的Notebook环境中训练后,再转换成对应平台的模型格式:onnx格式可以用在Windows设备上,RK系列设备上需要转换为rknn格式。二、应用开发1. 开发 Gradio 界面import cv2 import json import base64 import requests import numpy as np import gradio as gr def test_image(image_path): try: image_bgr = cv2.imread(image_path) image_string = cv2.imencode('.jpg', image_bgr)[1].tobytes() image_base64 = base64.b64encode(image_string).decode('utf-8') params = {"image_base64": image_base64} response = requests.post(f'http://{ip}:{port}{url}', data=json.dumps(params), headers={"Content-Type": "application/json"}) if response.status_code == 200: image_base64 = response.json().get("image_base64") image_binary = base64.b64decode(image_base64) image_array = np.frombuffer(image_binary, dtype=np.uint8) image_rgb = cv2.imdecode(image_array, cv2.IMREAD_COLOR) else: image_rgb = None except Exception as e: return None else: return image_rgb if __name__ == "__main__": port = 8000 ip = "127.0.0.1" url = "/v1/ResNet50V2" demo = gr.Interface(fn=test_image, inputs=gr.Image(type="filepath"), outputs=["image"], title="ResNet50V2 猫脸关键点检测") demo.launch(share=False, server_port=3000) /home/orangepi/miniconda3/envs/python-3.10.10/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm * Running on local URL: http://127.0.0.1:3000 * To create a public link, set `share=True` in `launch()`. 2. 编写推理代码class ResNet50V2: def __init__(self, model_path): self.rknn_lite = RKNNLite() self.rknn_lite.load_rknn(model_path) self.rknn_lite.init_runtime(core_mask=RKNNLite.NPU_CORE_0_1_2) def preprocess(self, image): image = image[:, :, ::-1] image = cv2.resize(image, (224, 224)) return np.expand_dims(image, axis=0) def rknn_infer(self, data): outputs = self.rknn_lite.inference(inputs=[data]) return outputs[0] def post_process(self, pred): feat = pred.squeeze().reshape(-1, 2) return feat def predict(self, image): # 图像预处理 data = self.preprocess(image) # 模型推理 pred = self.rknn_infer(data) # 模型后处理 keypoints = self.post_process(pred) # 绘制关键点检测结果 h, w, _ = image.shape for x, y in keypoints: cv2.circle(image, (int(x * w), int(y * h)), 5, (0, 255, 0), -1) return image[..., ::-1] def release(self): self.rknn_lite.release() 3. 图片批量预测import os import cv2 import numpy as np import matplotlib.pyplot as plt from rknnlite.api import RKNNLite model = ResNet50V2('model/ResNet50V2.rknn') for image in os.listdir("image"): image = cv2.imread(os.path.join("image", image)) image = model.predict(image) plt.imshow(image) plt.axis('off') plt.show() model.release() 4. 创建 Flask 服务import cv2 import base64 import numpy as np from rknnlite.api import RKNNLite from flask import Flask, request, jsonify from flask_cors import CORS app = Flask(__name__) CORS(app) @app.route('/v1/ResNet50V2', methods=['POST']) def inference(): data = request.get_json() image_base64 = data.get("image_base64") image_binary = base64.b64decode(image_base64) image_array = np.frombuffer(image_binary, dtype=np.uint8) image_bgr = cv2.imdecode(image_array, cv2.IMREAD_COLOR) image_rgb = model.predict(image_bgr) image_string = cv2.imencode('.jpg', image_rgb)[1].tobytes() image_base64 = base64.b64encode(image_string).decode('utf-8') return jsonify({ "image_base64": image_base64 }), 200 if __name__ == '__main__': model = ResNet50V2('model/ResNet50V2.rknn') app.run(host='0.0.0.0', port=8000) model.release() W rknn-toolkit-lite2 version: 2.3.2 * Serving Flask app '__main__' * Debug mode: off WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead. * Running on all addresses (0.0.0.0) * Running on http://127.0.0.1:8000 * Running on http://192.168.3.50:8000 Press CTRL+C to quit 127.0.0.1 - - [02/May/2025 02:13:40] "POST /v1/ResNet50V2 HTTP/1.1" 200 - 127.0.0.1 - - [02/May/2025 02:13:46] "POST /v1/ResNet50V2 HTTP/1.1" 200 - 5. 上传图片预测三、小结本章介绍了基于 RK3588 的 ResNet50V2 关键点检测应用开发全流程,包括模型训练与转换、Gradio 界面设计、推理代码实现、批量预测处理及 Flask 服务部署,完整实现了从模型到端到端应用的落地。----转自博客:https://bbs.huaweicloud.com/blogs/451999
-
RK3588 AI 应用开发 (FCN-语义分割)一、模型训练与转换FCN(全卷积网络,Fully Convolutional Networks)是用于语义分割任务的一种深度学习模型架构,引入了跳跃结构(Skip Architecture),通过融合浅层和深层的特征图,保留更多的细节信息,提升分割精度。此外,FCN还利用多尺度上下文聚合,捕捉不同层级的特征,增强了对不同大小目标的识别能力。FCN的成功推动了语义分割领域的发展,成为后续许多先进模型的基础。模型的训练与转换教程已经开放在AI Gallery中,其中包含训练数据、训练代码、模型转换脚本。在ModelArts的Notebook环境中训练后,再转换成对应平台的模型格式:onnx格式可以用在Windows设备上,RK系列设备上需要转换为rknn格式。二、应用开发1. 开发 Gradio 界面import cv2 import json import base64 import requests import numpy as np import gradio as gr def test_image(image_path): try: image_bgr = cv2.imread(image_path) image_string = cv2.imencode('.jpg', image_bgr)[1].tobytes() image_base64 = base64.b64encode(image_string).decode('utf-8') params = {"image_base64": image_base64} response = requests.post(f'http://{ip}:{port}{url}', data=json.dumps(params), headers={"Content-Type": "application/json"}) if response.status_code == 200: image_base64 = response.json().get("image_base64") image_binary = base64.b64decode(image_base64) image_array = np.frombuffer(image_binary, dtype=np.uint8) image_rgb = cv2.imdecode(image_array, cv2.IMREAD_COLOR) else: image_rgb = None except Exception as e: return None else: return image_rgb if __name__ == "__main__": port = 8000 ip = "127.0.0.1" url = "/v1/FCN" demo = gr.Interface(fn=test_image, inputs=gr.Image(type="filepath"), outputs=["image"], title="FCN 果蔬病虫害分割") demo.launch(share=False, server_port=3000) /home/orangepi/miniconda3/envs/python-3.10.10/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm * Running on local URL: http://127.0.0.1:3000 * To create a public link, set `share=True` in `launch()`. 2. 编写推理代码class FCN: def __init__(self, model_path): self.num_classes = 117 self.rknn_lite = RKNNLite() self.rknn_lite.load_rknn(model_path) self.rknn_lite.init_runtime(core_mask=RKNNLite.NPU_CORE_0_1_2) self.color_list = np.random.randint(0, 255, size=(self.num_classes, 3), dtype=np.uint8).tolist() def preprocess(self, image): image = image[:, :, ::-1] image = cv2.resize(image, (224, 224)) return np.expand_dims(image, axis=0) def rknn_infer(self, data): outputs = self.rknn_lite.inference(inputs=[data]) return outputs[0] def post_process(self, pred): feat = pred.squeeze() return np.argmax(feat, axis=-1).astype(np.uint8) def predict(self, image): # 图像预处理 data = self.preprocess(image) # 模型推理 pred = self.rknn_infer(data) # 模型后处理 feat = self.post_process(pred) # 生成图像分割结果 canv = np.zeros_like(image) mask = cv2.resize(feat, image.shape[:2][::-1], interpolation=cv2.INTER_NEAREST) for i in range(1, self.num_classes): canv[mask==i] = self.color_list[i] return cv2.addWeighted(image[..., ::-1], 0.5, canv, 0.5, 0) def release(self): self.rknn_lite.release() 3. 图片批量预测import os import cv2 import numpy as np import matplotlib.pyplot as plt from rknnlite.api import RKNNLite model = FCN('model/FCN.rknn') for image in os.listdir("image"): image = cv2.imread(os.path.join("image", image)) image = model.predict(image) plt.imshow(image) plt.axis('off') plt.show() model.release() 4. 创建 Flask 服务import cv2 import base64 import numpy as np from rknnlite.api import RKNNLite from flask import Flask, request, jsonify from flask_cors import CORS app = Flask(__name__) CORS(app) @app.route('/v1/FCN', methods=['POST']) def inference(): data = request.get_json() image_base64 = data.get("image_base64") image_binary = base64.b64decode(image_base64) image_array = np.frombuffer(image_binary, dtype=np.uint8) image_bgr = cv2.imdecode(image_array, cv2.IMREAD_COLOR) image_rgb = model.predict(image_bgr) image_string = cv2.imencode('.jpg', image_rgb)[1].tobytes() image_base64 = base64.b64encode(image_string).decode('utf-8') return jsonify({ "image_base64": image_base64 }), 200 if __name__ == '__main__': model = FCN('model/FCN.rknn') app.run(host='0.0.0.0', port=8000) model.release() W rknn-toolkit-lite2 version: 2.3.2 I RKNN: [00:06:51.738] RKNN Runtime Information: librknnrt version: 1.4.0 (a10f100eb@2022-09-09T09:07:14) I RKNN: [00:06:51.738] RKNN Driver Information: version: 0.9.6 I RKNN: [00:06:51.739] RKNN Model Information: version: 1, toolkit version: 1.4.0-22dcfef4(compiler version: 1.4.0 (3b4520e4f@2022-09-05T12:50:09)), target: RKNPU v2, target platform: rk3588, framework name: TFLite, framework layout: NHWC * Serving Flask app '__main__' * Debug mode: off WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead. * Running on all addresses (0.0.0.0) * Running on http://127.0.0.1:8000 * Running on http://192.168.3.50:8000 Press CTRL+C to quit 127.0.0.1 - - [02/May/2025 00:07:17] "POST /v1/FCN HTTP/1.1" 200 - 127.0.0.1 - - [02/May/2025 00:07:24] "POST /v1/FCN HTTP/1.1" 200 - 127.0.0.1 - - [02/May/2025 00:07:31] "POST /v1/FCN HTTP/1.1" 200 - 127.0.0.1 - - [02/May/2025 00:07:39] "POST /v1/FCN HTTP/1.1" 200 - 5. 上传图片预测三、小结本章介绍了基于RK3588平台使用FCN模型进行语义分割的AI应用开发全流程,包括模型训练与转换、Gradio界面开发、推理代码编写、批量预测实现及Flask服务部署。通过该流程,开发者可实现高效的图像分割任务,并在本地或云端进行预测和展示。----转自博客:https://bbs.huaweicloud.com/blogs/451998
-
RK3588 AI 应用开发 (YOLOX-目标检测)一、模型训练和转换YOLOX是YOLO系列的优化版本,引入了解耦头、数据增强、无锚点以及标签分类等目标检测领域的优秀进展,拥有较好的精度表现,同时对工程部署友好。模型的训练与转换教程已经开放在AI Gallery中,其中包含训练数据、训练代码、模型转换脚本。在ModelArts的Notebook环境中训练后,再转换成对应平台的模型格式:onnx格式可以用在Windows设备上,RK系列设备上需要转换为rknn格式。二、应用开发1. 开发 Gradio 界面import cv2 import json import base64 import requests import numpy as np import gradio as gr def test_image(image_path): try: image_bgr = cv2.imread(image_path) image_string = cv2.imencode('.jpg', image_bgr)[1].tobytes() image_base64 = base64.b64encode(image_string).decode('utf-8') params = {"image_base64": image_base64} response = requests.post(f'http://{ip}:{port}{url}', data=json.dumps(params), headers={"Content-Type": "application/json"}) if response.status_code == 200: image_base64 = response.json().get("image_base64") image_binary = base64.b64decode(image_base64) image_array = np.frombuffer(image_binary, dtype=np.uint8) image_rgb = cv2.imdecode(image_array, cv2.IMREAD_COLOR) else: image_rgb = None except Exception as e: return None else: return image_rgb if __name__ == "__main__": port = 8000 ip = "127.0.0.1" url = "/v1/fish_det" demo = gr.Interface(fn=test_image, inputs=gr.Image(type="filepath"), outputs=["image"], title="YOLOX 深海鱼类检测") demo.launch(share=False, server_port=3000) * Running on local URL: http://127.0.0.1:3000 * To create a public link, set `share=True` in `launch()`. 2. 编写推理代码%%writefile YOLOX/yolox/data/datasets/voc_classes.py #!/usr/bin/env python3 # -*- coding:utf-8 -*- # Copyright (c) Megvii, Inc. and its affiliates. # VOC_CLASSES = ( '__background__', # always index 0 VOC_CLASSES = ( "fish", ) Overwriting YOLOX/yolox/data/datasets/voc_classes.pyimport sys sys.path.append("YOLOX") from yolox.utils import demo_postprocess, multiclass_nms, vis from yolox.data.data_augment import preproc as preprocess from yolox.data.datasets.voc_classes import VOC_CLASSESimport cv2 import numpy as np import ipywidgets as widgets from rknnlite.api import RKNNLite from IPython.display import display class YOLOX: def __init__(self, model_path): self.ratio = None self.rknn_lite = RKNNLite() self.rknn_lite.load_rknn(model_path) self.rknn_lite.init_runtime(core_mask=RKNNLite.NPU_CORE_0_1_2) def preprocess(self, image): start_img, self.ratio = preprocess(image, (320, 320), swap=(0, 1, 2)) return np.expand_dims(start_img, axis=0) def rknn_infer(self, data): outputs = self.rknn_lite.inference(inputs=[data]) return outputs[0] def post_process(self, pred): predictions = demo_postprocess(pred.squeeze(), (320, 320)) boxes = predictions[:, :4] scores = predictions[:, 4:5] * predictions[:, 5:] boxes_xyxy = np.ones_like(boxes) boxes_xyxy[:, 0] = boxes[:, 0] - boxes[:, 2] / 2. boxes_xyxy[:, 1] = boxes[:, 1] - boxes[:, 3] / 2. boxes_xyxy[:, 2] = boxes[:, 0] + boxes[:, 2] / 2. boxes_xyxy[:, 3] = boxes[:, 1] + boxes[:, 3] / 2. boxes_xyxy /= self.ratio dets = multiclass_nms(boxes_xyxy, scores, nms_thr=0.45, score_thr=0.25) return dets def predict(self, image): # 图像预处理 data = self.preprocess(image) # 模型推理 pred = self.rknn_infer(data) # 模型后处理 dets = self.post_process(pred) # 绘制目标检测结果 if dets is not None: final_boxes = dets[:, :4] final_scores, final_cls_inds = dets[:, 4], dets[:, 5] image = vis(image, final_boxes, final_scores, final_cls_inds, conf=0.25, class_names=VOC_CLASSES) return image[..., ::-1] def img2bytes(self, image): """将图片转换为字节码""" return bytes(cv2.imencode('.jpg', image)[1]) def infer_video(self, video_path): """视频推理""" image_widget = widgets.Image(format='jpeg', width=800, height=600) display(image_widget) cap = cv2.VideoCapture(video_path) while True: ret, img_frame = cap.read() if not ret: break image_pred = self.predict(img_frame) image_widget.value = self.img2bytes(image_pred) cap.release() def release(self): """释放资源""" self.rknn_lite.release() 3. 图像预测4. 视频推理5. 创建 Flask 服务import cv2 import base64 import numpy as np from rknnlite.api import RKNNLite from flask import Flask, request, jsonify from flask_cors import CORS app = Flask(__name__) CORS(app) @app.route('/v1/fish_det', methods=['POST']) def inference(): data = request.get_json() image_base64 = data.get("image_base64") image_binary = base64.b64decode(image_base64) image_array = np.frombuffer(image_binary, dtype=np.uint8) image_bgr = cv2.imdecode(image_array, cv2.IMREAD_COLOR) image_rgb = model.predict(image_bgr) image_string = cv2.imencode('.jpg', image_rgb)[1].tobytes() image_base64 = base64.b64encode(image_string).decode('utf-8') return jsonify({ "image_base64": image_base64 }), 200 if __name__ == '__main__': model = YOLOX('model/yolox_fish.rknn') app.run(host='0.0.0.0', port=8000) model.release() 6. 上传图片预测三、小结本章介绍了基于RK3588平台使用YOLOX进行目标检测的全流程,包括模型训练与转换、Gradio界面开发、推理代码编写、图像和视频预测实现,以及Flask服务部署。整体实现了高效的鱼类检测应用,适用于嵌入式设备部署与实际场景应用。----转自博客:https://bbs.huaweicloud.com/blogs/452001
-
RK3588 AI 应用开发 (InceptionV3-图像分类)一、模型训练与转换Inception V3,GoogLeNet的改进版本,采用InceptionModule和全局平均池化层,v3一个最重要的改进是分解(Factorization),将7x7分解成两个一维的卷积(1x7,7x1),3x3也是一样(1x3,3x1),这样的好处,既可以加速计算(多余的计算能力可以用来加深网络),又可以将1个conv拆成2个conv,使得网络深度进一步增加,增加了网络的非线性。模型的训练与转换教程已经开放在AI Gallery中,其中包含训练数据、训练代码、模型转换脚本。在ModelArts的Notebook环境中训练后,再转换成对应平台的模型格式:onnx格式可以用在Windows设备上,RK系列设备上需要转换为rknn格式。二、应用开发1. 开发 Gradio 界面import cv2 import json import base64 import requests import numpy as np import gradio as gr def test_image(image_path): try: image_bgr = cv2.imread(image_path) image_string = cv2.imencode('.jpg', image_bgr)[1].tobytes() image_base64 = base64.b64encode(image_string).decode('utf-8') params = {"image_base64": image_base64} response = requests.post(f'http://{ip}:{port}{url}', data=json.dumps(params), headers={"Content-Type": "application/json"}) if response.status_code == 200: image_base64 = response.json().get("image_base64") image_binary = base64.b64decode(image_base64) image_array = np.frombuffer(image_binary, dtype=np.uint8) image_rgb = cv2.imdecode(image_array, cv2.IMREAD_COLOR) else: image_rgb = None except Exception as e: return None else: return image_rgb if __name__ == "__main__": port = 8000 ip = "127.0.0.1" url = "/v1/InceptionV3" demo = gr.Interface(fn=test_image, inputs=gr.Image(type="filepath"), outputs=["image"], title="InceptionV3 动物分类") demo.launch(share=False, server_port=3000) /home/orangepi/miniconda3/envs/python-3.10.10/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm * Running on local URL: http://127.0.0.1:3000 * To create a public link, set `share=True` in `launch()`. 2. 编写推理代码class InceptionV3: def __init__(self, model_path): self.rknn_lite = RKNNLite() self.rknn_lite.load_rknn(model_path) self.rknn_lite.init_runtime(core_mask=RKNNLite.NPU_CORE_0_1_2) self.label = ['antelope', 'badger', 'bat', 'bear', 'bee', 'beetle', 'bison', 'boar', 'butterfly', 'cat', 'caterpillar', 'chimpanzee', 'cockroach', 'cow', 'coyote', 'crab', 'crow', 'deer', 'dog', 'dolphin', 'donkey', 'dragonfly', 'duck', 'eagle', 'elephant', 'flamingo', 'fly', 'fox', 'goat', 'goldfish', 'goose', 'gorilla', 'grasshopper', 'hamster', 'hare', 'hedgehog', 'hippopotamus', 'hornbill', 'horse', 'hummingbird', 'hyena', 'jellyfish', 'kangaroo', 'koala', 'ladybugs', 'leopard', 'lion', 'lizard', 'lobster', 'mosquito', 'moth', 'mouse', 'octopus', 'okapi', 'orangutan', 'otter', 'owl', 'ox', 'oyster', 'panda', 'parrot', 'pelecaniformes', 'penguin', 'pig', 'pigeon', 'porcupine', 'possum', 'raccoon', 'rat', 'reindeer', 'rhinoceros', 'sandpiper', 'seahorse', 'seal', 'shark', 'sheep', 'snake', 'sparrow', 'squid', 'squirrel', 'starfish', 'swan', 'tiger', 'turkey', 'turtle', 'whale', 'wolf', 'wombat', 'woodpecker', 'zebra'] def preprocess(self, image): image = image[:, :, ::-1] image = cv2.resize(image, (224, 224)) return np.expand_dims(image, axis=0) def rknn_infer(self, data): outputs = self.rknn_lite.inference(inputs=[data]) return outputs[0] def post_process(self, pred): clsse = np.argmax(pred, axis=-1) score = pred[0][clsse[0]].item() return self.label[clsse[0]], round(score * 100, 2) def predict(self, image): # 图像预处理 data = self.preprocess(image) # 模型推理 pred = self.rknn_infer(data) # 模型后处理 label, score = self.post_process(pred) # 绘制识别结果 print(f'{label}:{score}%') image = cv2.putText(image, f'{label}:{score}%', (0, 100), cv2.FONT_HERSHEY_TRIPLEX, 4, (0, 255, 0), 8) return image[..., ::-1] def release(self): self.rknn_lite.release() 3. 图片批量预测import os import cv2 import numpy as np import matplotlib.pyplot as plt from rknnlite.api import RKNNLite model = InceptionV3('model/InceptionV3.rknn') for image in os.listdir("image"): image = cv2.imread(os.path.join("image", image)) image = model.predict(image) plt.imshow(image) plt.axis('off') plt.show() model.release() 4. 创建 Flask 服务import cv2 import base64 import numpy as np from rknnlite.api import RKNNLite from flask import Flask, request, jsonify from flask_cors import CORS app = Flask(__name__) CORS(app) @app.route('/v1/InceptionV3', methods=['POST']) def inference(): data = request.get_json() image_base64 = data.get("image_base64") image_binary = base64.b64decode(image_base64) image_array = np.frombuffer(image_binary, dtype=np.uint8) image_bgr = cv2.imdecode(image_array, cv2.IMREAD_COLOR) image_rgb = model.predict(image_bgr) image_string = cv2.imencode('.jpg', image_rgb)[1].tobytes() image_base64 = base64.b64encode(image_string).decode('utf-8') return jsonify({ "image_base64": image_base64 }), 200 if __name__ == '__main__': model = InceptionV3('model/InceptionV3.rknn') app.run(host='0.0.0.0', port=8000) model.release() W rknn-toolkit-lite2 version: 2.3.2 * Serving Flask app '__main__' * Debug mode: off WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead. * Running on all addresses (0.0.0.0) * Running on http://127.0.0.1:8000 * Running on http://192.168.3.50:8000 Press CTRL+C to quit 127.0.0.1 - - [01/May/2025 20:37:00] "POST /v1/InceptionV3 HTTP/1.1" 200 - pig:99.95% 127.0.0.1 - - [01/May/2025 20:37:09] "POST /v1/InceptionV3 HTTP/1.1" 200 - swan:99.95% 127.0.0.1 - - [01/May/2025 20:37:20] "POST /v1/InceptionV3 HTTP/1.1" 200 - cat:97.02% 5. 上传图片预测三、小结本章介绍了基于RK3588平台的InceptionV3图像分类应用开发全流程,包括模型训练与格式转换、Gradio界面设计、推理代码实现、批量预测处理及Flask服务部署,实现了从本地到Web端的高效AI推理应用。----转自博客:https://bbs.huaweicloud.com/blogs/451978
-
一、Modelarts Lite Server 使用前必读1. Lite Server 使用流程总览ModelArts Lite Server提供多样化的xPU裸金属服务器,赋予用户以root账号自主安装 和部署AI框架、应用程序等第三方软件的能力,为用户打造专属的云上物理服务器环 境。用户只需轻松选择服务器的规格、镜像、网络配置及密钥等基本信息,即可迅速 创建弹性裸金属服务器,获取所需的云上物理资源,充分满足算法工程师在日常训练 和推理工作中的需求。流程图 2. Lite Server 高危操作一览表ModelArts Lite Server在日常操作与维护过程中涉及的高危操作,需要严格按照操作 指导进行,否则可能会影响业务的正常运行。 高危操作风险等级说明:高:对于可能直接导致业务失败、数据丢失、系统不能维护、系统资源耗尽的高 危操作。中:对于可能导致安全风险及可靠性降低的高危操作。低:高、中风险等级外的其他高危操作。操作对象操作名称风险描述风险等级应对措施操作系统升级/修改操 作系统内核 或者驱动。如果升级/修改操作 系统内核或者驱 动,很可能导致驱 动和内核版本不兼 容,从而导致OS无 法启动,或者基本 功能不可用。相关 高危命令如:ap高如果需要升级/ 修改,请联系 技术支持。操作系统切换或者重 置操作系统。服务器在进行过 “切换或者重置操 作系统”操作后, EVS系统盘ID发生变 化,和下单时订单 中的EVS ID已经不 一致, 因此EVS系 统盘将不支持扩 容,并显示信息: “当前订单已到 期,无法进行扩容 操作,请续订”。中切换或者重置 操作系统后, 建议通过挂载 数据盘EVS或 挂载SFS盘等 方式进行存储 扩容。 二、Modelarts Lite Server 资源开通图 2-1 Server 资源开通流程图 表 2-1 Server 资源开通流程阶段任务准备工作1、申请开通资源规格2、资源配合提升3、基础权限开通4、配置ModelArts委托授权购买Server5、在ModelArts控制台购买资源池步骤 1:申请开通资源规格 请联系客户经理/接口人确认Server资源方案、申请要开通资源的规格(如果无客户经理可提 交工单)。步骤 2:资源配额提升 由于Server所需资源可能会超出默认提供的资源(如ECS、EIP、SFS、内存大小、CPU 核数),因此需要提升资源配额。1 登录云管理控制台。2 在顶部导航栏单击“资源 > 我的配额”,进入服务配额页面。3 单击右上角“申请扩大配额”,填写申请材料后提交工单。 说明:配额需大于需要开通的资源,且在购买开通前完成提升,否则会导致资源开通失败。4 基础权限开通(若主账号/已有iam子账号可忽略)基础权限开通需要登录管理员账号,为子用户账号开通Server功能所需的基础权限 (ModelArts FullAccess/BMS FullAccess/ECS FullAccess/VPC FullAccess/VPC Administrator/VPCEndpoint Administrator),即允许子用户账号同时可以使用这些 云服务。步骤1 登录统一身份认证服务管理控制台。步骤2 单击目录左侧“用户组”,然后在页面右上角单击“创建用户组”。步骤3 填写“用户组名称”并单击“确定”。步骤4 在操作列单击“用户组管理”,将需要配置权限的用户加入用户组中。步骤5 单击用户组名称,进入用户组详情页。步骤6 在权限管理页签下,单击“授权”。图 2-2 “配置权限”步骤7 在搜索栏输入“ModelArts FullAccess”,并勾选“ModelArts FullAccess”。图 2-3 ModelArts FullAccess以相同的方式,依次添加:BMS FullAccess、ECS FullAccess、VPC FullAccess、VPC Administrator、VPCEndpoint Administrator。(Server Administrator、DNS Administrator为依赖策略,会自动被勾选)。步骤8 单击“下一步”,授权范围方案选择“所有资源”。步骤9 单击“确认”,完成基础权限开通。----结束5. 在 ModelArts 上创建委托授权ModelArts Lite Server在任务执行过程中需要访问用户的其他服务,典型的就是容器 使用过程中需要到SWR服务拉取镜像。在这个过程中,就出现了ModelArts“代表” 用户去访问其他云服务的情形。从安全角度出发,ModelArts代表用户访问任何云服务 之前,均需要先获得用户的授权,而这个动作就是一个“委托”的过程。用户授权 ModelArts代表自己访问特定的云服务,以完成其在ModelArts平台上执行的AI计算任 务。● 新建委托第一次使用ModelArts时需要创建委托授权,授权允许ModelArts代表用户去访问 其他云服务。进入到ModelArts控制台的“权限管理”页面,单击“添加授权”, 根据提示进行操作。 ● 更新委托如果之前给ModelArts创过委托授权,此处可以更新授权。a. 进入到ModelArts控制台的“资源管理>AI专属资源池>弹性节点 Server”页 面,查看是否存在授权缺失的提示。 图 2-4 弹性节点 Server 权限缺失提示 b. 如果有授权缺失,根据提示,单击“此处”更新委托。根据提示选择“追加 至已有授权”,单击“确定”,系统会提示权限更新成功。 图 2-5 追加授权 6. 购买弹性节点 Server 资源 购买弹性节点Server资源的过程即创建资源过程。登录ModelArts管理控制台。在左侧导航栏中,选择“资源管理 > AI专属资源池 > 弹性节点 Server”,进入“节 点”列表。图 2-6 购买弹性节点 Server 入口 单击节点列表页右上角的“购买AI专属节点”,进入“购买AI专属节点”页面,在该 页面填写相关参数信息。(根据申请的区域(贵阳 乌兰察布 华东)) 三、ModelArts Lite Server 软件环境NPU 服务器上配置 Lite Server 资源软件环境1.1 安装驱动(安装完毕后需重启服务器(可输入指令 reboot 重启服务器))wget "https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Ascend HDK/Ascend HDK 23.0.3/Ascend-hdk-910b-npu-driver_23.0.3_linux-aarch64.run" sudo sh Ascend-hdk-910b-npu-driver_23.0.3_linux-aarch64.run --full --install-for-all1.2 验证驱动(完整输出NPU各卡状态内容则为正常)npu-smi info1.3 安装固件wget "https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Ascend HDK/Ascend HDK 23.0.3/Ascend-hdk-910b-npu-firmware_7.1.0.5.220.run" sudo sh Ascend-hdk-910b-npu-firmware_7.1.0.5.220.run --full1.4 查看安装结果,如图3.2所示npu-smi info -t board -i 1 | egrep -i "software|firmware" 图 3-2 查看固件和驱动版本 1.5 安装CANN# install CANN Toolkit wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C17SPC701/Ascend-cann-toolkit_8.0.RC1.alpha001_linux-"$(uname -i)".run bash Ascend-cann-toolkit_8.0.RC1.alpha001_linux-"$(uname -i)".run –install # install CANN Kernels wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C17SPC701/Ascend-cann-kernels-910b_8.0.RC1.alpha001_linux.run bash Ascend-cann-kernels-910b_8.0.RC1.alpha001_linux.run –install #注意安装顺序,先安装Toolkit 在安装Kernels #配置昇腾开发工具包运行所需的环境变量 source /usr/local/Ascend/ascend-toolkit/set_env.sh1.6 安装dockeryum install -y docker-engine.aarch64 docker-engine-selinux.noarch docker-runc.aarch64 #配置Ascend-docker-runtime。 #下载Ascend-docker-runtime wget https://mindx.obs.cn-south-1.myhuaweicloud.com/OpenSource/MindX/MindX%205.0.RC2/MindX%20DL%205.0.RC2/Ascend-docker-runtime_5.0.RC2_linux-aarch64.run chmod 700 *.run ./ Ascend-docker-runtime_5.0.RC2_linux-aarch64.run --install1.7 检查安装Ascend-docker-runtime、docker info |grep Runtime图 3-3 Ascend-docker-runtime 查询 1.8 配置docker #将新挂载的盘设置为docker容器使用路径 { "data-root": "/home/docker", "default-runtime": "ascend", "default-shm-size": "8G", "insecure-registries": [ "ascendhub.huawei.com" ], "registry-mirrors": [ "https://90c2db5e37a147c6afd970e45c68e365.mirror.swr.myhuaweicloud.com" ], "runtimes": { "ascend": { "path": "/usr/local/Ascend/Ascend-Docker-Runtime/ascend-docker-runtime", "runtimeArgs": [] } } } #保存后,执行如下命令重启docker使配置生效。 systemctl daemon-reload && systemctl restart docker 四、模型部署4.1下载模型镜像前往昇腾社区/开发资源(https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fdeveloper%2Fascendhub%2Fdetail%2Faf85b724a7e5469ebd7ea13c3439d48f)下载适配,下载镜像前需要申请权限,耐心等待权限申请通过后,根据指南下载对应镜像文件。完成之后,请使用docker images命令确认查找具体镜像名称与标签。4.2 权重文件下载:Qwen2.5-72B-Instruct(下载地址:https://www.modelscope.cn/models/Qwen/Qwen2.5-72B-Instruct/files)4.3 启动模型docker run -itd --privileged --name=Qwen2.5_72b --net=host --shm-size=500g --device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 --device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 --device=/dev/davinci_manager --device=/dev/devmm_svm --device=/dev/hisi_hdc -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ -v /usr/local/sbin/:/usr/local/sbin/ -v /var/log/npu/slog/:/var/log/npu/slog -v /var/log/npu/profiling/:/var/log/npu/profiling -v /var/log/npu/dump/:/var/log/npu/dump -v /var/log/npu/:/usr/slog -v /etc/hccn.conf:/etc/hccn.conf -v /Qwen2.5/Qwen2.5-72B-Instruct:/Qwen2.5/Qwen2.5-72B-Instruct 2acbb0b1003d /bin/bash #这边下载的权重文件目录在/Qwen2.5/Qwen2.5-72B-Instruct 需要启动容器挂载上4.4 纯模型推理Docker容器跑起来后需要进入容器docker exec it 容器id /bin/bash对话测试#前置准备 # 通信优化 export HCCL_DETERMINISTIC=false # 关闭昇腾集群通信的确定性模式(提升性能,允许并行计算优化) export LCCL_DETERMINISTIC=0 # 关闭集合通信的严格同步(减少通信延迟,提升分布式效率) export HCCL_BUFFSIZE=120 # 设置通信缓冲区大小(默认32,增大可提升吞吐量,但占用更多内存) # 内存优化 export ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1 # 全局预分配昇腾算子的内存空间(避免反复申请,加速计算) #测试开始 #进入modeltest路径 cd /usr/local/Ascend/atb-models #Step1.清理残余进程 pkill -9 -f 'mindie|python' #Step2.执行命令: bash examples/models/qwen/run_pa.sh -m ${weight_path} --trust_remote_code true ${weight_path}需要改成自己的模型权重文件路径 这里是Qwen2.5/Qwen2.5-72B-Instruct4.5 服务化推理前置准备【使用场景】对标真实客户上线场景,使用不同并发、不同发送频率、不同输入长度和输出长度分布,去测试服务化性能#配置服务化环境变量 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True #修改服务化参数 cd /usr/local/Ascend/mindie/latest/mindie-service/ vim conf/config.json #修改以下参数: "httpsEnabled" : false, # 如果网络环境不安全,不开启HTTPS通信,即“httpsEnabled”=“false”时,会存在较高的网络安全风险 ... # 若不需要安全认证,则将以下两个参数设为false "interCommTLSEnabled" : false, "interNodeTLSEnabled" : false, ... "npudeviceIds" : [[0,1,2,3,4,5,6,7]], ... "modelName" : "qwen" # 不影响服务化拉起 "modelWeightPath" : "权重路径", "worldSize":8,參考修改{ "Version": "1.0.0", "LogConfig": { "logLevel": "Info", // 日志级别:信息 "logFileSize": 20, // 单个日志文件大小:20MB "logFileNum": 20, // 保留的日志文件数量:20个 "logPath": "logs/mindie-server.log" // 日志文件路径:logs/mindie-server.log }, "ServerConfig": { "ipAddress": "127.0.0.1", // IP地址:本地回环地址 "managementIpAddress": "127.0.0.2", // 管理IP地址:本地回环地址 "port": 1025, // 端口号:1025 "managementPort": 1026, // 管理端口:1026 "metricsPort": 1027, // 指标端口:1027 "allowAllZeroIpListening": false, // 是否允许监听所有零IP:否 "maxLinkNum": 1000, // 最大连接数:1000 "httpsEnabled": true, // HTTPS是否启用:是 "fullTextEnabled": false, // 全文搜索是否启用:否 "tlsCaPath": "/usr/local/Ascend/mindie/latest/mindie-service/security/ca/", // TLS CA证书路径 "tlsCaFile": ["ca.pem"], // TLS CA证书文件列表 "tlsCert": "/usr/local/Ascend/mindie/latest/mindie-service/security/certs/server.pem", // TLS证书文件路径 "tlsPk": "/usr/local/Ascend/mindie/latest/mindie-service/security/keys/server.key.pem", // TLS私钥文件路径 "tlsPkPwd": "/usr/local/Ascend/mindie/latest/mindie-service/security/pass/key_pwd.txt", // TLS私钥密码文件路径 "tlsCrlPath": "/usr/local/Ascend/mindie/latest/mindie-service/security/certs/", // TLS CRL路径 "tlsCrlFiles": ["server_crl.pem"], // TLS CRL文件列表 "managementTlsCaFile": ["management_ca.pem"], // 管理TLS CA证书文件列表 "managementTlsCert": "/security/certs/management/server.pem", // 管理TLS证书文件路径 "managementTlsPk": "/security/keys/management/server.key.pem", // 管理TLS私钥文件路径 "managementTlsPkPwd": "/security/pass/management/key_pwd.txt", // 管理TLS私钥密码文件路径 "managementTlsCrlPath": "/security/management/certs/", // 管理TLS CRL路径 "managementTlsCrlFiles": ["server_crl.pem"], // 管理TLS CRL文件列表 "kmcKsfMaster": "tools/pmt/master/ksfa", // KMC密钥主文件路径 "kmcKsfStandby": "tools/pmt/standby/ksfb", // KMC密钥备用文件路径 "inferMode": "standard", // 推理模式:标准 "interCommTLSEnabled": false, // 内部通信TLS是否启用:否 "interCommPort": 1121, // 内部通信端口:1121 "interCommTlsCaPath": "security/grpc/ca/", // 内部通信TLS CA证书路径 "interCommTlsCaFiles": ["ca.pem"], // 内部通信TLS CA证书文件列表 "interCommTlsCert": "security/grpc/certs/server.pem", // 内部通信TLS证书文件路径 "interCommPk": "security/grpc/keys/server.key.pem", // 内部通信TLS私钥文件路径 "interCommPkPwd": "security/grpc/pass/key_pwd.txt", // 内部通信TLS私钥密码文件路径 "interCommTlsCrlPath": "security/grpc/certs/", // 内部通信TLS CRL路径 "interCommTlsCrlFiles": ["server_crl.pem"], // 内部通信TLS CRL文件列表 "openAiSupport": "vllm" // OpenAI支持:vllm }, "BackendConfig": { "backendName": "mindieservice_llm_engine", // 后端名称:mindieservice_llm_engine "modelInstanceNumber": 1, // 模型实例数量:1 "npuDeviceIds": [[0, 1, 2, 3, 4, 5, 6, 7]], // NPU设备ID列表 "tokenizerProcessNumber": 8, // 分词器进程数量:8 "multiNodesInferEnabled": false, // 多节点推理是否启用:否 "multiNodesInferPort": 1120, // 多节点推理端口:1120 "interNodeTLSEnabled": false, // 节点间通信TLS是否启用:否 "interNodeTlsCaPath": "security/grpc/ca/", // 节点间通信TLS CA证书路径 "interNodeTlsCaFiles": ["ca.pem"], // 节点间通信TLS CA证书文件列表 "interNodeTlsCert": "security/grpc/certs/server.pem", // 节点间通信TLS证书文件路径 "interNodeTlsPk": "security/grpc/keys/server.key.pem", // 节点间通信TLS私钥文件路径 "interNodeTlsPkPwd": "security/grpc/pass/mindie_server_key_pwd.txt", // 节点间通信TLS私钥密码文件路径 "interNodeTlsCrlPath": "security/grpc/certs/", // 节点间通信TLS CRL路径 "interNodeTlsCrlFiles": ["server_crl.pem"], // 节点间通信TLS CRL文件列表 "interNodeKmcKsfMaster": "tools/pmt/master/ksfa", // 节点间通信KMC密钥主文件路径 "interNodeKmcKsfStandby": "tools/pmt/standby/ksfb", // 节点间通信KMC密钥备用文件路径 "ModelDeployConfig": { "maxSeqLen": 2560, // 最大序列长度:2560 "maxInputTokenLen": 2048, // 最大输入令牌长度:2048 "truncation": false, // 是否截断:否 "ModelConfig": [ { "modelInstanceType": "Standard", // 模型实例类型:标准 "modelName": "Qwen2.5-72b", // 模型名称:Qwen2.5-72b "modelWeightPath": "/Qwen2.5/Qwen2.5-72B-Instruct", // 模型权重文件路径 "worldSize": 8, // 世界大小(分布式训练):8 "cpuMemSize": 5, // CPU内存大小:5GB "npuMemSize": -1, // NPU内存大小:自动分配 "backendType": "atb", // 后端类型:atb "trustRemoteCode": false // 是否信任远程代码:否 } ] }, "ScheduleConfig": { "templateType": "Standard", // 模板类型:标准 "templateName": "Standard_LLM", // 模板名称:Standard_LLM "cacheBlockSize" : 128, // 缓存块大小:128 "maxPrefillBatchSize" : 50, // 最大预填充批处理大小:50 "maxPrefillTokens" : 8192, // 最大预填充令牌数:8192 "prefillTimeMsPerReq" : 150, // 每请求预填充时间:150毫秒 "prefillPolicyType" : 0, // 预填充策略类型:0 "decodeTimeMsPerReq" : 50, // 每请求解码时间:50毫秒 "decodePolicyType" : 0, // 解码策略类型:0 "maxBatchSize" : 200, // 最大批处理大小:200 "maxIterTimes" : 512, // 最大迭代次数:512 "maxPreemptCount" : 0, // 最大抢占计数:0 "supportSelectBatch" : false, // 是否支持选择批处理:否 "maxQueueDelayMicroseconds" : 5000 // 最大队列延迟:5000微秒 } } }拉起服务化# 以下命令需在所有机器上同时执行 # 解决权重加载过慢问题 export OMP_NUM_THREADS=1 #设置显存比 export NPU_MEMORY_FRACTION=0.95 # 拉起服务化 cd /usr/local/Ascend/mindie/latest/mindie-service/ ./bin/mindieservice_daemon执行命令后,首先会打印本次启动所用的所有参数,然后直到出现输出:Daemon start success! 则认为服务成功启动。
-
无人机巡检数据集:空中语义分割由计算机图形与视觉研究所(ICG开发的语义无人机数据集,旨在推动城市场景的语义理解研究,提升自主无人机飞行与着陆的安全性。以下从数量、类别、分布及分辨率四个核心维度展开说明:一、数据数量与划分该数据集包含600张高分辨率图像,按用途分为训练集和测试集两部分:训练集:400张图像,公开可获取,包含完整的标注信息,支持模型训练与算法验证。测试集:200张图像,为私有数据,主要用于评估模型的泛化能力,确保研究结果的客观性。此外,数据集还提供了丰富的辅助数据,包括1Hz采集的高分辨率图像序列、5Hz的鱼眼立体图像(同步配有IMU测量数据)、1Hz的热成像图像,以及3栋房屋的地面控制点和全站仪获取的3D真值数据,进一步扩展了其应用场景。二、语义类别与标注数据集针对语义分割任务定义了20个核心类别,覆盖城市场景中的典型元素,具体分类如下:自然元素:树(tree)、草(gras)、其他植被(other vegetation)、泥土(dirt)、 gravel、岩石(rocks)、水(water);人工构造:铺装区域(paved area)、泳池(pool)、屋顶(roof)、墙(wall)、栅栏(fence)、栅栏柱(fence-pole)、窗户(window)、门(door)、障碍物(obstacle);动态目标:人(person)、狗(dog)、汽车(car)、自行车(bicycle)。标注精度达到像素级,确保语义分割任务的准确性;同时,针对人物检测任务,提供了训练集和测试集的边界框(bounding box)标注,支持多任务研究。三、数据分布特点数据集的图像均通过无人机从天底视角(鸟瞰视角) 采集,覆盖超过20栋房屋的城市区域,拍摄高度在地面以上5至30米之间,确保场景的真实性与多样性。从分布来看:场景覆盖:包含居民区内的建筑、植被、道路、休闲区域(如泳池)等,兼顾自然与人工环境的混合场景;目标密度:图像中包含不同数量的动态目标(人、动物、车辆)和静态结构,适合测试算法在复杂目标交互场景下的表现;辅助数据分布:热成像、立体图像等辅助数据与主图像时空同步,可用于多模态融合研究,提升模型对环境的感知能力。四、分辨率与数据格式数据集的核心图像采用高分辨率相机采集,单张图像尺寸为6000×4000像素(2400万像素),确保细节信息的完整性,满足精细语义分割的需求。训练集提供多种格式的标注文件,包括:Python pickle格式的边界框数据;可选的XML格式边界框标注;可选的掩码图像(mask images),便于不同算法框架的直接使用。五、数据集下载地址# 数据集地址 https://developer.huaweicloud.com/develop/aigallery/dataset/detail?id=0efe9613-5248-4a43-925e-4d6d377b6996综上,该数据集凭借大尺寸、多类别、高精度的特点,为无人机视觉、语义分割、目标检测等领域的研究提供了高质量的基准数据,其丰富的辅助信息也为多模态感知与3D重建等任务奠定了基础。
推荐直播
-
HDC深度解读系列 - Serverless与MCP融合创新,构建AI应用全新智能中枢2025/08/20 周三 16:30-18:00
张昆鹏 HCDG北京核心组代表
HDC2025期间,华为云展示了Serverless与MCP融合创新的解决方案,本期访谈直播,由华为云开发者专家(HCDE)兼华为云开发者社区组织HCDG北京核心组代表张鹏先生主持,华为云PaaS服务产品部 Serverless总监Ewen为大家深度解读华为云Serverless与MCP如何融合构建AI应用全新智能中枢
回顾中 -
关于RISC-V生态发展的思考2025/09/02 周二 17:00-18:00
中国科学院计算技术研究所副所长包云岗教授
中科院包云岗老师将在本次直播中,探讨处理器生态的关键要素及其联系,分享过去几年推动RISC-V生态建设实践过程中的经验与教训。
回顾中 -
一键搞定华为云万级资源,3步轻松管理企业成本2025/09/09 周二 15:00-16:00
阿言 华为云交易产品经理
本直播重点介绍如何一键续费万级资源,3步轻松管理成本,帮助提升日常管理效率!
回顾中
热门标签