• [问题求助] GPU加速云服务器 GACS 能用来训练大模型吗
     GPU加速云服务器 GACS 能用来训练大模型吗
  • [问题求助] 货物到了港口还会在 1000 帧后消失吗
    这个信息还是很关键的吧
  • [交流吐槽] 服务器高折扣支持
    服务器高折扣支持 折上折优惠
  • [常见FAQ] 不同帧工作台的数量会变吗?还是从初始化开始就固定数量了?
    不同帧工作台的数量会变吗?还是从初始化开始就固定数量了?
  • [技术干货] 【AMD GPU】使用A卡进行ai模型训练-转载
     吐槽 rocm都更新这么多版本了怎么还没有windows的 ~~##RX580用户看过来 rocm4.0版本后就不支持RX580了,垃圾AMD  使用的设备配置 linux:Ubuntu20.04.1 CPU:R9-5900hx GPU:RX6800M 12G python:3.10.6  2022-10-24 23:21:50一键部署工具发布 顺序:1-8-2-3-4-5-7-6 加个源:deb https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy main 下载链接https://www.123pan.com/s/xW39-dVMmH提取码:2333  安装GPU驱动 如果你已经安装成功了gpu驱动可以跳过 如果之前装过其它版本没有驱动成功的,在终端输入 sudo amdgpu-install --uninstall卸载驱动  访问amd官网下载amdgpu-install_xxxxxx.xxxxxx_all.deb  进入安装包所在的目录 接着在终端输入:sudo apt install ./amdgpu-install_xxxxxxx-xxxxxx_all.deb(注:amdgpu-install_xxxxxxx-xxxxxx_all.deb指的是你下载的amdgpu版本  然后sudo apt update再sudo apt upgrade -y  开始安装驱动  sudo amdgpu-install --no-dkms sudo apt install rocm-dev //安装完后重启 sudo reboot 1 2 3 4 配置环境  ls -l /dev/dri/render* sudo usermod -a -G render $LOGNAME sudo usermod -a -G video $LOGNAME sudo reboot 1 2 3 4 测试  # 显示gpu性能监控 rocm-smi #查看显卡信息的两条命令(直接在终端输入) /opt/rocm/bin/rocminfo /opt/rocm/opencl/bin/clinfo #有一条报错可能是没安装好 1 2 3 4 5 6 添加path echo ‘export PATH=$PATH:/opt/rocm/bin:/opt/rocm/profiler/bin:/opt/rocm/opencl/bin/x86_64’ | sudo tee -a /etc/profile.d/rocm.sh  安装MIopen #安装hip sudo apt-get install miopen-hip #下载miopenkernels,适用与gfx1030的a卡,如果你不是可以试一下 链接:https://www.123pan.com/s/xW39-oyMmH sudo dpkg -i miopenkernels-gfx1030-36kdb_1.1.0.50200-65_amd64.deb 1 2 3 4 5 RDNA2架构安装pytorch pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/rocm5.1.1 1 RX580(gfx803)用户安装这个 pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/rocm3.7 1 运行stable-diffusion-webui sudo apt install git git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git cd stable-diffusion-webui #一般会提示pip版本太低,更新一下 python -m pip install --upgrade pip wheel pip install -r requirements.txt' -i https://pypi.tuna.tsinghua.edu.cn/simple HSA_OVERRIDE_GFX_VERSION=10.3.0 python launch.py --precision full --no-half #HSA_OVERRIDE_GFX_VERSION可以模拟版本可以填9.0.0或者8.0.3(没试过) //一般来讲会提示没有模型,如果有扔./models/Stable-diffusion里,本文不提供,自行百度 1 2 3 4 5 6 7 8 9 提示cuda错误,解决方法 torch is not able to use gpu  #打开launch.py找到这句代码 commandline_args = os.environ.get('COMMANDLINE_ARGS', "") #改成 commandline_args = os.environ.get('COMMANDLINE_ARGS', "--skip-torch-cuda-test") 1 2 3 4 疑难杂症解决 rocm-gdb依赖libpython3.8解决 进软件和更新——其他软件——添加下面软件源  deb https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy main 1 更新一下软件源  sudo apt upgrade sudo apt update 1 2 安装libpython3.8并重新运行amdgpu-install  sudo apt install libpython3.8 sudo apt install rocm-dev 1 2 rocm-llvm依赖python但无法安装它 找个目录进行操作  apt download rocm-llvm ar x rocm-llvm_xxxx.xxxxx_amd64.deb tar xf control.tar.xz #编辑文件,如果没有vim将先安装sudo apt install vim vim control #找到如下一行: Depends: python, libc6, libstdc++6|libstdc++8, libstdc++-5-dev|libstdc++-7-dev, libgcc-5-dev|libgcc-7-dev, rocm-core #改为如下内容: Depends: python3, libc6, libstdc++6|libstdc++8, libstdc++-5-dev|libstdc++-7-dev|libstdc++-10-dev, libgcc-5-dev|libgcc-7-dev|libgcc-10-dev, rocm-core #重新打包 tar c postinst prerm control | xz -c > control.tar.xz ar rcs rocm-llvm.deb debian-binary control.tar.xz data.tar.xz #安装前先安装依赖 sudo apt install libstdc++-10-dev libgcc-10-dev rocm-core #安装 sudo dpkg -i rocm-llvm.deb #重新安装驱动 sudo amdgpu-install --no-dkms  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 运行launch.py时出现语法错误/切换python版本版本 多半是你ubuntu默认python不对应  sudo HSA_OVERRIDE_GFX_VERSION=10.3.0 python launch.py --precision full --no-half 1 #先查看本地安装了多少个python ls /usr/bin/python* #正常来讲会出现一下内容 #/usr/bin/python      /usr/bin/python3.10-config  /usr/bin/python3-futurize #/usr/bin/python3     /usr/bin/python3.8          /usr/bin/python3-pasteurize #/usr/bin/python3.10  /usr/bin/python3-config  #我们要用的是python3.10的,所以 sudo rm /usr/bin/python  #删除原来的链接 sudo ln -s /usr/bin/python3.10 /usr/bin/python    #创建新的链接 python --version    #测试 1 2 3 4 5 6 7 8 9 10 Can’t run without a checkpoint. Find and place a .ckpt file into any of those locations. The program will exit. 你没有模型,把模型放进/models/Stable-diffusion里面吧(cpkt文件)  安装完驱动重启黑屏 启动的时候选择第二项(recovery模式)后,再选第一项继续进入系统,进来后卸载驱动  运行后下载插件超时 下载插件的速度三取决与年访问github是否流畅,很卡的话就修改launch.py吧 例  gfpgan_package = os.environ.get('GFPGAN_PACKAGE', "git+https://github.com/TencentARC/GFPGAN.git@8d2447a2d918f8eba5a4a01463fd48e45126a379") 修改成 gfpgan_package = os.environ.get('GFPGAN_PACKAGE', "git+ https://ghproxy.com/https://github.com/TencentARC/GFPGAN.git@8d2447a2d918f8eba5a4a01463fd48e45126a379") 1 2 3 GPU看戏(指GPU不工作) 用root环境运行webui吧(没试过)  su #输入密码,如果没设置就用sudo passwd root设置密码 HSA_OVERRIDE_GFX_VERSION=10.3.0 python launch.py --precision full --no-half #HSA_OVERRIDE_GFX_VERSION可以模拟版本可以填9.0.0或者8.0.3(没试过) 1 2 3 4 愉快玩耍 进webui目录执行以下操作  HSA_OVERRIDE_GFX_VERSION=10.3.0 python launch.py --precision full --no-half 1 如果运行时出现什么hip错误找不到gfx1030或者其他版号的可以不用管,等待一会就可以了,后面生成就不会提示,(每次启动第一次运行都会这样)  显卡监控(选装) sudo apt install radeontop radeontop ———————————————— 版权声明:本文为CSDN博主「晓舟 XiaozhouTAT」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。 原文链接:https://blog.csdn.net/qq_44948500/article/details/127346390 
  • [Atlas200] atlas200 制卡失败
    出现这种情况如图 用的是atlas200和第三方底板
  • [ManageOne二...] ManageOne API支持GPU生命周期管理吗
    请问ManageOne 或者  华为stack 支持通过API对GPU进行生命周期管理吗,与ECS的API接口是否一致?
  • AI一键作画,人人都是艺术家
    ffasdsdasdasdasd
  • [问题求助] RPA里的ftp控件删除文件 文件能使用*.xlsx来命名吗?一下子删除同类型的文件
    /home/target/*.xlsx 删除的远端目标文件这样写不行 改如何写呢
  • [热门活动] RiDeYinXiang
    日的印象黄昏,日出 z
  • [问题求助] 无法使用conda创建环境
    conda创建环境出现如下问题:RequestsDependencyWarning: urllib3 (1.26.12) or chardet (3.0.4) doesn't match a supported version!RequestsDependencyWarning)Collecting package metadata (current_repodata.json): failedProxyError: Conda cannot proceed due to an error in your proxy configuration.Check for typos and other configuration errors in any '.netrc' file in your home directory,any environment variables ending in '_PROXY', and any other system-wide proxyconfiguration settings.请问怎么解决呢,谢谢
  • [安装] 安装报错
    问下大家,这个报错怎么处理,查看昇腾和CANN版本的命令是什么。安装的是aarch64,anaconda创建了虚拟环境,在虚拟环境下安装mindspore,然后export环境,import mindspore报错如下
  • [问题求助] 广西AI pytorch-lightning-unet报错
    按照文档配置,在modelarts运行train.py文件报错,报错信息如下:TypeError Traceback (most recent call last) ~/work/agrivis/code/pytorch-lightning-unet/train.py in 139 140 if __name__ == "__main__": --> 141 main() ~/work/agrivis/code/pytorch-lightning-unet/train.py in main() 136 --> 137 trainer.fit(model, train_loader, val_loader) 138 139 ~/anaconda3/envs/PyTorch-1.8/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path) 695 self.strategy.model = model 696 self._call_and_handle_interrupt( --> 697 self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path 698 ) 699 ~/anaconda3/envs/PyTorch-1.8/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in _call_and_handle_interrupt(self, trainer_fn, *args, **kwargs) 648 return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs) 649 else: --> 650 return trainer_fn(*args, **kwargs) 651 # TODO(awaelchli): Unify both exceptions below, where `KeyboardError` doesn't re-raise 652 except KeyboardInterrupt as exception: ~/anaconda3/envs/PyTorch-1.8/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in _fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path) 735 ckpt_path, model_provided=True, model_connected=self.lightning_module is not None 736 ) --> 737 results = self._run(model, ckpt_path=self.ckpt_path) 738 739 assert self.state.stopped ~/anaconda3/envs/PyTorch-1.8/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in _run(self, model, ckpt_path) 1166 self._checkpoint_connector.resume_end() 1167 -> 1168 results = self._run_stage() 1169 1170 log.detail(f"{self.__class__.__name__}: trainer tearing down") ~/anaconda3/envs/PyTorch-1.8/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in _run_stage(self) 1252 if self.predicting: 1253 return self._run_predict() -> 1254 return self._run_train() 1255 1256 def _pre_training_routine(self): ~/anaconda3/envs/PyTorch-1.8/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in _run_train(self) 1283 1284 with torch.autograd.set_detect_anomaly(self._detect_anomaly): -> 1285 self.fit_loop.run() 1286 1287 def _run_evaluate(self) -> _EVALUATE_OUTPUT: ~/anaconda3/envs/PyTorch-1.8/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py in run(self, *args, **kwargs) 198 try: 199 self.on_advance_start(*args, **kwargs) --> 200 self.advance(*args, **kwargs) 201 self.on_advance_end() 202 self._restarting = False ~/anaconda3/envs/PyTorch-1.8/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py in advance(self) 268 ) 269 with self.trainer.profiler.profile("run_training_epoch"): --> 270 self._outputs = self.epoch_loop.run(self._data_fetcher) 271 272 def on_advance_end(self) -> None: ~/anaconda3/envs/PyTorch-1.8/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py in run(self, *args, **kwargs) 198 try: 199 self.on_advance_start(*args, **kwargs) --> 200 self.advance(*args, **kwargs) 201 self.on_advance_end() 202 self._restarting = False ~/anaconda3/envs/PyTorch-1.8/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py in advance(self, data_fetcher) 191 192 # hook --> 193 self.trainer._call_callback_hooks("on_train_batch_start", batch, batch_idx) 194 response = self.trainer._call_lightning_module_hook("on_train_batch_start", batch, batch_idx) 195 self.trainer._call_strategy_hook("on_train_batch_start", batch, batch_idx) ~/anaconda3/envs/PyTorch-1.8/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py in _call_callback_hooks(self, hook_name, *args, **kwargs) 1597 if callable(fn): 1598 with self.profiler.profile(f"[Callback]{callback.state_key}.{hook_name}"): -> 1599 fn(self, self.lightning_module, *args, **kwargs) 1600 1601 if pl_module: TypeError: on_train_batch_start() missing 1 required positional argument: 'dataloader_idx'
  • [问题求助] cmake指定毕晟编译器,当make时候,出现F90-S-0038-Symbol, iargc, has not been explicitly declared
    F90-S-0038-符号,iargc,尚未明确声明
  • [Atlas200] EMMC烧录文件方式失败
    目前我想通过网口的方式烧录文件,但是报如下错误“cannot find driver run package in current path”,不知道是缺少文件还是上面东西呢