• [技术干货] RK3588部署CNN-LSTM驾驶行为识别模型
    RK3588部署CNN-LSTM驾驶行为识别模型CNN(卷积神经网络)擅长提取图像的空间特征,LSTM(长短期记忆网络)则擅长处理序列数据的时间特征。首先使用CNN提取视频每一帧特征,之后将提取出的所有特征送入LSTM捕捉视频中的时空特征并对视频特征序列进行分类,实现正常驾驶、闭眼、打哈欠、打电话、左顾右盼5种驾驶行为的识别。一. 模型训练我们在ModelArts创建Notebook完成模型的训练,使用规格是GPU: 1*Pnt1(16GB)|CPU: 8核 64GB,镜像为tensorflow_2.1.0-cuda_10.1-py_3.7-ubuntu_18.04,首先下载数据集:import os import moxing as mox if not os.path.exists('fatigue_driving'): mox.file.copy_parallel('obs://modelbox-course/fatigue_driving', 'fatigue_driving') if not os.path.exists('rknn_toolkit2-2.3.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl'): mox.file.copy_parallel('obs://modelbox-course/rknn_toolkit2-2.3.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl', 'rknn_toolkit2-2.3.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl') 该数据集包含1525段视频,总共有5个类别:0:正常驾驶、1:闭眼、2:打哈欠、3:打电话、4:左顾右盼我们从原视频中裁剪出主驾驶位画面,并将画面缩放到特征提取网络的输入大小:def crop_driving_square(frame): h, w = frame.shape[:2] start_x = w // 2 end_x = w start_y = 0 end_y = h return frame[start_y:end_y, start_x:end_x] 使用在imagenet上预训练的MobileNetV2网络作为卷积基创建并保存图像特征提取器:def get_feature_extractor(): feature_extractor = keras.applications.mobilenet_v2.MobileNetV2( weights = 'imagenet', include_top = False, pooling = 'avg', input_shape = (IMG_SIZE, IMG_SIZE, 3) ) preprocess_input = keras.applications.mobilenet_v2.preprocess_input inputs = keras.Input((IMG_SIZE, IMG_SIZE, 3)) preprocessed = preprocess_input(inputs) outputs = feature_extractor(preprocessed) model = keras.Model(inputs, outputs, name = 'feature_extractor') return model feature_extractor = get_feature_extractor() feature_extractor.save('feature_extractor') feature_extractor.summary() Model: "feature_extractor" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_2 (InputLayer) [(None, 256, 256, 3)] 0 _________________________________________________________________ tf_op_layer_truediv (TensorF [(None, 256, 256, 3)] 0 _________________________________________________________________ tf_op_layer_sub (TensorFlowO [(None, 256, 256, 3)] 0 _________________________________________________________________ mobilenetv2_1.00_224 (Model) (None, 1280) 2257984 ================================================================= Total params: 2,257,984 Trainable params: 2,223,872 Non-trainable params: 34,112 设置网络的输入大小为256x256,每隔6帧截取一帧提取视频的图像特征,特征向量的大小为1280,最终得到每个视频的特征序列,序列的最大长度为40,不足用0补齐:def load_video(file_name): cap = cv2.VideoCapture(file_name) frame_interval = 6 frames = [] count = 0 while True: ret, frame = cap.read() if not ret: break if count % frame_interval == 0: frame = crop_driving_square(frame) frame = cv2.resize(frame, (IMG_SIZE, IMG_SIZE)) frame = frame[:, :, [2, 1, 0]] frames.append(frame) count += 1 return np.array(frames) def load_data(videos, labels): video_features = [] for video in tqdm(videos): frames = load_video(video) counts = len(frames) # 如果帧数小于MAX_SEQUENCE_LENGTH if counts < MAX_SEQUENCE_LENGTH: # 补白 diff = MAX_SEQUENCE_LENGTH - counts # 创建全0的numpy数组 padding = np.zeros((diff, IMG_SIZE, IMG_SIZE, 3)) # 数组拼接 frames = np.concatenate((frames, padding)) # 获取前MAX_SEQUENCE_LENGTH帧画面 frames = frames[:MAX_SEQUENCE_LENGTH, :] # 批量提取图像特征 video_feature = feature_extractor.predict(frames) video_features.append(video_feature) return np.array(video_features), np.array(labels) video_features, classes = load_data(videos, labels) video_features.shape, classes.shape((1525, 40, 1280), (1525,)) 总共提取了1525个视频的特征序列,按照8:2的比例划分训练集和测试集(batchsize的大小设为16):batch_size = 16 dataset = tf.data.Dataset.from_tensor_slices((video_features, classes)) dataset = dataset.shuffle(len(videos)) test_count = int(len(videos) * 0.2) train_count = len(videos) - test_count dataset_train = dataset.skip(test_count).cache().repeat() dataset_test = dataset.take(test_count).cache().repeat() train_dataset = dataset_train.shuffle(train_count).batch(batch_size) test_dataset = dataset_test.shuffle(test_count).batch(batch_size) train_dataset, train_count, test_dataset, test_count(<BatchDataset shapes: ((None, 40, 1280), (None,)), types: (tf.float32, tf.int64)>, 1220, <BatchDataset shapes: ((None, 40, 1280), (None,)), types: (tf.float32, tf.int64)>, 305) 之后创建LSTM提取视频特征序列的时间信息送入Dense分类器,模型的定义如下:def video_cls_model(class_vocab): # 类别数量 classes_num = len(class_vocab) # 定义模型 model = keras.Sequential([ layers.Input(shape=(MAX_SEQUENCE_LENGTH, NUM_FEATURES)), layers.LSTM(64, return_sequences=True), layers.Flatten(), layers.Dense(classes_num, activation='softmax') ]) # 编译模型 model.compile(optimizer = keras.optimizers.Adam(1e-5), loss = keras.losses.SparseCategoricalCrossentropy(from_logits=False), metrics = ['accuracy'] ) return model # 模型实例化 model = video_cls_model(np.unique(labels)) # 保存检查点 checkpoint = keras.callbacks.ModelCheckpoint(filepath='best.h5', monitor='val_loss', save_weights_only=True, save_best_only=True, verbose=1, mode='min') # 模型结构 model.summary() 网络的输入大小为(N, 40, 1280),使用softmax进行激活,输出5个类别的概率:Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= lstm (LSTM) (None, 40, 64) 344320 _________________________________________________________________ flatten (Flatten) (None, 2560) 0 _________________________________________________________________ dense (Dense) (None, 5) 12805 ================================================================= Total params: 357,125 Trainable params: 357,125 Non-trainable params: 0 _________________________________________________________________实验表明模型训练300个Epoch基本收敛:history = model.fit(train_dataset, epochs = 300, steps_per_epoch = train_count // batch_size, validation_steps = test_count // batch_size, validation_data = test_dataset, callbacks=[checkpoint]) plt.plot(history.epoch, history.history['loss'], 'r', label='loss') plt.plot(history.epoch, history.history['val_loss'], 'g--', label='val_loss') plt.title('LSTM') plt.xlabel('Epoch') plt.ylabel('Loss') plt.legend() plt.plot(history.epoch, history.history['accuracy'], 'r', label='acc') plt.plot(history.epoch, history.history['val_accuracy'], 'g--', label='val_acc') plt.title('LSTM') plt.xlabel('Epoch') plt.ylabel('Accuracy') plt.legend() 加载模型最优权重,模型在测试集上的分类准确率为95.8%,保存为saved_model格式:model.load_weights('best.h5') model.evaluate(dataset.batch(batch_size)) model.save('saved_model') 96/96 [==============================] - 0s 5ms/step - loss: 0.2169 - accuracy: 0.9580 [0.21687692414949802, 0.9580328] 二、模型转换首先将图像特征提取器feature_extractor转为tflite格式,并开启模型量化:import tensorflow as tf converter = tf.lite.TFLiteConverter.from_saved_model('feature_extractor') converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS, tf.lite.OpsSet.SELECT_TF_OPS] converter.post_training_quantize = True # 模型量化 tflite_model = converter.convert() with open('mbv2.tflite', 'wb') as f: f.write(tflite_model) 再将视频序列分类模型转为onnx格式,由于lstm参数量较少,不需要进行量化:python -m tf2onnx.convert --saved-model saved_model --output lstm.onnx --opset 12 最后导出RKNN格式的模型,可根据需要设置target_platform为rk3568/rk3588:from rknn.api import RKNN rknn = RKNN(verbose=False) rknn.config(target_platform="rk3588") rknn.load_tflite(model="mbv2.tflite") rknn.build(do_quantization=False) rknn.export_rknn('mbv2.rknn') rknn.release() rknn = RKNN(verbose=False) rknn.config(target_platform="rk3588") rknn.load_onnx( model="lstm.onnx", inputs=['input_3'], # 输入节点名称 input_size_list=[[1, 40, 1280]] # 固定输入尺寸 ) rknn.build(do_quantization=False) rknn.export_rknn('lstm.rknn') rknn.release() 三、模型部署我们在RK3588上部署MobileNetV2和LSTM模型,以下是板侧的推理代码:import os import cv2 import glob import shutil import imageio import numpy as np from IPython.display import Image from rknnlite.api import RKNNLite MAX_SEQUENCE_LENGTH = 40 IMG_SIZE = 256 NUM_FEATURES = 1280 def crop_driving_square(img): h, w = img.shape[:2] start_x = w // 2 end_x = w start_y = 0 end_y = h result = img[start_y:end_y, start_x:end_x] return result def load_video(file_name): cap = cv2.VideoCapture(file_name) # 每隔多少帧抽取一次 frame_interval = 6 frames = [] count = 0 while True: ret, frame = cap.read() if not ret: break # 每隔frame_interval帧保存一次 if count % frame_interval == 0: # 中心裁剪 frame = crop_driving_square(frame) # 缩放 frame = cv2.resize(frame, (IMG_SIZE, IMG_SIZE)) # BGR -> RGB [0,1,2] -> [2,1,0] frame = frame[:, :, [2, 1, 0]] frames.append(frame) count += 1 cap.release() return np.array(frames).astype(np.uint8) # 获取视频特征序列 def getVideoFeat(frames): frames_count = len(frames) # 如果帧数小于MAX_SEQUENCE_LENGTH if frames_count < MAX_SEQUENCE_LENGTH: # 补白 diff = MAX_SEQUENCE_LENGTH - frames_count # 创建全0的numpy数组 padding = np.zeros((diff, IMG_SIZE, IMG_SIZE, 3)) # 数组拼接 frames = np.concatenate((frames, padding)) # 取前MAX_SEQ_LENGTH帧 frames = frames[:MAX_SEQUENCE_LENGTH,:] frames = frames.astype(np.float32) # 提取视频每一帧特征 feats = [] for frame in frames: frame = np.expand_dims(frame, axis=0) result = rknn_lite_mbv2.inference(inputs=[frame]) feats.append(result[0]) return feats rknn_lite_mbv2 = RKNNLite() rknn_lite_lstm = RKNNLite() rknn_lite_mbv2.load_rknn('model/mbv2.rknn') rknn_lite_lstm.load_rknn('model/lstm.rknn') rknn_lite_mbv2.init_runtime(core_mask=RKNNLite.NPU_CORE_0_1_2) rknn_lite_lstm.init_runtime(core_mask=RKNNLite.NPU_CORE_0_1_2) files = glob.glob("video/*.mp4") for video_path in files: label_to_name = {0:'正常驾驶', 1:'闭眼', 2:'打哈欠', 3:'打电话', 4:'左顾右盼'} frames = load_video(video_path) frames = frames[:MAX_SEQUENCE_LENGTH] imageio.mimsave('test.gif', frames, durations=10, loop=0) display(Image(open('test.gif', 'rb').read())) feats = getVideoFeat(frames) feats = np.concatenate(feats, axis=0) feats = np.expand_dims(feats, axis=0) preds = rknn_lite_lstm.inference(inputs=[feats])[0][0] for i in np.argsort(preds)[::-1][:5]: print('{}: {}%'.format(label_to_name[i], round(preds[i]*100, 2))) rknn_lite_mbv2.release() rknn_lite_lstm.release() 最终的视频识别效果如下:🚀四、本文小结本文详细阐述了基于RK3588平台的CNN-LSTM驾驶行为识别模型全流程,利用MobileNetV2提取图像的空间特征、LSTM处理视频的时序特征完成对正常驾驶、闭眼、打哈欠、打电话和左顾右盼5类驾驶行为的精准识别,在ModelArts上训练达到95.8%分类准确率,并分别将mbv2.tflite和lstm.onnx转换为RKNN格式实现板侧的高效推理部署。
  • [技术干货] 深入理解CUDA图像并行处理
    在CUDA编程中,一个CUDA Kernel是由众多线程(threds)组成,而这些线程又可以被组织成一个或多个block块。在同一线程块中,线程ID是从0开始连续编号的,可以通过内置变量threadIdx来获取:// 获取本线程的索引,blockIdx 指的是线程块的索引,blockDim 指的是线程块的大小,threadIdx 指的是本线程块中的线程索引 int tid = blockIdx.x * blockDim.x + threadIdx.x; 以对图像的归一化处理为例,需要对图片中的每一个像素点的三个通道值分别除以255,相比于使用CPU进行串行计算,我们可以使用CUDA核函数创建更多的线程和线程块来充分利用GPU的并行处理能力:// 计算需要的线程总量(高度 x 宽度):640*640=409600 int jobs = dst_height * dst_width; // 一个线程块包含256个线程 int threads = 256; // 计算线程块的数量(向上取整) int blocks = ceil(jobs / (float)threads); // 调用kernel函数 preprocess_kernel<<<blocks, threads>>>( img_buffer_device, dst, dst_width, dst_height, jobs); // 函数的参数 这里我们定义每个线程块的线程数量为256,线程块的数量为ceil(jobs / (float)threads),总的线程总量要大于等图片的像素数量。当启动Kernel函数时,GPU上的每个线程都会执行相同的程序代码,从而实现更高效的并行计算,函数具体实现如下:// 一个线程处理一个像素点 __global__ void preprocess_kernel( uint8_t *src, float *dst, int dst_width, int dst_height, int edge) { int tid = blockDim.x * blockIdx.x + threadIdx.x; if (tid >= edge) return; int dx = tid % dst_width; // 计算当前线程对应的目标图像的x坐标 int dy = tid / dst_width; // 计算当前线程对应的目标图像的y坐标 // normalization(对原图中(x,y)坐标的像素点3个通道进行归一化) float c0 = src[dy * dst_width * 3 + dx * 3 + 0] / 255.0f; float c1 = src[dy * dst_width * 3 + dx * 3 + 1] / 255.0f; float c2 = src[dy * dst_width * 3 + dx * 3 + 2] / 255.0f; // bgr to rgb float t = c2; c2 = c0; c0 = t; // rgbrgbrgb to rrrgggbbb // NHWC to NCHW int area = dst_width * dst_height; float *pdst_c0 = dst + dy * dst_width + dx; float *pdst_c1 = pdst_c0 + area; float *pdst_c2 = pdst_c1 + area; *pdst_c0 = c0; *pdst_c1 = c1; *pdst_c2 = c2; } 其中tid是本线程的索引,dst_width和dst_height是图像的宽和高,edge是图片的像素数量,每一个线程处理一个像素点。由于线程索引是从0开始计数的,我们要确保tid不能超过图片的像素数量edge:int tid = blockDim.x * blockIdx.x + threadIdx.x; if (tid >= edge) return; 由于图像数据以行优先(row-major)顺序连续存储在内存中,每个像素由3个字节表示(BGR)。为了获取每个线程所处理的像素点在内存中的起始位置,我们可以先计算当前线程所对应图像的x和y坐标即dx和dy:int dx = tid % dst_width; // 计算当前线程对应的目标图像的x坐标 int dy = tid / dst_width; // 计算当前线程对应的目标图像的y坐标 然后获取当前线程所处理的像素点在内存中的起始位置:dy * dst_width * 3 + dx * 3,*3是因为每个像素点有3个通道值,在内存中的排列方式为:BGRBGRBGR...,最后再/255对原图中(x,y)坐标的像素点3个通道值进行归一化:// normalization float c0 = src[dy * dst_width * 3 + dx * 3 + 0] / 255.0f; float c1 = src[dy * dst_width * 3 + dx * 3 + 1] / 255.0f; float c2 = src[dy * dst_width * 3 + dx * 3 + 2] / 255.0f; dy * dst_width * 3:定位到第dy行的起始位置dx * 3:在当前行中定位到第dx个像素的起始位置+ 0, + 1, + 2:分别访问B、G、R三个通道的值除以255交换变量做BGR到RGB的通道转换:// bgr to rgb float t = c2; c2 = c0; c0 = t; 目标图像(RGB)像素点在内存中的排列方式为RRR...GGG...BBB,当前像素点R通道的值在目标图像中内存地址为(dst + dy * dst_width + dx),G通道的值在目标图像中内存地址为(dst + dy * dst_width + dx) + area,加上1个通道的偏移量area,以此类推,完成对图像的通道转换:// NHWC to NCHW // rgbrgbrgb to rrrgggbbb int area = dst_width * dst_height; float *pdst_c0 = dst + dy * dst_width + dx; float *pdst_c1 = pdst_c0 + area; float *pdst_c2 = pdst_c1 + area; *pdst_c0 = c0; *pdst_c1 = c1; *pdst_c2 = c2;
  • [技术干货] Ascend310部署Qwen-VL-7B实现吸烟动作识别
    Ascend310部署Qwen-VL-7B实现吸烟动作识别OrangePi AI Studio Pro是基于2个昇腾310P处理器的新一代高性能推理解析卡,提供基础通用算力+超强AI算力,整合了训练和推理的全部底层软件栈,实现训推一体。其中AI半精度FP16算力约为176TFLOPS,整数Int8精度可达352TOPS,本文将带领大家在Ascend 310P上部署Qwen2.5-VL-7B多模态理解大模型实现吸烟动作的识别。一、环境配置我们在OrangePi AI Stuido上使用Docker容器部署MindIE:docker pull swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.1.RC1-300I-Duo-py311-openeuler24.03-ltsroot@orangepi:~# docker images REPOSITORY TAG IMAGE ID CREATED SIZE swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie 2.1.RC1-300I-Duo-py311-openeuler24.03-lts 0574b8d4403f 3 months ago 20.4GB langgenius/dify-web 1.0.1 b2b7363571c2 8 months ago 475MB langgenius/dify-api 1.0.1 3dd892f50a2d 8 months ago 2.14GB langgenius/dify-plugin-daemon 0.0.4-local 3f180f39bfbe 8 months ago 1.35GB ubuntu/squid latest dae40da440fe 8 months ago 243MB postgres 15-alpine afbf3abf6aeb 8 months ago 273MB nginx latest b52e0b094bc0 9 months ago 192MB swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie 1.0.0-300I-Duo-py311-openeuler24.03-lts 74a5b9615370 10 months ago 17.5GB redis 6-alpine 6dd588768b9b 10 months ago 30.2MB langgenius/dify-sandbox 0.2.10 4328059557e8 13 months ago 567MB semitechnologies/weaviate 1.19.0 8ec9f084ab23 2 years ago 52.5MB之后创建一个名为start-docker.sh的启动脚本,内容如下:NAME=$1 if [ $# -ne 1 ]; then echo "warning: need input container name.Use default: mindie" NAME=mindie fi docker run --name ${NAME} -it -d --net=host --shm-size=500g \ --privileged=true \ -w /usr/local/Ascend/atb-models \ --device=/dev/davinci_manager \ --device=/dev/hisi_hdc \ --device=/dev/devmm_svm \ --entrypoint=bash \ -v /models:/models \ -v /data:/data \ -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ -v /usr/local/sbin:/usr/local/sbin \ -v /home:/home \ -v /tmp:/tmp \ -v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime \ -e http_proxy=$http_proxy \ -e https_proxy=$https_proxy \ -e "PATH=/usr/local/python3.11.6/bin:$PATH" \ swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.1.RC1-300I-Duo-py311-openeuler24.03-ltsbash start-docker.sh启动容器后,我们需要替换几个文件并安装Ascend-cann-nnal软件包:root@orangepi:~# docker exec -it mindie bash Welcome to 5.15.0-126-generic System information as of time: Sat Nov 15 22:06:48 CST 2025 System load: 1.87 Memory used: 6.3% Swap used: 0.0% Usage On: 33% Users online: 0 [root@orangepi atb-models]# cd /usr/local/Ascend/ascend-toolkit/8.2.RC1/lib64/ [root@orangepi lib64]# ls /data/fix_openeuler_docker/fixhccl/8.2hccl/ libhccl.so libhccl_alg.so libhccl_heterog.so libhccl_plf.so [root@orangepi lib64]# cp /data/fix_openeuler_docker/fixhccl/8.2hccl/* ./ cp: overwrite './libhccl.so'? cp: overwrite './libhccl_alg.so'? cp: overwrite './libhccl_heterog.so'? cp: overwrite './libhccl_plf.so'? [root@orangepi lib64]# source /usr/local/Ascend/ascend-toolkit/set_env.sh [root@orangepi lib64]# chmod +x /data/fix_openeuler_docker/Ascend-cann-nnal/Ascend-cann-nnal_8.3.RC1_linux-x86_64.run [root@orangepi lib64]# /data/fix_openeuler_docker/Ascend-cann-nnal/Ascend-cann-nnal_8.3.RC1_linux-x86_64.run --install --quiet [NNAL] [20251115-22:41:45] [INFO] LogFile:/var/log/ascend_seclog/ascend_nnal_install.log [NNAL] [20251115-22:41:45] [INFO] Ascend-cann-atb_8.3.RC1_linux-x86_64.run --install --install-path=/usr/local/Ascend/nnal --install-for-all --quiet --nox11 start WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv [NNAL] [20251115-22:41:58] [INFO] Ascend-cann-atb_8.3.RC1_linux-x86_64.run --install --install-path=/usr/local/Ascend/nnal --install-for-all --quiet --nox11 install success [NNAL] [20251115-22:41:58] [INFO] Ascend-cann-SIP_8.3.RC1_linux-x86_64.run --install --install-path=/usr/local/Ascend/nnal --install-for-all --quiet --nox11 start [NNAL] [20251115-22:41:59] [INFO] Ascend-cann-SIP_8.3.RC1_linux-x86_64.run --install --install-path=/usr/local/Ascend/nnal --install-for-all --quiet --nox11 install success [NNAL] [20251115-22:41:59] [INFO] Ascend-cann-nnal_8.3.RC1_linux-x86_64.run install success Warning!!! If the environment variables of atb and asdsip are set at the same time, unexpected consequences will occur. Import the corresponding environment variables based on the usage scenarios: atb for large model scenarios, asdsip for embedded scenarios. Please make sure that the environment variables have been configured. If you want to use atb module: - To take effect for current user, you can exec command below: source /usr/local/Ascend/nnal/atb/set_env.sh or add "source /usr/local/Ascend/nnal/atb/set_env.sh" to ~/.bashrc. If you want to use asdsip module: - To take effect for current user, you can exec command below: source /usr/local/Ascend/nnal/asdsip/set_env.sh or add "source /usr/local/Ascend/nnal/asdsip/set_env.sh" to ~/.bashrc. [root@orangepi lib64]# cat /usr/local/Ascend/nnal/atb/latest/version.info Ascend-cann-atb : 8.3.RC1 Ascend-cann-atb Version : 8.3.RC1.B106 Platform : x86_64 branch : 8.3.rc1-0702 commit id : 16004f23040e0dcdd3cf0c64ecf36622487038ba修改推理使用的逻辑NPU核心为0,1,测试多模态理解大模型:Qwen2.5-VL-7B-Instruct:运行结果表明,Qwen2.5-VL-7B-Instruct在2 x Ascned 310P上推理平均每秒可以输出20个tokens,同时准确理解画面中的人物信息和行为动作。[root@orangepi atb-models]# bash examples/models/qwen2_vl/run_pa.sh --model_path /models/Qwen2.5-VL-7B-Instruct/ --input_image /root/pic/test.jpg [2025-11-15 22:12:49,663] torch.distributed.run: [WARNING] [2025-11-15 22:12:49,663] torch.distributed.run: [WARNING] ***************************************** [2025-11-15 22:12:49,663] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2025-11-15 22:12:49,663] torch.distributed.run: [WARNING] ***************************************** /usr/local/lib64/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libc10_cuda.so: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source? warn( /usr/local/lib64/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libc10_cuda.so: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source? warn( 2025-11-15 22:12:53.250 7934 LLM log default format: [yyyy-mm-dd hh:mm:ss.uuuuuu] [processid] [threadid] [llmmodels] [loglevel] [file:line] [status code] msg 2025-11-15 22:12:53.250 7933 LLM log default format: [yyyy-mm-dd hh:mm:ss.uuuuuu] [processid] [threadid] [llmmodels] [loglevel] [file:line] [status code] msg [2025-11-15 22:12:53.250] [7934] [139886327420160] [llmmodels] [WARN] [model_factory.cpp:28] deepseekV2_DecoderModel model already exists, but the duplication doesn't matter. [2025-11-15 22:12:53.250] [7933] [139649439929600] [llmmodels] [WARN] [model_factory.cpp:28] deepseekV2_DecoderModel model already exists, but the duplication doesn't matter. [2025-11-15 22:12:53.250] [7934] [139886327420160] [llmmodels] [WARN] [model_factory.cpp:28] deepseekV2_DecoderModel model already exists, but the duplication doesn't matter. [2025-11-15 22:12:53.250] [7933] [139649439929600] [llmmodels] [WARN] [model_factory.cpp:28] deepseekV2_DecoderModel model already exists, but the duplication doesn't matter. [2025-11-15 22:12:53.250] [7934] [139886327420160] [llmmodels] [WARN] [model_factory.cpp:28] llama_LlamaDecoderModel model already exists, but the duplication doesn't matter. [2025-11-15 22:12:53.250] [7933] [139649439929600] [llmmodels] [WARN] [model_factory.cpp:28] llama_LlamaDecoderModel model already exists, but the duplication doesn't matter. [2025-11-15 22:12:55,335] [7934] [139886327420160] [llmmodels] [INFO] [cpu_binding.py-254] : rank_id: 1, device_id: 1, numa_id: 0, shard_devices: [0, 1], cpus: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] [2025-11-15 22:12:55,336] [7934] [139886327420160] [llmmodels] [INFO] [cpu_binding.py-280] : process 7934, new_affinity is [8, 9, 10, 11, 12, 13, 14, 15], cpu count 8 [2025-11-15 22:12:55,356] [7933] [139649439929600] [llmmodels] [INFO] [cpu_binding.py-254] : rank_id: 0, device_id: 0, numa_id: 0, shard_devices: [0, 1], cpus: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] [2025-11-15 22:12:55,357] [7933] [139649439929600] [llmmodels] [INFO] [cpu_binding.py-280] : process 7933, new_affinity is [0, 1, 2, 3, 4, 5, 6, 7], cpu count 8 [2025-11-15 22:12:56,032] [7933] [139649439929600] [llmmodels] [INFO] [model_runner.py-156] : model_runner.quantize: None, model_runner.kv_quant_type: None, model_runner.fa_quant_type: None, model_runner.dtype: torch.float16 [2025-11-15 22:13:01,826] [7933] [139649439929600] [llmmodels] [INFO] [dist.py-81] : initialize_distributed has been Set [2025-11-15 22:13:01,827] [7933] [139649439929600] [llmmodels] [INFO] [model_runner.py-187] : init tokenizer done Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. [2025-11-15 22:13:02,070] [7934] [139886327420160] [llmmodels] [INFO] [dist.py-81] : initialize_distributed has been Set Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. [W InferFormat.cpp:62] Warning: Cannot create tensor with NZ format while dim < 2, tensor will be created with ND format. (function operator()) [W InferFormat.cpp:62] Warning: Cannot create tensor with NZ format while dim < 2, tensor will be created with ND format. (function operator()) [2025-11-15 22:13:08,435] [7933] [139649439929600] [llmmodels] [INFO] [flash_causal_qwen2.py-153] : >>>> qwen_QwenDecoderModel is called. [2025-11-15 22:13:08,526] [7934] [139886327420160] [llmmodels] [INFO] [flash_causal_qwen2.py-153] : >>>> qwen_QwenDecoderModel is called. [2025-11-15 22:13:16.666] [7933] [139649439929600] [llmmodels] [WARN] [operation_factory.cpp:42] OperationName: TransdataOperation not find in operation factory map [2025-11-15 22:13:16.698] [7934] [139886327420160] [llmmodels] [WARN] [operation_factory.cpp:42] OperationName: TransdataOperation not find in operation factory map [2025-11-15 22:13:22,379] [7933] [139649439929600] [llmmodels] [INFO] [model_runner.py-282] : model: FlashQwen2vlForCausalLM( (rotary_embedding): PositionRotaryEmbedding() (attn_mask): AttentionMask() (vision_tower): Qwen25VisionTransformerPretrainedModelATB( (encoder): Qwen25VLVisionEncoderATB( (layers): ModuleList( (0-31): 32 x Qwen25VLVisionLayerATB( (attn): VisionAttention( (qkv): TensorParallelColumnLinear( (linear): FastLinear() ) (proj): TensorParallelRowLinear( (linear): FastLinear() ) ) (mlp): VisionMlp( (gate_up_proj): TensorParallelColumnLinear( (linear): FastLinear() ) (down_proj): TensorParallelRowLinear( (linear): FastLinear() ) ) (norm1): BaseRMSNorm() (norm2): BaseRMSNorm() ) ) (patch_embed): FastPatchEmbed( (proj): TensorReplicatedLinear( (linear): FastLinear() ) ) (patch_merger): PatchMerger( (patch_merger_mlp_0): TensorParallelColumnLinear( (linear): FastLinear() ) (patch_merger_mlp_2): TensorParallelRowLinear( (linear): FastLinear() ) (patch_merger_ln_q): BaseRMSNorm() ) ) (rotary_pos_emb): VisionRotaryEmbedding() ) (language_model): FlashQwen2UsingMROPEForCausalLM( (rotary_embedding): PositionRotaryEmbedding() (attn_mask): AttentionMask() (transformer): FlashQwenModel( (wte): TensorEmbeddingWithoutChecking() (h): ModuleList( (0-27): 28 x FlashQwenLayer( (attn): FlashQwenAttention( (rotary_emb): PositionRotaryEmbedding() (c_attn): TensorParallelColumnLinear( (linear): FastLinear() ) (c_proj): TensorParallelRowLinear( (linear): FastLinear() ) ) (mlp): QwenMLP( (act): SiLU() (w2_w1): TensorParallelColumnLinear( (linear): FastLinear() ) (c_proj): TensorParallelRowLinear( (linear): FastLinear() ) ) (ln_1): QwenRMSNorm() (ln_2): QwenRMSNorm() ) ) (ln_f): QwenRMSNorm() ) (lm_head): TensorParallelHead( (linear): FastLinear() ) ) ) [2025-11-15 22:13:24,268] [7933] [139649439929600] [llmmodels] [INFO] [run_pa.py-134] : hbm_capacity(GB): 87.5078125, init_memory(GB): 11.376015624962747 [2025-11-15 22:13:24,789] [7933] [139649439929600] [llmmodels] [INFO] [run_pa.py-342] : pa_runner: PARunner(model_path=/models/Qwen2.5-VL-7B-Instruct/, input_text=请用超过500个字详细说明图片的内容,并仔细判断画面中的人物是否有吸烟动作。, max_position_embeddings=None, max_input_length=16384, max_output_length=1024, max_prefill_tokens=-1, load_tokenizer=True, enable_atb_torch=False, max_prefill_batch_size=None, max_batch_size=1, dtype=torch.float16, block_size=128, model_config=ModelConfig(num_heads=14, num_kv_heads=2, num_kv_heads_origin=4, head_size=128, k_head_size=128, v_head_size=128, num_layers=28, device=npu:0, dtype=torch.float16, soc_info=NPUSocInfo(soc_name='', soc_version=200, need_nz=True, matmul_nd_nz=False), kv_quant_type=None, fa_quant_type=None, mapping=Mapping(world_size=2, rank=0, num_nodes=1,pp_rank=0, pp_groups=[[0], [1]], micro_batch_size=1, attn_dp_groups=[[0], [1]], attn_tp_groups=[[0, 1]], attn_inner_sp_groups=[[0], [1]], attn_cp_groups=[[0], [1]], attn_o_proj_tp_groups=[[0], [1]], mlp_tp_groups=[[0, 1]], moe_ep_groups=[[0], [1]], moe_tp_groups=[[0, 1]]), cla_share_factor=1, model_type=qwen2_5_vl, enable_nz=False), max_memory=93960798208, [2025-11-15 22:13:24,794] [7933] [139649439929600] [llmmodels] [INFO] [run_pa.py-122] : ---------------Begin warm_up--------------- [2025-11-15 22:13:24,794] [7933] [139649439929600] [llmmodels] [INFO] [cache.py-154] : kv cache will allocate 0.46484375GB memory [2025-11-15 22:13:24,821] [7934] [139886327420160] [llmmodels] [INFO] [cache.py-154] : kv cache will allocate 0.46484375GB memory [2025-11-15 22:13:24,827] [7933] [139649439929600] [llmmodels] [INFO] [generate.py-1139] : ------total req num: 1, infer start-------- [2025-11-15 22:13:26,002] [7934] [139886327420160] [llmmodels] [INFO] [flash_causal_qwen2.py-680] : <<<<<<<after transdata k_caches[0].shape=torch.Size([136, 16, 128, 16]) [2025-11-15 22:13:26,023] [7933] [139649439929600] [llmmodels] [INFO] [flash_causal_qwen2.py-676] : <<<<<<< ori k_caches[0].shape=torch.Size([136, 16, 128, 16]) [2025-11-15 22:13:26,023] [7933] [139649439929600] [llmmodels] [INFO] [flash_causal_qwen2.py-680] : <<<<<<<after transdata k_caches[0].shape=torch.Size([136, 16, 128, 16]) [2025-11-15 22:13:26,024] [7933] [139649439929600] [llmmodels] [INFO] [flash_causal_qwen2.py-705] : >>>>>>id of kcache is 139645634198608 id of vcache is 139645634198320 [2025-11-15 22:13:34,363] [7933] [139649439929600] [llmmodels] [INFO] [generate.py-1294] : Prefill time: 9476.590633392334ms, Prefill average time: 9476.590633392334ms, Decode token time: 54.94809150695801ms, E2E time: 9531.538724899292ms [2025-11-15 22:13:34,363] [7934] [139886327420160] [llmmodels] [INFO] [generate.py-1294] : Prefill time: 9452.020645141602ms, Prefill average time: 9452.020645141602ms, Decode token time: 54.654598236083984ms, E2E time: 9506.675243377686ms [2025-11-15 22:13:34,366] [7933] [139649439929600] [llmmodels] [INFO] [generate.py-1326] : -------------------performance dumped------------------------ [2025-11-15 22:13:34,371] [7933] [139649439929600] [llmmodels] [INFO] [generate.py-1329] : | batch_size | input_seq_len | output_seq_len | e2e_time(ms) | prefill_time(ms) | decoder_token_time(ms) | prefill_count | prefill_average_time(ms) | |-------------:|----------------:|-----------------:|---------------:|-------------------:|-------------------------:|----------------:|---------------------------:| | 1 | 16384 | 2 | 9531.54 | 9476.59 | 54.95 | 1 | 9476.59 | /usr/local/lib64/python3.11/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True). warnings.warn( [2025-11-15 22:13:35,307] [7933] [139649439929600] [llmmodels] [INFO] [run_pa.py-148] : warmup_memory(GB): 15.75 [2025-11-15 22:13:35,307] [7933] [139649439929600] [llmmodels] [INFO] [run_pa.py-153] : ---------------End warm_up--------------- /usr/local/lib64/python3.11/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True). warnings.warn( [2025-11-15 22:13:35,363] [7933] [139649439929600] [llmmodels] [INFO] [generate.py-1139] : ------total req num: 1, infer start-------- [2025-11-15 22:13:50,021] [7933] [139649439929600] [llmmodels] [INFO] [generate.py-1294] : Prefill time: 1004.0028095245361ms, Prefill average time: 1004.0028095245361ms, Decode token time: 13.301290491575836ms, E2E time: 14611.222982406616ms [2025-11-15 22:13:50,021] [7934] [139886327420160] [llmmodels] [INFO] [generate.py-1294] : Prefill time: 1067.9974555969238ms, Prefill average time: 1067.9974555969238ms, Decode token time: 13.300292536193908ms, E2E time: 14674.196720123291ms [2025-11-15 22:13:50,025] [7933] [139649439929600] [llmmodels] [INFO] [generate.py-1326] : -------------------performance dumped------------------------ [2025-11-15 22:13:50,028] [7933] [139649439929600] [llmmodels] [INFO] [generate.py-1329] : | batch_size | input_seq_len | output_seq_len | e2e_time(ms) | prefill_time(ms) | decoder_token_time(ms) | prefill_count | prefill_average_time(ms) | |-------------:|----------------:|-----------------:|---------------:|-------------------:|-------------------------:|----------------:|---------------------------:| | 1 | 1675 | 1024 | 14611.2 | 1004 | 13.3 | 1 | 1004 | [2025-11-15 22:13:50,035] [7933] [139649439929600] [llmmodels] [INFO] [run_pa.py-385] : Question[0]: [{'image': '/root/pic/test.jpg'}, {'text': '请用超过500个字详细说明图片的内容,并仔细判断画面中的人物是否有吸烟动作。'}] [2025-11-15 22:13:50,035] [7933] [139649439929600] [llmmodels] [INFO] [run_pa.py-386] : Answer[0]: 这张图片展示了一个无人机航拍的场景,画面中可以看到两名工人站在一个雪地或冰面上。他们穿着橙色的安全背心和红色的安全帽,显得非常醒目。背景中可以看到一些雪地和一些金属结构,可能是桥梁或工业设施的一部分。 从图片的细节来看,画面右侧的工人右手放在嘴边,似乎在吸烟。他的姿势和动作与吸烟者的典型姿势相符。然而,由于图片的分辨率和角度限制,无法完全确定这个动作是否真实发生。如果要准确判断,可能需要更多的视频片段或更清晰的图像。 从无人机航拍的角度来看,这个场景可能是在进行某种工业或建筑项目的检查或监控。两名工人可能正在进行现场检查或讨论工作事宜。雪地和金属结构表明这可能是一个寒冷的冬季,或者是一个寒冷的气候区域。 无人机航拍技术在工业和建筑领域中非常常见,因为它可以提供高空视角,帮助工程师和管理人员更好地了解现场情况。这种技术不仅可以节省时间和成本,还可以提高工作效率和安全性。在进行航拍时,确保遵守当地的法律法规和安全规定是非常重要的。 总的来说,这张图片展示了一个无人机航拍的场景,画面中两名工人站在雪地上,其中一人似乎在吸烟。虽然无法完全确定这个动作是否真实发生,但根据他们的姿势和动作,可以合理推测这个动作的存在。 [2025-11-15 22:13:50,035] [7933] [139649439929600] [llmmodels] [INFO] [run_pa.py-387] : Generate[0] token num: 282 [2025-11-15 22:13:50,035] [7933] [139649439929600] [llmmodels] [INFO] [run_pa.py-389] : Latency(s): 14.721353530883789 [2025-11-15 22:13:50,035] [7933] [139649439929600] [llmmodels] [INFO] [run_pa.py-390] : Throughput(tokens/s): 19.15584728050956 本文详细介绍了在OrangePi AI Studio上使用Docker容器部署MindIE环境并运行Qwen2.5-VL-7B-Instruct多模态大模型实现吸烟动作识别的完整过程,验证了在Ascned 310p设备上运行多模态理解大模型的可靠性。
  • [技术干货] 如何在Jetson上将YOLOv5实时检测速度提升至120+FPS
    如何在Jetson上将YOLOv5实时检测速度提升至120+FPS这个项目提供了基于 Pybind11 的 TensorRT YOLOv5 插件 Python 绑定,实现了令人难以置信的实时目标检测性能!⚡ 超100FPS性能: 在 Jetson Orin Nano 上轻松实现超过 120 帧/秒的检测速度🎯 高精度检测: 基于成熟的 YOLOv5 架构,准确识别COCO数据集上的80类目标🔌 即插即用: 简单的 Python 接口,无需复杂的配置🛠️ 工业级优化: 采用 TensorRT 进行模型优化和加速1. Building the plugin首先安装必要的库克隆仓库构建项目,注意JetPack 5.x版本才能正常运行:sudo apt update sudo apt install ffmpeg sudo apt install pybind11-dev git clone https://github.com/HouYanSong/yolov5_trt_pybind11.git cd yolov5_trt_pybind11 pip install pybind11 rm -fr build cmake -S . -B build cmake --build build2. Model quantization生成量化图片对YOLOv5s模型进行Int8量化,保存量化后的模型:./media/gen_calib.sh ./build/build weights/yolov5s.onnx 1 ./media/ ./media/filelist.txt weights/yolov5s.engine[11/06/2025-11:57:36] [I] [TRT] [MemUsageChange] Init CUDA: CPU +221, GPU +0, now: CPU 249, GPU 4229 (MiB) [11/06/2025-11:57:39] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +302, GPU +277, now: CPU 574, GPU 4529 (MiB) [11/06/2025-11:57:39] [I] [TRT] ---------------------------------------------------------------- [11/06/2025-11:57:39] [I] [TRT] Input filename: weights/yolov5s.onnx [11/06/2025-11:57:39] [I] [TRT] ONNX IR version: 0.0.7 [11/06/2025-11:57:39] [I] [TRT] Opset version: 12 [11/06/2025-11:57:39] [I] [TRT] Producer name: [11/06/2025-11:57:39] [I] [TRT] Producer version: [11/06/2025-11:57:39] [I] [TRT] Domain: [11/06/2025-11:57:39] [I] [TRT] Model version: 0 [11/06/2025-11:57:39] [I] [TRT] Doc string: [11/06/2025-11:57:39] [I] [TRT] ---------------------------------------------------------------- [11/06/2025-11:57:39] [I] [TRT] No importer registered for op: YoloLayer_TRT. Attempting to import as plugin. [11/06/2025-11:57:39] [I] [TRT] Searching for plugin: YoloLayer_TRT, plugin_version: 1, plugin_namespace: [11/06/2025-11:57:39] [I] [TRT] Successfully created plugin: YoloLayer_TRT [11/06/2025-11:57:39] [I] sample0001.png [11/06/2025-11:57:39] [I] sample0002.png [11/06/2025-11:57:39] [I] sample0003.png [11/06/2025-11:57:39] [I] sample0004.png [11/06/2025-11:57:39] [I] sample0005.png [11/06/2025-11:57:39] [I] sample0006.png [11/06/2025-11:57:39] [I] sample0007.png [11/06/2025-11:57:39] [I] sample0008.png [11/06/2025-11:57:39] [I] sample0009.png [11/06/2025-11:57:39] [I] sample0010.png [11/06/2025-11:57:39] [I] sample0011.png [11/06/2025-11:57:39] [I] sample0012.png [11/06/2025-11:57:39] [I] sample0013.png [11/06/2025-11:57:39] [I] sample0014.png [11/06/2025-11:57:39] [I] sample0015.png [11/06/2025-11:57:39] [I] sample0016.png [11/06/2025-11:57:39] [I] sample0017.png [11/06/2025-11:57:39] [I] sample0018.png [11/06/2025-11:57:39] [I] sample0019.png [11/06/2025-11:57:39] [I] sample0020.png [11/06/2025-11:57:39] [I] sample0021.png [11/06/2025-11:57:39] [I] sample0022.png [11/06/2025-11:57:39] [I] sample0023.png [11/06/2025-11:57:39] [I] sample0024.png [11/06/2025-11:57:39] [I] sample0025.png [11/06/2025-11:57:39] [I] sample0026.png [11/06/2025-11:57:39] [I] sample0027.png [11/06/2025-11:57:39] [I] sample0028.png [11/06/2025-11:57:39] [I] sample0029.png [11/06/2025-11:57:39] [I] sample0030.png [11/06/2025-11:57:39] [I] sample0031.png [11/06/2025-11:57:39] [I] sample0032.png [11/06/2025-11:57:39] [I] sample0033.png [11/06/2025-11:57:39] [I] sample0034.png [11/06/2025-11:57:39] [I] sample0035.png [11/06/2025-11:57:39] [I] sample0036.png [11/06/2025-11:57:39] [I] sample0037.png [11/06/2025-11:57:39] [I] sample0038.png [11/06/2025-11:57:39] [I] sample0039.png [11/06/2025-11:57:39] [I] sample0040.png [11/06/2025-11:57:39] [I] sample0041.png [11/06/2025-11:57:39] [I] sample0042.png [11/06/2025-11:57:39] [I] sample0043.png [11/06/2025-11:57:39] [I] sample0044.png [11/06/2025-11:57:39] [I] sample0045.png [11/06/2025-11:57:39] [I] sample0046.png [11/06/2025-11:57:39] [I] sample0047.png [11/06/2025-11:57:39] [I] sample0048.png [11/06/2025-11:57:39] [I] sample0049.png [11/06/2025-11:57:39] [I] sample0050.png [11/06/2025-11:57:39] [I] sample0051.png [11/06/2025-11:57:39] [I] sample0052.png [11/06/2025-11:57:39] [I] sample0053.png [11/06/2025-11:57:39] [I] sample0054.png [11/06/2025-11:57:39] [I] sample0055.png [11/06/2025-11:57:39] [I] sample0056.png [11/06/2025-11:57:39] [I] sample0057.png [11/06/2025-11:57:39] [I] sample0058.png [11/06/2025-11:57:39] [I] sample0059.png [11/06/2025-11:57:39] [I] sample0060.png [11/06/2025-11:57:39] [I] sample0061.png [11/06/2025-11:57:39] [I] sample0062.png [11/06/2025-11:57:39] [I] sample0063.png [11/06/2025-11:57:39] [I] sample0064.png [11/06/2025-11:57:39] [I] sample0065.png [11/06/2025-11:57:39] [I] sample0066.png [11/06/2025-11:57:39] [I] sample0067.png [11/06/2025-11:57:39] [I] sample0068.png [11/06/2025-11:57:39] [I] sample0069.png [11/06/2025-11:57:39] [I] sample0070.png [11/06/2025-11:57:39] [I] sample0071.png [11/06/2025-11:57:39] [I] sample0072.png [11/06/2025-11:57:39] [I] sample0073.png [11/06/2025-11:57:39] [I] sample0074.png [11/06/2025-11:57:39] [I] sample0075.png [11/06/2025-11:57:39] [I] sample0076.png [11/06/2025-11:57:39] [I] sample0077.png [11/06/2025-11:57:39] [I] sample0078.png [11/06/2025-11:57:39] [I] sample0079.png [11/06/2025-11:57:39] [I] sample0080.png [11/06/2025-11:57:39] [I] sample0081.png [11/06/2025-11:57:39] [I] sample0082.png [11/06/2025-11:57:39] [I] sample0083.png [11/06/2025-11:57:39] [I] sample0084.png [11/06/2025-11:57:39] [I] sample0085.png [11/06/2025-11:57:39] [I] sample0086.png [11/06/2025-11:57:39] [I] sample0087.png [11/06/2025-11:57:39] [I] sample0088.png [11/06/2025-11:57:39] [I] sample0089.png [11/06/2025-11:57:39] [I] sample0090.png [11/06/2025-11:57:39] [I] sample0091.png [11/06/2025-11:57:39] [I] sample0092.png [11/06/2025-11:57:39] [I] sample0093.png [11/06/2025-11:57:39] [I] sample0094.png [11/06/2025-11:57:39] [I] sample0095.png [11/06/2025-11:57:39] [I] sample0096.png [11/06/2025-11:57:39] [I] sample0097.png [11/06/2025-11:57:39] [I] sample0098.png [11/06/2025-11:57:39] [I] sample0099.png [11/06/2025-11:57:39] [I] sample0100.png [11/06/2025-11:57:39] [I] sample0101.png [11/06/2025-11:57:39] [I] sample0102.png [11/06/2025-11:57:39] [I] sample0103.png [11/06/2025-11:57:39] [I] sample0104.png [11/06/2025-11:57:39] [I] sample0105.png [11/06/2025-11:57:39] [I] sample0106.png [11/06/2025-11:57:39] [I] sample0107.png [11/06/2025-11:57:39] [I] sample0108.png [11/06/2025-11:57:39] [I] sample0109.png [11/06/2025-11:57:39] [I] sample0110.png [11/06/2025-11:57:39] [I] sample0111.png [11/06/2025-11:57:39] [I] sample0112.png [11/06/2025-11:57:39] [I] sample0113.png [11/06/2025-11:57:39] [I] sample0114.png [11/06/2025-11:57:39] [I] sample0115.png [11/06/2025-11:57:39] [I] sample0116.png [11/06/2025-11:57:39] [I] sample0117.png [11/06/2025-11:57:39] [I] sample0118.png [11/06/2025-11:57:39] [I] sample0119.png [11/06/2025-11:57:39] [I] sample0120.png [11/06/2025-11:57:39] [I] sample0121.png [11/06/2025-11:57:39] [I] sample0122.png [11/06/2025-11:57:39] [I] sample0123.png [11/06/2025-11:57:39] [I] sample0124.png [11/06/2025-11:57:39] [I] sample0125.png [11/06/2025-11:57:39] [I] sample0126.png [11/06/2025-11:57:39] [I] sample0127.png [11/06/2025-11:57:39] [I] sample0128.png [11/06/2025-11:57:39] [I] sample0129.png [11/06/2025-11:57:39] [I] sample0130.png [11/06/2025-11:57:39] [I] sample0131.png [11/06/2025-11:57:39] [I] sample0132.png [11/06/2025-11:57:39] [I] sample0133.png [11/06/2025-11:57:39] [I] sample0134.png [11/06/2025-11:57:39] [I] sample0135.png [11/06/2025-11:57:39] [I] sample0136.png [11/06/2025-11:57:39] [I] sample0137.png [11/06/2025-11:57:39] [I] sample0138.png [11/06/2025-11:57:39] [I] sample0139.png [11/06/2025-11:57:39] [I] sample0140.png [11/06/2025-11:57:39] [I] sample0141.png [11/06/2025-11:57:39] [I] sample0142.png [11/06/2025-11:57:39] [I] sample0143.png [11/06/2025-11:57:39] [I] sample0144.png [11/06/2025-11:57:39] [I] sample0145.png CalibrationDataReader: 145 images, 145 batches. [11/06/2025-11:57:39] [I] [TRT] Reading Calibration Cache for calibrator: MinMaxCalibration [11/06/2025-11:57:39] [I] [TRT] Generated calibration scales using calibration cache. Make sure that calibration cache has latest scales. [11/06/2025-11:57:39] [I] [TRT] To regenerate calibration cache, please delete the existing one. TensorRT will generate a new calibration cache. [11/06/2025-11:57:39] [W] [TRT] Missing scale and zero-point for tensor DecodeNumDetection, expect fall back to non-int8 implementation for any layer consuming or producing given tensor [11/06/2025-11:57:39] [W] [TRT] Missing scale and zero-point for tensor DecodeDetectionClasses, expect fall back to non-int8 implementation for any layer consuming or producing given tensor [11/06/2025-11:57:39] [I] [TRT] ---------- Layers Running on DLA ---------- [11/06/2025-11:57:39] [I] [TRT] ---------- Layers Running on GPU ---------- [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.0/conv/Conv + PWN(PWN(/model.0/act/Sigmoid), /model.0/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.1/conv/Conv + PWN(PWN(/model.1/act/Sigmoid), /model.1/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.2/cv1/conv/Conv + PWN(PWN(/model.2/cv1/act/Sigmoid), /model.2/cv1/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.2/cv2/conv/Conv + PWN(PWN(/model.2/cv2/act/Sigmoid), /model.2/cv2/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.2/m/m.0/cv1/conv/Conv + PWN(PWN(/model.2/m/m.0/cv1/act/Sigmoid), /model.2/m/m.0/cv1/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.2/m/m.0/cv2/conv/Conv [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] POINTWISE: PWN(PWN(PWN(/model.2/m/m.0/cv2/act/Sigmoid), /model.2/m/m.0/cv2/act/Mul), /model.2/m/m.0/Add) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.2/cv3/conv/Conv + PWN(PWN(/model.2/cv3/act/Sigmoid), /model.2/cv3/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.3/conv/Conv + PWN(PWN(/model.3/act/Sigmoid), /model.3/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.4/cv1/conv/Conv + PWN(PWN(/model.4/cv1/act/Sigmoid), /model.4/cv1/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.4/cv2/conv/Conv + PWN(PWN(/model.4/cv2/act/Sigmoid), /model.4/cv2/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.4/m/m.0/cv1/conv/Conv + PWN(PWN(/model.4/m/m.0/cv1/act/Sigmoid), /model.4/m/m.0/cv1/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.4/m/m.0/cv2/conv/Conv [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] POINTWISE: PWN(PWN(PWN(/model.4/m/m.0/cv2/act/Sigmoid), /model.4/m/m.0/cv2/act/Mul), /model.4/m/m.0/Add) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.4/m/m.1/cv1/conv/Conv + PWN(PWN(/model.4/m/m.1/cv1/act/Sigmoid), /model.4/m/m.1/cv1/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.4/m/m.1/cv2/conv/Conv [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] POINTWISE: PWN(PWN(PWN(/model.4/m/m.1/cv2/act/Sigmoid), /model.4/m/m.1/cv2/act/Mul), /model.4/m/m.1/Add) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.4/cv3/conv/Conv + PWN(PWN(/model.4/cv3/act/Sigmoid), /model.4/cv3/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.5/conv/Conv + PWN(PWN(/model.5/act/Sigmoid), /model.5/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.6/cv1/conv/Conv + PWN(PWN(/model.6/cv1/act/Sigmoid), /model.6/cv1/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.6/cv2/conv/Conv + PWN(PWN(/model.6/cv2/act/Sigmoid), /model.6/cv2/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.6/m/m.0/cv1/conv/Conv + PWN(PWN(/model.6/m/m.0/cv1/act/Sigmoid), /model.6/m/m.0/cv1/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.6/m/m.0/cv2/conv/Conv [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] POINTWISE: PWN(PWN(PWN(/model.6/m/m.0/cv2/act/Sigmoid), /model.6/m/m.0/cv2/act/Mul), /model.6/m/m.0/Add) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.6/m/m.1/cv1/conv/Conv + PWN(PWN(/model.6/m/m.1/cv1/act/Sigmoid), /model.6/m/m.1/cv1/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.6/m/m.1/cv2/conv/Conv [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] POINTWISE: PWN(PWN(PWN(/model.6/m/m.1/cv2/act/Sigmoid), /model.6/m/m.1/cv2/act/Mul), /model.6/m/m.1/Add) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.6/m/m.2/cv1/conv/Conv + PWN(PWN(/model.6/m/m.2/cv1/act/Sigmoid), /model.6/m/m.2/cv1/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.6/m/m.2/cv2/conv/Conv [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] POINTWISE: PWN(PWN(PWN(/model.6/m/m.2/cv2/act/Sigmoid), /model.6/m/m.2/cv2/act/Mul), /model.6/m/m.2/Add) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.6/cv3/conv/Conv + PWN(PWN(/model.6/cv3/act/Sigmoid), /model.6/cv3/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.7/conv/Conv + PWN(PWN(/model.7/act/Sigmoid), /model.7/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.8/cv1/conv/Conv + PWN(PWN(/model.8/cv1/act/Sigmoid), /model.8/cv1/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.8/cv2/conv/Conv + PWN(PWN(/model.8/cv2/act/Sigmoid), /model.8/cv2/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.8/m/m.0/cv1/conv/Conv + PWN(PWN(/model.8/m/m.0/cv1/act/Sigmoid), /model.8/m/m.0/cv1/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.8/m/m.0/cv2/conv/Conv [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] POINTWISE: PWN(PWN(PWN(/model.8/m/m.0/cv2/act/Sigmoid), /model.8/m/m.0/cv2/act/Mul), /model.8/m/m.0/Add) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.8/cv3/conv/Conv + PWN(PWN(/model.8/cv3/act/Sigmoid), /model.8/cv3/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.9/cv1/conv/Conv + PWN(PWN(/model.9/cv1/act/Sigmoid), /model.9/cv1/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] POOLING: /model.9/m/MaxPool [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] POOLING: /model.9/m_1/MaxPool [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] POOLING: /model.9/m_2/MaxPool [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] COPY: /model.9/cv1/act/Mul_output_0 copy [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] COPY: /model.9/m/MaxPool_output_0 copy [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] COPY: /model.9/m_1/MaxPool_output_0 copy [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.9/cv2/conv/Conv + PWN(PWN(/model.9/cv2/act/Sigmoid), /model.9/cv2/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.10/conv/Conv + PWN(PWN(/model.10/act/Sigmoid), /model.10/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] RESIZE: /model.11/Resize [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] COPY: /model.11/Resize_output_0 copy [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.13/cv1/conv/Conv + PWN(PWN(/model.13/cv1/act/Sigmoid), /model.13/cv1/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.13/cv2/conv/Conv + PWN(PWN(/model.13/cv2/act/Sigmoid), /model.13/cv2/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.13/m/m.0/cv1/conv/Conv + PWN(PWN(/model.13/m/m.0/cv1/act/Sigmoid), /model.13/m/m.0/cv1/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.13/m/m.0/cv2/conv/Conv + PWN(PWN(/model.13/m/m.0/cv2/act/Sigmoid), /model.13/m/m.0/cv2/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.13/cv3/conv/Conv + PWN(PWN(/model.13/cv3/act/Sigmoid), /model.13/cv3/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.14/conv/Conv + PWN(PWN(/model.14/act/Sigmoid), /model.14/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] RESIZE: /model.15/Resize [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] COPY: /model.15/Resize_output_0 copy [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] COPY: /model.4/cv3/act/Mul_output_0 copy [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.17/cv1/conv/Conv + PWN(PWN(/model.17/cv1/act/Sigmoid), /model.17/cv1/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.17/cv2/conv/Conv + PWN(PWN(/model.17/cv2/act/Sigmoid), /model.17/cv2/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.17/m/m.0/cv1/conv/Conv + PWN(PWN(/model.17/m/m.0/cv1/act/Sigmoid), /model.17/m/m.0/cv1/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.17/m/m.0/cv2/conv/Conv + PWN(PWN(/model.17/m/m.0/cv2/act/Sigmoid), /model.17/m/m.0/cv2/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.17/cv3/conv/Conv + PWN(PWN(/model.17/cv3/act/Sigmoid), /model.17/cv3/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.18/conv/Conv + PWN(PWN(/model.18/act/Sigmoid), /model.18/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.24/m.0/Conv + PWN(/model.24/Sigmoid) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] COPY: /model.14/act/Mul_output_0 copy [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.20/cv1/conv/Conv + PWN(PWN(/model.20/cv1/act/Sigmoid), /model.20/cv1/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.20/cv2/conv/Conv + PWN(PWN(/model.20/cv2/act/Sigmoid), /model.20/cv2/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.20/m/m.0/cv1/conv/Conv + PWN(PWN(/model.20/m/m.0/cv1/act/Sigmoid), /model.20/m/m.0/cv1/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.20/m/m.0/cv2/conv/Conv + PWN(PWN(/model.20/m/m.0/cv2/act/Sigmoid), /model.20/m/m.0/cv2/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.20/cv3/conv/Conv + PWN(PWN(/model.20/cv3/act/Sigmoid), /model.20/cv3/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.21/conv/Conv + PWN(PWN(/model.21/act/Sigmoid), /model.21/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.24/m.1/Conv + PWN(/model.24/Sigmoid_1) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] COPY: /model.10/act/Mul_output_0 copy [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.23/cv1/conv/Conv + PWN(PWN(/model.23/cv1/act/Sigmoid), /model.23/cv1/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.23/cv2/conv/Conv + PWN(PWN(/model.23/cv2/act/Sigmoid), /model.23/cv2/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.23/m/m.0/cv1/conv/Conv + PWN(PWN(/model.23/m/m.0/cv1/act/Sigmoid), /model.23/m/m.0/cv1/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.23/m/m.0/cv2/conv/Conv + PWN(PWN(/model.23/m/m.0/cv2/act/Sigmoid), /model.23/m/m.0/cv2/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.23/cv3/conv/Conv + PWN(PWN(/model.23/cv3/act/Sigmoid), /model.23/cv3/act/Mul) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] CONVOLUTION: /model.24/m.2/Conv + PWN(/model.24/Sigmoid_2) [11/06/2025-11:57:39] [I] [TRT] [GpuLayer] PLUGIN_V2: YoloLayer [11/06/2025-11:57:40] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +534, GPU +689, now: CPU 1137, GPU 5200 (MiB) [11/06/2025-11:57:41] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +83, GPU +132, now: CPU 1220, GPU 5332 (MiB) [11/06/2025-11:57:41] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored. [11/06/2025-12:00:45] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size will enable more tactics, please check verbose output for requested sizes. [11/06/2025-12:01:03] [I] [TRT] Total Activation Memory: 1115794944 [11/06/2025-12:01:03] [I] [TRT] Detected 1 inputs and 4 output network tensors. [11/06/2025-12:01:03] [I] [TRT] Total Host Persistent Memory: 175984 [11/06/2025-12:01:03] [I] [TRT] Total Device Persistent Memory: 614912 [11/06/2025-12:01:03] [I] [TRT] Total Scratch Memory: 0 [11/06/2025-12:01:03] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 7 MiB, GPU 553 MiB [11/06/2025-12:01:03] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 67 steps to complete. [11/06/2025-12:01:03] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 2.77161ms to assign 6 blocks to 67 nodes requiring 10925056 bytes. [11/06/2025-12:01:03] [I] [TRT] Total Activation Memory: 10925056 [11/06/2025-12:01:04] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1557, GPU 5945 (MiB) [11/06/2025-12:01:04] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1557, GPU 5945 (MiB) [11/06/2025-12:01:04] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +7, GPU +8, now: CPU 7, GPU 8 (MiB) Engine build success! Python call example以下是一个简单Python示例调用C++生成的动态链接库,仅需指定模型文件的路径和视频输入的大小,就能返回视频每一帧的检测结果,并且在视频推理过程中可以动态调整置信度和交并比等参数的阈值。import cv2 import time import ctypes ctypes.CDLL("./build/libyolo_plugin.so", mode=ctypes.RTLD_GLOBAL) ctypes.CDLL("./build/libyolo_utils.so", mode=ctypes.RTLD_GLOBAL) from build import yolov5_trt def draw_detections(image, detections, fps): for detection in detections: class_id = detection['class_id'] x1, y1, x2, y2 = detection['bbox'] confidence = detection['confidence'] cv2.rectangle(image, (x1, y1), (x2, y2), (0x27, 0xC1, 0x36), 2) cv2.putText(image, f"{class_id}:{confidence:.2f}", (x1, y1 - 10), cv2.FONT_HERSHEY_PLAIN, 1.2, (0x27, 0xC1, 0x36), 2) cv2.putText(image, f"FPS: {fps:.2f}", (10, 30), cv2.FONT_HERSHEY_PLAIN, 1.5, (0, 0, 255), 2) return image def main(input_path, output_path): cap = cv2.VideoCapture(input_path) fps = int(cap.get(cv2.CAP_PROP_FPS)) width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)) height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)) detector = yolov5_trt.YOLOv5Detector("./weights/yolov5s.engine", width, height) writer = cv2.VideoWriter(output_path, cv2.VideoWriter_fourcc(*'MJPG'), fps, (width, height)) fps_list = [] frame_count = 0 total_time = 0.0 while cap.isOpened(): ret, frame = cap.read() if not ret: break start_time = time.time() detections = detector.detect(input_image=frame, input_w=640, input_h=640, conf_thresh=0.45, nms_thresh=0.55) process_time = time.time() - start_time current_fps = 1.0 / process_time if process_time > 0 else 0 frame_count += 1 total_time += process_time fps_list.append(current_fps) image = draw_detections(frame, detections, current_fps) writer.write(image) cap.release() writer.release() if frame_count > 0: avg_fps = frame_count / total_time if total_time > 0 else 0 print(f"Processed {frame_count} frames") print(f"Average FPS: {avg_fps:.2f}") print(f"Min FPS: {min(fps_list):.2f}") print(f"Max FPS: {max(fps_list):.2f}") if __name__ == "__main__": input_video = "./media/sample_720p.mp4" output_video = "./result.avi" main(input_video, output_video) 对应的C++推理代码如下:#include "NvInfer.h" #include "logger.h" #include "common.h" #include "buffers.h" #include "utils/preprocess.h" #include "utils/postprocess.h" #include "utils/types.h" #include "utils/utils.h" #include <pybind11/pybind11.h> #include <pybind11/numpy.h> #include <pybind11/stl.h> #include <memory> #include <mutex> namespace py = pybind11; // 将numpy数组转换为cv::Mat cv::Mat numpy_to_mat(py::array_t<unsigned char>& input) { py::buffer_info buf_info = input.request(); if (buf_info.ndim == 3) { // 彩色图像 int height = buf_info.shape[0]; int width = buf_info.shape[1]; int channels = buf_info.shape[2]; cv::Mat mat(height, width, CV_8UC3, (unsigned char*)buf_info.ptr); return mat.clone(); } else if (buf_info.ndim == 2) { // 灰度图像 int height = buf_info.shape[0]; int width = buf_info.shape[1]; cv::Mat mat(height, width, CV_8UC1, (unsigned char*)buf_info.ptr); return mat.clone(); } throw std::runtime_error("Unsupported array dimensions"); } // 将cv::Mat转换为numpy数组 py::array_t<unsigned char> mat_to_numpy(cv::Mat& mat) { if (mat.empty()) { return py::array_t<unsigned char>(); } if (mat.channels() == 1) { // 灰度图像 auto result = py::array_t<unsigned char>({mat.rows, mat.cols}); auto buf = result.request(); memcpy(buf.ptr, mat.data, sizeof(unsigned char) * mat.total()); return result; } else { // 彩色图像 auto result = py::array_t<unsigned char>({mat.rows, mat.cols, mat.channels()}); auto buf = result.request(); memcpy(buf.ptr, mat.data, sizeof(unsigned char) * mat.total() * mat.channels()); return result; } } // 加载模型文件 std::vector<unsigned char> load_engine_file(const std::string &file_name) { std::vector<unsigned char> engine_data; std::ifstream engine_file(file_name, std::ios::binary); assert(engine_file.is_open() && "Unable to load engine file."); engine_file.seekg(0, engine_file.end); int length = engine_file.tellg(); engine_data.resize(length); engine_file.seekg(0, engine_file.beg); engine_file.read(reinterpret_cast<char *>(engine_data.data()), length); return engine_data; } // YOLOv5推理器类 class YOLOv5Detector { private: std::unique_ptr<nvinfer1::IRuntime> runtime; std::shared_ptr<nvinfer1::ICudaEngine> engine; std::unique_ptr<nvinfer1::IExecutionContext> context; std::unique_ptr<samplesCommon::BufferManager> buffers; bool initialized = false; public: YOLOv5Detector(const std::string& engine_file, int frame_width, int frame_height) { initialize(engine_file); int img_size = frame_width * frame_height; cuda_preprocess_init(img_size); // 申请cuda内存 } void initialize(const std::string& engine_file) { // ========== 1. 创建推理运行时runtime ========== runtime = std::unique_ptr<nvinfer1::IRuntime>(nvinfer1::createInferRuntime(sample::gLogger.getTRTLogger())); if (!runtime) { throw std::runtime_error("Failed to create TensorRT runtime"); } // ========== 2. 反序列化生成engine ========== auto plan = load_engine_file(engine_file); engine = std::shared_ptr<nvinfer1::ICudaEngine>(runtime->deserializeCudaEngine(plan.data(), plan.size())); if (!engine) { throw std::runtime_error("Failed to deserialize engine"); } // ========== 3. 创建执行上下文context ========== context = std::unique_ptr<nvinfer1::IExecutionContext>(engine->createExecutionContext()); if (!context) { throw std::runtime_error("Failed to create execution context"); } // ========== 4. 创建输入输出缓冲区 ========== buffers = std::make_unique<samplesCommon::BufferManager>(engine); initialized = true; } py::list detect(py::array_t<unsigned char>& input_image, int input_w=kInputW, int input_h=kInputH, float conf_thresh=kConfThresh, float nms_thresh=kNmsThresh) { if (!initialized) { throw std::runtime_error("Detector not initialized"); } // 将numpy数组转换为cv::Mat cv::Mat frame = numpy_to_mat(input_image); if (frame.empty()) { throw std::runtime_error("Invalid input image"); } // CUDA预处理 process_input_gpu(frame, (float *)buffers->getDeviceBuffer(kInputTensorName), input_w, input_h); // ========== 5. 执行推理 ========== context->executeV2(buffers->getDeviceBindings().data()); // 拷贝回host buffers->copyOutputToHost(); // 从buffer manager中获取模型输出 int32_t *num_det = (int32_t *)buffers->getHostBuffer(kOutNumDet); int32_t *cls = (int32_t *)buffers->getHostBuffer(kOutDetCls); float *conf = (float *)buffers->getHostBuffer(kOutDetScores); float *bbox = (float *)buffers->getHostBuffer(kOutDetBBoxes); // 执行nms(非极大值抑制) std::vector<Detection> bboxs; yolo_nms(bboxs, num_det, cls, conf, bbox, conf_thresh, nms_thresh); // 返回检测结果 py::list result_list; for (size_t j = 0; j < bboxs.size(); j++) { cv::Rect r = get_rect(frame, bboxs[j].bbox, input_w, input_h); py::dict detection; detection["class_id"] = (int)bboxs[j].class_id; detection["confidence"] = (float)bboxs[j].conf; detection["bbox"] = py::cast(std::vector<int>{r.x, r.y, r.x + r.width, r.y + r.height}); result_list.append(detection); } return result_list; } }; // Python绑定代码 PYBIND11_MODULE(yolov5_trt, m) { m.doc() = "YOLOv5 TensorRT Python bindings"; py::class_<YOLOv5Detector>(m, "YOLOv5Detector") .def(py::init<const std::string&, int, int>(), "Initialize detector with engine file", py::arg("engine_file"), py::arg("frame_width"), py::arg("frame_height")) .def("detect", &YOLOv5Detector::detect, "Perform detection on input image", py::arg("input_image"), py::arg("input_w") = kInputW, py::arg("input_h") = kInputH, py::arg("conf_thresh") = kConfThresh, py::arg("nms_thresh") = kNmsThresh); } 实际在Jetson Oron Nano (8GB)上对720P输入大小的视频进行目标检测,平均帧率稳定在120+ FPS,满足工业场景下对实时性的要求。python yolov5_infer.py[11/06/2025-15:23:26] [I] [TRT] Loaded engine size: 7 MiB Deserialize yoloLayer plugin: YoloLayer [11/06/2025-15:23:28] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +536, GPU +955, now: CPU 830, GPU 4470 (MiB) [11/06/2025-15:23:28] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +83, GPU +149, now: CPU 913, GPU 4619 (MiB) [11/06/2025-15:23:28] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +7, now: CPU 0, GPU 7 (MiB) [11/06/2025-15:23:28] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 913, GPU 4620 (MiB) [11/06/2025-15:23:28] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +3, now: CPU 913, GPU 4623 (MiB) [11/06/2025-15:23:28] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +11, now: CPU 0, GPU 18 (MiB) Processed 1442 frames Average FPS: 127.51 Min FPS: 75.75 Max FPS: 134.67 Conclusion Remarks最后我们还提供了ByteTrack跟踪算法的Python绑定,基于Pybind11实现,并在原有算法基础上提供了跟踪目标的类别信息,Jetson Orin Nano也能在此基础上也能实现高达83 FPS的实时目标检测和跟踪性能:ByteTrack-Pybind11: 高性能实时目标跟踪解决方案 🚀
  • [技术干货] 视觉多模态模型切分检测和边缘推理
    视觉多模态模型切分检测和边缘推理一、NanoOWL + SAHINanoOWL(边缘实时开放词汇目标检测模型)基于Vision Transformer架构,结合CLIP的图文对齐能力,可通过文本查询在图像中检测任意类别目标。我们可以结合SAHI框架使用“文本提示+图像切分”在Jetson Orin等嵌入式设备上实现低空目标的多模态检测和TensorRT推理。二、代码实现首先我们拉取官方代码仓库https://github.com/dusty-nv/jetson-containers并运行安装命令install.sh:git clone https://github.com/dusty-nv/jetson-containers bash jetson-containers/install.sh之后使用jetson-containers run和autotag命令自动提取并构建兼容的容器:jetson-containers run --workdir /opt/nanoowl $(autotag nanoowl) 我们在终端中查看容器的ID并将容器中nanoowl拷贝到自定义目录下/home/vsuav/workspace/vit:sudo docker ps -asudo docker cp af063d738879:/opt/nanoowl /home/vsuav/workspace/vit将/home/vsuav/workspace/vit/nanoowl目录及其内部所有文件和子目录的所有者和所属组都设置为当前登录用户,从而确保我们可以正常访问和修改这些文件。sudo chown -R $(whoami):$(whoami) /home/vsuav/workspace/vit/nanoowl运行jetson-containers run命令并指定--workdir参数,将nanoowl目录挂载到容器中,设置容器的名称为NanoOWL:jetson-containers run -v /home/vsuav/workspace/vit/nanoowl:/opt/nanoowl --name NanoOWL --workdir /opt/nanoowl $(autotag nanoowl) 运行docker start NanoOWL启动容器并进入容器的终端:sudo docker start NanoOWL sudo docker exec -it NanoOWL bash 我们在容器内部使用pip3安装python库sahi并创建main.pypip3 install sahi -i https://pypi.tuna.tsinghua.edu.cn/simple我们在examples/owl_predict.py基础上添加SAHI切分检测的逻辑,将无人机拍摄的高清大图切分为640x640和1280x1280的子图并叠加原图和类别名称的编码信息送入模型进行预测,最后把推理结果映射到原图上使用GreedyNMM进行合并后处理,完整代码如下:import os import cv2 import PIL.Image import numpy as np from sahi.slicing import get_slice_bboxes from nanoowl.owl_predictor import OwlPredictor from sahi.postprocess.utils import ObjectPrediction from sahi.postprocess.combine import GreedyNMMPostprocess class OWL_SAHI: def __init__(self, model_path, label_list, OBJ_THRESH, NMS_THRESH, overlap_ratio, slice, slice_scales): self.model_path = model_path if label_list != []: self.label_list = label_list else: self.label_list = [""] self.OBJ_THRESH = OBJ_THRESH self.NMS_THRESH = NMS_THRESH self.overlap_ratio = overlap_ratio self.slice = slice self.slice_scales = slice_scales self.predictor = OwlPredictor( "google/owlvit-base-patch32", image_encoder_engine = self.model_path ) self.text_encodings = self.predictor.encode_text(self.label_list) self.postprocess = GreedyNMMPostprocess( match_threshold = self.NMS_THRESH, match_metric = "IOS", class_agnostic = False, ) def getImageSlices(self, image): img_height, img_width = image.shape[:2] slice_bboxes = [] if self.slice: for slice_scale in self.slice_scales: slice_bboxe = get_slice_bboxes( image_height = img_height, image_width = img_width, auto_slice_resolution = True, slice_height = slice_scale[1], slice_width = slice_scale[0], overlap_height_ratio = self.overlap_ratio, overlap_width_ratio = self.overlap_ratio, ) slice_bboxes.extend(slice_bboxe) slice_bboxes.append([0, 0, img_width, img_height]) else: slice_bboxes = [[0, 0, img_width, img_height]] img_batch = [] for bbox in slice_bboxes: l, t, r, b = bbox img_batch.append(image[t:b, l:r]) return img_batch, slice_bboxes def predict(self, image_path): image = cv2.imread(image_path) image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) img_batch, slice_bboxes = self.getImageSlices(image) all_boxes = [] for img, slice_bbox in zip(img_batch, slice_bboxes): img_pil = PIL.Image.fromarray(img) output = self.predictor.predict( image = img_pil, text = self.label_list, text_encodings = self.text_encodings, threshold = self.OBJ_THRESH, pad_square = True ) boxes = output.boxes.cpu().numpy() if boxes.shape[0] > 0: boxes[:, 0] = boxes[:, 0] + slice_bbox[0] boxes[:, 1] = boxes[:, 1] + slice_bbox[1] boxes[:, 2] = boxes[:, 2] + slice_bbox[0] boxes[:, 3] = boxes[:, 3] + slice_bbox[1] boxes = boxes.astype(np.int32).tolist() for i in range(len(boxes)): obj_item = ObjectPrediction( bbox = boxes[i], score = float(output.scores[i]), category_id = int(output.labels[i]) ) all_boxes.append(obj_item) if len(all_boxes) > 0: all_boxes = self.postprocess(all_boxes) return all_boxes if __name__ == "__main__": model_path = "./data/owl_image_encoder_patch32.engine" label_list = ["car", "tower"] OBJ_THRESH = 0.1 NMS_THRESH = 0.5 overlap_ratio = 0.25 slice = True slice_scales = [[640, 640], [1280, 1280]] owl_sahi = OWL_SAHI(model_path, label_list, OBJ_THRESH, NMS_THRESH, overlap_ratio, slice, slice_scales) images = os.listdir("images") for image_file in images: print(image_file) image_path = os.path.join("images", image_file) all_boxes_processed = owl_sahi.predict(image_path) image = cv2.imread(image_path) for box in all_boxes_processed: xmin, ymin, xmax, ymax = box.bbox.to_xyxy() score = box.score.value clsse = box.category.id cv2.rectangle(image, (xmin, ymin), (xmax, ymax), (0, 255, 0), 4) cv2.putText(image, '{0} {1:.2f}'.format(label_list[clsse], score), (xmin, ymin - 10), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 255), 4, cv2.LINE_AA) cv2.imwrite(f"output/{image_file}", image) 三、小结本文介绍了基于NanoOWL和SAHI框架的视觉多模态模型切分检测方案,通过结合文本提示与图像切分技术,在Jetson Orin等边缘设备上实现开放词汇目标检测。该方案利用Vision Transformer架构和CLIP图文对齐能力,支持任意类别目标检测,并通过SAHI进行图像切片处理与后处理合并,提升检测精度与效率,适用于低空无人机目标检测等边缘推理场景。
  • [技术干货] 【朝推夜训】松材线虫病高清图片切分检测
    【朝推夜训】松材线虫病高清图片切分检测我们以 Jetson Orin Nano 为例,介绍如何使用Python在资源受限的嵌入式设备上实现高清大图切分检测。一、模型导出1. 安装 Cmake,创建 24G swap 空间模型导出时依赖更高版本的Cmake,这里我们直接编译安装:sudo apt update sudo apt install libssl-dev git clone -b v3.25.1 https://github.com/Kitware/CMake.git cd CMake ./bootstrap && make && sudo make install cmake --version交换空间是操作系统用来拓展可用内存的一种机制,可以在内存不足的情况下继续运行,避免程序崩溃或者系统卡死,但是交换空间的访问速度远低于物理内存!禁用Jetson设备上的ZRAM交换配置:ZRAM会将内存页面压缩并存储在内存中,以减少对磁盘的依赖。sudo systemctl disable nvzramconfig使用fallocate创建一个24GB大小的文件,位于/var/24GB.swap路径。sudo fallocate -l 24G /var/24GB.swap设置交换空间格式sudo mkswap /var/24GB.swap启用交换空间sudo swapon /var/24GB.swap永久自启交换空间echo "/var/24GB.swap none swap sw 0 0" | sudo tee -a /etc/fstab重启系统后,系统交换空间增加至24GB:sudo reboot 2. 安装 ultralytics,创建 TensorRT 软连接pip install ultralytics pip install tqdm pandas pip install onnx==1.12.0 onnxslim==0.1.65 protobuf==3.20.1 pip install onnx-simplifier==0.3.10 pip install /home/vsuav/Downloads/onnxruntime_gpu-1.12.1-cp38-cp38-linux_aarch64.whl导出TensorRT模型时依赖其Python安装包,一般在系统Python目录下,以Jetson Orin Nano为例,我们可以建立软连接指向TensorRT安装路径:sudo ln -s /usr/lib/python3.8/dist-packages/tensorrt* /home/vsuav/miniconda3/envs/py38/lib/python3.8/site-packages3. 导出 TensorRT FP16 精度的模型from ultralytics import YOLO model = YOLO("yolov8-1_640x640_amd64_fp32.pt") model.export( format="engine", workspace=4, imgsz=640, half=True, device=0, batch=1 ) 二、切分检测1. 安装 SAHI 库pip install sahi==0.11.18这里我们使用0.11.18版本,修改/home/vsuav/miniconda3/envs/py38/lib/python3.8/site-packages/sahi/models/yolov8.py文件,注释掉第33行代码使其能够加载导出的Engine模型。class Yolov8DetectionModel(DetectionModel): def check_dependencies(self) -> None: check_requirements(["ultralytics"]) def load_model(self): """ Detection model is initialized and set to self.model. """ from ultralytics import YOLO try: model = YOLO(self.model_path) # model.to(self.device) self.set_model(model) except Exception as e: raise TypeError("model_path is not a valid yolov8 model path: ", e) 2. 运行检测代码加载模型import cv2 import numpy as np import matplotlib.pyplot as plt from sahi import AutoDetectionModel from sahi.predict import get_sliced_prediction detection_model = AutoDetectionModel.from_pretrained( model_type='yolov8', model_path="yolov8-1_640x640_amd64_fp16.engine", confidence_threshold=0.45 ) WARNING ⚠️ Unable to automatically guess model task, assuming 'task=detect'. Explicitly define task for your model, i.e. 'task=detect', 'segment', 'classify','pose' or 'obb'. Loading yolov8-1_640x640_amd64_fp16.engine for TensorRT inference... [09/04/2025-11:27:15] [TRT] [I] Loaded engine size: 51 MiB [09/04/2025-11:27:17] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +616, GPU +757, now: CPU 1052, GPU 5537 (MiB) [09/04/2025-11:27:17] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +49, now: CPU 0, GPU 49 (MiB) [09/04/2025-11:27:17] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +29, now: CPU 1001, GPU 5519 (MiB) [09/04/2025-11:27:18] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +27, now: CPU 0, GPU 76 (MiB) 图片切分检测image_path = "28c56c0b-3ff7-4997-88b3-5a8330f7ea88.jpeg" result = get_sliced_prediction( image_path, detection_model, slice_height = 640, slice_width = 640, overlap_height_ratio = 0.2, overlap_width_ratio = 0.2, perform_standard_pred = True, postprocess_class_agnostic = True, postprocess_match_threshold = 0.55, ) Performing prediction on 6 slices. Loading yolov8-1_640x640_amd64_fp16.engine for TensorRT inference... [09/04/2025-11:27:20] [TRT] [I] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value. [09/04/2025-11:27:20] [TRT] [I] Loaded engine size: 51 MiB [09/04/2025-11:27:20] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +32, now: CPU 1581, GPU 6577 (MiB) [09/04/2025-11:27:20] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +50, now: CPU 0, GPU 126 (MiB) [09/04/2025-11:27:20] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +6, now: CPU 1530, GPU 6537 (MiB) [09/04/2025-11:27:20] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +1, GPU +27, now: CPU 1, GPU 153 (MiB) 导出检测结果result.export_visuals(export_dir="output/", file_name="sliced_result") result_img_split = cv2.imread("output/sliced_result.png") plt.imshow(result_img_split[:, :, ::-1]) plt.axis('off') plt.show() 三、小结该方案成功在Jetson Orin Nano设备上运行,能够有效处理高清大图的松材线虫病检测任务,在保证检测精度的同时充分利用了嵌入式设备的硬件资源,为林业病虫害防治提供了实用的技术方案。
  • [技术干货] C++ TensorRT YOLOv8-SAHI 高性能部署指南
    C++ TensorRT YOLOv8-SAHI 高性能部署指南项目介绍本项目将介绍如何在Jetson等嵌入式设备上实现YOLOv8-SAHI的高性能部署,特别是使用Int8引擎的优化方案。在Jetson Orin Nano (8GB)设备上,图像切片和批量推理的测试时间消耗小于0.05秒,对1080p视频进行切分检测和bytetrack跟踪性能接近15FPS。# 代码仓库: https://github.com/HouYanSong/tensorrtx-yolov8-sahi导出 YOLOv8 Int8 量化模型我们固定输入图像尺寸1440x1080进行切分,其中每张切分子图的大小为640x640重叠度>20%,加上原始图像一次推理对8张图像进行检测,导出Int8量化后BatchSize=8的模型。从yolov8.pt生成yolov8s.wts权重文件pip install ultralytics python gen_wts.py从yolov8s.wts导出yolov8s.engine引擎文件,BatchSize大小为8sudo apt install libeigen3-devrm -fr build cmake -S . -B build cmake --build build cd build ./yolov8_sahi -s ../weights/yolov8s.wts ../weights/yolov8s.engine s模型的参数配置模型的配置文件为include/config.h,这里我们使用yolov8s官方预训练模型,模型的输入大小为640x640总共有80个类别,并且设置模型的kBatchSize = 8,一次最多可8以推理8张图像,指定量化图片的路径导出Int8量化后的模型。#ifndef CONFIG_H #define CONFIG_H // #define USE_FP16 #define USE_INT8 #include <string> #include <vector> const static char *kInputTensorName = "images"; const static char *kOutputTensorName = "output"; const static int kNumClass = 80; const static int kBatchSize = 8; const static int kGpuId = 0; const static int kInputH = 640; const static int kInputW = 640; const static float kNmsThresh = 0.55f; const static float kConfThresh = 0.45f; const static int kMaxInputImageSize = 3000 * 3000; const static int kMaxNumOutputBbox = 1000; const std::string trtFile = "../weights/yolov8s.engine"; const std::string cacheFile = "./int8calib.table"; const std::string calibrationDataPath = "../images/"; // 存放用于 int8 量化校准的图像 const std::vector<std::string> vClassNames { "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat", "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee", "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove", "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon", "bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut", "cake", "chair", "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse", "remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book", "clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush" }; #endif // CONFIG_H YOLOv8-SAHI 切分检测为了验证量化后模型精度以及Batch推理的性能,这里我们使用Int8量化后的模型直接对量化图片进行切分检测,推理命令如下:cd build ./yolov8_sahi -d ../weights/yolov8s.engine ../images/在Jetson Orin Nano (8GB)上使用Int8引擎的YOLOv8-SAHI性能表现如下:sample0102.png YOLOv8-SAHI: 1775ms sample0206.png YOLOv8-SAHI: 46ms sample0121.png YOLOv8-SAHI: 44ms sample0058.png YOLOv8-SAHI: 44ms sample0070.png YOLOv8-SAHI: 44ms sample0324.png YOLOv8-SAHI: 43ms sample0122.png YOLOv8-SAHI: 44ms sample0086.png YOLOv8-SAHI: 45ms sample0124.png YOLOv8-SAHI: 45ms sample0230.png YOLOv8-SAHI: 45ms ...可以看到模型对单张图片的推理时间小于0.5毫秒,可以达到实时检测的要求。YOLOv8-SAHI-ByteTrack 视频跟踪我们可以结合ByteTrack跟踪算法对视频文件进行实时的切分检测和跟踪,在build目录下执行:cd build ./yolov8_sahi_track ../media/c3_1080.mp4 在Jetson Orin Nano (8GB)上YOLOv8-SAHI-ByteTrack性能表现如下:Total frames: 341 Init ByteTrack! Processing frame 20 (8 fps) Processing frame 40 (11 fps) Processing frame 60 (12 fps) Processing frame 80 (12 fps) Processing frame 100 (13 fps) Processing frame 120 (13 fps) Processing frame 140 (13 fps) Processing frame 160 (14 fps) Processing frame 180 (14 fps) Processing frame 200 (14 fps) Processing frame 220 (14 fps) Processing frame 240 (14 fps) Processing frame 260 (14 fps) Processing frame 280 (14 fps) Processing frame 300 (14 fps) Processing frame 320 (14 fps) Processing frame 340 (15 fps) FPS: 15 可以看到模型在1080p的视频上切分检测的帧率接近15FPS,并且ByteTrack的跟踪效果非常优秀。小结通过本项目,开发者可以在资源受限的嵌入式设备上实现高效的YOLOv8切分检测和跟踪,特别适用于需要实时处理的边缘计算场景。
  • [技术干货] 如何在虚拟环境中调用TensorRT
    如何在虚拟环境中调用TensorRTPython版本的TensorRT是跟随Jetpack已经安装好的,但只适配了Jetpack自带的Python版本,因此我们在使用conda创建虚拟环境时Python版本尽量和系统版本保持一致。我们首先激活虚拟环境,我创建的名称是py38:conda activate py38输入pip list查看已经安装好的Python依赖包:Package Version ------------------ ----------------------- certifi 2025.8.3 charset-normalizer 3.4.3 filelock 3.16.1 fsspec 2025.3.0 idna 3.10 Jinja2 3.1.6 MarkupSafe 2.1.5 mpmath 1.3.0 networkx 3.1 numpy 1.23.5 Pillow 9.5.0 pip 24.2 requests 2.32.4 setuptools 75.1.0 sympy 1.13.3 torch 2.1.0a0+41361538.nv23.6 torchvision 0.16.1 typing_extensions 4.13.2 urllib3 2.2.3 wheel 0.44.0可以看到没有TensorRT,其安装路径一般在系统Python目录下,以Jetson Orin Nano为例,我们可以建立软连接指向TensorRT安装路径:sudo ln -s /usr/lib/python3.8/dist-packages/tensorrt* /home/houyansong/miniconda3/envs/py38/lib/python3.8/site-packages之后再次输入pip list可以看到已经包含TensoRT依赖包:Package Version ------------------ ----------------------- certifi 2025.8.3 charset-normalizer 3.4.3 filelock 3.16.1 fsspec 2025.3.0 idna 3.10 Jinja2 3.1.6 MarkupSafe 2.1.5 mpmath 1.3.0 networkx 3.1 numpy 1.23.5 Pillow 9.5.0 pip 24.2 requests 2.32.4 setuptools 75.1.0 sympy 1.13.3 tensorrt 8.5.2.2 torch 2.1.0a0+41361538.nv23.6 torchvision 0.16.1 typing_extensions 4.13.2 urllib3 2.2.3 wheel 0.44.0我们可以验证一下,在命令行中输入:python -c "import tensorrt; print(tensorrt.__version__)" 输出结果为8.5.2.2,说明安装成功。
  • [技术干货] 【朝推夜训】如何在边缘设备上搭建深度学习开发环境
    【朝推夜训】如何在边缘设备上搭建深度学习开发环境如何在边缘设备上搭建深度学习的开发环境,我们以Jetson Orin Nano为例,介绍如何在开发板上安装Miniconda并配置conda源,以及如何安装Pytorch和Torchvision。1. 安装 Miniconda首先下载Miniconda最新安装包:wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh运行Miniconda3-latest-Linux-aarch64.sh安装脚本进行安装:bash ~/Miniconda3-latest-Linux-aarch64.sh关闭并重新打开终端窗口以使安装完全生效,或者使用以下命令刷新终端:source ~/.bashrc2. conda 换源首先编辑.condarc文件:vi ~/.condarc我们使用清华源,将文件修改为如下内容,即可添加Anaconda Python免费仓库。channels: - defaults show_channel_urls: true default_channels: - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2 custom_channels: conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud使用下列命令清除索引缓存,并创建Python-3.8开发环境。conda clean -i conda create -n py38 python=3.8 3. pip 换源在用户目录创建.pip目录,并编辑pip.conf文件:cd ~ mkdir .pip cd .pip vi pip.confpip.conf写入以下内容:[global] index-url = https://pypi.tuna.tsinghua.edu.cn/simple/ [install] trusted-host = pypi.tuna.tsinghua.edu.cn保存pip.conf4. 安装 Pytorch 和 Torchvision首先打开PyTorch for Jetson官网根据自己的JetPack版本选择合适安装包进行下载:https://forums.developer.nvidia.com/t/pytorch-for-jetson/72048 cd ~/Downloads wget https://developer.download.nvidia.cn/compute/redist/jp/v512/pytorch/torch-2.1.0a0+41361538.nv23.06-cp38-cp38-linux_aarch64.whl我下载版本的是PyTorch v2.1.0,之后激活我们刚刚创建好的conda环境py38并安装Pytorch:conda activate py38 sudo apt-get install python3-pip libopenblas-base libopenmpi-dev libomp-dev pip install ~/Downloads/torch-2.1.0a0+41361538.nv23.06-cp38-cp38-linux_aarch64.whl安装完成后,我们可以验证一下,在命令行中输入:python -c "import torch; print(torch.cuda.is_available())" 结果为True,则说明安装成功。之后安装Torchvision,Pytorch v2.1.0对应torchvision v0.16.1:PyTorch v1.0 - torchvision v0.2.2PyTorch v1.1 - torchvision v0.3.0PyTorch v1.2 - torchvision v0.4.0PyTorch v1.3 - torchvision v0.4.2PyTorch v1.4 - torchvision v0.5.0PyTorch v1.5 - torchvision v0.6.0PyTorch v1.6 - torchvision v0.7.0PyTorch v1.7 - torchvision v0.8.1PyTorch v1.8 - torchvision v0.9.0PyTorch v1.9 - torchvision v0.10.0PyTorch v1.10 - torchvision v0.11.1PyTorch v1.11 - torchvision v0.12.0PyTorch v1.12 - torchvision v0.13.0PyTorch v1.13 - torchvision v0.13.0PyTorch v1.14 - torchvision v0.14.1PyTorch v2.0 - torchvision v0.15.1PyTorch v2.1 - torchvision v0.16.1PyTorch v2.2 - torchvision v0.17.1PyTorch v2.3 - torchvision v0.18.0Torchvision安装命令如下:sudo apt-get install libjpeg-dev zlib1g-dev libpython3-dev libopenblas-dev libavcodec-dev libavformat-dev libswscale-dev git clone --branch v0.16.1 https://github.com/pytorch/vision torchvision cd torchvision conda activate py38 export BUILD_VERSION=0.16.1 pip install numpy==1.23.5 Pillow==9.5.0 requests==2.32.4 python setup.py install --user等待Trochvision安装完成后,我们可以验证一下,在命令行中输入:python -c "import torchvision; print(torchvision.__version__)" 成功打印版本号说明安装成功!
  • [技术干货] 松材线虫病检测
    松材线虫病检测1. 数据切分无人机广角拍摄的影像分辨率较高(4000x3000),首先对人工标注好的松材线虫病数据集进行切分,将大图切分成小图并设置不同的切分尺寸(例如:1000x1000、1500x1500、2000x2000)和重叠比例(例如:0%、10%、20%、30%)送入模型进行训练。2. 模型训练YOLOv8自2023年推出后,经过多次优化迭代,其架构设计(如C2F模块、动态标签分配)与训练流程已趋于成熟。例如嵌入式设备依赖v8的轻量化特效,在医疗检测领域,v8的高召回率已被临床验证。YOLO12等虽在理论上超越YOLOv8,但是v8的推理速度仍具不可替代性,目前在工业界广泛采用该版本进行部署。我们使用YOLOv8对等比例缩放后的原始图像和切分后的松材线虫病检测数据集进行训练,提高模型对不同大小目标的泛化能力,每次迭代训练s和m两种尺寸的模型,分别用于视频直播检测和图像的自动标注。目前我们的模型已经适配国产昇腾和英伟达的算力卡,可以实现模型的自动化训练作业,并针对不同算力芯片进行模型的自动转换和量化。3. 云上标注我们的模型可以对无人机回传的图片和视频进行切分检测和自动标注,针对不同大小的目标和类别可以设置不同的切分尺寸和重叠比例,实现无人机影像的细粒度检测。4. 直播推理我们的AI直播推理业务Pipeline并发运行,使用Python结合C++进行开发,功能模块化,业务运行更高效,可以在RK3588、Jetson系列开发板上进行部署。目前针对松材线虫病检测的场景,已经支持对9种疫木的实时识别。----转自博客:https://bbs.huaweicloud.com/blogs/458003
  • [技术干货] 如何使用 Python 开发 AI 图编排应用
    如何使用 Python 开发 AI 图编排应用本文将介绍使用Python开发一个简单的AI图编排应用,我们的目标是实现AI应用在RK3588上灵活编排和高效部署。首先我们定义的图是由边和节点组成的有向无环图,边代表任务队列,表示数据在节点之间的流动关系,每个节点都是一个计算单元,用于处理特定的任务。之后我们可以定义一组的处理特定任务的函数节点也称为计算单元,例如:read_frame、model_infer、kf_tracker、draw_boxes、push_frame、redis_push,分别用于读取视频、模型检测、目标跟踪、图像绘制、视频输出以及结果推送。每个节点可以有一个输入和多个输出,数据在节点之间是单向流动的,节点之间通过边进行连接,每个节点通过队列消费和传递数据。代码地址:https://github.com/HouYanSong/modelbox-rk3588一. 计算节点的实现我们在Json文件中定义每一个节点的的数据结构并使用Python进行代码实现:读流计算单元有4个参数:pull_video_url、height、width、fps,分别代表视频地址、视频高度和宽度以及读取帧率,它仅作为生产者,产生的数据可以输出到多个队列。"read_frame": { "config": { "pull_video_url": { "type": "str", "required": true, "default": null, "desc": "pull video url", "source": "mp4|flv|rtmp|rtsp" }, "height": { "type": "int", "required": true, "default": null, "max": 1440, "min": 720, "desc": "video height" }, "width": { "type": "int", "required": true, "default": null, "max": 1920, "min": 960, "desc": "video width" }, "fps": { "type": "int", "required": true, "default": null, "max": 15, "min": 5, "desc": "frame rate" } }, "multi_output": [] } 函数代码的实现如下,我们可以对视频文件或者视频流使用ffmpeg进行硬件解码,并将解码后的帧数据写入到队列中,用于后续任务节点的计算。def read_frame(share_dict, flowunit_data, queue_dict, data): pull_video_url = flowunit_data["config"]["pull_video_url"] height = flowunit_data["config"]["height"] width = flowunit_data["config"]["width"] fps = flowunit_data["config"]["fps"] ffmpeg_cmd = [ 'ffmpeg', '-c:v', 'h264_rkmpp', '-i', pull_video_url, '-r', f'{fps}', '-loglevel', 'info', '-s', f'{width}x{height}', '-an', '-f', 'rawvideo', '-pix_fmt', 'bgr24', 'pipe:' ] ffmpeg_process = sp.Popen(ffmpeg_cmd, stdout=sp.PIPE, stderr=sp.DEVNULL, bufsize=10**7) index = 0 while True: index += 1 raw_frame = ffmpeg_process.stdout.read(width * height * 3) if not raw_frame: break else: frame = np.frombuffer(raw_frame, dtype=np.uint8).reshape((height, width, -1)) data["frame"] = frame for queue_name in flowunit_data["multi_output"]: queue_dict[queue_name].put(data) # 读取结束,图片数据置为None data["frame"] = None for queue_name in flowunit_data["multi_output"]: queue_dict[queue_name].put(data) ffmpeg_process.stdout.close() ffmpeg_process.terminate() 推理计算单元的函数定义如下,它有一个输入和多个输出,我们可以指定模型和配置文件路径以及单次图像推理的批次大小等参数。"model_infer": { "config": { "model_file": { "type": "str", "required": true, "default": null, "desc": "model file path, rk3588 mostly ends with .rknn" }, "model_info": { "type": "str", "required": true, "default": null, "desc": "model info file path, mostly use json file" }, "batch_size": { "type": "int", "required": true, "default": null, "max": 8, "min": 1, "desc": "batch size" } }, "single_input": null, "multi_output": [] } 对应的函数实现如下,这里我们通过创建线程池的方式对图像进行批量推理,BatchSize的大小代表创建线程池的数量,将一个批次的推理结果写入到输出队列中,输出队列不唯一,可以为空或有多个输出队列。def model_infer(share_dict, flowunit_data, queue_dict, data): model_file = flowunit_data["config"]["model_file"] model_info = flowunit_data["config"]["model_info"] batch_size = flowunit_data["config"]["batch_size"] rknn_lite_list = [] for i in range(batch_size): rknn_lite = RKNNLite() rknn_lite.load_rknn(model_file) rknn_lite.init_runtime(core_mask=RKNNLite.NPU_CORE_0_1_2) rknn_lite_list.append(rknn_lite) with open(model_info, "r") as f: model_info = json.load(f) labels = [] for label in list(model_info["model_classes"].values()): labels.append(label) IMG_SIZE = model_info["input_shape"][0][-2:] OBJ_THRESH = model_info["conf_threshold"] NMS_THRESH = model_info["nms_threshold"] exist = False index = 0 while True: index += 1 image_batch = [] if flowunit_data["single_input"] is not None: for i in range(batch_size): data = queue_dict[flowunit_data["single_input"]].get() # 图片数据为None就退出循环 if data["frame"] is None: exist = True break image_batch.append(data) else: break with ThreadPoolExecutor(max_workers=batch_size) as executor: results = list(executor.map(infer_single_image, [(data["frame"], rknn_lite_list[i % batch_size], IMG_SIZE, OBJ_THRESH, NMS_THRESH) for i, data in enumerate(image_batch)])) for i, (boxes, classes, scores) in enumerate(results): classes = [labels[class_id] for class_id in classes] data = image_batch[i] if data.get("boxes") is None: data["boxes"] = boxes data["classes"] = classes data["scores"] = scores else: data["boxes"].extend(boxes) data["classes"].extend(classes) data["scores"].extend(scores) for queue_name in flowunit_data["multi_output"]: queue_dict[queue_name].put(data) if exist: break # 读取结束,图片数据置为None data["frame"] = None for queue_name in flowunit_data["multi_output"]: queue_dict[queue_name].put(data) for rknn_lite in rknn_lite_list: rknn_lite.release() 跟踪功能单元的可以对推理结果添加跟踪ID,如果没有推理结果,则直接返回原始数据,其定义如下:"kf_tracker": { "config": {}, "single_input": null, "multi_output": [] } 对应的函数代码实现如下:def kf_tracker(share_dict, flowunit_data, queue_dict, data): tracker = CentroidKF_Tracker(max_lost=30) index = 0 while True: index += 1 if flowunit_data["single_input"] is not None: data = queue_dict[flowunit_data["single_input"]].get() else: break # 图片数据为None就退出循环 if data["frame"] is None: break boxes, classes, scores = data.get("boxes"), data.get("classes"), data.get("scores") boxes = np.array(boxes) classes = np.array(classes) scores = np.array(scores) boxes[:, 2] = boxes[:, 2] - boxes[:, 0] boxes[:, 3] = boxes[:, 3] - boxes[:, 1] results = tracker.update(boxes, scores, classes) boxes = [] classes = [] scores = [] tracks = [] for result in results: frame_num, id, bb_left, bb_top, bb_width, bb_height, confidence, x, y, z, class_id = result boxes.append([bb_left, bb_top, bb_left + bb_width, bb_top + bb_height]) classes.append(class_id) scores.append(confidence) tracks.append(id) data["boxes"] = boxes data["classes"] = classes data["scores"] = scores data["tracks"] = tracks for queue_name in flowunit_data["multi_output"]: queue_dict[queue_name].put(data) # 读取结束,图片数据置为None data["frame"] = None for queue_name in flowunit_data["multi_output"]: queue_dict[queue_name].put(data) 绘制功能单元可以对检测和跟踪结果进行绘制,如果检测结果或跟踪结果为空,则直接返回原始数据,其定义如下:"draw_boxes": { "single_input": null, "config": {}, "multi_output": [] } 代码逻辑如下:def draw_boxes(share_dict, flowunit_data, queue_dict, data): index = 0 while True: index += 1 if flowunit_data["single_input"] is not None: data = queue_dict[flowunit_data["single_input"]].get() else: break # 图片数据为None就退出循环 if data["frame"] is None: break boxes, classes, scores = data.get("boxes"), data.get("classes"), data.get("scores") if boxes is not None: tracks = data.get("tracks") if tracks is not None: for boxe, clss, track in zip(boxes, classes, tracks): cv2.rectangle(data["frame"], (boxe[0], boxe[1]), (boxe[2], boxe[3]), (0, 255, 0), 2) cv2.putText(data["frame"], f"{clss} {track}", (boxe[0], boxe[1] - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2) else: for boxe, clss, conf in zip(boxes, classes, scores): cv2.rectangle(data["frame"], (boxe[0], boxe[1]), (boxe[2], boxe[3]), (0, 255, 0), 2) cv2.putText(data["frame"], f"{clss} {conf * 100:.2f}%", (boxe[0], boxe[1] - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2) for queue_name in flowunit_data["multi_output"]: queue_dict[queue_name].put(data) # 读取结束,图片数据置为None data["frame"] = None for queue_name in flowunit_data["multi_output"]: queue_dict[queue_name].put(data) 输出功能单元可以将视频帧编码成视频输到到视频文件或者推流到RTMP服务器,其参数定义如下:"push_frame": { "config": { "push_video_url": { "type": "str", "required": true, "default": null, "desc": "push video url", "source": "rtmp|flv|mp4" }, "format": { "type": "str", "required": true, "default": null, "desc": "vodeo format", "source": "flv|mp4" }, "height": { "type": "int", "required": true, "default": null, "max": 1920, "min": 720, "desc": "video height" }, "width": { "type": "int", "required": true, "default": null, "max": 1920, "min": 960, "desc": "video width" }, "fps": { "type": "int", "required": true, "default": null, "max": 15, "min": 5, "desc": "frame rate" } }, "single_input": null } push_video_url参数是推流地址,也可以输出到本地视频文件。format参数指定视频格式,支持flv和mp4。height和width为视频分辨率,fps是输出帧率。它仅作为消费者,具体函数代码实现如下:def push_frame(share_dict, flowunit_data, queue_dict, data): push_video_url = flowunit_data["config"]["push_video_url"] format = flowunit_data["config"]["format"] height = flowunit_data["config"]["height"] width = flowunit_data["config"]["width"] fps = flowunit_data["config"]["fps"] process_stdin = ( ffmpeg .input('pipe:', format='rawvideo', pix_fmt='bgr24', s="{}x{}".format(width, height), framerate=fps) .filter('fps', fps=fps, round='up') .output( push_video_url, vcodec='h264_rkmpp', bitrate='2500k', f=format, g=fps, an=None, timeout='0' ) .overwrite_output() .run_async(cmd=["ffmpeg", "-re"], pipe_stdin=True) ) index = 0 while True: index += 1 if flowunit_data["single_input"] is not None: data = queue_dict[flowunit_data["single_input"]].get() else: break # 图片数据为None就退出循环 if data["frame"] is None: break frame = data["frame"] frame = cv2.resize(frame, (width, height)) process_stdin.stdin.write(frame.tobytes()) process_stdin.stdin.close() process_stdin.terminate() 消息功能单元可以将检测或跟踪结果发送到Redis服务器,具体可以根据实际情况进行调整。"redis_push": { "config": { "task_id": { "type": "str", "required": true, "default": null, "desc": "task id" }, "host": { "type": "str", "required": true, "default": null, "desc": "redis host" }, "port": { "type": "int", "required": true, "default": null, "desc": "redis port" }, "username": { "type": "str", "required": true, "default": null, "desc": "redis username" }, "password": { "type": "str", "required": true, "default": null, "desc": "redis password" }, "db": { "type": "int", "required": true, "default": null, "desc": "redis db" } }, "single_input": null } 同样,它也仅作为消费者,只有一个输入,具体函数代码如下:def redis_push(share_dict, flowunit_data, queue_dict, data): task_id = flowunit_data["config"]["task_id"] host = flowunit_data["config"]["host"] port = flowunit_data["config"]["port"] username = flowunit_data["config"]["username"] password = flowunit_data["config"]["password"] db = flowunit_data["config"]["db"] r = redis.Redis( host = host, port = port, username = username, password = password, db = db, decode_responses = True ) index = 0 while True: index += 1 if flowunit_data["single_input"] is not None: data = queue_dict[flowunit_data["single_input"]].get() else: break # 图片数据为None就退出循环 if data["frame"] is None: break track_objs = [] height, width = data["frame"].shape[:2] boxes, classes, scores, tracks = data.get("boxes"), data.get("classes"), data.get("scores"), data.get("tracks") if boxes is not None: for boxe, clss, conf, track in zip(boxes, classes, scores, tracks): x1 = float(boxe[0] / width) y1 = float(boxe[1] / height) x2 = float(boxe[2] / width) y2 = float(boxe[3] / height) track_obj = { "bbox": [x1, y1, x2, y2], "track_id": int(track), "class_id": 0, "class_name": str(clss) } track_objs.append(track_obj) key = 'vision:track:' + str(task_id) + ':frame:' + str(index) value = json.dumps({"track_result": track_objs}) r.set(key, value) r.expire(key, 2) print(track_objs) r.close() 二、流程图编排定义好节点,我们就可以定义管道也就是“边”将“节点”的输入和输出连接起来,这里我们定义6条边也就是实例化6个队列,在配置文件中声明每条管道的名称以及队列的最大容量。"queue_size": 16, "queue_list": [ "frame_queue", "infer_queue_1", "infer_queue_2", "track_queue", "draw_queue_1", "draw_queue_2" ] 之后就是对每一个节点的参数进行配置,并定义功能单元的输入和输出。"graph_edge": { "读流功能单元": { "read_frame": { "config": { "pull_video_url": "/home/orangepi/workspace/modelbox/data/car.mp4", "height": 720, "width": 1280, "fps": 20 }, "multi_output": [ "frame_queue" ] } }, "推理功能单元": { "model_infer": { "config": { "model_file": "/home/orangepi/workspace/modelbox/model/yolov8n_800x800_int8.rknn", "model_info": "/home/orangepi/workspace/modelbox/model/yolov8n_800x800_int8.json", "batch_size": 8 }, "single_input": "frame_queue", "multi_output": [ "infer_queue_1", "infer_queue_2" ] } }, "跟踪功能单元_2": { "kf_tracker": { "config": {}, "single_input": "infer_queue_2", "multi_output": [ "track_queue" ] } }, "绘图功能单元_1": { "draw_boxes": { "config": {}, "single_input": "infer_queue_1", "multi_output": [ "draw_queue_1" ] } }, "绘图功能单元_2": { "draw_boxes": { "single_input": "track_queue", "config": {}, "multi_output": [ "draw_queue_2" ] } }, "推流功能单元_1": { "push_frame": { "config": { "push_video_url": "/home/orangepi/workspace/modelbox/output/det_result.mp4", "format": "mp4", "height": 720, "width": 1280, "fps": 20 }, "single_input": "draw_queue_1" } }, "推流功能单元_2": { "push_frame": { "config": { "push_video_url": "/home/orangepi/workspace/modelbox/output/track_result.mp4", "format": "mp4", "height": 720, "width": 1280, "fps": 20 }, "single_input": "draw_queue_2" } } } 每个功能单元需要起一个节点名称用于功能单元的创建,每个节点名称保证全局唯一,正如字典中的键值不能重复。之后根据这份图文件编排启动AI应用,Python代码如下:import os import sys import json import argparse sys.path.append(os.path.join(os.path.dirname(__file__), '..')) from etc.flowunit import * from multiprocessing import Process, Queue, Manager if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('graph_path', type=str, nargs='?', default='/home/orangepi/workspace/modelbox/graph/person_car.json') args = parser.parse_args() # 初始化数据 data = {"frame": None} config = {} # 读取流程图 with open(args.graph_path) as f: graph = json.load(f) # 创建队列 queue_dict = {} queue_size = graph["queue_size"] for queue_name in graph["queue_list"]: queue_dict[queue_name] = Queue(maxsize=queue_size) with Manager() as manager: # 创建共享字典 share_dict = manager.dict() # 创建进程 process_list = [] graph_edge = graph["graph_edge"] for process in graph_edge.keys(): p = Process(target=eval(list(graph_edge[process].keys())[0]), args=(share_dict, list(graph_edge[process].values())[0], queue_dict, data,)) process_list.append(p) print("=============Start Process...=============") # 启动进程 for p in process_list: p.start() # 等待进程结束 for p in process_list: p.join() print("==========All Process Finished.===========") 这里我们读取一段测试视频分别将检测结果和跟踪结果保存为两个视频文件输出到output目录下:(python-3.9.15) orangepi@orangepi5plus:~$ python /home/orangepi/workspace/modelbox/graph/graph.py /home/orangepi/workspace/modelbox/graph/person_car.json =============Start Process...============= ffmpeg version 04f5eaa Copyright (c) 2000-2023 the FFmpeg developers built with gcc 11 (Ubuntu 11.4.0-1ubuntu1~22.04) configuration: --prefix=/usr --enable-gpl --enable-version3 --enable-libdrm --enable-rkmpp --enable-rkrga libavutil 58. 29.100 / 58. 29.100 libavcodec 60. 31.102 / 60. 31.102 libavformat 60. 16.100 / 60. 16.100 libavdevice 60. 3.100 / 60. 3.100 libavfilter 9. 12.100 / 9. 12.100 libswscale 7. 5.100 / 7. 5.100 libswresample 4. 12.100 / 4. 12.100 libpostproc 57. 3.100 / 57. 3.100 ffmpeg version 04f5eaa Copyright (c) 2000-2023 the FFmpeg developers built with gcc 11 (Ubuntu 11.4.0-1ubuntu1~22.04) configuration: --prefix=/usr --enable-gpl --enable-version3 --enable-libdrm --enable-rkmpp --enable-rkrga libavutil 58. 29.100 / 58. 29.100 libavcodec 60. 31.102 / 60. 31.102 libavformat 60. 16.100 / 60. 16.100 libavdevice 60. 3.100 / 60. 3.100 libavfilter 9. 12.100 / 9. 12.100 libswscale 7. 5.100 / 7. 5.100 libswresample 4. 12.100 / 4. 12.100 libpostproc 57. 3.100 / 57. 3.100 W rknn-toolkit-lite2 version: 2.3.2 I RKNN: [13:10:47.190] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27) I RKNN: [13:10:47.191] RKNN Driver Information, version: 0.9.6 I RKNN: [13:10:47.192] RKNN Model Information, version: 2, toolkit version: 1.4.0-22dcfef4(compiler version: 1.4.0 (3b4520e4f@2022-09-05T12:50:09)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape W RKNN: [13:10:47.248] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.) W rknn-toolkit-lite2 version: 2.3.2 I RKNN: [13:10:47.338] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27) I RKNN: [13:10:47.338] RKNN Driver Information, version: 0.9.6 I RKNN: [13:10:47.339] RKNN Model Information, version: 2, toolkit version: 1.4.0-22dcfef4(compiler version: 1.4.0 (3b4520e4f@2022-09-05T12:50:09)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape W RKNN: [13:10:47.384] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.) W rknn-toolkit-lite2 version: 2.3.2 I RKNN: [13:10:47.459] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27) I RKNN: [13:10:47.459] RKNN Driver Information, version: 0.9.6 I RKNN: [13:10:47.460] RKNN Model Information, version: 2, toolkit version: 1.4.0-22dcfef4(compiler version: 1.4.0 (3b4520e4f@2022-09-05T12:50:09)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape W RKNN: [13:10:47.504] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.) W rknn-toolkit-lite2 version: 2.3.2 I RKNN: [13:10:47.606] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27) I RKNN: [13:10:47.606] RKNN Driver Information, version: 0.9.6 I RKNN: [13:10:47.608] RKNN Model Information, version: 2, toolkit version: 1.4.0-22dcfef4(compiler version: 1.4.0 (3b4520e4f@2022-09-05T12:50:09)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape W RKNN: [13:10:47.658] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.) W rknn-toolkit-lite2 version: 2.3.2 I RKNN: [13:10:47.761] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27) I RKNN: [13:10:47.761] RKNN Driver Information, version: 0.9.6 I RKNN: [13:10:47.762] RKNN Model Information, version: 2, toolkit version: 1.4.0-22dcfef4(compiler version: 1.4.0 (3b4520e4f@2022-09-05T12:50:09)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape W RKNN: [13:10:47.814] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.) W rknn-toolkit-lite2 version: 2.3.2 I RKNN: [13:10:47.910] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27) I RKNN: [13:10:47.910] RKNN Driver Information, version: 0.9.6 I RKNN: [13:10:47.912] RKNN Model Information, version: 2, toolkit version: 1.4.0-22dcfef4(compiler version: 1.4.0 (3b4520e4f@2022-09-05T12:50:09)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape W RKNN: [13:10:47.962] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.) W rknn-toolkit-lite2 version: 2.3.2 I RKNN: [13:10:48.069] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27) I RKNN: [13:10:48.070] RKNN Driver Information, version: 0.9.6 I RKNN: [13:10:48.071] RKNN Model Information, version: 2, toolkit version: 1.4.0-22dcfef4(compiler version: 1.4.0 (3b4520e4f@2022-09-05T12:50:09)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape W RKNN: [13:10:48.122] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.) W rknn-toolkit-lite2 version: 2.3.2 I RKNN: [13:10:48.228] RKNN Runtime Information, librknnrt version: 2.3.2 (429f97ae6b@2025-04-09T09:09:27) I RKNN: [13:10:48.228] RKNN Driver Information, version: 0.9.6 I RKNN: [13:10:48.229] RKNN Model Information, version: 2, toolkit version: 1.4.0-22dcfef4(compiler version: 1.4.0 (3b4520e4f@2022-09-05T12:50:09)), target: RKNPU v2, target platform: rk3588, framework name: ONNX, framework layout: NCHW, model inference type: static_shape W RKNN: [13:10:48.280] query RKNN_QUERY_INPUT_DYNAMIC_RANGE error, rknn model is static shape type, please export rknn with dynamic_shapes W Query dynamic range failed. Ret code: RKNN_ERR_MODEL_INVALID. (If it is a static shape RKNN model, please ignore the above warning message.) Input #0, rawvideo, from 'pipe:': Duration: N/A, start: 0.000000, bitrate: 442368 kb/s Stream #0:0: Video: rawvideo (BGR[24] / 0x18524742), bgr24, 1280x720, 442368 kb/s, 20 tbr, 20 tbn Stream mapping: Stream #0:0 (rawvideo) -> fps:default fps:default -> Stream #0:0 (h264_rkmpp) Output #0, mp4, to '/home/orangepi/workspace/modelbox/output/det_result.mp4': Metadata: encoder : Lavf60.16.100 Stream #0:0: Video: h264 (High) (avc1 / 0x31637661), bgr24(progressive), 1280x720, q=2-31, 2000 kb/s, 20 fps, 10240 tbn Metadata: encoder : Lavc60.31.102 h264_rkmpp Input #0, rawvideo, from 'pipe:': 0kB time=N/A bitrate=N/A speed=N/A Duration: N/A, start: 0.000000, bitrate: 442368 kb/s Stream #0:0: Video: rawvideo (BGR[24] / 0x18524742), bgr24, 1280x720, 442368 kb/s, 20 tbr, 20 tbn Stream mapping: Stream #0:0 (rawvideo) -> fps:default fps:default -> Stream #0:0 (h264_rkmpp) Output #0, mp4, to '/home/orangepi/workspace/modelbox/output/track_result.mp4': Metadata: encoder : Lavf60.16.100 Stream #0:0: Video: h264 (High) (avc1 / 0x31637661), bgr24(progressive), 1280x720, q=2-31, 2000 kb/s, 20 fps, 10240 tbn Metadata: encoder : Lavc60.31.102 h264_rkmpp [out#0/mp4 @ 0x558f2625e0] video:2330kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.062495% frame= 132 fps= 19 q=-0.0 Lsize= 2331kB time=00:00:06.55 bitrate=2915.8kbits/s speed=0.924x Exiting normally, received signal 15. [out#0/mp4 @ 0x557e4875e0] video:1990kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.072192% frame= 131 fps= 18 q=-0.0 Lsize= 1991kB time=00:00:06.50 bitrate=2509.6kbits/s speed= 0.9x ==========All Process Finished.=========== Exiting normally, received signal 15.应用推理的帧率取决于视频读取的帧率以及耗时最久的功能单元,实测FPS约为20左右,满足AI实时检测的场景。
  • 如何在边缘设备上搭建深度学习开发环境?
    最近入手了一套国产的香橙派和Jetson开发板,如何在板子上搭建深度学习开发环境进行AI应用开发?
  • [交流吐槽] 山顶西瓜240一个?物联网技术解决景区服务天价难题
    山顶泡面18块一桶?西瓜240一个?”最近,江西武功山景区的“天价”标签冲上热搜,刺痛游客神经。后经网警证实,事件纯属虚构,系为博流量恶意摆拍!舆论哗然间,景区物价的“孤岛困境”被无情揭示——山高路远,物资输送成本高昂,让消费者无奈承担了这沉重溢价。当商业逻辑冲破道德底线,物联网技术恰似一剂清醒剂,以万物互联为核心的技术革命,正在重塑景区服务的尺度。通过部署传感器实时监控库存与消耗,景区得以告别盲目囤货与紧急调运的被动局面。更关键的,是借助智能调度系统统筹规划无人运输车、缆车、甚至无人机等运力资源,构建起一张多维度的动态物流网络。当盈电智控物联网精准驱动,运送效率极大提升,传统“人挑马驮”的低效模式便自然淘汰,运输成本这压在价格上的巨石也随之松动。物联网让“透明”与“智慧”成为价格管理的双翼。电子价签可动态关联后台系统,根据运输成本、需求热度等因素灵活调整,确保价格浮动有据可循。盈电智控区块链溯源技术则能让每一份山巅商品清晰记录从源头到终端的旅程,成本构成一目了然,游客消费得安心、明白。技术应用的本质是服务于人。故宫博物院早已通过物联网建立起的智慧供应链,让昔日深宫内的纪念品与餐饮服务实现高效配送与价格亲民,游客络绎不绝而赞誉有加。盈电智控物联网非但不会增加景区负担,反而通过优化流程显著降低长期运营成本,最终惠泽游客——山顶一碗面,价格自然不必再“高耸入云”。物联网照亮的不仅是景区的每个角落,更是商业文明的新境界。当智能货架闪烁的价格数字与山间明月交相辉映,当扫码支付的提示音与松涛合奏成曲,技术终将证明:真正的"山顶经济",应是科技温暖与人性光辉的共同抵达。这或许才是旅游业高质量发展的应有之义。
  • 如何通过改进采样策略来降低扩散模型的推理时间成本
    通过改进采样策略,扩散模型可以在保持生成质量的同时显著减少推理时间。以下是核心方法及其数学依据的详细解析:​​一、传统扩散模型的采样瓶颈​​扩散模型的生成过程需要逐步去噪(通常需数千步),每一步均需运行噪声预测网络(如UNet)。例如,DDPM生成512×512图像需1000步,耗时约10秒。其核心瓶颈在于:​​马尔可夫链的线性依赖​​:每一步仅依赖前一步的状态,无法跳步。​​局部线性近似​​:传统方法(如DDPM)假设反向过程是局部线性的,导致收敛速度慢。​​二、加速采样策略的核心方法​​​​1. DDIM(Denoising Diffusion Implicit Models)​​​​核心思想​​:将扩散过程参数化为非马尔可夫过程,允许跳步生成。​​数学依据​​:​​重新参数化反向过程​​:传统DDPM定义反向过程为 x_{t-1} = f(x_t, t),而DDIM将其扩展为:其中 \lambda 为跳步比例,允许直接从 x_t 生成 x_{t-\lambda}。​​确定性生成​​:通过固定随机种子,DDIM可一步生成完整图像(类似GAN)。​​效果​​:在ImageNet上,仅需50步即可达到DDPM 1000步的FID(25.6 vs 25.8)。​​2. PLMS(Pseudo Linear Multi-Step Sampling)​​​​核心思想​​:用线性插值估计多步后的状态,减少迭代次数。​​数学依据​​:假设多步噪声预测可近似为线性组合:权重 w_i 通过最小化MSE优化。​​效果​​:在50步时FID为26.1,接近DDPM 1000步效果。​​3. Stable Consistency Models(SCM)​​​​核心思想​​:直接建模多步一致性,避免迭代。​​数学依据​​:定义一致性损失函数:其中 \text{Iterate} 表示从 x_t 经过 T-t 步生成 x_0 的过程。​​效果​​:仅需10步即可生成高质量图像,速度提升100倍。​​4. 动态步长调整(Dynamic Step Selection)​​​​核心思想​​:根据生成中间结果的置信度自适应调整步数。​​数学依据​​:使用强化学习策略(如PPO)选择步数:其中状态 s 为当前去噪图像,动作 a 为选择步数。​​效果​​:平均步数从1000降至300,速度提升3倍。​​三、数学核心:扩散过程的重新参数化​​所有加速方法均基于对扩散过程的重新参数化,其理论基础可归纳为:​​非马尔可夫性​​:允许反向过程跨越多步,打破马尔可夫链的线性依赖。​​噪声预测的泛化性​​:假设噪声预测网络 \epsilon_\theta 能够隐式建模多步分布:​​重参数化技巧​​:通过引入虚拟变量(如DDIM的 \lambda),将多步过程映射到单步空间。​​四、实际效果与优化组合​​​​DiT-XL/2 + DDIM​​:在ImageNet 256×256生成任务中,仅需50步即可达到FID 29.7(接近1000步的38.5)。​​SCM + 潜在扩散模型​​:在3D生成中,10步生成质量与1000步相当,显存占用减少90%。​​混合策略​​:结合动态步长(前100步)与SCM(后900步),总步数减少至200步,速度提升5倍。​​五、未来方向​​​​神经微分方程求解​​:将扩散过程建模为ODE,用自适应求解器(如DPM-Solver)动态调整步数。​​硬件感知优化​​:针对GPU/NPU特性设计并行化采样算法(如CUDA核融合)。​​多模态联合训练​​:共享噪声预测网络,提升跨任务采样效率。​​总结​​改进采样策略的核心在于​​打破扩散过程的线性依赖​​和​​增强噪声预测的泛化能力​​。通过数学上的重新参数化与非马尔可夫建模,DDIM、SCM等方法可将推理时间从小时级缩短至秒级,同时保持生成质量。未来方向是结合硬件特性与多模态架构,进一步突破效率瓶颈。
  • [设备专区] AR502H容器版:制作编译环境基础镜像报错
    1.执行 build_sdk.sh 报如下图错误(dockhub网站内未能找到huawei-ec-iot/sdk:base镜像)2.镜像服务应该是正常的,可以下载nginx3.docker镜像拉取采用了国内代理
总条数:37 到第
上滑加载中