• [技术干货] 华为算法精英实战营第5期 - 高维向量数据的近似检索 参赛经验分享
    华为算法精英实战营第5期 - 高维向量数据的近似检索 参赛经验分享赛题核心点介绍赛题目标是在保证精度的前提下追求极致性能优化,所有需要做到两点:通过预处理及数据压缩快速筛选数据尽可能的使用硬件加速向量化处理计算逻辑SIMD向量化经过测试,线上环境支持avx2和avx512指令集本题中为了确保精度需要求欧式距离$\sum_{i=1}^n(x_i-y_i)^2=\sum_{i=1}^nx_i^2+\sum_{i=1}^ny_i^2-2\sum_{i=1}^nx_iy_i$其中$\sum_{i=1}^{n}x_{i}^2$ 部分可以预处理时求得,$\sum_{i=1}^{n}y_{i}^2$在单次查询时不需要计算,只需要求$\sum_{i=1}^nx_iy_i$核心代码示例如下:__attribute__((target("avx512bw", "avx512f"), optimize("O3", "unroll-loops"))) float prod_avx512(const float *p, const float *q, size_t size) { float __attribute__((aligned(64))) tmp[16]; size_t align = size / 16 * 16; __m512 v1, v2; __m512 sum = _mm512_set1_ps(0); for (size_t i = 0; i < align; i += 16) { v1 = _mm512_loadu_ps(&p[i]); v2 = _mm512_loadu_ps(&q[i]); sum = _mm512_add_ps(sum, _mm512_mul_ps(v1, v2)); } _mm512_store_ps(tmp, sum); float res = tmp[0] + tmp[1] + tmp[2] + tmp[3] + tmp[4] + tmp[5] + tmp[6] + tmp[7] + tmp[8] + tmp[9] + tmp[10] + tmp[11] + tmp[12] + tmp[13] + tmp[14] + tmp[15]; for (size_t i = align; i < size; ++i) { res += (p[i] - q[i]) * (p[i] - q[i]); } return res; }汉明距离在数据预处理时,将32bit的float数据压缩到1bit,将1bit数据按列凑成uint64_t类型,这样就可以使用汉明距离做快速筛选明距离的计算相对于欧式距离来说更加高效,因为它只涉及到按位的异或(XOR)操作和按位计数(POPCOUNT)操作,这些操作在现代CPU上可以非常快速地执行。核心代码示例如下(这里测试avx2性能比avx512更高):__attribute__((target("avx2"), optimize("O3", "unroll-loops"))) uint64_t xor_uint64_avx2(const uint64_t *p, const uint64_t *q, const uint64_t size) { uint16_t res = 0; size_t size_a = size / 4 * 4; for (size_t i = 0; i < size_a; i += 4) { __m256i avx2_p = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(&p[i])); __m256i avx2_q = _mm256_loadu_si256(reinterpret_cast<const __m256i *>(&q[i])); __m256i result = _mm256_xor_si256(avx2_p, avx2_q); res += _mm_popcnt_u64(_mm256_extract_epi64(result, 0)) + _mm_popcnt_u64(_mm256_extract_epi64(result, 1)) + _mm_popcnt_u64(_mm256_extract_epi64(result, 2)) + _mm_popcnt_u64(_mm256_extract_epi64(result, 3)); } for (uint64_t i = size_a; i < size; ++i) { uint64_t r = (*(p + i)) ^ (*(q + i)); res += _mm_popcnt_u64(r); } return res; }压缩代码如下(这里耗时较少,未做并行化处理):void construct(float *d, uint64_t *h) { for (int c = 0; c < D / 64; c++) { uint64_t v = 0; for (int n = 0; n < 64; n++) { int j = c * 64 + n; // mid_col is the median of the column v = (v << 1) + (d[j] > mid_col[j] ? 1 : 0); } h[c] = v; } if (D % 64 > 0) { uint64_t v = 0; for (int n = 0; n < D % 64; n++) { int j = (D / 64) * 64 + n; v = (v << 1) + (d[j] > mid_col[j] ? 1 : 0); } h[D / 64] = v; } }为了确保精度,数据快速筛选后保留10%的数据继续通过欧式距离求得TOPK
  • [技术干货] 华为算法精英实战营第4期 - 云集群成本优化 参赛经验分享
    赛题核心点介绍这道题的核心赛点在于设计算法将结点分配给适当类型的服务器来最小化费用。整体思路介绍1. 对每一个CREATE操作,把同类(CPU和MEM相同)的查询规为一类,先看看之前有没有空位可填,剩下的做一个最优化dp。对于任意一种类型,令dp[n]表示给n个结点分配服务器的最小代价。则有两种转移,第一种把所有结点塞到一个服务器里,第二种是dp[n] = min{dp[i] + dp[n - i]}。2. 注意每个服务器只存放同类的结点就行了。这样做的原因是同类结点持续时间一般相同,更有可能同时删除,利用率会更高。更具体一点,就比如说在调用new分配内存的时候,同样大小的块是会被放在一起的,因为他们可能会被同时访问,这样基于空间局部性性能就会更高。放到这题里就是:同类结点的持续时间更一致,放到一组里利用率会更高。因为给一个结点分配了服务器之后就不能改变了,而且当服务器空闲的时候会自动释放。3. 一个服务器只能放一个类型结点,如果有结点结束了,之后往里面添加的,也必须同类型。也就是说服务器对应的类型在创建的时候就确定了。核心策略详解当着前面的只是基本思路,要像得到更高的分数需要更多的优化:1. 维护实时的服务器占用率,对每种类型的服务器维护一个占用率的期望值,然后在dp分配结点的时候用 价格/利用率 作为这种服务器的真实价格。对于单台服务器,其占用率的定义为:所有在这个服务器上执行过的结点的持续时长的和 / (服务器存在的时长 * 能存放的结点数量)。对于一个类型的服务器,其占用率的期望定义为到目前为止所有该类型服务器占用率的平均值。2. 基于上面的策略,假设对于某个类型x的结点,其最优的服务器类型是y,那么对于一次CREATE操作,我们的dp算法要做的就是分配足够数量的y服务器,此时剩下的少量结点会分配非y类型的服务器。举例来说:假设y能放10个x的结点,但是如果CREATE了11个,那么我们只会创建1一个y,剩下单出来的x有一个容量更小的服务器运行即可。这个算法看起来很合理,但是如果下一次CREATE操作很快就会到来的话,不如直接一次创建2个y,虽然会有一个y的占用率很低,但是这个低占用率的持续时间是很短的。对此我们要做的事情也就很明确了:根据之前的CREATE操作维护一些信息来预测下一次CREATE到来的期望时间,然后根据这个信息来决定最后一次分配是选择y还是更小型的服务器。
  • [技术干货] 华为算法精英实战营第5期 - 高维向量数据的近似检索 参赛经验分享
    赛题核心点介绍这套题的核心赛点在于使用近似算法快速计算高维向量的距离。整体思路介绍手写SIMD,让计算性能达到极致。在预处理阶段算出来数据的均值和方差,将数据进行标准化处理,基于此将float压缩到一个byte里。这样做的原因是计算的瓶颈在于从内存中读取向量(缓存已经放不下了),通过缩小数据规模来加速计算。但是byte的精度是不够的,所以在计算完成之后再选取距离前5%近的向量,再用float算一遍就行了。这道题的排序第一个关键字是正确性,第二关键字才是性能,所以我在优化的时候还是很保守的,为了防止终测暴毙我就没有提交更激进的优化版本,不过在这里还是想简单介绍一下。假设查询向量为y,当前计算的向量为x,则有: 其中x^2和y^2的和都可以预处理出来,问题转化为计算向量x与y的内积。更进一步的,我们将待查询的向量x作为行向量拼成矩阵X,那么我们要做的就是计算Xy(Xy的第i个元素就是第i个向量与向量y的内积)。 题目说明了数据来源于真实数据,真实数据中的嵌入向量基本符合正态分布,也就是说向量值在高密度分布在0附近,同时我们从上面公式也可以观察到绝对值较小的数对距离的贡献期望也较小,所以对于较小的yi我们可以选择直接剪枝掉!这个优化在我自己的测试数据能显著提升执行效率,但是考虑到这个比赛有终测,本着稳一手精度(差不多得了)的想法,最后没有提交这个优化的版本。
  • [技术干货] 第五期 - 高维向量数据的近似检索 RANK2 团队解题思路分享
    高维向量数据的近似检索-huaweifirst_worldlatter团队赛题核心点介绍1.采用c++编写可执行程序单纯python+numpy的版本分数在1600.0374左右,上限较低,改用纯c++实现可以达到1600.0474分左右2.采用循环展开由于无法开启编译器simd指令优化,因此采用了循环展开(采用4/8/16)展开。对于M个doc向量在与query计算时可以使用循环展开,每次计算N(4/8/16)个doc的相似度,编译器会尝试生成向量化指令,从而加速整体的运算速度3.采用简化数据计算根据(x-y)^2=x^2+y^2-2xy可知,如果将向量doc提前归一化后可以将x^2和y^2的计算舍弃,进而只有一次乘法近似计算4.采用cache对于已经计算过的输入则不会重复计算,没有计算过的输入会保存其输出到cache中方便之后相同输出差表。整体思路介绍核心策略详解1.提前分配内存避免后面动态分配内存造成额外耗时由于已知doc的数量和每次只查询一个query,因此占用的内存是固定,可以提前开辟一块连续的大块内存,避免后面算法在动态分配内存造成内存碎片。2.数据读取阶段加速,关闭io相关的同步,加快数据读取由于c++的流机制存在的同步,因此关闭同步并且解绑输入输出流可以加快io的速度。3.通过使用doc归一化手段简化欧式距离计算,只用一次乘法即可获取结果简化了欧式距离的流程,由原来长度为L的向量需要计算两次减法一次乘法一次跟号共4L次运算,变换为只需要计算一次乘法L次运算。4.通过循环展开由于无法使用simd,因此采用循环展开,对于M个doc进行分块,按照4/8/16分块后每次计算4/8/16个doc与query的欧式距离,最后将剩余的块单独计算。经过编译器优化后可以产生一次加载4/8/16元素的向量化访存和计算的指令,从而加快整个doc的计算。5.通过优先队列计算topk由于topk的计算不需要全部doc的分数进行排序,只需要排序好前topk个即可,因此采用了堆排序的优先队列,可以避免全量doc的分数排序。6.使用cache缓存由于推理时可能有重复的query请求,为了避免重复query请求浪费计算力,会在计算时缓存旧的结果,如果遇到已经缓存的结果则直接取缓存值。
  • [大赛资讯] 【第六期拓扑感知的虚拟机放置问题】赛题补充介绍以及相关的一些算法思路建议。
    附件为出题专家组撰写的赛题介绍,以及关于第六期赛题的一些算法思路建议,对于各位选手的解题相信会有很大的启发,大家赶紧下载查看转发起来吧!以下为正文:Cloud computing has emerged as a new paradigm for on-demand access to a huge pool of computing resources. During the last decade it has formed a large-scale ecosystem of technologies, products and services which enabled the rapid innovation and development of new applications with millions of users. At the same time, cloud computing has posed new challenging problems to cloud providers that should operate the huge resource pools in a cost efficient way while providing a high quality of service to their users.To win the competition, major cloud service providers are focusing on improving the utilization of cloud resources, as this will allow to reduce the cost of cloud services. It is estimated that even a 1% increase in cloud service resource utilization can save millions of dollars in costs. The optimization of resource utilization, in turn, requires the development of advanced algorithms and technologies for cloud resource management. The virtual machine (VM) placement algorithm considered in this contest is one of these critical algorithms. This algorithm works online by finding a suitable server for running each requested user VM. The algorithm should compute its decision fast while providing a dense packing of VMs and avoiding rejections due to the lack of free space. The increase of the scale and complexity of cloud services and deployed applications has led to new user requirements. In particular, instead of a single VM, users often need to deploy a set of VMs with specific requirements on the quality of their placement. For example, to ensure high availability each VM must be placed in a separate fault domain, a set of servers that share a single point of failure, such as rack. In other situation, like neural network training, it may be preferable to place VMs "close" to each other in terms of cloud network topology to reduce the communication latency and speed up the computations.These new requirements bring new challenges to VM placement problem – how to quickly and efficiently place large sets of VMs while satisfying numerous constraints. Basic algorithms that place each VM independently cannot adapt to such complex scenarios. So the context of this Challenge is development of topology-aware VM placement algorithm that can both satisfy new constraints and ensure efficient resource utilization.The algorithm operates a large pool of servers by processing incoming user requests on creation and deletion of virtual machines. The resource pool is organized in a hierarchical topology with multiple levels. Servers live in racks with form separate fault domains, because the failure of the rack network switch or power supply leads to the failure of all servers from the rack. Racks are grouped into pods and network domains. The network communication inside the network domain is much faster than between different network domains.Cloud users create VMs from the predefined set of VM types. A user can request creation of multiple VMs of the same type at once. A user can request the deletion of any previously created VM at any time. Each VM is associated with some placement group (PG). The main role of PG is to define the requirements on placement of VMs belonging to this group. These requirements are specified as affinity and anti-affinity constraints. For example, each VM in group should be placed in a separate rack is an anti-affinity constraint. It is also possible to combine multiple constraints in a PG.The algorithm processes requests one by one, in online fashion. That means that the algorithm must output its decision regarding the current request without knowledge about the next requests. This is exactly how such algorithm runs in a real cloud. Your main goal will be to maximize the number of successfully placed VMs. As the algorithm runs on a resource pool of fixed size the more requests are allocated, the less resources are wasted and the more resource efficient is the algorithm.The first version of this problem was published as the ICPC 2022 Online Challenge powered by Huawei which attracted many algorithm enthusiasts from all over the world. For the new competition, the problem has been reworked to introduce more complex topology and placement constraints, new test cases and solution scoring. Below we describe the main differences.To model new placement constraints arising in AI training scenarios, a new topology level (pod) has been introduced which corresponds to a set of racks. Training of large AI models requires a massive amount of computations and typically runs in parallel across a large number of servers. The communication speed between these servers is crucial, therefore such workloads require allocation of one or several whole pods in a single network domain. Such constraints has been added to this challenge. We have also added more complex real-world constraints which combine affinity and anti-affinity on different topology levels.57 completely new test cases have been prepared for this challenge. While the previous challenge featured only synthetic test cases, this challenge also features test cases based on real cloud requests. The generator of synthetics cases has also been improved.Finally, the solution scoring is also reworked. While in ICPC Challenge the achieved results were compared with simple baseline algorithm, now we compare them with an upper bound on the number of successfully placed VMs which is a hard theoretical limit.The problem being solved shares many similarities with the classic bin packing and knapsack problems. Indeed, we need to pack items represented by VMs into bins represented by servers. Therefore algorithms for solving these well-known problems can be used to efficiently pack VMs into servers without leaving unused resources. However, our problem is remarkably harder than these classic problems because of the following features. First, we consider online scheduling in which future requests are unknown in advance, while most of classic algorithms consider offline setting where all packed items are given as input. Development of online algorithm requires dealing with uncertainty. One approach is to minimize the risks by preparing to challenging requests. For example, we can be prepared for upcoming requests for large VM by reserving enough free space to accommodate such VMs. Another approach is to try to anticipate future requests by using some predictive model based on already processed requests. For example, the distribution of requested VM types, or what kind or requests arrived recently, may provide some hints about the upcoming requests.Second, we consider the dynamic packing problem where the items may depart from the bins, which also quite different from the classic bin packing. While item departures are not as challenging as dynamic item arrivals and can even simplify the packing, they can also be leveraged to improve the packing efficiency. For example, by packing together VMs with similar lifetime, we can improve the availability of large free space on servers for packing large VMs. This requires predicting the VM lifetime, which can be also done using the already observed information. Note also that VMs from the same PG are usually deleted in batches, so their lifetime is similar.Third, the most complexity comes from the placement constraints. Consider a PG with affinity constraint on rack level. When placing the first request from this group, we need to choose a rack with enough free space to accommodate the VMs from this request and the VMs from upcoming requests of this group. This means that it is no longer safe to pack VMs one by one, without considering whether the whole request will fit a rack. It also means that a care should be taken to provide some room for possible future growth of the group inside the chosen rack. One possible approach is to choose a rack with the most available free space. Note also that groups cannot grow infinitely, there is some limit on their size. Anti-affinity constraints are usually less challenging but they also require the strategical management of available space to success. For example, for anti-affinity on rack level with partitions it is better to choose the racks where the request partition is already present to avoid occupying too much racks for some partition, since it will reduce possible racks for accommodating other partitions. Think about what kind of similar strategies can be applied for other constraints. Finally, when choosing a topology domain for placing VMs it is convenient to select domains in a top-down fashion using some scoring metrics or rules for each domain on some level.As a last recommendation and a word of caution, try to avoid overfitting your solution parameters and even placement strategies to individual tests. This will both complicate your algorithm and make it inapplicable in a real cloud setting. Strive for simpler and general algorithms with a clear intuition behind them and a small number of parameters, and also add comments in the code to explaining them. That is what we expect from a perfect solution to our challenge.
  • [自动学习] 深度学习算法中的过拟合问题如何解决?
    在深度学习算法中,过拟合是一个常见的问题,即模型在训练数据上表现良好,但在测试数据或实际应用中性能下降。请提出几种解决过拟合问题的有效方法,并解释其原理和应用场景。例如,可以采用正则化技术、增加数据集多样性、使用dropout等方法来防止过拟合。
  • [第5期高维向量数据] total time limit: 50 seconds是指每个测试集50秒还是一共50秒
    total time limit: 50 seconds是指每个测试集50秒还是一共50秒
  • [第5期高维向量数据] 得分为0,反馈信息也没有,
    但是本地测试一些简单的样例后,结果确实是正确的。baseline有分数,但是使用一些近似检索算法后就一直得分为0。。图1是baseline,图2 是改进代码,没有调用额外的库。
  • [常见FAQ] python程序在上传之后显示”初始化超时“
    在本地运行没有任何问题,本次初始化5ms都不到的初始化程序,在提交到系统之后一直显示”初始化超时“,求指导,该怎么办?
  • [应用开发] SegFormer-B0 OM模型在MDC300F上推理时间为1000ms
    在将segformer-b0算法模型转OM后,在MDC300F MINI上推理耗时1000ms。ONNX中MatMul算子的输入数据的shape带batch维度,将算子的type类型更改为 BatchMatMul后,转OM时,发现MatMul算子前后各出现一个trans_TransData算子。通过profiling计算算子耗时,发现TransposeD、ArgMaxD、TransData、SoftmaxV2、BatchMatMul占用大量的推理时间。请问如何避免带batch维度的MatMul产生TransData算子?针对现在算子耗时,有什么优化的方法吗?
  • [技术干货] 深度优先搜索(DFS)与广度优先搜索(BFS)及其应用场景
    在算法领域中,深度优先搜索(DFS)和广度优先搜索(BFS)是两种常见且重要的图遍历算法。它们各自具有独特的特点和适用场景。本文将详细介绍这两种算法,并探讨它们的最佳应用场景。一、深度优先搜索(DFS)深度优先搜索是一种用于遍历或搜索树或图的算法。这个算法会尽可能深地搜索图的分支。当节点v的所在边都己被探寻过,搜索将回溯到发现节点v的那条边的起始节点。这一过程一直进行到已发现从源节点可达的所有节点为止。如果还存在未被发现的节点,则选择其中一个作为源节点并重复以上过程,整个进程反复进行直到所有节点都被访问为止。DFS的应用场景:图的连通性检查:给定一个无向图,判断是否存在从节点A到节点B的路径。 生成树的构造:在计算机网络中,DFS可以用来发现网络的拓扑结构,并生成一棵生成树。 解决迷宫问题:在迷宫问题中,DFS可以用来找到从入口到出口的路径。 树的遍历:对于二叉树、多叉树等数据结构,DFS是一种非常有效的遍历方式。二、广度优先搜索(BFS)广度优先搜索是另一种用于遍历或搜索树或图的算法。它从根(或任意节点)开始并探索最近的节点,然后进行下一层的邻居节点,这样一层一层地进行。BFS使用队列数据结构来保存信息。BFS的应用场景:最短路径问题:在图中找到从一个节点到另一个节点的最短路径。例如,在地图中找到从起点到终点的最短路线。 层次遍历:对于二叉树、多叉树等数据结构,BFS可以用来进行层次遍历。 图的连通性检查:与DFS相似,BFS也可以用来检查图的连通性,但它通常用于找到从起点到所有其他可达节点的最短路径。 网络爬虫:在网页搜索中,BFS被用来遍历网页之间的链接,从而找到与给定网页相关的所有网页。总结DFS和BFS各有优缺点,选择哪种算法取决于具体的问题和应用场景。DFS通常用于深度探索,而BFS则更适用于广度探索。在实际应用中,我们需要根据问题的特性来选择最合适的算法。希望本文能帮助您更好地理解深度优先搜索和广度优先搜索,以及它们在不同场景下的应用。
  • [大赛资讯] 关于A题
    节点被自动删除后,新生成的节点编号还会复用此前的吗?比如样例中3号节点在4号pod被删除后就删除了,那再新生成一个节点编号是3还是4?
  • [技术干货] 开源项目分享之Hello算法(hello-algo)
    动画图解、一键运行的数据结构与算法教程项目地址: GitHub在线阅读                                   下载PDF《Hello 算法》:动画图解、一键运行的数据结构与算法教程,支持 Java, C++, Python, Go, JS, TS, C#, Swift, Rust, Dart, Zig 等语言。​关于本书本项目旨在打造一本开源免费、新手友好的数据结构与算法入门教程。全书采用动画图解,内容清晰易懂、学习曲线平滑,引导初学者探索数据结构与算法的知识地图。源代码可一键运行,帮助读者在练习中提升编程技能,了解算法工作原理和数据结构底层实现。鼓励读者互助学习,提问与评论通常可在两日内得到回复。推荐语
  • [问题求助] 雪花算法生成的ID为什么能保证全局唯一?
    雪花算法生成的ID为什么能保证全局唯一?
  • [其他] 浅谈AdaBoost算法
     Adaboost是一种迭代算法,其核心思想是针对同一个训练集训练不同的分类器(弱分类器),然后把这些弱分类器集合起来,构成一个更强的最终分类器(强分类器)。其算法本身是通过改变数据分布来实现的,它根据每次训练集之中每个样本的分类是否正确,以及上次的总体分类的准确率,来确定每个样本的权值。将修改过权值的新数据集送给下层分类器进行训练,最后将每次训练得到的分类器最后融合起来,作为最后的决策分类器。使用adaboost分类器可以排除一些不必要的训练数据特征,并放在关键的训练数据上面。 AdaBoost用于短决策树。创建第一棵树之后,使用树在每个训练实例上的性能来得到一个权重,决定下一棵树对每个训练实例的注意力。 难以预测的训练数据被赋予更多权重,而易于预测的实例被赋予更少的权重。模型是一个接一个地顺序创建的,每个模型更新训练实例上的权重,这些权重影响序列中下一个树所执行的学习。构建完所有树之后,将对新数据进行预测。 因为着重于修正算法的错误,所以重要的是提前清洗好数据,去掉异常值。 Adaboost是一种迭代算法,其核心思想是针对同一个训练集训练不同的分类器(弱分类器),然后把这些弱分类器集合起来,构成一个更强的最终分类器(强分类器)。  算法应用  对adaBoost算法的研究以及应用大多集中于分类问题,同时也出现了一些在回归问题上的应用。就其应用adaBoost系列主要解决了: 两类问题、多类单标签问题、多类多标签问题、大类单标签问题、回归问题。它用全部的训练样本进行学习。 应用示例  adaboost 是 bosting 的方法之一,bosting就是把若干个分类效果并不好的分类器综合起来考虑,会得到一个效果比较好的分类器。 下图,左右两个决策树,单个看是效果不怎么好的,但是把同样的数据投入进去,把两个结果加起来考虑,就会增加可信度。  adaboost 的栗子,手写识别中,在画板上可以抓取到很多 features,例如 始点的方向,始点和终点的距离等等。  training 的时候,会得到每个 feature 的 weight,例如 2 和 3 的开头部分很像,这个 feature 对分类起到的作用很小,它的权重也就会较小。  而这个 alpha 角 就具有很强的识别性,这个 feature 的权重就会较大,最后的预测结果是综合考虑这些 feature 的结果。 
总条数:44 到第
上滑加载中