




  1. 如果编译器能知道怎么优化是最好的(绝大多数情况下),那么不要复杂化代码。
  2. 编译器的优势是聪明,但你的优势是知道的多,因此提示编译器而不是手写汇编/SIMD。
  3. 99%的情况下不要使用SIMD,如果你发现无法成功提示编译器,并且这里的性能_真的_很重要,那么可以使用SIMD,但是要注意跨平台的问题,并测试你的代码真的超过了-O3下的编译器(因为流水线和CPU性能问题,性能可能并没有提高)。
  4. 不要使用汇编,除非你找到了SIMD库的问题( https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html)























































































-fschedule-insns2 -fsched-interblock:对指令进行调度以提高执行效率。






















计算一个数字的二进制中有多少个 1



cpp 复制代码
void f(int *a, int *b, int *c) 
  *a += *c;
  *b += *c;

// f(int*, int*, int*):
// 	movl	(%rdx), %eax
// 	addl	%eax, (%rdi)
// 	movl	(%rdx), %eax
// 	addl	%eax, (%rsi)
// 	ret

void f(int * __restrict__ a, int* __restrict__ b, int* __restrict__ c)
  *a += *c;
  *b += *c;

//  f(int*, int*, int*):
// 	movl	(%rdx), %eax
// 	addl	%eax, (%rdi)
// 	addl	%eax, (%rsi)
// 	ret


除了可以用 __restrict 让编译器放心做 SIMD 优化外,还可以用 OpenMP 的这条指令来迫使编译器无视指针别名的问题,并启用 SIMD 优化。不过得给编译器打开 -fopenmp 这个选项。:




对小的循环体进行 unroll 可能是划算的,但最好不要 unroll 大的循环体,否则会造成指令缓存的压力反而变慢。



比如使用avx-512,将数据与64个字节对齐时可以通过_mm512_load_pd将数据直接加载到zmmm寄存器中,并在其上应用SIMD指令,然后通过_mm512 _stream_pd将其存储回。如果不进行大量的向量化计算,只会造成内存浪费。







  • 遮罩 Masking
  • 排列
  • 选择性加载 / 存储
  • 压缩 / 扩展
  • 选择性聚集 / 散开



对于每个通道,将索引向量中指定的偏移量处的输入向量的值复制到目标向量中。在 AVX-512 之前,数据库管理系统必须将数据从 SIMD 寄存器写入内存,然后再写回 SIMD 寄存器。而 AVX-512 指令集引入了新的 PERMUTE 操作,可以直接在 SIMD 寄存器内部完成元素重排,大大提高了性能。


在SIMD(Single Instruction, Multiple Data)编程中,Blend(混合)是一种操作,用于将两个向量按照指定的规则进行混合。混合操作通常是将两个向量的对应元素进行混合,生成一个新的向量。

选择性加载 / 存储


压缩 / 扩展




选择性聚集 / 散开





Make the most out of your SIMD investments: counter control flow divergence in compiled query pipelines



cpp 复制代码
#ifdef __SSE2__
#include <emmintrin.h>

#include <immintrin.h>

#if defined(__aarch64__) && defined(__ARM_NEON)
#    include <arm_neon.h>
#      pragma clang diagnostic ignored "-Wreserved-identifier"





  1. 如果用 -ftree-loop-distribute-patterns可能会导致编译器优化为自带的memcpy,而又会重新调用到重写的memcpy,导致递归调用,所以必须禁用掉。
  2. 用AVX512有两个问题:一个是降频,第二个是SSE切换AVX512的性能开销。
  3. 然后作者列了几个影响性能的因素:
    1. 预取指令,因为预取指令的大小不确定,而且在ARM中性能比较差,所以这里没有预取
    2. 对齐,这里使用不对齐的加载和对齐的存储
    3. 循环展开次数,这里固定为8次
  4. attribute((no_sanitize("coverage")))禁用行数统计
  5. 最后作者提到memcpy可能会在编译时被优化为循环赋值,使用**-fbuiltin-memcpy**
cpp 复制代码
#include <stddef.h>

#include <emmintrin.h>

/** Custom memcpy implementation for ClickHouse.
  * It has the following benefits over using glibc's implementation:
  * 1. Avoiding dependency on specific version of glibc's symbol, like memcpy@@GLIBC_2.14 for portability.
  * 2. Avoiding indirect call via PLT due to shared linking, that can be less efficient.
  * 3. It's possible to include this header and call inline_memcpy directly for better inlining or interprocedural analysis.
  * 4. Better results on our performance tests on current CPUs: up to 25% on some queries and up to 0.7%..1% in average across all queries.
  * Writing our own memcpy is extremely difficult for the following reasons:
  * 1. The optimal variant depends on the specific CPU model.
  * 2. The optimal variant depends on the distribution of size arguments.
  * 3. It depends on the number of threads copying data concurrently.
  * 4. It also depends on how the calling code is using the copied data and how the different memcpy calls are related to each other.
  * Due to vast range of scenarios it makes proper testing especially difficult.
  * When writing our own memcpy there is a risk to overoptimize it
  * on non-representative microbenchmarks while making real-world use cases actually worse.
  * Most of the benchmarks for memcpy on the internet are wrong.
  * Let's look at the details:
  * For small size, the order of branches in code is important.
  * There are variants with specific order of branches (like here or in glibc)
  * or with jump table (in asm code see example from Cosmopolitan libc:
  * https://github.com/jart/cosmopolitan/blob/de09bec215675e9b0beb722df89c6f794da74f3f/libc/nexgen32e/memcpy.S#L61)
  * or with Duff device in C (see https://github.com/skywind3000/FastMemcpy/)
  * It's also important how to copy uneven sizes.
  * Almost every implementation, including this, is using two overlapping movs.
  * It is important to disable -ftree-loop-distribute-patterns when compiling memcpy implementation,
  * otherwise the compiler can replace internal loops to a call to memcpy that will lead to infinite recursion.
  * For larger sizes it's important to choose the instructions used:
  * - SSE or AVX or AVX-512;
  * - rep movsb;
  * Performance will depend on the size threshold, on the CPU model, on the "erms" flag
  * ("Enhansed Rep MovS" - it indicates that performance of "rep movsb" is decent for large sizes)
  * https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy
  * Using AVX-512 can be bad due to throttling.
  * Using AVX can be bad if most code is using SSE due to switching penalty
  * (it also depends on the usage of "vzeroupper" instruction).
  * But in some cases AVX gives a win.
  * It also depends on how many times the loop will be unrolled.
  * We are unrolling the loop 8 times (by the number of available registers), but it not always the best.
  * It also depends on the usage of aligned or unaligned loads/stores.
  * We are using unaligned loads and aligned stores.
  * It also depends on the usage of prefetch instructions. It makes sense on some Intel CPUs but can slow down performance on AMD.
  * Setting up correct offset for prefetching is non-obvious.
  * Non-temporary (cache bypassing) stores can be used for very large sizes (more than a half of L3 cache).
  * But the exact threshold is unclear - when doing memcpy from multiple threads the optimal threshold can be lower,
  * because L3 cache is shared (and L2 cache is partially shared).
  * Very large size of memcpy typically indicates suboptimal (not cache friendly) algorithms in code or unrealistic scenarios,
  * so we don't pay attention to using non-temporary stores.
  * On recent Intel CPUs, the presence of "erms" makes "rep movsb" the most beneficial,
  * even comparing to non-temporary aligned unrolled stores even with the most wide registers.
  * memcpy can be written in asm, C or C++. The latter can also use inline asm.
  * The asm implementation can be better to make sure that compiler won't make the code worse,
  * to ensure the order of branches, the code layout, the usage of all required registers.
  * But if it is located in separate translation unit, inlining will not be possible
  * (inline asm can be used to overcome this limitation).
  * Sometimes C or C++ code can be further optimized by compiler.
  * For example, clang is capable replacing SSE intrinsics to AVX code if -mavx is used.
  * Please note that compiler can replace plain code to memcpy and vice versa.
  * - memcpy with compile-time known small size is replaced to simple instructions without a call to memcpy;
  *   it is controlled by -fbuiltin-memcpy and can be manually ensured by calling __builtin_memcpy.
  *   This is often used to implement unaligned load/store without undefined behaviour in C++.
  * - a loop with copying bytes can be recognized and replaced by a call to memcpy;
  *   it is controlled by -ftree-loop-distribute-patterns.
  * - also note that a loop with copying bytes can be unrolled, peeled and vectorized that will give you
  *   inline code somewhat similar to a decent implementation of memcpy.
  * This description is up to date as of Mar 2021.
  * How to test the memcpy implementation for performance:
  * 1. Test on real production workload.
  * 2. For synthetic test, see utils/memcpy-bench, but make sure you will do the best to exhaust the wide range of scenarios.
  * TODO: Add self-tuning memcpy with bayesian bandits algorithm for large sizes.
  * See https://habr.com/en/company/yandex/blog/457612/

static inline void * inline_memcpy(void * __restrict dst_, const void * __restrict src_, size_t size)
    /// We will use pointer arithmetic, so char pointer will be used.
    /// Note that __restrict makes sense (otherwise compiler will reload data from memory
    /// instead of using the value of registers due to possible aliasing).
    char * __restrict dst = reinterpret_cast<char * __restrict>(dst_);
    const char * __restrict src = reinterpret_cast<const char * __restrict>(src_);

    /// Standard memcpy returns the original value of dst. It is rarely used but we have to do it.
    /// If you use memcpy with small but non-constant sizes, you can call inline_memcpy directly
    /// for inlining and removing this single instruction.
    void * ret = dst;

    /// Small sizes and tails after the loop for large sizes.
    /// The order of branches is important but in fact the optimal order depends on the distribution of sizes in your application.
    /// This order of branches is from the disassembly of glibc's code.
    /// We copy chunks of possibly uneven size with two overlapping movs.
    /// Example: to copy 5 bytes [0, 1, 2, 3, 4] we will copy tail [1, 2, 3, 4] first and then head [0, 1, 2, 3].
    // 不对齐的加载 两个重叠的movs
	if (size <= 16)
        if (size >= 8)
            /// Chunks of 8..16 bytes.
            __builtin_memcpy(dst + size - 8, src + size - 8, 8);
            __builtin_memcpy(dst, src, 8);
        else if (size >= 4)
            /// Chunks of 4..7 bytes.
            __builtin_memcpy(dst + size - 4, src + size - 4, 4);
            __builtin_memcpy(dst, src, 4);
        else if (size >= 2)
            /// Chunks of 2..3 bytes.
            __builtin_memcpy(dst + size - 2, src + size - 2, 2);
            __builtin_memcpy(dst, src, 2);
        else if (size >= 1)
            /// A single byte.
            *dst = *src;
        /// No bytes remaining.
        // 这里src和dst不可能同时128对齐,因此
        /// Medium and large sizes.
        if (size <= 128)
            /// Medium size, not enough for full loop unrolling.

            /// We will copy the last 16 bytes.
            _mm_storeu_si128(reinterpret_cast<__m128i *>(dst + size - 16), _mm_loadu_si128(reinterpret_cast<const __m128i *>(src + size - 16)));

            /// Then we will copy every 16 bytes from the beginning in a loop.
            /// The last loop iteration will possibly overwrite some part of already copied last 16 bytes.
            /// This is Ok, similar to the code for small sizes above.
            while (size > 16)
                _mm_storeu_si128(reinterpret_cast<__m128i *>(dst), _mm_loadu_si128(reinterpret_cast<const __m128i *>(src)));
                dst += 16;
                src += 16;
                size -= 16;
            /// Large size with fully unrolled loop.

            /// Align destination to 16 bytes boundary.
            size_t padding = (16 - (reinterpret_cast<size_t>(dst) & 15)) & 15;

            /// If not aligned - we will copy first 16 bytes with unaligned stores.
            if (padding > 0)
                __m128i head = _mm_loadu_si128(reinterpret_cast<const __m128i*>(src));
                _mm_storeu_si128(reinterpret_cast<__m128i*>(dst), head);
                dst += padding;
                src += padding;
                size -= padding;

            /// Aligned unrolled copy. We will use half of available SSE registers.
            /// It's not possible to have both src and dst aligned.
            /// So, we will use aligned stores and unaligned loads.
            __m128i c0, c1, c2, c3, c4, c5, c6, c7;

            while (size >= 128)
                c0 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(src) + 0);
                c1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(src) + 1);
                c2 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(src) + 2);
                c3 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(src) + 3);
                c4 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(src) + 4);
                c5 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(src) + 5);
                c6 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(src) + 6);
                c7 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(src) + 7);
                src += 128;
                _mm_store_si128((reinterpret_cast<__m128i*>(dst) + 0), c0);
                _mm_store_si128((reinterpret_cast<__m128i*>(dst) + 1), c1);
                _mm_store_si128((reinterpret_cast<__m128i*>(dst) + 2), c2);
                _mm_store_si128((reinterpret_cast<__m128i*>(dst) + 3), c3);
                _mm_store_si128((reinterpret_cast<__m128i*>(dst) + 4), c4);
                _mm_store_si128((reinterpret_cast<__m128i*>(dst) + 5), c5);
                _mm_store_si128((reinterpret_cast<__m128i*>(dst) + 6), c6);
                _mm_store_si128((reinterpret_cast<__m128i*>(dst) + 7), c7);
                dst += 128;

                size -= 128;

            /// The latest remaining 0..127 bytes will be processed as usual.
            goto tail;

    return ret;


有些实现还会用预取指令,比如韦大佬写的FastMemcpy,clickhouse里面也有完整代码作为benchmark https://github.com/skywind3000/FastMemcpy/blob/master/FastMemcpy.h



这段代码是 ClickHouse 项目中的一段,它定义了一个名为 optimize 的方法,该方法在读取 ClickHouse 表的数据时优化读取的顺序和方式。具体来说,它在读取 ClickHouse 的 MergeTree 表时,对表中的数据进行预过滤,以减少读取的数据量,从而提高查询性能。https://github.com/ClickHouse/ClickHouse/blob/4279dd2bf11841d8f68bdea78f3d8668a2c4289b/src/Storages/MergeTree/MergeTreeRangeReader.cpp#L495

首先,它将 current_filter 和已有的 final_filter (如果存在)进行组合,创建一个新的过滤条件 filter,这个过滤条件将被应用在每个数据块的开头。


cpp 复制代码
    using Filter = PaddedPODArray<UInt8>;










cpp 复制代码
/// Transform 64-byte mask to 64-bit mask
inline UInt64 bytes64MaskToBits64Mask(const UInt8 * bytes64)
#if defined(__AVX512F__) && defined(__AVX512BW__)
    const __m512i vbytes = _mm512_loadu_si512(reinterpret_cast<const void *>(bytes64));
    UInt64 res = _mm512_testn_epi8_mask(vbytes, vbytes);
#elif defined(__AVX__) && defined(__AVX2__)
    const __m256i zero32 = _mm256_setzero_si256();
    UInt64 res =
        _mm256_loadu_si256(reinterpret_cast<const __m256i *>(bytes64)), zero32))) & 0xffffffff)
        | (static_cast<UInt64>(_mm256_movemask_epi8(_mm256_cmpeq_epi8(
        _mm256_loadu_si256(reinterpret_cast<const __m256i *>(bytes64+32)), zero32))) << 32);
#elif defined(__SSE2__)
    const __m128i zero16 = _mm_setzero_si128();
    UInt64 res =
        _mm_loadu_si128(reinterpret_cast<const __m128i *>(bytes64)), zero16))) & 0xffff)
        | ((static_cast<UInt64>(_mm_movemask_epi8(_mm_cmpeq_epi8(
        _mm_loadu_si128(reinterpret_cast<const __m128i *>(bytes64 + 16)), zero16))) << 16) & 0xffff0000)
        | ((static_cast<UInt64>(_mm_movemask_epi8(_mm_cmpeq_epi8(
        _mm_loadu_si128(reinterpret_cast<const __m128i *>(bytes64 + 32)), zero16))) << 32) & 0xffff00000000)
        | ((static_cast<UInt64>(_mm_movemask_epi8(_mm_cmpeq_epi8(
        _mm_loadu_si128(reinterpret_cast<const __m128i *>(bytes64 + 48)), zero16))) << 48) & 0xffff000000000000);
#elif defined(__aarch64__) && defined(__ARM_NEON)
    const uint8x16_t bitmask = {0x01, 0x02, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80, 0x01, 0x02, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80};
    const auto * src = reinterpret_cast<const unsigned char *>(bytes64);
    const uint8x16_t p0 = vceqzq_u8(vld1q_u8(src));
    const uint8x16_t p1 = vceqzq_u8(vld1q_u8(src + 16));
    const uint8x16_t p2 = vceqzq_u8(vld1q_u8(src + 32));
    const uint8x16_t p3 = vceqzq_u8(vld1q_u8(src + 48));
    uint8x16_t t0 = vandq_u8(p0, bitmask);
    uint8x16_t t1 = vandq_u8(p1, bitmask);
    uint8x16_t t2 = vandq_u8(p2, bitmask);
    uint8x16_t t3 = vandq_u8(p3, bitmask);
    uint8x16_t sum0 = vpaddq_u8(t0, t1);
    uint8x16_t sum1 = vpaddq_u8(t2, t3);
    sum0 = vpaddq_u8(sum0, sum1);
    sum0 = vpaddq_u8(sum0, sum0);
    UInt64 res = vgetq_lane_u64(vreinterpretq_u64_u8(sum0), 0);
    UInt64 res = 0;
    for (size_t i = 0; i < 64; ++i)
        res |= static_cast<UInt64>(0 == bytes64[i]) << i;
    return ~res;






