SIMD数据级并行处理

Posted 2023-09-08 16:30 +0800 by ZhangJie ‐ 3 min read

Q: 那，接下来需要看下下面几个问题：什么是SIMD呢？SIMD最初是用来解决什么问题的呢？SIMD在JSON编码、解析中可以用来做什么呢？simdjson正确使用SIMD还需要注意些什么呢？

What’s SIMD

SIMD(Single Instruction Multiple Data) 是一种并行计算技术,可以同时对多个数据执行相同的操作。使用 SIMD 的主要目的是为了提升计算性能。

目前在大多数现代主流ARM、x64处理器上都支持SIMD，最初Pentium支持SIMD是为了更好地对多媒体（声音）进行处理，现代处理器增加了位宽更大的寄存器（128-bit、256-bit、512-bit），也增加了一些高效的指令。

老的x64（Intel、AMD）平台可以用SSE2…SSE4.2（128-bit），主流的x64（Intel、AMD）可以用AVX、AVX2（256-bit），最新的x64（Intel）可以用AVX-512（512-bit），其他平台可以自行检索下。

ps: 并行处理按照发生的粒度，可以划分为：任务并行（多核），指令并行（超标量流水线），数据并行（simd、vector、gpu）。

SIMD适用场景

适合使用 SIMD 的情况包括:

需要对大批量数据执行相同的数学运算或逻辑运算,如向量、矩阵运算、图像处理等。
需要对多媒体数据如音频、视频等进行处理,如编码、解码、滤波、变换等。
在数据库、科学计算、金融分析等需要处理大量数值计算的场景。
游戏开发中的物理模拟、人工智能等也可以使用 SIMD。

使用 SIMD 的好处有:

提高计算并行度,单次指令处理更多数据。
减少指令数,降低指令调度开销。
更高效利用处理器内部执行单元。
数据级并行,更易映射到多核架构。

一些常见使用 SIMD 的例子:

图像处理:模糊、锐化、色彩空间转换等算法可以用SIMD加速。
信号处理:FFT、FIR/IIR 滤波等用SIMD实现。
科学计算:向量矩阵运算都可以用 SIMD 优化。
数据压缩/解压:如音频视频编解码中的 SIMD 优化。
数据库操作:聚集函数、关系运算可用 SIMD 实现。
机器学习:神经网络中矩阵乘法、激活函数计算等使用 SIMD。

总之,SIMD 非常适合数据并行的场景,使用它可以显著提升计算性能。编译器和开发者都可以通过自动向量化和手动优化,利用 SIMD 使程序运行更快。

SIMD新手入门

这里以对两个数组进行求和为例，如果使用C来进行编码的话，基本逻辑应该是这样：

int nums1[LENGTH] = {1, 2, 3, 4, 5, 6, 7, 8};
int nums2[LENGTH] = {1, 1, 1, 1, 1, 1, 1, 1};
int result[LENGTH] = {0};

for (int i=0; i<LENGTH; i++) {
    result[i] = nums1[i] + nums2[i];
}

现在，我们考虑使用SSE、AVX2分别对其进行处理。

./add_2arrays_sse128.c:用SSE添加两个数组：

#include <xmmintrin.h> // Need this in order to be able to use the SSE "intrinsics" (which provide access to instructions without writing assembly)
#include <stdio.h>

int main(int argc, char **argv) {
    float a[4], b[4], result[4]; // a and b: input, result: output
    __m128 va, vb, vresult; // these vars will "point" to SIMD registers

    // initialize arrays (just {0,1,2,3})
    for (int i = 0; i < 4; i++) {
        a[i] = (float)i;
        b[i] = (float)i;
    }
    
    // load arrays into SIMD registers
    va = _mm_loadu_ps(a); // https://software.intel.com/en-us/node/524260
    vb = _mm_loadu_ps(b); // same

    // add them together
    vresult = _mm_add_ps(va, vb);

    // store contents of SIMD register into memory
    _mm_storeu_ps(result, vresult); // https://software.intel.com/en-us/node/524262

    // print out result
    for (int i = 0; i < 4; i++) {
        printf("%f\n", result[i]);
    }
}

./add_2arrays_avx256.c:用AVX2添加两个数组:

在使用AVX2指令之前,我们必须对齐数组,否则会发生’段错误'。而且必须提供编译选项'-mavx2'。 AVX2支持在更先进和更新的CPU上,所以AVX2在gcc中默认是不启用的。

#include <immintrin.h> // Need this in order to be able to use the AVX "intrinsics" (which provide access to instructions without writing assembly)
#include <stdio.h>

int main(int argc, char **argv) {
    float a[8] __attribute__ ((aligned (32))); // Intel documentation states that we need 32-byte alignment to use _mm256_load_ps/_mm256_store_ps
    float b[8]  __attribute__ ((aligned (32))); // GCC's syntax makes this look harder than it is: https://gcc.gnu.org/onlinedocs/gcc-6.4.0/gcc/Common-Variable-Attributes.html#Common-Variable-Attributes
    float result[8]  __attribute__ ((aligned (32)));
    __m256 va, vb, vresult; // __m256 is a 256-bit datatype, so it can hold 8 32-bit floats

    // initialize arrays (just {0,1,2,3,4,5,6,7})
    for (int i = 0; i < 8; i++) {
        a[i] = (float)i;
        b[i] = (float)i;
    }

    // load arrays into SIMD registers
    va = _mm256_load_ps(a); // https://software.intel.com/en-us/node/694474
    vb = _mm256_load_ps(b); // same

    // add them together
    vresult = _mm256_add_ps(va, vb); // https://software.intel.com/en-us/node/523406

    // store contents of SIMD register into memory
    _mm256_store_ps(result, vresult); // https://software.intel.com/en-us/node/694665

    // print out result
    for (int i = 0; i < 8; i++) {
        printf("%f\n", result[i]);
    }
    
    return 0;
}

你可以gcc编译后尝试运行一下，但是为了更好地比较出这几种方式的性能差异，我们还是需要多执行几次，我们来写个benchmark测试。

./bench_add_2arrays.c，（不使用SIMD）将两个数组计算向量和1000w次：

#include <stdio.h>

#define TIMES 10000000
#define LENGTH 8

int main(int argc, char *argv[])
{
    for (int k=0; k<TIMES; k++) {
        int nums1[LENGTH] = {1, 2, 3, 4, 5, 6, 7, 8};
        int nums2[LENGTH] = {1, 1, 1, 1, 1, 1, 1, 1};
        int result[LENGTH] = {0};
    
        for (int i=0; i<LENGTH; i++) {
            result[i] = nums1[i] + nums2[i];
        }
    }

    return 0;
}

./bench_add_2arrays_avx2.c，使用AVX2将两个数组计算向量和1000w次：

#include <stdio.h>
#include <immintrin.h>

#define TIMES 10000000
#define LENGTH 8

#define i256 __m256i
#define avx256_set32 _mm256_set_epi32
#define avx256_add _mm256_add_epi32

int main(int argc, char *argv[])
{
    for (int k=0; k<TIMES; k++) {
        i256 first = avx256_set32(1, 2, 3, 4, 5, 6, 7, 8);
        i256 second = avx256_set32(1, 1, 1, 1, 1, 1, 1, 1);
        i256 result = avx256_add(first ,second);
        
        /*
        int *value = (int*)&result;
        for (int i=0; i<LENGTH; i++) {
            printf("%d ", value[i]);
        }
        printf("\n");
        */
    }
    return 0;
}

实际测试结果表明，AVX2的运行速度是普通方法的2倍：

$ time ./bench_add_2arrays

real    0m0.092s
user    0m0.092s
sys     0m0.000s
$ 
$ time ./bench_add_2arrays_avx2 

real    0m0.036s
user    0m0.036s
sys     0m0.000s

如果应用程序中类似操作很多，通过SIMD进行加速会获得明显的性能提升。

参考文献

Practical SIMD Programming, http://www.cs.uu.nl/docs/vakken/magr/2017-2018/files/SIMD%20Tutorial.pdf
C SIMD AVX2 Example, https://github.com/jean553/c-simd-avx2-example
Intel® Intrinsics Guide, https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#expand=91,555&techs=AVX2
Crunching Numbers with AVX and AVX2, https://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX
This article describes the datatypes and naming conventions, and how different intrinsics functions works.
http://ftp.cvut.cz/kernel/people/geoff/cell/ps3-linux-docs/CellProgrammingTutorial/BasicsOfSIMDProgramming.html
This article is also very helpful.

Edit this page on GitHub

SIMD数据级并行处理

What’s SIMD ##

SIMD适用场景 ##

SIMD新手入门 ##

阅读更多 ##

参考文献 ##

What’s SIMD

SIMD适用场景

SIMD新手入门

阅读更多

参考文献