SIMD Basic Using AVX and AVX2

What Is SIMD

SIMD代表单指令多数据(Single Instruction, Multiple Data)。它是一种计算机指令集架构,旨在同时处理多个数据元素。SIMD指令允许在单个指令中对多个数据元素执行相同的操作,从而实现并行处理和向量化计算。

在SIMD架构中,数据被组织为向量(也称为寄存器),其中每个向量元素都可以是整数、浮点数或其他数据类型。SIMD指令可以同时对向量中的多个元素执行相同的操作,例如加法、乘法、逻辑运算等。这种并行处理方式可以显著加速计算过程,尤其是在处理大规模数据集时。

现代处理器和图形处理器(GPU)通常支持SIMD指令集,如Intel的SSE(Streaming SIMD Extensions)和AVX(Advanced Vector Extensions),以及ARM的NEON(Advanced SIMD)。这些指令集提供了丰富的SIMD操作,使开发者能够利用硬件并行性来优化程序性能。

Fundamentals of AVX Programming

SIMD Data Type

Data Type Description
__m128 vector containing 4 floats
__m128d vector containing 2 doubles
__m128i vector containing integers
__m256 vector containing 8 floats
__m256d vector containing 4 doubles
__m256i vector containing integers, int8/32/64, uint8/32/64

Function Naming Conventions

_mm<bit_width>_<name>_<data_type>

  1. bit_width,函数返回的vector的宽度. 空代表128-bit vectors, 256代表256-bit vectors.
  2. name,intrinsic操作如add, set, shuffle
  3. data_type,函数主要参数的数据类型,详细如下
    • ps - float vectors (ps stands for packed single-precision)
    • pd - double vectors (pd stands for packed double-precision)
    • epi8/epi16/epi32/epi64 - signed integer vectors
    • epu8/epu16/epu32/epu64 - unsigned integer vectors
    • si128/si256 - unspecified 128-bit vector or 256-bit vector
    • m128/m128i/m128d/m256/m256i/m256d - identifies input vector types when they’re different than the type of the returned vector

一个简单例子

set_ps.cpp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#include <immintrin.h>
#include <iostream>
#include <iomanip>

template<typename T, typename E>
void print(T result) {
/* Display the elements of the result vector */
size_t cnt = sizeof(T) / sizeof(E);
E* ptr = reinterpret_cast<E*>(&result);
for (size_t i = 0; i < cnt; ++i) {
std::cout << std::fixed << std::setprecision(1) << " " << ptr[i];
}
std::cout << std::endl;
}

int main() {
__m256 floatNums = _mm256_set_ps(2.0, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0);
std::cout << "Print float nums\n";
print<__m256, float>(floatNums);

__m256d doubleNums= _mm256_set_pd(2.0, 4.0, 6.0, 8.0);
std::cout << "Print double nums\n";
print<__m256d, double>(doubleNums);

__m256i intNums = _mm256_set_epi32(1, 3, 5, 7, 2, 4, 6, 8);
std::cout << "Print int32 nums\n";
print<__m256i, int32_t>(intNums);

__m256i bigIntNums = _mm256_set_epi64x(1, 3, 5, 7);
std::cout << "Print int64 nums\n";
print<__m256i, int64_t>(bigIntNums);

__m256i uBigIntNums = _mm256_set_epi64x(1, 3, 5, 7);
std::cout << "Print uint64 nums\n";
print<__m256i, uint64_t>(uBigIntNums);

return 0;
}

运行结果

1
2
3
4
5
6
7
8
9
10
11
12
# g++ -mavx2 set_ps.cpp 
# ./a.out
Print float nums
16.0 14.0 12.0 10.0 8.0 6.0 4.0 2.0
Print double nums
8.0 6.0 4.0 2.0
Print int32 nums
8 6 4 2 7 5 3 1
Print int64 nums
7 5 3 1
Print uint64 nums
7 5 3 1

Initialization Intrinsics

在对AVX向量进行操作之前,需要填充向量数据,有两种方法可以实现这一点:使用标量值初始化向量和使用从内存加载的数据初始化向量。

Initialization with Scalar Values

AVX提供了将一个或多个值组合成256位向量的内嵌函数,还有类似的内嵌函数可以初始化128位向量,但这些函数是由SSE提供的,而不是AVX。这些函数的名称唯一的区别是将_mm256_替换为_mm_。

Function Description
_mm256_setzero_ps/pd Returns a floating-point vector filled with zeros
_mm256_setzero_si256 Returns an integer vector whose bytes are set to zero
_mm256_set1_ps/pd Fill a vector with a floating-point value
_mm256_set1_epi8/epi16/epi32/epi64 Fill a vector with an integer
_mm256_set_ps/pd Initialize a vector with eight floats (ps) or four doubles (pd)
_mm256_set_epi8/epi16/epi32/epi64 Initialize a vector with integers
_mm256_set_m128/m128d/m128i Initialize a 256-bit vector with two 128-bit vectors
_mm256_setr_ps/pd Initialize a vector with eight floats (ps) or four doubles (pd) in reverse order
_mm256_setr_epi8/epi16/epi32/epi64 Initialize a vector with i

Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#include <immintrin.h>
#include <stdio.h>

int main() {
int int_array[8] = {100, 200, 300, 400, 500, 600, 700, 800};

/* Initialize the mask vector */
__m256i mask = _mm256_setr_epi32(-20, -72, -48, -9, -100, 3, 5, 8);

/* Selectively load data into the vector */
__m256i result = _mm256_maskload_epi32(int_array, mask);

/* Display the elements of the result vector */
int* res = (int*)&result;
printf("%d %d %d %d %d %d %d %d\n",
res[0], res[1], res[2], res[3], res[4], res[5], res[6], res[7]);

return 0;
}
1
2
3
$ gcc -O3 -mavx2 test.cpp 
$ ./a.out
100 200 300 400 500 0 0 0

这里用_mm256_setr_epi32而不是_mm256_set_epi32主要是为了更自然的展示set的效果(从0到7),负数最高位的bit是1,所以_mm256_maskload_epi32选择性到将前5个数据load到result中了。

Loading Data from Memory

AVX/AVX2的常见用法是将数据从内存加载到向量中,对向量进行处理,然后将结果存回内存。

Data Type Description
_mm256_load_ps/pd Loads a floating-point vector from an aligned memory address
_mm256_load_si256 Loads an integer vector from an aligned memory address
_mm256_loadu_ps/pd Loads a floating-point vector from an unaligned memory address
_mm256_loadu_si256 Loads an integer vector from an unaligned memory address
_mm_maskload_ps/pd / _mm256_maskload_ps/pd Load portions of a 128-bit/256-bit floating-point vector according to a mask
_mm_maskload_epi32/64 / _mm256_maskload_epi32/64 Load portions of a 128-bit/256-bit integer vector according to a mask

Arithmetic Intrinsics

Addition and Subtraction

Data Type Description
_mm256_add_ps/pd Add two floating-point vectors
_mm256_sub_ps/pd Subtract two floating-point vectors
_mm256_add_epi8/16/32/64 Add two integer vectors
_mm236_sub_epi8/16/32/64 Subtract two integer vectors
_mm256_adds_epi8/16/_epu8/16 Add two integer vectors with saturation
_mm256_subs_epi8/16/epu8/16 Subtract two integer vectors with saturation
_mm256_hadd_ps/pd Add two floating-point vectors horizontally
_mm256_hsub_ps/pd Subtract two floating-point vectors horizontally
_mm256_hadd_epi16/32 Add two integer vectors horizontally
_mm256_hsub_epi16/32 Subtract two integer vectors horizontally
_mm256_hadds_epi16 Add two vectors containing shorts horizontally with saturation
_mm256_hsubs_epi16 Subtract two vectors containing shorts horizontally with saturation
_mm256_addsub_ps/pd Add and subtract two floating-point vectors

Multiplication and Division

Data Type Description
_mm256_mul_ps/pd Multiply two floating-point vectors
_mm256_mul_epi32/epu32 Multiply the lowest four elements of vectors containing 32-bit integers
_mm256_mullo_epi16/32 Multiply integers and store low halves
_mm256_mulhi_epi16/epu16 Multiply integers and store high halves
_mm256_mulhrs_epi16 Multiply 16-bit elements to form 32-bit elements
_mm256_div_ps/pd Divide two floating-point vectors

Fused Multiply and Add (FMA)

两个N位数相乘的结果可以占据2N位。因此,当您将两个浮点数a和b相乘时,实际上得到的结果是round(a * b),其中round(x)返回最接近x的浮点数值。随着进一步的操作,精度损失会增加。AVX2提供了将乘法和加法融合在一起的指令。也就是说,它们返回的结果是round(a * b + c),而不是round(round(a * b) + c)。因此,这些指令比分别执行乘法和加法提供了更高的速度和精度。

Data Type Description
_mm_fmadd_ps/pd / _mm256_fmadd_ps/pd Multiply two vectors and add the product to a third (res = a * b + c)
_mm_fmsub_ps/pd / _mm256_fmsub_ps/pd Multiply two vectors and subtract a vector from the product (res = a * b - c)
_mm_fmadd_ss/sd Multiply and add the lowest element in the vectors (res[0] = a[0] * b[0] + c[0])
_mm_fmsub_ss/sd Multiply and subtract the lowest element in the vectors (res[0] = a[0] * b[0] - c[0])
_mm_fnmadd_ps/pd / _mm256_fnmadd_ps/pd Multiply two vectors and add the negated product to a third (res = -(a * b) + c)
_mm_fnmsub_ps/pd / _mm256_fnmsub_ps/pd Multiply two vectors and add the negated product to a third (res = -(a * b) - c)
_mm_fnmadd_ss/sd Multiply the two lowest elements and add the negated product to the lowest element of the third vector (res[0] = -(a[0] * b[0]) + c[0])
_mm_fnmsub_ss/sd Multiply the lowest elements and subtract the lowest element of the third vector from the negated product (res[0] = -(a[0] * b[0]) - c[0])
_mm_fmaddsub_ps/pd / _mm256_fmaddsub_ps/pd Multiply two vectors and alternately add and subtract from the product (res = a * b - c)
_mm_fmsubadd_ps/pd / _mmf256_fmsubadd_ps/pd Multiply two vectors and alternately subtract and add from the product (res = a * b - c)

Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include <immintrin.h>
#include <stdio.h>

int main() {
__m256d veca = _mm256_setr_pd(6.0, 6.0, 6.0, 6.0);
__m256d vecb = _mm256_setr_pd(2.0, 2.0, 2.0, 2.0);
__m256d vecc = _mm256_setr_pd(7.0, 7.0, 7.0, 7.0);

/* Alternately subtract and add the third vector
from the product of the first and second vectors */
__m256d result = _mm256_fmaddsub_pd(veca, vecb, vecc);

/* Display the elements of the result vector */
double* res = (double*)&result;
printf("%lf %lf %lf %lf\n", res[0], res[1], res[2], res[3]);

return 0;
}
1
2
3
$ g++ -mavx2 -mfma test.cpp 
$ ./a.out
5.000000 19.000000 5.000000 19.000000

Permuting and Shuffling

许多应用程序需要重新排列向量元素,以确保操作正确执行。AVX/AVX2提供了许多内嵌函数来实现这个目的,其中两个主要类别是_permute_函数和_shuffle_函数。

Permuting

Data Type Description
_mm_permute_ps/pd / _mm256_permute_ps/pd Select elements from the input vector based on an 8-bitcontrol value
_mm256_permute4x64_pd / _mm256_permute4x64_epi64 Select 64-bit elements from the input vector based on an 8-bit control value
_mm256_permute2f128_ps/pd Select 128-bit chunks from two input vectors based on an 8-bit control value
_mm256_permute2f128_si256 Select 128-bit chunks from two input vectors based on an 8-bit control value

_mm_permutevar_ps/pd
_mm256_permutevar_ps/pd
Select elements from the input vector based on bits in an integer vector

_mm256_permutevar8x32_ps
_mm256_permutevar8x32_epi32
Select 32-bit elements (floats and ints) using indices in an integer vector

_permute_内嵌函数接受两个参数:一个输入向量和一个8位控制值。控制值的位确定了要插入输出的输入向量元素。对于_mm256_permute_ps,每对控制位通过选择输入向量中的上半部分或下半部分元素之一,确定一个上半部分和下半部分的输出元素,如下图所示:

permute

如图所示,输入向量的值可能在输出中重复多次。其他输入值可能根本不被选中。在_mm256_permute_pd中,控制值的低四位在相邻的双精度数对之间进行选择。_mm256_permute4x4_pd类似,但使用所有的控制位来选择放置在输出中的64位元素。在_permute2f128_内嵌函数中,控制值选择两个输入向量的128位块,而不是从一个输入向量中选择元素。_permutevar_内嵌函数执行与_permute_内嵌函数相同的操作。但是,它们不使用8位控制值来选择元素,而是依赖于与输入向量大小相同的整数向量。例如,_mm256_permute_ps的输入向量是_mm256,因此整数向量是_mm256i。整数向量的高位以与_permute_内嵌函数的8位控制值的位相同的方式进行选择。

Shuffling

与_permute_内嵌函数类似,_shuffle_内嵌函数从一个或两个输入向量中选择元素,并将它们放置在输出向量中。

Data Type Description
_mm256_shuffle_ps/pd Select floating-point elements according to an 8-bit value
_mm256_shuffle_epi8 / _mm256_shuffle_epi32 Select integer elements according to an 8-bit value
_mm256_shufflelo_epi16 / _mm256_shufflehi_epi16 Select 128-bit chunks from two input vectors based on an 8-bit control value

所有_shuffle_内嵌函数都操作256位向量。在每种情况下,最后一个参数是一个8位值,确定应该将哪些输入元素放置在输出向量中。对于_mm256_shuffle_ps,只使用控制值的高四位。如果输入向量包含整数或浮点数,所有控制位都会被使用。对于_mm256_shuffle_ps,前两对位选择第一个向量的元素,后两对位选择第二个向量的元素。

permute

为了对16位值进行洗牌,AVX2提供了_mm256_shufflelo_epi16和_mm256_shufflehi_epi16。与_mm256_shuffle_ps类似,控制值被分为4对bits,从8个元素中进行选择。但对于_mm256_shufflelo_epi16,这8个元素来自8个低16位值。对于_mm256_shufflehi_epi16,这8个元素来自8个高16位值。

Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#include <immintrin.h>
#include <stdio.h>

int main() {

__m256d vec1 = _mm256_setr_pd(4.0, 5.0, 13.0, 6.0);
__m256d vec2 = _mm256_setr_pd(9.0, 3.0, 6.0, 7.0);
__m256d neg = _mm256_setr_pd(1.0, -1.0, 1.0, -1.0);

/* Step 1: Multiply vec1 and vec2 */
__m256d vec3 = _mm256_mul_pd(vec1, vec2);

/* Step 2: Switch the real and imaginary elements of vec2 */
vec2 = _mm256_permute_pd(vec2, 0x5);
/* Step 3: Negate the imaginary elements of vec2 */
vec2 = _mm256_mul_pd(vec2, neg);
/* Step 4: Multiply vec1 and the modified vec2 */
__m256d vec4 = _mm256_mul_pd(vec1, vec2);
/* Horizontally subtract the elements in vec3 and vec4 */
vec1 = _mm256_hsub_pd(vec3, vec4);
/* Display the elements of the result vector */
double* res = (double*)&vec1;
printf("%lf %lf %lf %lf\n", res[0], res[1], res[2], res[3]);

return 0;
}
1
2
3
4
$ g++ -mavx2 -mfma test.cpp 
[root@VM-76-184-tencentos~/native/test]
$ ./a.out
21.000000 57.000000 36.000000 127.000000

References