SIMD Basic Using AVX and AVX2

duanmeng

2021-08-22

technology

C++, SIMD

What Is SIMD

SIMD代表单指令多数据（Single Instruction, Multiple Data）。它是一种计算机指令集架构，旨在同时处理多个数据元素。SIMD指令允许在单个指令中对多个数据元素执行相同的操作，从而实现并行处理和向量化计算。

在SIMD架构中，数据被组织为向量（也称为寄存器），其中每个向量元素都可以是整数、浮点数或其他数据类型。SIMD指令可以同时对向量中的多个元素执行相同的操作，例如加法、乘法、逻辑运算等。这种并行处理方式可以显著加速计算过程，尤其是在处理大规模数据集时。

现代处理器和图形处理器（GPU）通常支持SIMD指令集，如Intel的SSE（Streaming SIMD Extensions）和AVX（Advanced Vector Extensions），以及ARM的NEON（Advanced SIMD）。这些指令集提供了丰富的SIMD操作，使开发者能够利用硬件并行性来优化程序性能。

Fundamentals of AVX Programming

SIMD Data Type

Data Type	Description
__m128	vector containing 4 floats
__m128d	vector containing 2 doubles
__m128i	vector containing integers
__m256	vector containing 8 floats
__m256d	vector containing 4 doubles
__m256i	vector containing integers, int8/32/64, uint8/32/64

Function Naming Conventions

_mm<bit_width>_<name>_<data_type>

bit_width，函数返回的vector的宽度. 空代表128-bit vectors, 256代表256-bit vectors.
name，intrinsic操作如add, set, shuffle等
data_type，函数主要参数的数据类型，详细如下
- ps - float vectors (ps stands for packed single-precision)
- pd - double vectors (pd stands for packed double-precision)
- epi8/epi16/epi32/epi64 - signed integer vectors
- epu8/epu16/epu32/epu64 - unsigned integer vectors
- si128/si256 - unspecified 128-bit vector or 256-bit vector
- m128/m128i/m128d/m256/m256i/m256d - identifies input vector types when they’re different than the type of the returned vector

一个简单例子

set_ps.cpp

#include <immintrin.h>
#include <iostream>
#include <iomanip>

template<typename T, typename E>
void print(T result) {
  /* Display the elements of the result vector */
  size_t cnt = sizeof(T) / sizeof(E);
  E* ptr = reinterpret_cast<E*>(&result);
  for (size_t i = 0; i < cnt; ++i) {
      std::cout << std::fixed << std::setprecision(1) << " " << ptr[i];
  }
  std::cout << std::endl;
}

int main() {
  __m256 floatNums = _mm256_set_ps(2.0, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0);
  std::cout << "Print float nums\n";
  print<__m256, float>(floatNums);

  __m256d doubleNums= _mm256_set_pd(2.0, 4.0, 6.0, 8.0);
  std::cout << "Print double nums\n";
  print<__m256d, double>(doubleNums);

  __m256i intNums = _mm256_set_epi32(1, 3, 5, 7, 2, 4, 6, 8);
  std::cout << "Print int32 nums\n";
  print<__m256i, int32_t>(intNums);

  __m256i bigIntNums = _mm256_set_epi64x(1, 3, 5, 7);
  std::cout << "Print int64 nums\n";
  print<__m256i, int64_t>(bigIntNums);
  
  __m256i uBigIntNums = _mm256_set_epi64x(1, 3, 5, 7);
  std::cout << "Print uint64 nums\n";
  print<__m256i, uint64_t>(uBigIntNums);

  return 0;
}

运行结果

# g++ -mavx2 set_ps.cpp 
# ./a.out        
Print float nums
 16.0 14.0 12.0 10.0 8.0 6.0 4.0 2.0
Print double nums
 8.0 6.0 4.0 2.0
Print int32 nums
 8 6 4 2 7 5 3 1
Print int64 nums
 7 5 3 1
Print uint64 nums
 7 5 3 1

Initialization Intrinsics

在对AVX向量进行操作之前，需要填充向量数据，有两种方法可以实现这一点：使用标量值初始化向量和使用从内存加载的数据初始化向量。

Initialization with Scalar Values

AVX提供了将一个或多个值组合成256位向量的内嵌函数，还有类似的内嵌函数可以初始化128位向量，但这些函数是由SSE提供的，而不是AVX。这些函数的名称唯一的区别是将_mm256_替换为_mm_。

Function	Description
_mm256_setzero_ps/pd	Returns a floating-point vector filled with zeros
_mm256_setzero_si256	Returns an integer vector whose bytes are set to zero
_mm256_set1_ps/pd	Fill a vector with a floating-point value
_mm256_set1_epi8/epi16/epi32/epi64	Fill a vector with an integer
_mm256_set_ps/pd	Initialize a vector with eight floats (ps) or four doubles (pd)
_mm256_set_epi8/epi16/epi32/epi64	Initialize a vector with integers
_mm256_set_m128/m128d/m128i	Initialize a 256-bit vector with two 128-bit vectors
_mm256_setr_ps/pd	Initialize a vector with eight floats (ps) or four doubles (pd) in reverse order
_mm256_setr_epi8/epi16/epi32/epi64	Initialize a vector with i

Example

#include <immintrin.h>
#include <stdio.h>

int main() {
  int int_array[8] = {100, 200, 300, 400, 500, 600, 700, 800};
  
  /* Initialize the mask vector */
  __m256i mask = _mm256_setr_epi32(-20, -72, -48, -9, -100, 3, 5, 8);

  /* Selectively load data into the vector */
  __m256i result = _mm256_maskload_epi32(int_array, mask);
  
  /* Display the elements of the result vector */
  int* res = (int*)&result;
  printf("%d %d %d %d %d %d %d %d\n",
  	res[0], res[1], res[2], res[3], res[4], res[5], res[6], res[7]);
  
  return 0;
}

1
2
3

$ gcc -O3 -mavx2 test.cpp 
$ ./a.out 
100 200 300 400 500 0 0 0

这里用_mm256_setr_epi32而不是_mm256_set_epi32主要是为了更自然的展示set的效果（从0到7），负数最高位的bit是1，所以_mm256_maskload_epi32选择性到将前5个数据load到result中了。

Loading Data from Memory

AVX/AVX2的常见用法是将数据从内存加载到向量中，对向量进行处理，然后将结果存回内存。

Data Type	Description
_mm256_load_ps/pd	Loads a floating-point vector from an aligned memory address
_mm256_load_si256	Loads an integer vector from an aligned memory address
_mm256_loadu_ps/pd	Loads a floating-point vector from an unaligned memory address
_mm256_loadu_si256	Loads an integer vector from an unaligned memory address
_mm_maskload_ps/pd / _mm256_maskload_ps/pd	Load portions of a 128-bit/256-bit floating-point vector according to a mask
_mm_maskload_epi32/64 / _mm256_maskload_epi32/64	Load portions of a 128-bit/256-bit integer vector according to a mask

Arithmetic Intrinsics

Addition and Subtraction

Data Type	Description
_mm256_add_ps/pd	Add two floating-point vectors
_mm256_sub_ps/pd	Subtract two floating-point vectors
_mm256_add_epi8/16/32/64	Add two integer vectors
_mm236_sub_epi8/16/32/64	Subtract two integer vectors
_mm256_adds_epi8/16/_epu8/16 Add two integer vectors with saturation
_mm256_subs_epi8/16/epu8/16 Subtract two integer vectors with saturation
_mm256_hadd_ps/pd	Add two floating-point vectors horizontally
_mm256_hsub_ps/pd	Subtract two floating-point vectors horizontally
_mm256_hadd_epi16/32	Add two integer vectors horizontally
_mm256_hsub_epi16/32	Subtract two integer vectors horizontally
_mm256_hadds_epi16	Add two vectors containing shorts horizontally with saturation
_mm256_hsubs_epi16	Subtract two vectors containing shorts horizontally with saturation
_mm256_addsub_ps/pd	Add and subtract two floating-point vectors

Multiplication and Division

Data Type	Description
_mm256_mul_ps/pd	Multiply two floating-point vectors
_mm256_mul_epi32/epu32	Multiply the lowest four elements of vectors containing 32-bit integers
_mm256_mullo_epi16/32	Multiply integers and store low halves
_mm256_mulhi_epi16/epu16	Multiply integers and store high halves
_mm256_mulhrs_epi16	Multiply 16-bit elements to form 32-bit elements
_mm256_div_ps/pd	Divide two floating-point vectors

Fused Multiply and Add (FMA)

两个N位数相乘的结果可以占据2N位。因此，当您将两个浮点数a和b相乘时，实际上得到的结果是round(a * b)，其中round(x)返回最接近x的浮点数值。随着进一步的操作，精度损失会增加。AVX2提供了将乘法和加法融合在一起的指令。也就是说，它们返回的结果是round(a * b + c)，而不是round(round(a * b) + c)。因此，这些指令比分别执行乘法和加法提供了更高的速度和精度。

Data Type	Description
_mm_fmadd_ps/pd / _mm256_fmadd_ps/pd	Multiply two vectors and add the product to a third (res = a * b + c)
_mm_fmsub_ps/pd / _mm256_fmsub_ps/pd	Multiply two vectors and subtract a vector from the product (res = a * b - c)
_mm_fmadd_ss/sd	Multiply and add the lowest element in the vectors (res[0] = a[0] * b[0] + c[0])
_mm_fmsub_ss/sd	Multiply and subtract the lowest element in the vectors (res[0] = a[0] * b[0] - c[0])
_mm_fnmadd_ps/pd / _mm256_fnmadd_ps/pd	Multiply two vectors and add the negated product to a third (res = -(a * b) + c)
_mm_fnmsub_ps/pd / _mm256_fnmsub_ps/pd	Multiply two vectors and add the negated product to a third (res = -(a * b) - c)
_mm_fnmadd_ss/sd	Multiply the two lowest elements and add the negated product to the lowest element of the third vector (res[0] = -(a[0] * b[0]) + c[0])
_mm_fnmsub_ss/sd	Multiply the lowest elements and subtract the lowest element of the third vector from the negated product (res[0] = -(a[0] * b[0]) - c[0])
_mm_fmaddsub_ps/pd / _mm256_fmaddsub_ps/pd	Multiply two vectors and alternately add and subtract from the product (res = a * b - c)
_mm_fmsubadd_ps/pd / _mmf256_fmsubadd_ps/pd	Multiply two vectors and alternately subtract and add from the product (res = a * b - c)

Example

#include <immintrin.h>
#include <stdio.h>

int main() {
  __m256d veca = _mm256_setr_pd(6.0, 6.0, 6.0, 6.0);
  __m256d vecb = _mm256_setr_pd(2.0, 2.0, 2.0, 2.0);
  __m256d vecc = _mm256_setr_pd(7.0, 7.0, 7.0, 7.0);
  
  /* Alternately subtract and add the third vector
     from the product of the first and second vectors */
  __m256d result = _mm256_fmaddsub_pd(veca, vecb, vecc);
  
  /* Display the elements of the result vector */
  double* res = (double*)&result;
  printf("%lf %lf %lf %lf\n", res[0], res[1], res[2], res[3]);
  
  return 0;
}

1
2
3

$ g++ -mavx2 -mfma test.cpp 
$ ./a.out 
5.000000 19.000000 5.000000 19.000000

Permuting and Shuffling

许多应用程序需要重新排列向量元素，以确保操作正确执行。AVX/AVX2提供了许多内嵌函数来实现这个目的，其中两个主要类别是_permute_函数和_shuffle_函数。

Permuting

Data Type	Description
_mm_permute_ps/pd / _mm256_permute_ps/pd	Select elements from the input vector based on an 8-bitcontrol value
_mm256_permute4x64_pd / _mm256_permute4x64_epi64	Select 64-bit elements from the input vector based on an 8-bit control value
_mm256_permute2f128_ps/pd	Select 128-bit chunks from two input vectors based on an 8-bit control value
_mm256_permute2f128_si256	Select 128-bit chunks from two input vectors based on an 8-bit control value

_mm_permutevar_ps/pd _mm256_permutevar_ps/pd	Select elements from the input vector based on bits in an integer vector

_mm256_permutevar8x32_ps _mm256_permutevar8x32_epi32	Select 32-bit elements (floats and ints) using indices in an integer vector

_permute_内嵌函数接受两个参数：一个输入向量和一个8位控制值。控制值的位确定了要插入输出的输入向量元素。对于_mm256_permute_ps，每对控制位通过选择输入向量中的上半部分或下半部分元素之一，确定一个上半部分和下半部分的输出元素，如下图所示：

如图所示，输入向量的值可能在输出中重复多次。其他输入值可能根本不被选中。在_mm256_permute_pd中，控制值的低四位在相邻的双精度数对之间进行选择。_mm256_permute4x4_pd类似，但使用所有的控制位来选择放置在输出中的64位元素。在_permute2f128_内嵌函数中，控制值选择两个输入向量的128位块，而不是从一个输入向量中选择元素。_permutevar_内嵌函数执行与_permute_内嵌函数相同的操作。但是，它们不使用8位控制值来选择元素，而是依赖于与输入向量大小相同的整数向量。例如，_mm256_permute_ps的输入向量是_mm256，因此整数向量是_mm256i。整数向量的高位以与_permute_内嵌函数的8位控制值的位相同的方式进行选择。

Shuffling

与_permute_内嵌函数类似，_shuffle_内嵌函数从一个或两个输入向量中选择元素，并将它们放置在输出向量中。

Data Type	Description
_mm256_shuffle_ps/pd	Select floating-point elements according to an 8-bit value
_mm256_shuffle_epi8 / _mm256_shuffle_epi32	Select integer elements according to an 8-bit value
_mm256_shufflelo_epi16 / _mm256_shufflehi_epi16	Select 128-bit chunks from two input vectors based on an 8-bit control value

所有_shuffle_内嵌函数都操作256位向量。在每种情况下，最后一个参数是一个8位值，确定应该将哪些输入元素放置在输出向量中。对于_mm256_shuffle_ps，只使用控制值的高四位。如果输入向量包含整数或浮点数，所有控制位都会被使用。对于_mm256_shuffle_ps，前两对位选择第一个向量的元素，后两对位选择第二个向量的元素。

为了对16位值进行洗牌，AVX2提供了_mm256_shufflelo_epi16和_mm256_shufflehi_epi16。与_mm256_shuffle_ps类似，控制值被分为4对bits，从8个元素中进行选择。但对于_mm256_shufflelo_epi16，这8个元素来自8个低16位值。对于_mm256_shufflehi_epi16，这8个元素来自8个高16位值。

Example

#include <immintrin.h>
#include <stdio.h>

int main() {

  __m256d vec1 = _mm256_setr_pd(4.0, 5.0, 13.0, 6.0);
  __m256d vec2 = _mm256_setr_pd(9.0, 3.0, 6.0, 7.0);
  __m256d neg = _mm256_setr_pd(1.0, -1.0, 1.0, -1.0);
  
  /* Step 1: Multiply vec1 and vec2 */
  __m256d vec3 = _mm256_mul_pd(vec1, vec2);

  /* Step 2: Switch the real and imaginary elements of vec2 */
  vec2 = _mm256_permute_pd(vec2, 0x5);
  /* Step 3: Negate the imaginary elements of vec2 */
  vec2 = _mm256_mul_pd(vec2, neg);  
  /* Step 4: Multiply vec1 and the modified vec2 */
  __m256d vec4 = _mm256_mul_pd(vec1, vec2);
  /* Horizontally subtract the elements in vec3 and vec4 */
  vec1 = _mm256_hsub_pd(vec3, vec4);
  /* Display the elements of the result vector */
  double* res = (double*)&vec1;
  printf("%lf %lf %lf %lf\n", res[0], res[1], res[2], res[3]);
  
  return 0;
}

$ g++ -mavx2 -mfma test.cpp 
[root@VM-76-184-tencentos~/native/test]
$ ./a.out 
21.000000 57.000000 36.000000 127.000000

References

https://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX