- Single Instruction multiple data, stores.
- SIMD allows a single instruction to be executed across the multiple data elements of the vector.
SIMD[dtype, size]dtype = Dtype.int/ Dtype.float64 ....- size = The size of the SIMD vector to be positive and a power of 2.
- For example,
Int64is aSIMDvector of size 1.
The SIMD type has Methods and operators.
Examples:
print(
4 * SIMD[DType.int8,4](1,2,3,4)
)[4, 8, 12, 16]
print(
SIMD[DType.int32,4](1,1,2,2).reduce_add()
)6
print(
SIMD[DType.bool,4](True,False,True,False).reduce_and()
)false
The multiply operator (*) works in an unifying manner, with both SIMD vectors of size 1 and size 32.
It has the
__add__dunder,__init__and many moremethods.
SIMD on the Stack
var x = SIMD[DType.float64,2](1.5, 2.5)
var y = x.reduce_add()
print(y)y is a Float64.
SIMD on the Heap
Let’s take for example a pointer to 10 * Int64. Instead of iterating each elements to add them together, It is also possible to do a fast addition with SIMD !
Preparing a DTypePointer
It is like a pointer, but is more specialized for SIMD.
We’ll use alloc, we’ll have to free.
def main():
alias amount_of_bytes = 256
var mem = DTypePointer[DType.uint8].alloc(amount_of_bytes)
for i in range(amount_of_bytes):
mem[i] = i #slower but good first step !Second step: SIMD vector
Let’s load the first 8 elements
var bunch_of_bytes = SIMD[type=DType.uint8, size=8].load(mem)
print(bunch_of_bytes)[0, 1, 2, 3, 4, 5, 6, 7]
The data is now in a SIMD vector.
Stride and width
width is the size of the SIMD vector, stride can be used with offset.
0◄─────┐
1 │
2◄─────┤
3 │ Stride: 2
4◄─────┤ Width: 4
5 │
6◄─────┤
7 │
│
▼
[0,2,4,6] SIMD[Width:4] A. The concept
var stride_like = 2
for i in range(0,8,stride_like):
print(i) 0, 2, 4, 6
B. The SIMD stride
var separated_by_2 = mem.simd_strided_load[width = 8](
stride = 2
)
print(separated_by_2)[0, 2, 4, 6, 8, 10, 12, 14]
gather
It gathers the values stored at various positions into a SIMD vector.
for i in range(16):
mem[i] = i*i
print(
mem.gather(
SIMD[DType.int64,4](1, 2, 5, 6)
)
)[1, 4, 25, 36]
Here is the gather method of DTypePointer in a visual form:
Memory: 0 10 20 30 40 50
│ │ │ │
└─────┬─┴──┴──┘
Gather 0 │ 3 4 5
▼
[0,30,40,50] scatter
It assign new values to various positions. The positions(int64) and values are provided in a SIMD vector.
mem.scatter(
offset = SIMD[DType.int64, 2](1,10),
val = SIMD[DType.uint8, 2](0, 0)
)
print(mem[1])
print(mem[10])0 0
Here is the scatter method of DTypePointer in a visual form:
Memory: 0 10 20 30 40 50
▲ ▲
│ │
│ │
┌──┴─────┘
│ 0 2 Indexes
│ 100 200 Values
scatter
> Memory: 100, 10, 200, 30, 40, 50 free
alloc gave us some RAM for the program, free gives it back:
mem.free()Very easy to use:
┌──────────────────────┐
│ RAM │
├──┐ │
└┼─┴───────────────────┘
│
│
▼
alloc
┌──┐
└──┘
our program has to give the small amount of ram back
because another program might need it !