MatMul

title: 
style: nestedList # TOC style (nestedList|nestedOrderedList|inlineFirstLevel)
minLevel: 0 # Include headings from the specified level
maxLevel: 0 # Include headings up to the specified level
includeLinks: true # Make headings clickable
debugInConsole: false # Print debug info in Obsidian console

CPU

Physical cores: 20
Physical + Logical cores: 40

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          40
On-line CPU(s) list:             0-39
Thread(s) per core:              2
Core(s) per socket:              10
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz
Stepping:                        7
CPU MHz:                         2400.000
CPU max MHz:                     3200.0000
CPU min MHz:                     1000.0000

Mojo is using only physical cores where as numpy utilizing all 40 cores

MatMul Benchmark Results

Metric	512x256 @ 256x512	Speed over Python	Speedup over NumPy	2048x1024 @ 1024x2048	Speedup over NumPy
Python (Lists)	77	N/A	N/A	N/A	N/A
NumPy	0.0010	74844.5881	N/A	0.0195	N/A
Mojo (For Loops)	0.6402	120.2806	0.0016	40.9093	0.0005
Mojo (SIMD)	0.0123	6243.6854	0.0834	1.0428	0.0187
Mojo (SIMD + Parallel)	0.0017	46433.6542	0.6204	0.1260	0.1547
Mojo (SIMD+ Parallel+ DTypePointer)	0.0006	121557.0351	1.6241	0.0547	0.3566

Python (Using Lists) 😀

512x256 @ 256x512

Matmul : 512x256 @ 256x512
Iterations :  2
Total GFLOPS :  0.134217728
Total time in sec :  77.04722635447979
GFLOP/sec :  0.001742018945399663

2048x1024 @ 1024x2048 - My system got tired!

All the operations from here are in FP32

Numpy

512x256 @ 256x512

Matmul : 512x256 @ 256x512
Iterations : 10000
Total GFLOPS :  0.134217728
Total time in sec :  0.0010287985601928084
GFLOP/sec :  130.46064914286632
speedup over python: 74890

2048x1024 @ 1024x2048

Matmul : 2048x1024 @ 1024x2048
Iterations : 10000
Total GFLOPS :  8.589934592
Total time in sec :  0.019497849141806363
GFLOP/sec :  440.55805999554434

Matmul : 2048x1024 @ 1024x2048
Iterations :  200
Total GFLOPS :  8.589934592
Total time in sec :  0.018091672335285695
GFLOP/sec :  474.80047354419173

Mojo

Just for loops

512x256 @ 256x512

---------------------
Benchmark Report (s)
---------------------
Mean: 0.64016959076470581
Total: 10.882883043
Iters: 17
Warmup Mean: 0.69197315560000006
Warmup Total: 3.4598657780000002
Warmup Iters: 5
Fastest Mean: 0.64016959076470592
Slowest Mean: 0.64016959076470592

Matmul :  512x256 @ 256x512
Total GFLOPS:  0.13421772800000001
GFLOP/sec:  0.20965964321996622
speedup over python:  120.35439899987139
speedup over numpy:  0.0016070718994381961
speedup in numpy:  622.24969545518297
worst speedup in numpy:  622.24969545518309

2048x1024 @ 1024x2048

---------------------
Benchmark Report (s)
---------------------
Mean: 40.9093318165
Total: 245.455990899
Iters: 6
Warmup Mean: 40.8972278066
Warmup Total: 204.486139033
Warmup Iters: 5
Fastest Mean: 40.9093318165
Slowest Mean: 40.9093318165

Matmul :  2048x1024 @ 1024x2048

GFLOP/sec:  0.2099749424050826
speedup over python:  0.0
speedup over numpy:  0.00047661128344174706
speedup in numpy:  2098.1458785002164
worst speedup in numpy:  2098.1458785002164

SIMD Vectorization

512x256 @ 256x512

---------------------
Benchmark Report (s)
---------------------
Mean: 0.012332463752118645
Total: 11.641845782000001
Iters: 944
Warmup Mean: 0.0146965052
Warmup Total: 0.073482526000000006
Warmup Iters: 5
Fastest Mean: 0.012332463752118644
Slowest Mean: 0.012332463752118644

Matmul :  512x256 @ 256x512
Total GFLOPS:  0.13421772800000001
GFLOP/sec:  10.883285829803651
speedup over python:  6247.5128979190004
speedup over numpy:  0.083421981274104037
speedup in numpy:  11.987248261513317
worst speedup in numpy:  11.987248261513315

2048x1024 @ 1024x2048

---------------------
Benchmark Report (s)
---------------------
Mean: 1.04279860254
Total: 52.139930127
Iters: 50
Warmup Mean: 1.1388049650000001
Warmup Total: 5.6940248249999996
Warmup Iters: 5
Fastest Mean: 1.04279860254
Slowest Mean: 1.04279860254

Matmul :  2048x1024 @ 1024x2048
Total GFLOPS:  8.5899345920000005
GFLOP/sec:  8.2373859833308565
speedup over python:  0.0
speedup over numpy:  0.018697617252568632
speedup in numpy:  53.482750582169636
worst speedup in numpy:  53.482750582169636

SIMD Vectorization and Parallelization

512x256 @ 256x512

---------------------
Benchmark Report (s)
---------------------
Mean: 0.0016582828104565537
Total: 11.259740282999999
Iters: 6790
Warmup Mean: 0.0045278230000000003
Warmup Total: 0.022639115000000001
Warmup Iters: 5
Fastest Mean: 0.0016582828104565537
Slowest Mean: 0.0016582828104565537

Matmul :  512x256 @ 256x512
Total GFLOPS:  0.13421772800000001
GFLOP/sec:  80.937779221776751
speedup over python:  46462.054523297731
speedup over numpy:  0.62039994246190278
speedup in numpy:  1.6118634634809101
worst speedup in numpy:  1.6118634634809101

2048x1024 @ 1024x2048

---------------------
Benchmark Report (s)
---------------------
Mean: 0.12600896040860216
Total: 11.718833318
Iters: 93
Warmup Mean: 0.1245288464
Warmup Total: 0.62264423199999996
Warmup Iters: 5
Fastest Mean: 0.12600896040860216
Slowest Mean: 0.12600896040860216

Matmul :  2048x1024 @ 1024x2048
Total GFLOPS:  8.5899345920000005
GFLOP/sec:  68.169236252294311
speedup over python:  0.0
speedup over numpy:  0.15473383066237342
speedup in numpy:  6.4627108093897254
worst speedup in numpy:  6.4627108093897254

SIMD Vectorize, Parallel using DTypePointer

Earlier used Tensor

512x256 @ 256x512

---------------------
Benchmark Report (s)
---------------------
Mean: 0.00063344754129086812
Total: 13.484831259
Iters: 21288
Warmup Mean: 0.0057730554000000002
Warmup Total: 0.028865277000000002
Warmup Iters: 5
Fastest Mean: 0.00063344754129086812
Slowest Mean: 0.00063344754129086812

Matmul :  512x256 @ 256x512
Total GFLOPS:  0.13421772800000001
GFLOP/sec:  211.8845196343884
speedup over python:  121631.58167362913
speedup over numpy:  1.6241259033009681
speedup in numpy:  0.61571581240564033
worst speedup in numpy:  0.61571581240564033

2048x1024 @ 1024x2048

---------------------
Benchmark Report (s)
---------------------
Mean: 0.054670777668181819
Total: 12.027571087
Iters: 220
Warmup Mean: 0.057961355399999998
Warmup Total: 0.28980677700000002
Warmup Iters: 5
Fastest Mean: 0.054670777668181819
Slowest Mean: 0.054670777668181819

Matmul :  2048x1024 @ 1024x2048
Total GFLOPS:  8.5899345920000005
GFLOP/sec:  157.12113414840465
speedup over python:  0.0
speedup over numpy:  0.35664115224675202
speedup in numpy:  2.8039388996481325
worst speedup in numpy:  2.8039388996481325

MVPavan's Notes

Explorer

MatMul

CPU

MatMul Benchmark Results

Python (Using Lists) 😀

Numpy

Mojo

Just for loops

SIMD Vectorization

SIMD Vectorization and Parallelization

SIMD Vectorize, Parallel using DTypePointer

Graph View

Table of Contents