title:
style: nestedList # TOC style (nestedList|nestedOrderedList|inlineFirstLevel)
minLevel: 0 # Include headings from the specified level
maxLevel: 0 # Include headings up to the specified level
includeLinks: true # Make headings clickable
debugInConsole: false # Print debug info in Obsidian consoleCPU
- Physical cores: 20
- Physical + Logical cores: 40
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz
Stepping: 7
CPU MHz: 2400.000
CPU max MHz: 3200.0000
CPU min MHz: 1000.0000
Mojo is using only physical cores where as numpy utilizing all 40 cores
MatMul Benchmark Results
| Metric | 512x256 @ 256x512 | Speed over Python | Speedup over NumPy | 2048x1024 @ 1024x2048 | Speedup over NumPy |
|---|---|---|---|---|---|
| Python (Lists) | 77 | N/A | N/A | N/A | N/A |
| NumPy | 0.0010 | 74844.5881 | N/A | 0.0195 | N/A |
| Mojo (For Loops) | 0.6402 | 120.2806 | 0.0016 | 40.9093 | 0.0005 |
| Mojo (SIMD) | 0.0123 | 6243.6854 | 0.0834 | 1.0428 | 0.0187 |
| Mojo (SIMD + Parallel) | 0.0017 | 46433.6542 | 0.6204 | 0.1260 | 0.1547 |
| Mojo (SIMD+ Parallel+ DTypePointer) | 0.0006 | 121557.0351 | 1.6241 | 0.0547 | 0.3566 |
Python (Using Lists) 😀
512x256 @ 256x512
Matmul : 512x256 @ 256x512
Iterations : 2
Total GFLOPS : 0.134217728
Total time in sec : 77.04722635447979
GFLOP/sec : 0.001742018945399663
2048x1024 @ 1024x2048- My system got tired!
All the operations from here are in FP32
Numpy
512x256 @ 256x512
Matmul : 512x256 @ 256x512
Iterations : 10000
Total GFLOPS : 0.134217728
Total time in sec : 0.0010287985601928084
GFLOP/sec : 130.46064914286632
speedup over python: 74890
2048x1024 @ 1024x2048
Matmul : 2048x1024 @ 1024x2048
Iterations : 10000
Total GFLOPS : 8.589934592
Total time in sec : 0.019497849141806363
GFLOP/sec : 440.55805999554434
Matmul : 2048x1024 @ 1024x2048
Iterations : 200
Total GFLOPS : 8.589934592
Total time in sec : 0.018091672335285695
GFLOP/sec : 474.80047354419173
Mojo
Just for loops
512x256 @ 256x512
---------------------
Benchmark Report (s)
---------------------
Mean: 0.64016959076470581
Total: 10.882883043
Iters: 17
Warmup Mean: 0.69197315560000006
Warmup Total: 3.4598657780000002
Warmup Iters: 5
Fastest Mean: 0.64016959076470592
Slowest Mean: 0.64016959076470592
Matmul : 512x256 @ 256x512
Total GFLOPS: 0.13421772800000001
GFLOP/sec: 0.20965964321996622
speedup over python: 120.35439899987139
speedup over numpy: 0.0016070718994381961
speedup in numpy: 622.24969545518297
worst speedup in numpy: 622.24969545518309
2048x1024 @ 1024x2048
---------------------
Benchmark Report (s)
---------------------
Mean: 40.9093318165
Total: 245.455990899
Iters: 6
Warmup Mean: 40.8972278066
Warmup Total: 204.486139033
Warmup Iters: 5
Fastest Mean: 40.9093318165
Slowest Mean: 40.9093318165
Matmul : 2048x1024 @ 1024x2048
GFLOP/sec: 0.2099749424050826
speedup over python: 0.0
speedup over numpy: 0.00047661128344174706
speedup in numpy: 2098.1458785002164
worst speedup in numpy: 2098.1458785002164
SIMD Vectorization
512x256 @ 256x512
---------------------
Benchmark Report (s)
---------------------
Mean: 0.012332463752118645
Total: 11.641845782000001
Iters: 944
Warmup Mean: 0.0146965052
Warmup Total: 0.073482526000000006
Warmup Iters: 5
Fastest Mean: 0.012332463752118644
Slowest Mean: 0.012332463752118644
Matmul : 512x256 @ 256x512
Total GFLOPS: 0.13421772800000001
GFLOP/sec: 10.883285829803651
speedup over python: 6247.5128979190004
speedup over numpy: 0.083421981274104037
speedup in numpy: 11.987248261513317
worst speedup in numpy: 11.987248261513315
2048x1024 @ 1024x2048
---------------------
Benchmark Report (s)
---------------------
Mean: 1.04279860254
Total: 52.139930127
Iters: 50
Warmup Mean: 1.1388049650000001
Warmup Total: 5.6940248249999996
Warmup Iters: 5
Fastest Mean: 1.04279860254
Slowest Mean: 1.04279860254
Matmul : 2048x1024 @ 1024x2048
Total GFLOPS: 8.5899345920000005
GFLOP/sec: 8.2373859833308565
speedup over python: 0.0
speedup over numpy: 0.018697617252568632
speedup in numpy: 53.482750582169636
worst speedup in numpy: 53.482750582169636
SIMD Vectorization and Parallelization
512x256 @ 256x512
---------------------
Benchmark Report (s)
---------------------
Mean: 0.0016582828104565537
Total: 11.259740282999999
Iters: 6790
Warmup Mean: 0.0045278230000000003
Warmup Total: 0.022639115000000001
Warmup Iters: 5
Fastest Mean: 0.0016582828104565537
Slowest Mean: 0.0016582828104565537
Matmul : 512x256 @ 256x512
Total GFLOPS: 0.13421772800000001
GFLOP/sec: 80.937779221776751
speedup over python: 46462.054523297731
speedup over numpy: 0.62039994246190278
speedup in numpy: 1.6118634634809101
worst speedup in numpy: 1.6118634634809101
2048x1024 @ 1024x2048
---------------------
Benchmark Report (s)
---------------------
Mean: 0.12600896040860216
Total: 11.718833318
Iters: 93
Warmup Mean: 0.1245288464
Warmup Total: 0.62264423199999996
Warmup Iters: 5
Fastest Mean: 0.12600896040860216
Slowest Mean: 0.12600896040860216
Matmul : 2048x1024 @ 1024x2048
Total GFLOPS: 8.5899345920000005
GFLOP/sec: 68.169236252294311
speedup over python: 0.0
speedup over numpy: 0.15473383066237342
speedup in numpy: 6.4627108093897254
worst speedup in numpy: 6.4627108093897254
SIMD Vectorize, Parallel using DTypePointer
Earlier used Tensor
512x256 @ 256x512
---------------------
Benchmark Report (s)
---------------------
Mean: 0.00063344754129086812
Total: 13.484831259
Iters: 21288
Warmup Mean: 0.0057730554000000002
Warmup Total: 0.028865277000000002
Warmup Iters: 5
Fastest Mean: 0.00063344754129086812
Slowest Mean: 0.00063344754129086812
Matmul : 512x256 @ 256x512
Total GFLOPS: 0.13421772800000001
GFLOP/sec: 211.8845196343884
speedup over python: 121631.58167362913
speedup over numpy: 1.6241259033009681
speedup in numpy: 0.61571581240564033
worst speedup in numpy: 0.61571581240564033
2048x1024 @ 1024x2048
---------------------
Benchmark Report (s)
---------------------
Mean: 0.054670777668181819
Total: 12.027571087
Iters: 220
Warmup Mean: 0.057961355399999998
Warmup Total: 0.28980677700000002
Warmup Iters: 5
Fastest Mean: 0.054670777668181819
Slowest Mean: 0.054670777668181819
Matmul : 2048x1024 @ 1024x2048
Total GFLOPS: 8.5899345920000005
GFLOP/sec: 157.12113414840465
speedup over python: 0.0
speedup over numpy: 0.35664115224675202
speedup in numpy: 2.8039388996481325
worst speedup in numpy: 2.8039388996481325