We use cookies on this site to enhance your user experience
By clicking the Accept button, you agree to us doing so. More info on our cookie policy
We use cookies on this site to enhance your user experience
By clicking the Accept button, you agree to us doing so. More info on our cookie policy
title:
style: nestedList # TOC style (nestedList|nestedOrderedList|inlineFirstLevel)
minLevel: 0 # Include headings from the specified level
maxLevel: 0 # Include headings up to the specified level
includeLinks: true # Make headings clickable
debugInConsole: false # Print debug info in Obsidian console
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz
Stepping: 7
CPU MHz: 2400.000
CPU max MHz: 3200.0000
CPU min MHz: 1000.0000
Mojo is using only physical cores where as numpy utilizing all 40 cores
Metric | 512x256 @ 256x512 | Speed over Python | Speedup over NumPy | 2048x1024 @ 1024x2048 | Speedup over NumPy |
---|---|---|---|---|---|
Python (Lists) | 77 | N/A | N/A | N/A | N/A |
NumPy | 0.0010 | 74844.5881 | N/A | 0.0195 | N/A |
Mojo (For Loops) | 0.6402 | 120.2806 | 0.0016 | 40.9093 | 0.0005 |
Mojo (SIMD) | 0.0123 | 6243.6854 | 0.0834 | 1.0428 | 0.0187 |
Mojo (SIMD + Parallel) | 0.0017 | 46433.6542 | 0.6204 | 0.1260 | 0.1547 |
Mojo (SIMD+ Parallel+ DTypePointer) | 0.0006 | 121557.0351 | 1.6241 | 0.0547 | 0.3566 |
512x256 @ 256x512
Matmul : 512x256 @ 256x512
Iterations : 2
Total GFLOPS : 0.134217728
Total time in sec : 77.04722635447979
GFLOP/sec : 0.001742018945399663
2048x1024 @ 1024x2048
- My system got tired!All the operations from here are in FP32
Numpy
512x256 @ 256x512
Matmul : 512x256 @ 256x512
Iterations : 10000
Total GFLOPS : 0.134217728
Total time in sec : 0.0010287985601928084
GFLOP/sec : 130.46064914286632
speedup over python: 74890
2048x1024 @ 1024x2048
```
Matmul : 2048x1024 @ 1024x2048
Iterations : 10000
Total GFLOPS : 8.589934592
Total time in sec : 0.019497849141806363
GFLOP/sec : 440.55805999554434Matmul : 2048x1024 @ 1024x2048 Iterations : 200 Total GFLOPS : 8.589934592 Total time in sec : 0.018091672335285695 GFLOP/sec : 474.80047354419173
## Mojo
### Just for loops
- `512x256 @ 256x512`
Benchmark Report (s) ——————— Mean: 0.64016959076470581 Total: 10.882883043 Iters: 17 Warmup Mean: 0.69197315560000006 Warmup Total: 3.4598657780000002 Warmup Iters: 5 Fastest Mean: 0.64016959076470592 Slowest Mean: 0.64016959076470592
Matmul : 512x256 @ 256x512 Total GFLOPS: 0.13421772800000001 GFLOP/sec: 0.20965964321996622 speedup over python: 120.35439899987139 speedup over numpy: 0.0016070718994381961 speedup in numpy: 622.24969545518297 worst speedup in numpy: 622.24969545518309
- `2048x1024 @ 1024x2048`
Benchmark Report (s) ——————— Mean: 40.9093318165 Total: 245.455990899 Iters: 6 Warmup Mean: 40.8972278066 Warmup Total: 204.486139033 Warmup Iters: 5 Fastest Mean: 40.9093318165 Slowest Mean: 40.9093318165
Matmul : 2048x1024 @ 1024x2048
GFLOP/sec: 0.2099749424050826 speedup over python: 0.0 speedup over numpy: 0.00047661128344174706 speedup in numpy: 2098.1458785002164 worst speedup in numpy: 2098.1458785002164
### SIMD Vectorization
- `512x256 @ 256x512`
Benchmark Report (s) ——————— Mean: 0.012332463752118645 Total: 11.641845782000001 Iters: 944 Warmup Mean: 0.0146965052 Warmup Total: 0.073482526000000006 Warmup Iters: 5 Fastest Mean: 0.012332463752118644 Slowest Mean: 0.012332463752118644
Matmul : 512x256 @ 256x512 Total GFLOPS: 0.13421772800000001 GFLOP/sec: 10.883285829803651 speedup over python: 6247.5128979190004 speedup over numpy: 0.083421981274104037 speedup in numpy: 11.987248261513317 worst speedup in numpy: 11.987248261513315
- `2048x1024 @ 1024x2048`
Benchmark Report (s) ——————— Mean: 1.04279860254 Total: 52.139930127 Iters: 50 Warmup Mean: 1.1388049650000001 Warmup Total: 5.6940248249999996 Warmup Iters: 5 Fastest Mean: 1.04279860254 Slowest Mean: 1.04279860254
Matmul : 2048x1024 @ 1024x2048 Total GFLOPS: 8.5899345920000005 GFLOP/sec: 8.2373859833308565 speedup over python: 0.0 speedup over numpy: 0.018697617252568632 speedup in numpy: 53.482750582169636 worst speedup in numpy: 53.482750582169636
### SIMD Vectorization and Parallelization
- `512x256 @ 256x512`
Benchmark Report (s) ——————— Mean: 0.0016582828104565537 Total: 11.259740282999999 Iters: 6790 Warmup Mean: 0.0045278230000000003 Warmup Total: 0.022639115000000001 Warmup Iters: 5 Fastest Mean: 0.0016582828104565537 Slowest Mean: 0.0016582828104565537
Matmul : 512x256 @ 256x512 Total GFLOPS: 0.13421772800000001 GFLOP/sec: 80.937779221776751 speedup over python: 46462.054523297731 speedup over numpy: 0.62039994246190278 speedup in numpy: 1.6118634634809101 worst speedup in numpy: 1.6118634634809101
- `2048x1024 @ 1024x2048`
Benchmark Report (s) ——————— Mean: 0.12600896040860216 Total: 11.718833318 Iters: 93 Warmup Mean: 0.1245288464 Warmup Total: 0.62264423199999996 Warmup Iters: 5 Fastest Mean: 0.12600896040860216 Slowest Mean: 0.12600896040860216
Matmul : 2048x1024 @ 1024x2048 Total GFLOPS: 8.5899345920000005 GFLOP/sec: 68.169236252294311 speedup over python: 0.0 speedup over numpy: 0.15473383066237342 speedup in numpy: 6.4627108093897254 worst speedup in numpy: 6.4627108093897254
### SIMD Vectorize, Parallel using DTypePointer
Earlier used Tensor
- `512x256 @ 256x512`
Benchmark Report (s) ——————— Mean: 0.00063344754129086812 Total: 13.484831259 Iters: 21288 Warmup Mean: 0.0057730554000000002 Warmup Total: 0.028865277000000002 Warmup Iters: 5 Fastest Mean: 0.00063344754129086812 Slowest Mean: 0.00063344754129086812
Matmul : 512x256 @ 256x512 Total GFLOPS: 0.13421772800000001 GFLOP/sec: 211.8845196343884 speedup over python: 121631.58167362913 speedup over numpy: 1.6241259033009681 speedup in numpy: 0.61571581240564033 worst speedup in numpy: 0.61571581240564033
- `2048x1024 @ 1024x2048`
Benchmark Report (s) ——————— Mean: 0.054670777668181819 Total: 12.027571087 Iters: 220 Warmup Mean: 0.057961355399999998 Warmup Total: 0.28980677700000002 Warmup Iters: 5 Fastest Mean: 0.054670777668181819 Slowest Mean: 0.054670777668181819
Matmul : 2048x1024 @ 1024x2048 Total GFLOPS: 8.5899345920000005 GFLOP/sec: 157.12113414840465 speedup over python: 0.0 speedup over numpy: 0.35664115224675202 speedup in numpy: 2.8039388996481325 worst speedup in numpy: 2.8039388996481325 ```
Latest Posts