We use cookies on this site to enhance your user experience

By clicking the Accept button, you agree to us doing so. More info on our cookie policy

title: 
style: nestedList # TOC style (nestedList|nestedOrderedList|inlineFirstLevel)
minLevel: 0 # Include headings from the specified level
maxLevel: 0 # Include headings up to the specified level
includeLinks: true # Make headings clickable
debugInConsole: false # Print debug info in Obsidian console

CPU

  • Physical cores: 20
  • Physical + Logical cores: 40
    Architecture:                    x86_64
    CPU op-mode(s):                  32-bit, 64-bit
    Byte Order:                      Little Endian
    Address sizes:                   46 bits physical, 48 bits virtual
    CPU(s):                          40
    On-line CPU(s) list:             0-39
    Thread(s) per core:              2
    Core(s) per socket:              10
    Socket(s):                       2
    NUMA node(s):                    2
    Vendor ID:                       GenuineIntel
    CPU family:                      6
    Model:                           85
    Model name:                      Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz
    Stepping:                        7
    CPU MHz:                         2400.000
    CPU max MHz:                     3200.0000
    CPU min MHz:                     1000.0000
    

Mojo is using only physical cores where as numpy utilizing all 40 cores

MatMul Benchmark Results

Metric 512x256 @ 256x512 Speed over Python Speedup over NumPy 2048x1024 @ 1024x2048 Speedup over NumPy
Python (Lists) 77 N/A N/A N/A N/A
NumPy 0.0010 74844.5881 N/A 0.0195 N/A
Mojo (For Loops) 0.6402 120.2806 0.0016 40.9093 0.0005
Mojo (SIMD) 0.0123 6243.6854 0.0834 1.0428 0.0187
Mojo (SIMD + Parallel) 0.0017 46433.6542 0.6204 0.1260 0.1547
Mojo (SIMD+ Parallel+ DTypePointer) 0.0006 121557.0351 1.6241 0.0547 0.3566

Python (Using Lists) 😀

  • 512x256 @ 256x512
    Matmul : 512x256 @ 256x512
    Iterations :  2
    Total GFLOPS :  0.134217728
    Total time in sec :  77.04722635447979
    GFLOP/sec :  0.001742018945399663
    
  • 2048x1024 @ 1024x2048 - My system got tired!

All the operations from here are in FP32

Numpy

  • 512x256 @ 256x512
    Matmul : 512x256 @ 256x512
    Iterations : 10000
    Total GFLOPS :  0.134217728
    Total time in sec :  0.0010287985601928084
    GFLOP/sec :  130.46064914286632
    speedup over python: 74890
    
  • 2048x1024 @ 1024x2048 ``` Matmul : 2048x1024 @ 1024x2048 Iterations : 10000 Total GFLOPS : 8.589934592 Total time in sec : 0.019497849141806363 GFLOP/sec : 440.55805999554434

Matmul : 2048x1024 @ 1024x2048 Iterations : 200 Total GFLOPS : 8.589934592 Total time in sec : 0.018091672335285695 GFLOP/sec : 474.80047354419173



## Mojo

### Just for loops

- `512x256 @ 256x512`

Benchmark Report (s) ——————— Mean: 0.64016959076470581 Total: 10.882883043 Iters: 17 Warmup Mean: 0.69197315560000006 Warmup Total: 3.4598657780000002 Warmup Iters: 5 Fastest Mean: 0.64016959076470592 Slowest Mean: 0.64016959076470592

Matmul : 512x256 @ 256x512 Total GFLOPS: 0.13421772800000001 GFLOP/sec: 0.20965964321996622 speedup over python: 120.35439899987139 speedup over numpy: 0.0016070718994381961 speedup in numpy: 622.24969545518297 worst speedup in numpy: 622.24969545518309

- `2048x1024 @ 1024x2048`

Benchmark Report (s) ——————— Mean: 40.9093318165 Total: 245.455990899 Iters: 6 Warmup Mean: 40.8972278066 Warmup Total: 204.486139033 Warmup Iters: 5 Fastest Mean: 40.9093318165 Slowest Mean: 40.9093318165

Matmul : 2048x1024 @ 1024x2048

GFLOP/sec: 0.2099749424050826 speedup over python: 0.0 speedup over numpy: 0.00047661128344174706 speedup in numpy: 2098.1458785002164 worst speedup in numpy: 2098.1458785002164




### SIMD Vectorization

- `512x256 @ 256x512`

Benchmark Report (s) ——————— Mean: 0.012332463752118645 Total: 11.641845782000001 Iters: 944 Warmup Mean: 0.0146965052 Warmup Total: 0.073482526000000006 Warmup Iters: 5 Fastest Mean: 0.012332463752118644 Slowest Mean: 0.012332463752118644

Matmul : 512x256 @ 256x512 Total GFLOPS: 0.13421772800000001 GFLOP/sec: 10.883285829803651 speedup over python: 6247.5128979190004 speedup over numpy: 0.083421981274104037 speedup in numpy: 11.987248261513317 worst speedup in numpy: 11.987248261513315

- `2048x1024 @ 1024x2048`

Benchmark Report (s) ——————— Mean: 1.04279860254 Total: 52.139930127 Iters: 50 Warmup Mean: 1.1388049650000001 Warmup Total: 5.6940248249999996 Warmup Iters: 5 Fastest Mean: 1.04279860254 Slowest Mean: 1.04279860254

Matmul : 2048x1024 @ 1024x2048 Total GFLOPS: 8.5899345920000005 GFLOP/sec: 8.2373859833308565 speedup over python: 0.0 speedup over numpy: 0.018697617252568632 speedup in numpy: 53.482750582169636 worst speedup in numpy: 53.482750582169636



### SIMD Vectorization and Parallelization

- `512x256 @ 256x512`

Benchmark Report (s) ——————— Mean: 0.0016582828104565537 Total: 11.259740282999999 Iters: 6790 Warmup Mean: 0.0045278230000000003 Warmup Total: 0.022639115000000001 Warmup Iters: 5 Fastest Mean: 0.0016582828104565537 Slowest Mean: 0.0016582828104565537

Matmul : 512x256 @ 256x512 Total GFLOPS: 0.13421772800000001 GFLOP/sec: 80.937779221776751 speedup over python: 46462.054523297731 speedup over numpy: 0.62039994246190278 speedup in numpy: 1.6118634634809101 worst speedup in numpy: 1.6118634634809101

- `2048x1024 @ 1024x2048`

Benchmark Report (s) ——————— Mean: 0.12600896040860216 Total: 11.718833318 Iters: 93 Warmup Mean: 0.1245288464 Warmup Total: 0.62264423199999996 Warmup Iters: 5 Fastest Mean: 0.12600896040860216 Slowest Mean: 0.12600896040860216

Matmul : 2048x1024 @ 1024x2048 Total GFLOPS: 8.5899345920000005 GFLOP/sec: 68.169236252294311 speedup over python: 0.0 speedup over numpy: 0.15473383066237342 speedup in numpy: 6.4627108093897254 worst speedup in numpy: 6.4627108093897254



### SIMD Vectorize, Parallel using DTypePointer
Earlier used Tensor

- `512x256 @ 256x512`

Benchmark Report (s) ——————— Mean: 0.00063344754129086812 Total: 13.484831259 Iters: 21288 Warmup Mean: 0.0057730554000000002 Warmup Total: 0.028865277000000002 Warmup Iters: 5 Fastest Mean: 0.00063344754129086812 Slowest Mean: 0.00063344754129086812

Matmul : 512x256 @ 256x512 Total GFLOPS: 0.13421772800000001 GFLOP/sec: 211.8845196343884 speedup over python: 121631.58167362913 speedup over numpy: 1.6241259033009681 speedup in numpy: 0.61571581240564033 worst speedup in numpy: 0.61571581240564033

- `2048x1024 @ 1024x2048`

Benchmark Report (s) ——————— Mean: 0.054670777668181819 Total: 12.027571087 Iters: 220 Warmup Mean: 0.057961355399999998 Warmup Total: 0.28980677700000002 Warmup Iters: 5 Fastest Mean: 0.054670777668181819 Slowest Mean: 0.054670777668181819

Matmul : 2048x1024 @ 1024x2048 Total GFLOPS: 8.5899345920000005 GFLOP/sec: 157.12113414840465 speedup over python: 0.0 speedup over numpy: 0.35664115224675202 speedup in numpy: 2.8039388996481325 worst speedup in numpy: 2.8039388996481325 ```

Latest Posts