MPI Benchmarks
The following uses nodecomm to measure the bandwidth between 2 nodes, with 1-32 processes on each node communicating with its partner on the other node.
TensorFlow Benchmarks
← OLD, NEW →
The figure above shows effect of increasing batch size on training rate. Trying to run Resnet 50 v1.5 with batch size 256 using fp32 for calculations results in OOM, which is why that result is omitted.
← OLD, NEW →
The figure above shows effect of increasing number of gpus on training rate with fixed batch size for both real (blue) and synthetic (orange) data.
A broader set of results can be seen in the following table:
model | num_gpus | batch_size | use_fp16 | use_synth | img_per_sec |
---|---|---|---|---|---|
resnet50_v1.5 | 1 | 32 | False | False | 305.1581998762749 |
resnet50_v1.5 | 1 | 32 | False | True | 305.0756935566205 |
resnet50_v1.5 | 1 | 32 | True | False | 557.608344009375 |
resnet50_v1.5 | 1 | 32 | True | True | 551.7856108469189 |
resnet50_v1.5 | 1 | 64 | False | False | 345.86744731677936 |
resnet50_v1.5 | 1 | 64 | False | True | 343.33576731867544 |
resnet50_v1.5 | 1 | 64 | True | False | 687.2315721842947 |
resnet50_v1.5 | 1 | 64 | True | True | 676.3865560680504 |
resnet50_v1.5 | 1 | 128 | False | False | 365.42804577993536 |
resnet50_v1.5 | 1 | 128 | False | True | 361.9197194570446 |
resnet50_v1.5 | 1 | 128 | True | False | 780.5401907002691 |
resnet50_v1.5 | 1 | 128 | True | True | 787.9725710218387 |
resnet50_v1.5 | 1 | 256 | True | False | 827.7779526071831 |
resnet50_v1.5 | 1 | 256 | True | True | 838.8718495151428 |
resnet50_v1.5 | 2 | 32 | False | False | 606.6988150458972 |
resnet50_v1.5 | 2 | 32 | False | True | 611.4476119199953 |
resnet50_v1.5 | 2 | 32 | True | False | 1093.3154950403268 |
resnet50_v1.5 | 2 | 32 | True | True | 1099.2094967673627 |
resnet50_v1.5 | 2 | 64 | False | False | 684.8824289499631 |
resnet50_v1.5 | 2 | 64 | False | True | 682.3657230029216 |
resnet50_v1.5 | 2 | 64 | True | False | 1337.7101641206896 |
resnet50_v1.5 | 2 | 64 | True | True | 1356.9406586722887 |
resnet50_v1.5 | 2 | 128 | False | False | 727.4366020878481 |
resnet50_v1.5 | 2 | 128 | False | True | 714.9118280128031 |
resnet50_v1.5 | 2 | 128 | True | False | 1520.3306305011697 |
resnet50_v1.5 | 2 | 128 | True | True | 1567.667541866167 |
resnet50_v1.5 | 2 | 256 | True | False | 1663.5347160145286 |
resnet50_v1.5 | 2 | 256 | True | True | 1685.7265358502561 |
resnet50_v1.5 | 4 | 32 | False | False | 1099.8515970079466 |
resnet50_v1.5 | 4 | 32 | False | True | 1158.8182004104326 |
resnet50_v1.5 | 4 | 32 | True | False | 1806.7010872696358 |
resnet50_v1.5 | 4 | 32 | True | True | 2140.346448732058 |
resnet50_v1.5 | 4 | 64 | False | False | 1263.7396753889122 |
resnet50_v1.5 | 4 | 64 | False | True | 1323.4508475425275 |
resnet50_v1.5 | 4 | 64 | True | False | 2132.559146072129 |
resnet50_v1.5 | 4 | 64 | True | True | 2630.0377706917334 |
resnet50_v1.5 | 4 | 128 | False | False | 1382.0710902623055 |
resnet50_v1.5 | 4 | 128 | False | True | 1414.0828353841684 |
resnet50_v1.5 | 4 | 128 | True | False | 2557.1778582125 |
resnet50_v1.5 | 4 | 128 | True | True | 3065.898543897684 |
resnet50_v1.5 | 4 | 256 | True | False | 2864.1530570235277 |
resnet50_v1.5 | 4 | 256 | True | True | 3308.184721599039 |
inception3 | 1 | 32 | False | False | 231.6304999196131 |
inception3 | 1 | 32 | False | True | 231.62821340833554 |
inception3 | 1 | 32 | True | False | 397.8483574516062 |
inception3 | 1 | 32 | True | True | 397.61870298268843 |
inception3 | 1 | 64 | False | False | 255.79774049318286 |
inception3 | 1 | 64 | False | True | 254.96217612362278 |
inception3 | 1 | 64 | True | False | 464.3570055911474 |
inception3 | 1 | 64 | True | True | 466.5224032998034 |
inception3 | 1 | 128 | False | False | 265.4893643297776 |
inception3 | 1 | 128 | False | True | 265.9010072543052 |
inception3 | 1 | 128 | True | False | 523.6175139095815 |
inception3 | 1 | 128 | True | True | 529.3233120661808 |
inception3 | 1 | 256 | True | False | 548.3831188132465 |
inception3 | 1 | 256 | True | True | 558.6182587782852 |
inception3 | 2 | 32 | False | False | 461.21245811536164 |
inception3 | 2 | 32 | False | True | 461.7955449394074 |
inception3 | 2 | 32 | True | False | 776.9118181766349 |
inception3 | 2 | 32 | True | True | 803.1606840744608 |
inception3 | 2 | 64 | False | False | 505.62156764514185 |
inception3 | 2 | 64 | False | True | 510.4880571128736 |
inception3 | 2 | 64 | True | False | 910.3834390431858 |
inception3 | 2 | 64 | True | True | 941.0765499887631 |
inception3 | 2 | 128 | False | False | 530.073420084333 |
inception3 | 2 | 128 | False | True | 532.0966824977393 |
inception3 | 2 | 128 | True | False | 1031.042745566893 |
inception3 | 2 | 128 | True | True | 1059.298960623237 |
inception3 | 2 | 256 | True | False | 1097.3179582016833 |
inception3 | 2 | 256 | True | True | 1112.944630855092 |
inception3 | 4 | 32 | False | False | 846.1137665298596 |
inception3 | 4 | 32 | False | True | 895.5092904990373 |
inception3 | 4 | 32 | True | False | 1271.1921617960843 |
inception3 | 4 | 32 | True | True | 1562.0603997601588 |
inception3 | 4 | 64 | False | False | 967.953751912867 |
inception3 | 4 | 64 | False | True | 994.9078067162436 |
inception3 | 4 | 64 | True | False | 1513.182913280423 |
inception3 | 4 | 64 | True | True | 1842.119923117435 |
inception3 | 4 | 128 | False | False | 992.0283404721883 |
inception3 | 4 | 128 | False | True | 1027.2404515646604 |
inception3 | 4 | 128 | True | False | 1751.5253071397437 |
inception3 | 4 | 128 | True | True | 2089.504434176232 |
inception3 | 4 | 256 | True | False | 1992.6604934734398 |
inception3 | 4 | 256 | True | True | 2174.329529373497 |
An updated table (utilizing full compute resources / without filesystem bottleneck) with results accumulated from 5 separate trials follows:
model | num_gpus | batch_size | use_fp16 | use_synth | mean of img_per_sec | std deviation of img_per_sec |
---|---|---|---|---|---|---|
resnet50_v1.5 | 1 | 32 | False | False | 300.7115095987747 | 0.8334252017088293 |
resnet50_v1.5 | 1 | 32 | False | True | 303.2748894498352 | 0.518768277442615 |
resnet50_v1.5 | 1 | 32 | True | False | 552.8543259023911 | 3.7419280205833876 |
resnet50_v1.5 | 1 | 32 | True | True | 551.3152982879091 | 1.9256151492887408 |
resnet50_v1.5 | 1 | 64 | False | False | 340.3455581202521 | 0.4352822822221454 |
resnet50_v1.5 | 1 | 64 | False | True | 341.4178425579963 | 1.399820974317093 |
resnet50_v1.5 | 1 | 64 | True | False | 679.092793370331 | 1.811493519714515 |
resnet50_v1.5 | 1 | 64 | True | True | 679.4220076333565 | 0.4846137767752205 |
resnet50_v1.5 | 1 | 128 | False | False | 365.97794971142156 | 0.6660901988191761 |
resnet50_v1.5 | 1 | 128 | False | True | 365.1002182169415 | 0.290659074503093 |
resnet50_v1.5 | 1 | 128 | True | False | 780.7795427908641 | 2.6289190299425735 |
resnet50_v1.5 | 1 | 128 | True | True | 782.0787833811031 | 1.4035397167521415 |
resnet50_v1.5 | 1 | 256 | True | False | 841.0209098895596 | 0.9327257479352289 |
resnet50_v1.5 | 1 | 256 | True | True | 841.6281134422322 | 0.6754027617726148 |
resnet50_v1.5 | 2 | 32 | False | False | 597.7123015353376 | 2.5358048324228486 |
resnet50_v1.5 | 2 | 32 | False | True | 602.1851277710804 | 1.4932197457537746 |
resnet50_v1.5 | 2 | 32 | True | True | 1101.2976316254073 | 3.1791943633783832 |
resnet50_v1.5 | 2 | 64 | False | False | 680.4331810776715 | 1.1986421991311502 |
resnet50_v1.5 | 2 | 64 | False | True | 680.168662439279 | 1.628589727697411 |
resnet50_v1.5 | 2 | 64 | True | False | 1350.6915781119133 | 4.771936223937959 |
resnet50_v1.5 | 2 | 64 | True | True | 1337.9212057737357 | 4.3318244554829635 |
resnet50_v1.5 | 2 | 128 | False | False | 722.1292515998915 | 1.7079082542095976 |
resnet50_v1.5 | 2 | 128 | False | True | 724.5907650506392 | 1.3016121917621204 |
resnet50_v1.5 | 2 | 128 | True | False | 1560.6063143166916 | 5.574390740438405 |
resnet50_v1.5 | 2 | 128 | True | True | 1558.8925541664316 | 3.7645037153069234 |
resnet50_v1.5 | 2 | 256 | True | False | 1681.9367020855304 | 2.135490739835972 |
resnet50_v1.5 | 2 | 256 | True | True | 1679.1043258972677 | 1.9428881159794065 |
resnet50_v1.5 | 4 | 32 | False | False | 1137.8151032757787 | 3.6070602577235196 |
resnet50_v1.5 | 4 | 32 | False | True | 1147.7653259816507 | 10.086122200344256 |
resnet50_v1.5 | 4 | 32 | True | False | 2090.181533002181 | 27.753203322924346 |
resnet50_v1.5 | 4 | 32 | True | True | 2118.951749390374 | 9.671318775529995 |
resnet50_v1.5 | 4 | 64 | False | False | 1323.4940363814712 | 2.4288696630804347 |
resnet50_v1.5 | 4 | 64 | False | True | 1336.5457321023537 | 2.8028756633168586 |
resnet50_v1.5 | 4 | 64 | True | False | 2627.208652367316 | 30.590239282719534 |
resnet50_v1.5 | 4 | 64 | True | True | 2605.2372571940177 | 16.77388699379653 |
resnet50_v1.5 | 4 | 128 | False | False | 1428.9004479146274 | 5.19872681737769 |
resnet50_v1.5 | 4 | 128 | False | True | 1431.0649693247092 | 2.1140696182827456 |
resnet50_v1.5 | 4 | 128 | True | False | 3038.123973486709 | 32.14983993002834 |
resnet50_v1.5 | 4 | 128 | True | True | 3050.6240630834754 | 13.087954923831944 |
resnet50_v1.5 | 4 | 256 | True | False | 3269.060930519893 | 18.13964231626883 |
resnet50_v1.5 | 4 | 256 | True | True | 3288.0027951477073 | 10.833033249948164 |