For this code that does not have a long running time (wall time), set optimization low (-O0) to see the routines appear in the profile report.  The TUNED flag for the code will force it to use functions for the hot loops instead of inlined versions in main().  This is not the optimum way to build the code for performance but it improves the profiling demonstration here.

gcc -g -fopenmp -pg -O0 -DTUNED stream.c
galen@galen-SVE14A27CXH:~/stream$ OMP_NUM_THREADS=1 ./a.out
...
Triad:           8806.7     0.057832     0.054504     0.076255
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

galen@galen-SVE14A27CXH:~/stream$ gprof
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 99.55     22.38    22.38      100   223.79   223.79  tuned_STREAM_Triad
  0.62     22.52     0.14        1   140.24   140.24  checkSTREAMresults
  0.00     22.52     0.00     1399     0.00     0.00  mysecond
  0.00     22.52     0.00      100     0.00     0.00  tuned_STREAM_Add
  0.00     22.52     0.00      100     0.00     0.00  tuned_STREAM_Copy
  0.00     22.52     0.00      100     0.00     0.00  tuned_STREAM_Scale
  0.00     22.52     0.00        1     0.00     0.00  checktick

galen@galen-SVE14A27CXH:~/stream$ OMP_NUM_THREADS=4 ./a.out
...
Triad:          15582.3     0.032002     0.030804     0.039081
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
galen@galen-SVE14A27CXH:~/stream$ gprof
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 99.68     28.28    28.28      100   282.79   282.79  tuned_STREAM_Triad
  0.49     28.42     0.14        1   140.24   140.24  checkSTREAMresults
  0.00     28.42     0.00     1429     0.00     0.00  mysecond
  0.00     28.42     0.00      100     0.00     0.00  tuned_STREAM_Add
  0.00     28.42     0.00      100     0.00     0.00  tuned_STREAM_Copy
  0.00     28.42     0.00      100     0.00     0.00  tuned_STREAM_Scale
  0.00     28.42     0.00        1     0.00     0.00  checktick


The times above show the code performing worse with threads than when single-threaded.  With more optimization there is a small improvement with threads.  Notice how the profile information is lost for this short wall time benchmark.

gcc -g -fopenmp -pg -O3 -DTUNED stream.c
galen@galen-SVE14A27CXH:~/stream$ OMP_NUM_THREADS=1 time ./a.out | grep Triad:
Triad:          13140.6     0.036906     0.036528     0.042073
12.04user 0.16system 0:12.20elapsed 100%CPU (0avgtext+0avgdata 470832maxresident)k
0inputs+24outputs (0major+117281minor)pagefaults 0swaps
galen@galen-SVE14A27CXH:~/stream$ OMP_NUM_THREADS=4 time ./a.out | grep Triad:
Triad:          15570.3     0.031303     0.030828     0.039431
40.65user 0.27system 0:10.48elapsed 390%CPU (0avgtext+0avgdata 470896maxresident)k
0inputs+8outputs (0major+117289minor)pagefaults 0swaps
galen@galen-SVE14A27CXH:~/stream$ gprof | head -20
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 91.72     19.42    19.42                             frame_dummy
  0.33     19.49     0.07                             checkSTREAMresults
...

references

  • man gprof
  • gprof --help
  • No labels