For this code that does not have a long running time (wall time), set optimization low (-O0) to see the routines appear in the profile report. The TUNED flag for the code will force it to use functions for the hot loops instead of inlined versions in main(). This is not the optimum way to build the code for performance but it improves the profiling demonstration here.
gcc -g -fopenmp -pg -O0 -DTUNED stream.c
galen@galen-SVE14A27CXH:~/stream$ OMP_NUM_THREADS=1 ./a.out ... Triad: 8806.7 0.057832 0.054504 0.076255 ------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-13 on all three arrays ------------------------------------------------------------- galen@galen-SVE14A27CXH:~/stream$ gprof Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 99.55 22.38 22.38 100 223.79 223.79 tuned_STREAM_Triad 0.62 22.52 0.14 1 140.24 140.24 checkSTREAMresults 0.00 22.52 0.00 1399 0.00 0.00 mysecond 0.00 22.52 0.00 100 0.00 0.00 tuned_STREAM_Add 0.00 22.52 0.00 100 0.00 0.00 tuned_STREAM_Copy 0.00 22.52 0.00 100 0.00 0.00 tuned_STREAM_Scale 0.00 22.52 0.00 1 0.00 0.00 checktick galen@galen-SVE14A27CXH:~/stream$ OMP_NUM_THREADS=4 ./a.out ... Triad: 15582.3 0.032002 0.030804 0.039081 ------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-13 on all three arrays ------------------------------------------------------------- galen@galen-SVE14A27CXH:~/stream$ gprof Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 99.68 28.28 28.28 100 282.79 282.79 tuned_STREAM_Triad 0.49 28.42 0.14 1 140.24 140.24 checkSTREAMresults 0.00 28.42 0.00 1429 0.00 0.00 mysecond 0.00 28.42 0.00 100 0.00 0.00 tuned_STREAM_Add 0.00 28.42 0.00 100 0.00 0.00 tuned_STREAM_Copy 0.00 28.42 0.00 100 0.00 0.00 tuned_STREAM_Scale 0.00 28.42 0.00 1 0.00 0.00 checktick
The times above show the code performing worse with threads than when single-threaded. With more optimization there is a small improvement with threads. Notice how the profile information is lost for this short wall time benchmark.
gcc -g -fopenmp -pg -O3 -DTUNED stream.c
galen@galen-SVE14A27CXH:~/stream$ OMP_NUM_THREADS=1 time ./a.out | grep Triad: Triad: 13140.6 0.036906 0.036528 0.042073 12.04user 0.16system 0:12.20elapsed 100%CPU (0avgtext+0avgdata 470832maxresident)k 0inputs+24outputs (0major+117281minor)pagefaults 0swaps galen@galen-SVE14A27CXH:~/stream$ OMP_NUM_THREADS=4 time ./a.out | grep Triad: Triad: 15570.3 0.031303 0.030828 0.039431 40.65user 0.27system 0:10.48elapsed 390%CPU (0avgtext+0avgdata 470896maxresident)k 0inputs+8outputs (0major+117289minor)pagefaults 0swaps galen@galen-SVE14A27CXH:~/stream$ gprof | head -20 Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ts/call Ts/call name 91.72 19.42 19.42 frame_dummy 0.33 19.49 0.07 checkSTREAMresults ...
references
- man gprof
- gprof --help