Vendor compilers typically provide the most information about their optimizations of your code.  Some may also provide an inline source listing (Cray and Intel below) where the optimization comments and labels appear next to the code.

Cray compiler (craycc: defaults -O3 and OpenMP enabled)
arnoldg@h2ologin4:~/stream> cc -c -hmsgs -hlist=ai -DTUNED stream.c
arnoldg@h2ologin4:~/stream> cat stream.lst
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
                          S u m m a r y   R e p o r t
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Compilation
-----------
File     : /mnt/a/u/staff/arnoldg/stream/stream.c
Compiled : 2021-03-05  09:16:31
Compiler : Version 8.7.7
Ftnlx    : Version 8503 (libcif 85008)
Target   : x86-64
Command  : driver.cc -h cpu=interlagos -h static -D __CRAYXE
           -D __CRAY_INTERLAGOS -D __CRAYXT_COMPUTE_LINUX_TARGET
           -h network=gemini -c -h msgs -h list=ai -D TUNED stream.c
           -isystem /opt/cray/cce/8.7.7/cce/x86_64/include/craylibs
           -isystem /opt/cray/cce/8.7.7/cce/x86_64/include/basic
           -isystem /opt/gcc/6.1.0/snos/lib/gcc/x86_64-suse-linux/6.1.0/include
           -isystem /opt/gcc/6.1.0/snos/lib/gcc/x86_64-suse-linux/6.1.0/include-
           fixed -isystem /opt/gcc/6.1.0/snos/include -isystem /usr/include
           -I /opt/cray/mpt/7.7.4/gni/mpich-cray/8.6/include
           -I /opt/cray/libsci/18.12.1/CRAY/8.6/x86_64/include
           -I /opt/cray/rca/1.0.0-2.0502.60530.1.63.gem/include
           -I /opt/cray/alps/5.2.4-2.0502.9774.31.12.gem/include
           -I /opt/cray/xpmem/0.1-2.0502.64982.5.3.gem/include
           -I /opt/cray/gni-headers/4.0-1.0502.10859.7.8.gem/include
           -I /opt/cray/dmapp/7.0.1-1.0502.11080.8.74.gem/include
           -I /opt/cray/pmi/5.0.14/include
           -I /opt/cray/ugni/6.0-1.0502.10863.8.28.gem/include
           -I /opt/cray/udreg/2.3.2-1.0502.10518.2.17.gem/include
           -I /usr/local/include
           -I /opt/cray/wlm_detect/1.0-1.0502.64649.2.2.gem/include
           -I /opt/cray/krca/1.0.0-2.0502.63139.4.30.gem/include
           -I /opt/cray-hss-devel/7.2.0/include

clx report
------------
Source   : /mnt/a/u/staff/arnoldg/stream/stream.c
Date     : 03/05/2021  09:16:32


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
                          O p t i o n s   R e p o r t
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Options :  -h cache2,scalar2,thread2,vector2,mpi0,ipa3,noaggress
           -h autoprefetch,noautothread,fusion2,msgs,nonegmsgs
           -h nooverindex,pattern,unroll2,nozeroinc
           -h noadd_paren,noupc,dwarf,fma,nofp_trap,nofunc_trace
           -h noomp_analyze,noomp_trace,nopat_trace
           -h omp,noacc
           -h c99,noexceptions,noconform,noinfinitevl
           -h notolerant,gnu
           -h safe_addr,thread_do_concurrent,fp2=approx,flex_mp=default
           -h alias=default:standard_restrict
           -h static (or -static)
           -h cpu=x86-64,interlagos
           -h network=gemini
           -K trap=none


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
                          S o u r c e   L i s t i n g
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


     %%%    L o o p m a r k   L e g e n d    %%%

     Primary Loop Type        Modifiers
     ------- ---- ----        ---------
     A - Pattern matched      a - atomic memory operation
                              b - blocked
     C - Collapsed            c - conditional and/or computed
     D - Deleted
     E - Cloned
     F - Flat - No calls      f - fused
     G - Accelerated          g - partitioned
     I - Inlined              i - interchanged
     M - Multithreaded        m - partitioned
                              n - non-blocking remote transfer
                              p - partial
     R - Rerolling            r - unrolled
                              s - shortloop
     V - Vectorized           w - unwound

     + - More messages listed at end of listing
     ------------------------------------------


    1.                      /*-----------------------------------------------------------------------*/
    2.                      /* Program: Stream                                                       */
    3.                      /* Revision: $Id: stream.c,v 5.9 2009/04/11 16:35:00 mccalpin Exp $ */
    4.                      /* Original code developed by John D. McCalpin                           */
    5.                      /* Programmers: John D. McCalpin                                         */
    6.                      /*              Joe R. Zagar                                             */
    7.                      /*                                                                       */
    8.                      /* This program measures memory transfer rates in MB/s for simple        */
    9.                      /* computational kernels coded in C.                                     */
   10.                      /*-----------------------------------------------------------------------*/
   11.                      /* Copyright 1991-2005: John D. McCalpin                                 */
   12.                      /*-----------------------------------------------------------------------*/
   13.                      /* License:                                                              */
   14.                      /*  1. You are free to use this program and/or to redistribute           */
   15.                      /*     this program.                                                     */
   16.                      /*  2. You are free to modify this program for your own use,             */
   17.                      /*     including commercial use, subject to the publication              */
   18.                      /*     restrictions in item 3.                                           */
   19.                      /*  3. You are free to publish results obtained from running this        */
   20.                      /*     program, or from works that you derive from this program,         */
   21.                      /*     with the following limitations:                                   */
   22.                      /*     3a. In order to be referred to as "STREAM benchmark results",     */
   23.                      /*         published results must be in conformance to the STREAM        */
   24.                      /*         Run Rules, (briefly reviewed below) published at              */
   25.                      /*         http://www.cs.virginia.edu/stream/ref.html                    */
   26.                      /*         and incorporated herein by reference.                         */
   27.                      /*         As the copyright holder, John McCalpin retains the            */
   28.                      /*         right to determine conformity with the Run Rules.             */
   29.                      /*     3b. Results based on modified source code or on runs not in       */
   30.                      /*         accordance with the STREAM Run Rules must be clearly          */
   31.                      /*         labelled whenever they are published.  Examples of            */
   32.                      /*         proper labelling include:                                     */
   33.                      /*         "tuned STREAM benchmark results"                              */
   34.                      /*         "based on a variant of the STREAM benchmark code"             */
   35.                      /*         Other comparable, clear and reasonable labelling is           */
   36.                      /*         acceptable.                                                   */
   37.                      /*     3c. Submission of results to the STREAM benchmark web site        */
   38.                      /*         is encouraged, but not required.                              */
   39.                      /*  4. Use of this program or creation of derived works based on this    */
   40.                      /*     program constitutes acceptance of these licensing restrictions.   */
   41.                      /*  5. Absolutely no warranty is expressed or implied.                   */
   42.                      /*-----------------------------------------------------------------------*/
   43.                      # include <stdio.h>
   44.                      # include <math.h>
   45.                      # include <float.h>
   46.                      # include <limits.h>
   47.                      # include <sys/time.h>
   48.
   49.                      /* INSTRUCTIONS:
   50.                       *
   51.                       *  1) Stream requires a good bit of memory to run.  Adjust the
   52.                       *          value of 'N' (below) to give a 'timing calibration' of
   53.                       *          at least 20 clock-ticks.  This will provide rate estimates
   54.                       *          that should be good to about 5% precision.
   55.                       */
   56.
   57.                      #ifndef N
   58.                      #   define N        40000000
   59.                      #endif
   60.                      #ifndef NTIMES
   61.                      #   define NTIMES   10
   62.                      #endif
   63.                      #ifndef OFFSET
   64.                      #   define OFFSET   0
   65.                      #endif
   66.
   67.                      /*
   68.                       *  3) Compile the code with full optimization.  Many compilers
   69.                       *     generate unreasonably bad code before the optimizer tightens
   70.                       *     things up.  If the results are unreasonably good, on the
   71.                       *     other hand, the optimizer might be too smart for me!
   72.                       *
   73.                       *         Try compiling with:
   74.                       *               cc -O stream_omp.c -o stream_omp
   75.                       *
   76.                       *         This is known to work on Cray, SGI, IBM, and Sun machines.
   77.                       *
   78.                       *
   79.                       *  4) Mail the results to mccalpin@cs.virginia.edu
   80.                       *     Be sure to include:
   81.                       *          a) computer hardware model number and software revision
   82.                       *          b) the compiler flags
   83.                       *          c) all of the output from the test case.
   84.                       * Thanks!
   85.                       *
   86.                       */
   87.
   88.                      # define HLINE "-------------------------------------------------------------\n"
   89.
   90.                      # ifndef MIN
   91.                      # define MIN(x,y) ((x)<(y)?(x):(y))
   92.                      # endif
   93.                      # ifndef MAX
   94.                      # define MAX(x,y) ((x)>(y)?(x):(y))
   95.                      # endif
   96.
   97.                      static double       a[N+OFFSET],
   98.                                  b[N+OFFSET],
   99.                                  c[N+OFFSET];
  100.
  101.                      static double       avgtime[4] = {0}, maxtime[4] = {0},
  102.                                  mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX};
  103.
  104.                      static char *label[4] = {"Copy:      ", "Scale:     ",
  105.                          "Add:       ", "Triad:     "};
  106.
  107.                      static double       bytes[4] = {
  108.                          2 * sizeof(double) * N,
  109.                          2 * sizeof(double) * N,
  110.                          3 * sizeof(double) * N,
  111.                          3 * sizeof(double) * N
  112.                          };
  113.
  114.                      extern double mysecond();
  115.                      extern void checkSTREAMresults();
  116.                      #ifdef TUNED
  117.                      extern void tuned_STREAM_Copy();
  118.                      extern void tuned_STREAM_Scale(double scalar);
  119.                      extern void tuned_STREAM_Add();
  120.                      extern void tuned_STREAM_Triad(double scalar);
  121.                      #endif
  122.                      #ifdef _OPENMP
  123.                      extern int omp_get_num_threads();
  124.                      #endif
  125.                      int
  126.                      main()
  127.                          {
  128.                          int                     quantum, checktick();
  129.                          int                     BytesPerWord;
  130.                          register int    j, k;
  131.                          double          scalar, t, times[4][NTIMES];
  132.
  133.                          /* --- SETUP --- determine precision and check timing --- */
  134.
  135.  +                       printf(HLINE);
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 135, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  136.  +                       printf("STREAM version $Revision: 5.9 $\n");
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 136, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  137.  +                       printf(HLINE);
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 137, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  138.                          BytesPerWord = sizeof(double);
  139.  +                       printf("This system uses %d bytes per DOUBLE PRECISION word.\n",
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 139, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  140.                          BytesPerWord);
  141.
  142.  +                       printf(HLINE);
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 142, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  143.                      #ifdef NO_LONG_LONG
  144.                          printf("Array size = %d, Offset = %d\n" , N, OFFSET);
  145.                      #else
  146.  +                       printf("Array size = %llu, Offset = %d\n", (unsigned long long) N, OFFSET);
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 146, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  147.                      #endif
  148.
  149.  +                       printf("Total memory required = %.1f MB.\n",
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 149, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  150.                          (3.0 * BytesPerWord) * ( (double) N / 1048576.0));
  151.  +                       printf("Each test is run %d times, but only\n", NTIMES);
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 151, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  152.  +                       printf("the *best* time for each is used.\n");
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 152, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  153.
  154.                      #ifdef _OPENMP
  155.  +                       printf(HLINE);
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 155, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  156.                      #pragma omp parallel
  157.       M------------<     {
CC-6823 CC: THREAD main, File = stream.c, Line = 157
  A region starting at line 157 and ending at line 163 was multi-threaded.

  158.       M              #pragma omp master
  159.       M                  {
  160.  +    M                      k = omp_get_num_threads();
                                    ^
CC-3021 CC: IPA main, File = stream.c, Line = 160, Column = 6
  "omp_get_num_threads" (called from "main") was not inlined because the compiler was unable to locate the routine.

  161.  +    M                      printf ("Number of Threads requested = %i\n",k);
                                    ^
CC-3021 CC: IPA main, File = stream.c, Line = 161, Column = 6
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  162.       M                      }
  163.       M------------>     }
  164.                      #endif
  165.
  166.  +                       printf(HLINE);
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 166, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  167.                      #pragma omp parallel
  168.  + M-<                   {
CC-6831 CC: THREAD main, File = stream.c, Line = 168
  An expanded multi-threaded region was created starting near line 168 and ending near line 178.

CC-6824 CC: THREAD main, File = stream.c, Line = 168
  A region starting at line 168 and ending at line 170 was multi-threaded and merged with an expanded multi-thread region.

  169.  + M                     printf ("Printing one line per active thread....\n");
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 169, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  170.    M                     }
  171.    M
  172.    M                     /* Get initial value for system clock. */
  173.    M                 #pragma omp parallel for
  174.  + M  mA-----------<     for (j=0; j<N; j++) {
CC-6230 CC: VECTOR main, File = stream.c, Line = 174
  A loop was replaced with multiple library calls.

CC-6824 CC: THREAD main, File = stream.c, Line = 174
  A region starting at line 174 and ending at line 178 was multi-threaded and merged with an expanded multi-thread region.

CC-6817 CC: THREAD main, File = stream.c, Line = 174
  A loop was partitioned.

  175.    M  mA                 a[j] = 1.0;
  176.    M  mA                 b[j] = 2.0;
  177.    M  mA                 c[j] = 0.0;
  178.    M->mA----------->     }
  179.
  180.  +                       printf(HLINE);
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 180, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  181.
  182.  +                       if  ( (quantum = checktick()) >= 1)
                                ^
CC-3118 CC: IPA main, File = stream.c, Line = 182, Column = 5
  "checktick" (called from "main") was not inlined because the call site will not flatten.  "gettimeofday" is missing.

  183.  +                       printf("Your clock granularity/precision appears to be "
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 183, Column = 2
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  184.                              "%d microseconds.\n", quantum);
  185.                          else {
  186.  +                       printf("Your clock granularity appears to be "
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 186, Column = 2
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  187.                              "less than one microsecond.\n");
  188.                          quantum = 1;
  189.                          }
  190.
  191.  +                       t = mysecond();
                                ^
CC-3118 CC: IPA main, File = stream.c, Line = 191, Column = 5
  "mysecond" (called from "main") was not inlined because the call site will not flatten.  "gettimeofday" is missing.

  192.                      #pragma omp parallel for
  193.       MmVr4--------<     for (j = 0; j < N; j++)
CC-6005 CC: SCALAR main, File = stream.c, Line = 193
  A loop was unrolled 4 times.

CC-6823 CC: THREAD main, File = stream.c, Line = 193
  A region starting at line 193 and ending at line 194 was multi-threaded.

CC-6204 CC: VECTOR main, File = stream.c, Line = 193
  A loop was vectorized.

CC-6817 CC: THREAD main, File = stream.c, Line = 193
  A loop was partitioned.

  194.       MmVr4-------->     a[j] = 2.0E0 * a[j];
  195.  +                       t = 1.0E6 * (mysecond() - t);
                                ^
CC-3118 CC: IPA main, File = stream.c, Line = 195, Column = 5
  "mysecond" (called from "main") was not inlined because the call site will not flatten.  "gettimeofday" is missing.

  196.
  197.  +                       printf("Each test below will take on the order"
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 197, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  198.                          " of %d microseconds.\n", (int) t  );
  199.  +                       printf("   (= %d clock ticks)\n", (int) (t/quantum) );
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 199, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  200.  +                       printf("Increase the size of the arrays if this shows that\n");
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 200, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  201.  +                       printf("you are not getting at least 20 clock ticks per test.\n");
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 201, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  202.
  203.  +                       printf(HLINE);
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 203, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  204.
  205.  +                       printf("WARNING -- The above is only a rough guideline.\n");
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 205, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  206.  +                       printf("For best results, please be sure you know the\n");
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 206, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  207.  +                       printf("precision of your system timer.\n");
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 207, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  208.  +                       printf(HLINE);
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 208, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  209.
  210.                          /*      --- MAIN LOOP --- repeat test cases NTIMES times --- */
  211.
  212.                          scalar = 3.0;
  213.  +    1------------<     for (k=0; k<NTIMES; k++)
CC-6287 CC: VECTOR main, File = stream.c, Line = 213
  A loop was not vectorized because it contains a call to function "mysecond" on line 215.

  214.       1                  {
  215.  +    1                  times[0][k] = mysecond();
                                ^
CC-3118 CC: IPA main, File = stream.c, Line = 215, Column = 2
  "mysecond" (called from "main") was not inlined because the call site will not flatten.  "gettimeofday" is missing.

  216.       1              #ifdef TUNED
  217.       1 MmA I-----<>         tuned_STREAM_Copy();
CC-6202 CC: VECTOR main, File = stream.c, Line = 217
  A loop was replaced by a library call.

CC-6823 CC: THREAD main, File = stream.c, Line = 217
  A region starting at line 217 and ending at line 217 was multi-threaded.

CC-6817 CC: THREAD main, File = stream.c, Line = 217
  A loop was partitioned.

                                    ^
CC-3001 CC: IPA main, File = stream.c, Line = 217, Column = 9
  The call to tiny leaf routine "tuned_STREAM_Copy" was textually inlined.

  218.       1              #else
  219.       1              #pragma omp parallel for
  220.       1                  for (j=0; j<N; j++)
  221.       1                      c[j] = a[j];
  222.       1              #endif
  223.  +    1                  times[0][k] = mysecond() - times[0][k];
                                ^
CC-3118 CC: IPA main, File = stream.c, Line = 223, Column = 2
  "mysecond" (called from "main") was not inlined because the call site will not flatten.  "gettimeofday" is missing.

  224.       1
  225.  +    1                  times[1][k] = mysecond();
                                ^
CC-3118 CC: IPA main, File = stream.c, Line = 225, Column = 2
  "mysecond" (called from "main") was not inlined because the call site will not flatten.  "gettimeofday" is missing.

  226.       1              #ifdef TUNED
  227.       1 MmVr4 I---<>         tuned_STREAM_Scale(scalar);
CC-6005 CC: SCALAR main, File = stream.c, Line = 227
  A loop was unrolled 4 times.

CC-6823 CC: THREAD main, File = stream.c, Line = 227
  A region starting at line 227 and ending at line 227 was multi-threaded.

CC-6204 CC: VECTOR main, File = stream.c, Line = 227
  A loop was vectorized.

CC-6817 CC: THREAD main, File = stream.c, Line = 227
  A loop was partitioned.

                                    ^
CC-3001 CC: IPA main, File = stream.c, Line = 227, Column = 9
  The call to tiny leaf routine "tuned_STREAM_Scale" was textually inlined.

  228.       1              #else
  229.       1              #pragma omp parallel for
  230.       1                  for (j=0; j<N; j++)
  231.       1                      b[j] = scalar*c[j];
  232.       1              #endif
  233.  +    1                  times[1][k] = mysecond() - times[1][k];
                                ^
CC-3118 CC: IPA main, File = stream.c, Line = 233, Column = 2
  "mysecond" (called from "main") was not inlined because the call site will not flatten.  "gettimeofday" is missing.

  234.       1
  235.  +    1                  times[2][k] = mysecond();
                                ^
CC-3118 CC: IPA main, File = stream.c, Line = 235, Column = 2
  "mysecond" (called from "main") was not inlined because the call site will not flatten.  "gettimeofday" is missing.

  236.       1              #ifdef TUNED
  237.       1 MmVr4 I---<>         tuned_STREAM_Add();
CC-6005 CC: SCALAR main, File = stream.c, Line = 237
  A loop was unrolled 4 times.

CC-6823 CC: THREAD main, File = stream.c, Line = 237
  A region starting at line 237 and ending at line 237 was multi-threaded.

CC-6204 CC: VECTOR main, File = stream.c, Line = 237
  A loop was vectorized.

CC-6817 CC: THREAD main, File = stream.c, Line = 237
  A loop was partitioned.

                                    ^
CC-3001 CC: IPA main, File = stream.c, Line = 237, Column = 9
  The call to tiny leaf routine "tuned_STREAM_Add" was textually inlined.

  238.       1              #else
  239.       1              #pragma omp parallel for
  240.       1                  for (j=0; j<N; j++)
  241.       1                      c[j] = a[j]+b[j];
  242.       1              #endif
  243.  +    1                  times[2][k] = mysecond() - times[2][k];
                                ^
CC-3118 CC: IPA main, File = stream.c, Line = 243, Column = 2
  "mysecond" (called from "main") was not inlined because the call site will not flatten.  "gettimeofday" is missing.

  244.       1
  245.  +    1                  times[3][k] = mysecond();
                                ^
CC-3118 CC: IPA main, File = stream.c, Line = 245, Column = 2
  "mysecond" (called from "main") was not inlined because the call site will not flatten.  "gettimeofday" is missing.

  246.       1              #ifdef TUNED
  247.       1 MmVr4 I---<>         tuned_STREAM_Triad(scalar);
CC-6005 CC: SCALAR main, File = stream.c, Line = 247
  A loop was unrolled 4 times.

CC-6823 CC: THREAD main, File = stream.c, Line = 247
  A region starting at line 247 and ending at line 247 was multi-threaded.

CC-6204 CC: VECTOR main, File = stream.c, Line = 247
  A loop was vectorized.

CC-6817 CC: THREAD main, File = stream.c, Line = 247
  A loop was partitioned.

                                    ^
CC-3001 CC: IPA main, File = stream.c, Line = 247, Column = 9
  The call to tiny leaf routine "tuned_STREAM_Triad" was textually inlined.

  248.       1              #else
  249.       1              #pragma omp parallel for
  250.       1                  for (j=0; j<N; j++)
  251.       1                      a[j] = b[j]+scalar*c[j];
  252.       1              #endif
  253.  +    1                  times[3][k] = mysecond() - times[3][k];
                                ^
CC-3118 CC: IPA main, File = stream.c, Line = 253, Column = 2
  "mysecond" (called from "main") was not inlined because the call site will not flatten.  "gettimeofday" is missing.

  254.       1------------>     }
  255.
  256.                          /*      --- SUMMARY --- */
  257.
  258.  +    iVw----------<     for (k=1; k<NTIMES; k++) /* note -- skip first iteration */
CC-6007 CC: SCALAR main, File = stream.c, Line = 258
  A loop was interchanged with the loop starting at line 260.

CC-6373 CC: VECTOR main, File = stream.c, Line = 258
  A loop with a trip count of 9 was unwound into 2 vector iterations.

CC-6382 CC: VECTOR main, File = stream.c, Line = 258
  A loop was partially vector pipelined.

CC-6204 CC: VECTOR main, File = stream.c, Line = 258
  A loop was vectorized.

  259.       iVw                {
  260.  +    iVw i--------<     for (j=0; j<4; j++)
CC-6294 CC: VECTOR main, File = stream.c, Line = 260
  A loop was not vectorized because a better candidate was found at line 258.

  261.       iVw i                  {
  262.       iVw i                  avgtime[j] = avgtime[j] + times[j][k];
  263.       iVw i                  mintime[j] = MIN(mintime[j], times[j][k]);
  264.       iVw i                  maxtime[j] = MAX(maxtime[j], times[j][k]);
  265.       iVw i-------->         }
  266.       iVw---------->     }
  267.
  268.  +                       printf("Function      Rate (MB/s)   Avg time     Min time     Max time\n");
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 268, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  269.  +    1------------<     for (j=0; j<4; j++) {
CC-6287 CC: VECTOR main, File = stream.c, Line = 269
  A loop was not vectorized because it contains a call to function "printf" on line 272.

  270.       1                  avgtime[j] = avgtime[j]/(double)(NTIMES-1);
  271.       1
  272.  +    1                  printf("%s%11.4f  %11.4f  %11.4f  %11.4f\n", label[j],
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 272, Column = 2
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  273.       1                         1.0E-06 * bytes[j]/mintime[j],
  274.       1                         avgtime[j],
  275.       1                         mintime[j],
  276.       1                         maxtime[j]);
  277.       1------------>     }
  278.  +                       printf(HLINE);
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 278, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  279.
  280.                          /* --- Check Results --- */
  281.  +                       checkSTREAMresults();
                                ^
CC-3118 CC: IPA main, File = stream.c, Line = 281, Column = 5
  "checkSTREAMresults" (called from "main") was not inlined because the call site will not flatten.  "printf" is missing.

  282.  +                       printf(HLINE);
                                ^
CC-3021 CC: IPA main, File = stream.c, Line = 282, Column = 5
  "printf" (called from "main") was not inlined because the compiler was unable to locate the routine.

  283.
  284.                          return 0;
  285.                      }
  286.
  287.                      # define    M       20
  288.
  289.                      int
  290.                      checktick()
  291.                          {
  292.                          int             i, minDelta, Delta;
  293.                          double  t1, t2, timesfound[M];
  294.
  295.                      /*  Collect a sequence of M unique time values from the system. */
  296.
  297.  +    1------------<     for (i = 0; i < M; i++) {
CC-6287 CC: VECTOR checktick, File = stream.c, Line = 297
  A loop was not vectorized because it contains a call to function "mysecond" on line 298.

  298.  +    1                  t1 = mysecond();
                                ^
CC-3118 CC: IPA checktick, File = stream.c, Line = 298, Column = 2
  "mysecond" (called from "checktick") was not inlined because the call site will not flatten.  "gettimeofday" is missing.

  299.  +    1 2----------<     while( ((t2=mysecond()) - t1) < 1.0E-6 )
CC-6287 CC: VECTOR checktick, File = stream.c, Line = 299
  A loop was not vectorized because it contains a call to function "mysecond" on line 299.

                                ^
CC-3118 CC: IPA checktick, File = stream.c, Line = 299, Column = 2
  "mysecond" (called from "checktick") was not inlined because the call site will not flatten.  "gettimeofday" is missing.

                                ^
CC-3118 CC: IPA checktick, File = stream.c, Line = 299, Column = 2
  "mysecond" (called from "checktick") was not inlined because the call site will not flatten.  "gettimeofday" is missing.

  300.       1 2---------->         ;
  301.       1                  timesfound[i] = t1 = t2;
  302.       1------------>     }
  303.
  304.                      /*
  305.                       * Determine the minimum difference between these M values.
  306.                       * This result will be our estimate (in microseconds) for the
  307.                       * clock granularity.
  308.                       */
  309.
  310.                          minDelta = 1000000;
  311.  +    Vw-----------<     for (i = 1; i < M; i++) {
CC-6373 CC: VECTOR checktick, File = stream.c, Line = 311
  A loop with a trip count of 19 was unwound into 4 vector iterations.

CC-6382 CC: VECTOR checktick, File = stream.c, Line = 311
  A loop was partially vector pipelined.

CC-6204 CC: VECTOR checktick, File = stream.c, Line = 311
  A loop was vectorized.

  312.       Vw                 Delta = (int)( 1.0E6 * (timesfound[i]-timesfound[i-1]));
  313.       Vw                 minDelta = MIN(minDelta, MAX(Delta,0));
  314.       Vw----------->     }
  315.
  316.                         return(minDelta);
  317.                          }
  318.
  319.
  320.
  321.                      /* A gettimeofday routine to give access to the wall
  322.                         clock timer on most UNIX-like systems.  */
  323.
  324.                      #include <sys/time.h>
  325.
  326.                      double mysecond()
  327.                      {
  328.                              struct timeval tp;
  329.                              struct timezone tzp;
  330.                              int i;
  331.
  332.  +                           i = gettimeofday(&tp,&tzp);
                                    ^
CC-3021 CC: IPA mysecond, File = stream.c, Line = 332, Column = 9
  "gettimeofday" (called from "mysecond") was not inlined because the compiler was unable to locate the routine.

  333.                              return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
  334.                      }
  335.
  336.                      void checkSTREAMresults ()
  337.                      {
  338.                          double aj,bj,cj,scalar;
  339.                          double asum,bsum,csum;
  340.                          double epsilon;
  341.                          int     j,k;
  342.
  343.                          /* reproduce initialization */
  344.                          aj = 1.0;
  345.                          bj = 2.0;
  346.                          cj = 0.0;
  347.                          /* a[] is modified during timing check */
  348.                          aj = 2.0E0 * aj;
  349.                          /* now execute timing loop */
  350.                          scalar = 3.0;
  351.       V------------<     for (k=0; k<NTIMES; k++)
CC-6204 CC: VECTOR checkSTREAMresults, File = stream.c, Line = 351
  A loop was vectorized.

  352.       V                      {
  353.       V                          cj = aj;
  354.       V                          bj = scalar*cj;
  355.       V                          cj = aj+bj;
  356.       V                          aj = bj+scalar*cj;
  357.       V------------>         }
  358.                          aj = aj * (double) (N);
  359.                          bj = bj * (double) (N);
  360.                          cj = cj * (double) (N);
  361.
  362.                          asum = 0.0;
  363.                          bsum = 0.0;
  364.                          csum = 0.0;
  365.       Vr4----------<     for (j=0; j<N; j++) {
CC-6005 CC: SCALAR checkSTREAMresults, File = stream.c, Line = 365
  A loop was unrolled 4 times.

CC-6204 CC: VECTOR checkSTREAMresults, File = stream.c, Line = 365
  A loop was vectorized.

  366.       Vr4                        asum += a[j];
  367.       Vr4                        bsum += b[j];
  368.       Vr4                        csum += c[j];
  369.       Vr4---------->     }
  370.                      #ifdef VERBOSE
  371.                          printf ("Results Comparison: \n");
  372.                          printf ("        Expected  : %f %f %f \n",aj,bj,cj);
  373.                          printf ("        Observed  : %f %f %f \n",asum,bsum,csum);
  374.                      #endif
  375.
  376.                      #ifndef abs
  377.                      #define abs(a) ((a) >= 0 ? (a) : -(a))
  378.                      #endif
  379.                          epsilon = 1.e-8;
  380.
  381.                          if (abs(aj-asum)/asum > epsilon) {
  382.  +                               printf ("Failed Validation on array a[]\n");
                                        ^
CC-3021 CC: IPA checkSTREAMresults, File = stream.c, Line = 382, Column = 3
  "printf" (called from "checkSTREAMresults") was not inlined because the compiler was unable to locate the routine.

  383.  +                               printf ("        Expected  : %f \n",aj);
                                        ^
CC-3021 CC: IPA checkSTREAMresults, File = stream.c, Line = 383, Column = 3
  "printf" (called from "checkSTREAMresults") was not inlined because the compiler was unable to locate the routine.

  384.  +                               printf ("        Observed  : %f \n",asum);
                                        ^
CC-3021 CC: IPA checkSTREAMresults, File = stream.c, Line = 384, Column = 3
  "printf" (called from "checkSTREAMresults") was not inlined because the compiler was unable to locate the routine.

  385.                          }
  386.                          else if (abs(bj-bsum)/bsum > epsilon) {
  387.  +                               printf ("Failed Validation on array b[]\n");
                                        ^
CC-3021 CC: IPA checkSTREAMresults, File = stream.c, Line = 387, Column = 3
  "printf" (called from "checkSTREAMresults") was not inlined because the compiler was unable to locate the routine.

  388.  +                               printf ("        Expected  : %f \n",bj);
                                        ^
CC-3021 CC: IPA checkSTREAMresults, File = stream.c, Line = 388, Column = 3
  "printf" (called from "checkSTREAMresults") was not inlined because the compiler was unable to locate the routine.

  389.  +                               printf ("        Observed  : %f \n",bsum);
                                        ^
CC-3021 CC: IPA checkSTREAMresults, File = stream.c, Line = 389, Column = 3
  "printf" (called from "checkSTREAMresults") was not inlined because the compiler was unable to locate the routine.

  390.                          }
  391.                          else if (abs(cj-csum)/csum > epsilon) {
  392.  +                               printf ("Failed Validation on array c[]\n");
                                        ^
CC-3021 CC: IPA checkSTREAMresults, File = stream.c, Line = 392, Column = 3
  "printf" (called from "checkSTREAMresults") was not inlined because the compiler was unable to locate the routine.

  393.  +                               printf ("        Expected  : %f \n",cj);
                                        ^
CC-3021 CC: IPA checkSTREAMresults, File = stream.c, Line = 393, Column = 3
  "printf" (called from "checkSTREAMresults") was not inlined because the compiler was unable to locate the routine.

  394.  +                               printf ("        Observed  : %f \n",csum);
                                        ^
CC-3021 CC: IPA checkSTREAMresults, File = stream.c, Line = 394, Column = 3
  "printf" (called from "checkSTREAMresults") was not inlined because the compiler was unable to locate the routine.

  395.                          }
  396.                          else {
  397.  +                               printf ("Solution Validates\n");
                                        ^
CC-3021 CC: IPA checkSTREAMresults, File = stream.c, Line = 397, Column = 3
  "printf" (called from "checkSTREAMresults") was not inlined because the compiler was unable to locate the routine.

  398.                          }
  399.                      }
  400.
  401.                      void tuned_STREAM_Copy()
  402.                      {
  403.                          int j;
  404.                      #pragma omp parallel for
  405.       MmA----------<         for (j=0; j<N; j++)
CC-6202 CC: VECTOR tuned_STREAM_Copy, File = stream.c, Line = 405
  A loop was replaced by a library call.

CC-6823 CC: THREAD tuned_STREAM_Copy, File = stream.c, Line = 405
  A region starting at line 405 and ending at line 406 was multi-threaded.

CC-6817 CC: THREAD tuned_STREAM_Copy, File = stream.c, Line = 405
  A loop was partitioned.

  406.       MmA---------->             c[j] = a[j];
  407.                      }
  408.
  409.                      void tuned_STREAM_Scale(double scalar)
  410.                      {
  411.                          int j;
  412.                      #pragma omp parallel for
  413.       MmVr4--------<     for (j=0; j<N; j++)
CC-6005 CC: SCALAR tuned_STREAM_Scale, File = stream.c, Line = 413
  A loop was unrolled 4 times.

CC-6823 CC: THREAD tuned_STREAM_Scale, File = stream.c, Line = 413
  A region starting at line 413 and ending at line 414 was multi-threaded.

CC-6204 CC: VECTOR tuned_STREAM_Scale, File = stream.c, Line = 413
  A loop was vectorized.

CC-6817 CC: THREAD tuned_STREAM_Scale, File = stream.c, Line = 413
  A loop was partitioned.

  414.       MmVr4-------->         b[j] = scalar*c[j];
  415.                      }
  416.
  417.                      void tuned_STREAM_Add()
  418.                      {
  419.                          int j;
  420.                      #pragma omp parallel for
  421.       MmVr4--------<     for (j=0; j<N; j++)
CC-6005 CC: SCALAR tuned_STREAM_Add, File = stream.c, Line = 421
  A loop was unrolled 4 times.

CC-6823 CC: THREAD tuned_STREAM_Add, File = stream.c, Line = 421
  A region starting at line 421 and ending at line 422 was multi-threaded.

CC-6204 CC: VECTOR tuned_STREAM_Add, File = stream.c, Line = 421
  A loop was vectorized.

CC-6817 CC: THREAD tuned_STREAM_Add, File = stream.c, Line = 421
  A loop was partitioned.

  422.       MmVr4-------->         c[j] = a[j]+b[j];
  423.                      }
  424.
  425.                      void tuned_STREAM_Triad(double scalar)
  426.                      {
  427.                          int j;
  428.                      #pragma omp parallel for
  429.       MmVr4--------<     for (j=0; j<N; j++)
CC-6005 CC: SCALAR tuned_STREAM_Triad, File = stream.c, Line = 429
  A loop was unrolled 4 times.

CC-6823 CC: THREAD tuned_STREAM_Triad, File = stream.c, Line = 429
  A region starting at line 429 and ending at line 430 was multi-threaded.

CC-6204 CC: VECTOR tuned_STREAM_Triad, File = stream.c, Line = 429
  A loop was vectorized.

CC-6817 CC: THREAD tuned_STREAM_Triad, File = stream.c, Line = 429
  A loop was partitioned.

  430.       MmVr4-------->         a[j] = b[j]+scalar*c[j];
  431.                      }
  432.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
arnoldg@h2ologin4:~/stream>
PGI compiler (pgcc)
arnoldg@h2ologin4:~/stream> cc -O3 -Minfo=all -mp -c -DTUNED stream.c
main:
    157, Parallel region activated
    158, Begin master region
    163, End master region
    166, Parallel region terminated
    168, Parallel region activated
    173, Parallel region terminated
    174, Parallel region activated
         Parallel loop activated with static block schedule
         Generated an alternate version of the loop
         Generated vector simd code for the loop
    180, Barrier
         Parallel region terminated
    182, checktick inlined, size=24 (inline) file stream.c (291)
         297, FMA (fused multiply-add) instruction(s) generated
         298, mysecond inlined, size=4 (inline) file stream.c (327)
         299, Loop not vectorized/parallelized: contains call
              FMA (fused multiply-add) instruction(s) generated
    191, mysecond inlined, size=4 (inline) file stream.c (327)
    191, FMA (fused multiply-add) instruction(s) generated
    193, Parallel region activated
         Parallel loop activated with static block schedule
         Generated vector simd code for the loop
         Generated a prefetch instruction for the loop
    195, mysecond inlined, size=4 (inline) file stream.c (327)
    195, Barrier
         Parallel region terminated
         FMA (fused multiply-add) instruction(s) generated
    213, Loop not vectorized/parallelized: contains call
         FMA (fused multiply-add) instruction(s) generated
    215, mysecond inlined, size=4 (inline) file stream.c (327)
    223, mysecond inlined, size=4 (inline) file stream.c (327)
    225, mysecond inlined, size=4 (inline) file stream.c (327)
    233, mysecond inlined, size=4 (inline) file stream.c (327)
    235, mysecond inlined, size=4 (inline) file stream.c (327)
    243, mysecond inlined, size=4 (inline) file stream.c (327)
    245, mysecond inlined, size=4 (inline) file stream.c (327)
    253, mysecond inlined, size=4 (inline) file stream.c (327)
    258, Loop not vectorized: data dependency
    260, Loop unrolled 4 times (completely unrolled)
    269, Loop not vectorized/parallelized: contains call
    281, checkSTREAMresults inlined, size=34 (inline) file stream.c (337)
         351, Loop unrolled 4 times
              FMA (fused multiply-add) instruction(s) generated
         365, Generated vector simd code for the loop containing reductions
              Generated 3 prefetch instructions for the loop
checktick:
    297, FMA (fused multiply-add) instruction(s) generated
    298, mysecond inlined, size=4 (inline) file stream.c (327)
    299, mysecond inlined, size=4 (inline) file stream.c (327)
    299, Loop not vectorized/parallelized: contains call
         FMA (fused multiply-add) instruction(s) generated
mysecond:
    332, FMA (fused multiply-add) instruction(s) generated
checkSTREAMresults:
    351, Loop unrolled 4 times
         FMA (fused multiply-add) instruction(s) generated
    365, Generated vector simd code for the loop containing reductions
         Generated 3 prefetch instructions for the loop
tuned_STREAM_Copy:
    405, Parallel region activated
         Parallel loop activated with static block schedule
         Memory copy idiom, loop replaced by call to __c_mcopy8
    407, Barrier
         Parallel region terminated
tuned_STREAM_Scale:
    413, Parallel region activated
         Parallel loop activated with static block schedule
         Generated an alternate version of the loop
         Generated vector simd code for the loop
         Generated a prefetch instruction for the loop
         Generated vector simd code for the loop
         Generated a prefetch instruction for the loop
    415, Barrier
         Parallel region terminated
tuned_STREAM_Add:
    421, Parallel region activated
         Parallel loop activated with static block schedule
         Generated an alternate version of the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
    423, Barrier
         Parallel region terminated
tuned_STREAM_Triad:
    429, Parallel region activated
         Parallel loop activated with static block schedule
         Generated an alternate version of the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
         FMA (fused multiply-add) instruction(s) generated
    431, Barrier
         Parallel region terminated
arnoldg@h2ologin4:~/stream>
Intel compiler (icc)
arnoldg@h2ologin4:~/stream> cc -c -qopt-report-annotate -DTUNED -O3 -qopenmp stream.c
arnoldg@h2ologin4:~/stream> cat stream.c.annot
//
// ------- Annotated listing with optimization reports for "/mnt/a/u/staff/arnoldg/stream/stream.c" -------
//
//INLINING OPTION VALUES:
//  -inline-factor: 100
//  -inline-min-size: 30
//  -inline-max-size: 230
//  -inline-max-total-size: 2000
//  -inline-max-per-routine: 10000
//  -inline-max-per-compile: 500000
//
1       /*-----------------------------------------------------------------------*/
2       /* Program: Stream                                                       */
3       /* Revision: $Id: stream.c,v 5.9 2009/04/11 16:35:00 mccalpin Exp $ */
4       /* Original code developed by John D. McCalpin                           */
5       /* Programmers: John D. McCalpin                                         */
6       /*              Joe R. Zagar                                             */
7       /*                                                                       */
8       /* This program measures memory transfer rates in MB/s for simple        */
9       /* computational kernels coded in C.                                     */
10      /*-----------------------------------------------------------------------*/
11      /* Copyright 1991-2005: John D. McCalpin                                 */
12      /*-----------------------------------------------------------------------*/
13      /* License:                                                              */
14      /*  1. You are free to use this program and/or to redistribute           */
15      /*     this program.                                                     */
16      /*  2. You are free to modify this program for your own use,             */
17      /*     including commercial use, subject to the publication              */
18      /*     restrictions in item 3.                                           */
19      /*  3. You are free to publish results obtained from running this        */
20      /*     program, or from works that you derive from this program,         */
21      /*     with the following limitations:                                   */
22      /*     3a. In order to be referred to as "STREAM benchmark results",     */
23      /*         published results must be in conformance to the STREAM        */
24      /*         Run Rules, (briefly reviewed below) published at              */
25      /*         http://www.cs.virginia.edu/stream/ref.html                    */
26      /*         and incorporated herein by reference.                         */
27      /*         As the copyright holder, John McCalpin retains the            */
28      /*         right to determine conformity with the Run Rules.             */
29      /*     3b. Results based on modified source code or on runs not in       */
30      /*         accordance with the STREAM Run Rules must be clearly          */
31      /*         labelled whenever they are published.  Examples of            */
32      /*         proper labelling include:                                     */
33      /*         "tuned STREAM benchmark results"                              */
34      /*         "based on a variant of the STREAM benchmark code"             */
35      /*         Other comparable, clear and reasonable labelling is           */
36      /*         acceptable.                                                   */
37      /*     3c. Submission of results to the STREAM benchmark web site        */
38      /*         is encouraged, but not required.                              */
39      /*  4. Use of this program or creation of derived works based on this    */
40      /*     program constitutes acceptance of these licensing restrictions.   */
41      /*  5. Absolutely no warranty is expressed or implied.                   */
42      /*-----------------------------------------------------------------------*/
43      # include <stdio.h>
44      # include <math.h>
45      # include <float.h>
46      # include <limits.h>
47      # include <sys/time.h>
48
49      /* INSTRUCTIONS:
50       *
51       *      1) Stream requires a good bit of memory to run.  Adjust the
52       *          value of 'N' (below) to give a 'timing calibration' of
53       *          at least 20 clock-ticks.  This will provide rate estimates
54       *          that should be good to about 5% precision.
55       */
56
57      #ifndef N
58      #   define N    40000000
59      #endif
60      #ifndef NTIMES
61      #   define NTIMES       10
62      #endif
63      #ifndef OFFSET
64      #   define OFFSET       0
65      #endif
66
67      /*
68       *      3) Compile the code with full optimization.  Many compilers
69       *         generate unreasonably bad code before the optimizer tightens
70       *         things up.  If the results are unreasonably good, on the
71       *         other hand, the optimizer might be too smart for me!
72       *
73       *         Try compiling with:
74       *               cc -O stream_omp.c -o stream_omp
75       *
76       *         This is known to work on Cray, SGI, IBM, and Sun machines.
77       *
78       *
79       *      4) Mail the results to mccalpin@cs.virginia.edu
80       *         Be sure to include:
81       *              a) computer hardware model number and software revision
82       *              b) the compiler flags
83       *              c) all of the output from the test case.
84       * Thanks!
85       *
86       */
87
88      # define HLINE "-------------------------------------------------------------\n"
89
90      # ifndef MIN
91      # define MIN(x,y) ((x)<(y)?(x):(y))
92      # endif
93      # ifndef MAX
94      # define MAX(x,y) ((x)>(y)?(x):(y))
95      # endif
96
97      static double   a[N+OFFSET],
98                      b[N+OFFSET],
99                      c[N+OFFSET];
100
101     static double   avgtime[4] = {0}, maxtime[4] = {0},
102                     mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX};
103
104     static char     *label[4] = {"Copy:      ", "Scale:     ",
105         "Add:       ", "Triad:     "};
106
107     static double   bytes[4] = {
108         2 * sizeof(double) * N,
109         2 * sizeof(double) * N,
110         3 * sizeof(double) * N,
111         3 * sizeof(double) * N
112         };
113
114     extern double mysecond();
115     extern void checkSTREAMresults();
116     #ifdef TUNED
117     extern void tuned_STREAM_Copy();
118     extern void tuned_STREAM_Scale(double scalar);
119     extern void tuned_STREAM_Add();
120     extern void tuned_STREAM_Triad(double scalar);
121     #endif
122     #ifdef _OPENMP
123     extern int omp_get_num_threads();
124     #endif
125     int
126     main()
127         {
//INLINE REPORT: (main()) [1] /mnt/a/u/staff/arnoldg/stream/stream.c(127,5)
//  -> INLINE: (182,22) checktick()
//    -> INLINE: (298,7) mysecond()
//    -> INLINE: (299,14) mysecond()
//    -> INLINE: (299,14) mysecond()
//  -> INLINE: (191,9) mysecond()
//  -> INLINE: (195,18) mysecond()
//  -> INLINE: (215,16) mysecond()
//  -> INLINE: (217,9) tuned_STREAM_Copy()
//  -> INLINE: (223,16) mysecond()
//  -> INLINE: (225,16) mysecond()
//  -> INLINE: (227,9) tuned_STREAM_Scale(double)
//  -> INLINE: (233,16) mysecond()
//  -> INLINE: (235,16) mysecond()
//  -> INLINE: (237,9) tuned_STREAM_Add()
//  -> INLINE: (243,16) mysecond()
//  -> INLINE: (245,16) mysecond()
//  -> INLINE: (247,9) tuned_STREAM_Triad(double)
//  -> INLINE: (253,16) mysecond()
//  -> INLINE: (281,5) checkSTREAMresults()
//
///mnt/a/u/staff/arnoldg/stream/stream.c(127,5):remark #34051: REGISTER ALLOCATION : [main] /mnt/a/u/staff/arnoldg/stream/stream.c:127
//
//    Hardware registers
//        Reserved     :    2[ rsp rip]
//        Available    :   39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15]
//        Callee-save  :    6[ rbx rbp r12-r15]
//        Assigned     :   30[ rax rdx rcx rbx rsi rdi r8-r15 zmm0-zmm15]
//
//    Routine temporaries
//        Total         :    1219
//            Global    :     168
//            Local     :    1051
//        Regenerable   :     448
//        Spilled       :      11
//
//    Routine stack
//        Variables     :     916 bytes*
//            Reads     :     124 [3.24e+02 ~ 0.0%]
//            Writes    :      41 [1.40e+01 ~ 0.0%]
//        Spills        :     128 bytes*
//            Reads     :      63 [0.00e+00 ~ 0.0%]
//            Writes    :      57 [1.00e+01 ~ 0.0%]
//
//    Notes
//
//        *Non-overlapping variables and spills may share stack space,
//         so the total stack size might be less than this.
//
//
128         int                 quantum, checktick();
129         int                 BytesPerWord;
130         register int        j, k;
131         double              scalar, t, times[4][NTIMES];
132
133         /* --- SETUP --- determine precision and check timing --- */
134
135         printf(HLINE);
136         printf("STREAM version $Revision: 5.9 $\n");
137         printf(HLINE);
138         BytesPerWord = sizeof(double);
139         printf("This system uses %d bytes per DOUBLE PRECISION word.\n",
140             BytesPerWord);
141
142         printf(HLINE);
143     #ifdef NO_LONG_LONG
144         printf("Array size = %d, Offset = %d\n" , N, OFFSET);
145     #else
146         printf("Array size = %llu, Offset = %d\n", (unsigned long long) N, OFFSET);
147     #endif
148
149         printf("Total memory required = %.1f MB.\n",
150             (3.0 * BytesPerWord) * ( (double) N / 1048576.0));
151         printf("Each test is run %d times, but only\n", NTIMES);
152         printf("the *best* time for each is used.\n");
153
154     #ifdef _OPENMP
155         printf(HLINE);
156     #pragma omp parallel
//OpenMP Construct at /mnt/a/u/staff/arnoldg/stream/stream.c(156,1)
//remark #16201: OpenMP DEFINED REGION WAS PARALLELIZED
157         {
158     #pragma omp master
//OpenMP Construct at /mnt/a/u/staff/arnoldg/stream/stream.c(158,1)
//remark #16205: OpenMP multithreaded code generation for MASTER was successful
159             {
160                 k = omp_get_num_threads();
161                 printf ("Number of Threads requested = %i\n",k);
162             }
163         }
164     #endif
165
166         printf(HLINE);
167     #pragma omp parallel
//OpenMP Construct at /mnt/a/u/staff/arnoldg/stream/stream.c(167,1)
//remark #16201: OpenMP DEFINED REGION WAS PARALLELIZED
168         {
169         printf ("Printing one line per active thread....\n");
170         }
171
172         /* Get initial value for system clock. */
173     #pragma omp parallel for
//OpenMP Construct at /mnt/a/u/staff/arnoldg/stream/stream.c(173,1)
//remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(173,1)
//<Peeled loop for vectorization>
//LOOP END
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(173,1)
//   remark #15300: LOOP WAS VECTORIZED
//LOOP END
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(173,1)
//<Remainder loop for vectorization>
//LOOP END
174         for (j=0; j<N; j++) {
175             a[j] = 1.0;
176             b[j] = 2.0;
177             c[j] = 0.0;
178             }
179
180         printf(HLINE);
181
182         if  ( (quantum = checktick()) >= 1)
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(297,5) inlined into /mnt/a/u/staff/arnoldg/stream/stream.c(182,22)
//   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
//   remark #15346: vector dependence: assumed OUTPUT dependence between call:gettimeofday(struct timeval *__restrict__, __timezone_ptr_t (332:13) and call:gettimeofday(struct timeval *__restrict__, __timezone_ptr_t (332:13)
//
//   LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(299,2) inlined into /mnt/a/u/staff/arnoldg/stream/stream.c(182,22)
//      remark #15521: loop was not vectorized: loop control variable was not identified. Explicitly compute the iteration count before executing the loop or try using canonical loop form from OpenMP specification
//   LOOP END
//LOOP END
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(311,5) inlined into /mnt/a/u/staff/arnoldg/stream/stream.c(182,22)
//   remark #15300: LOOP WAS VECTORIZED
//LOOP END
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(311,5) inlined into /mnt/a/u/staff/arnoldg/stream/stream.c(182,22)
//<Remainder loop for vectorization>
//   remark #25436: completely unrolled by 3
//LOOP END
183             printf("Your clock granularity/precision appears to be "
184                 "%d microseconds.\n", quantum);
185         else {
186             printf("Your clock granularity appears to be "
187                 "less than one microsecond.\n");
188             quantum = 1;
189         }
190
191         t = mysecond();
192     #pragma omp parallel for
//OpenMP Construct at /mnt/a/u/staff/arnoldg/stream/stream.c(192,1)
//remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(192,1)
//<Peeled loop for vectorization>
//LOOP END
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(192,1)
//   remark #15300: LOOP WAS VECTORIZED
//LOOP END
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(192,1)
//<Remainder loop for vectorization>
//LOOP END
193         for (j = 0; j < N; j++)
194             a[j] = 2.0E0 * a[j];
195         t = 1.0E6 * (mysecond() - t);
196
197         printf("Each test below will take on the order"
198             " of %d microseconds.\n", (int) t  );
199         printf("   (= %d clock ticks)\n", (int) (t/quantum) );
200         printf("Increase the size of the arrays if this shows that\n");
201         printf("you are not getting at least 20 clock ticks per test.\n");
202
203         printf(HLINE);
204
205         printf("WARNING -- The above is only a rough guideline.\n");
206         printf("For best results, please be sure you know the\n");
207         printf("precision of your system timer.\n");
208         printf(HLINE);
209
210         /*  --- MAIN LOOP --- repeat test cases NTIMES times --- */
211
212         scalar = 3.0;
213         for (k=0; k<NTIMES; k++)
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(213,5)
//   remark #15521: loop was not vectorized: loop control variable was not identified. Explicitly compute the iteration count before executing the loop or try using canonical loop form from OpenMP specification
//LOOP END
214             {
215             times[0][k] = mysecond();
216     #ifdef TUNED
217             tuned_STREAM_Copy();
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(404,1) inlined into /mnt/a/u/staff/arnoldg/stream/stream.c(217,9)
//   remark #25399: memcopy generated
//   remark #15398: loop was not vectorized: loop was transformed to memset or memcpy
//
//   LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(404,1) inlined into /mnt/a/u/staff/arnoldg/stream/stream.c(217,9)
//      remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
//      remark #25439: unrolled with remainder by 2
//   LOOP END
//
//   LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(404,1) inlined into /mnt/a/u/staff/arnoldg/stream/stream.c(217,9)
//   <Remainder>
//   LOOP END
//LOOP END
218     #else
219     #pragma omp parallel for
220             for (j=0; j<N; j++)
221                 c[j] = a[j];
222     #endif
223             times[0][k] = mysecond() - times[0][k];
224
225             times[1][k] = mysecond();
226     #ifdef TUNED
227             tuned_STREAM_Scale(scalar);
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(412,1) inlined into /mnt/a/u/staff/arnoldg/stream/stream.c(227,9)
//<Peeled loop for vectorization>
//LOOP END
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(412,1) inlined into /mnt/a/u/staff/arnoldg/stream/stream.c(227,9)
//   remark #15300: LOOP WAS VECTORIZED
//LOOP END
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(412,1) inlined into /mnt/a/u/staff/arnoldg/stream/stream.c(227,9)
//<Remainder loop for vectorization>
//LOOP END
228     #else
229     #pragma omp parallel for
230             for (j=0; j<N; j++)
231                 b[j] = scalar*c[j];
232     #endif
233             times[1][k] = mysecond() - times[1][k];
234
235             times[2][k] = mysecond();
236     #ifdef TUNED
237             tuned_STREAM_Add();
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(420,1) inlined into /mnt/a/u/staff/arnoldg/stream/stream.c(237,9)
//<Peeled loop for vectorization>
//LOOP END
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(420,1) inlined into /mnt/a/u/staff/arnoldg/stream/stream.c(237,9)
//   remark #15300: LOOP WAS VECTORIZED
//LOOP END
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(420,1) inlined into /mnt/a/u/staff/arnoldg/stream/stream.c(237,9)
//<Remainder loop for vectorization>
//LOOP END
238     #else
239     #pragma omp parallel for
240             for (j=0; j<N; j++)
241                 c[j] = a[j]+b[j];
242     #endif
243             times[2][k] = mysecond() - times[2][k];
244
245             times[3][k] = mysecond();
246     #ifdef TUNED
247             tuned_STREAM_Triad(scalar);
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(428,1) inlined into /mnt/a/u/staff/arnoldg/stream/stream.c(247,9)
//<Peeled loop for vectorization>
//LOOP END
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(428,1) inlined into /mnt/a/u/staff/arnoldg/stream/stream.c(247,9)
//   remark #15300: LOOP WAS VECTORIZED
//LOOP END
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(428,1) inlined into /mnt/a/u/staff/arnoldg/stream/stream.c(247,9)
//<Remainder loop for vectorization>
//LOOP END
248     #else
249     #pragma omp parallel for
250             for (j=0; j<N; j++)
251                 a[j] = b[j]+scalar*c[j];
252     #endif
253             times[3][k] = mysecond() - times[3][k];
254             }
255
256         /*  --- SUMMARY --- */
257
258         for (k=1; k<NTIMES; k++) /* note -- skip first iteration */
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(258,5)
//   remark #25461: Imperfect Loop Unroll-Jammed by 4   (pre-vector)
//   remark #25045: Fused Loops: ( 258 258 )
//
//   remark #25084: Preprocess Loopnests: Moving Out Store    [ /mnt/a/u/staff/arnoldg/stream/stream.c(258,25) ]
//   remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
//   remark #25436: completely unrolled by 9
//LOOP END
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(258,5)
//<Distributed chunk2>
//   remark #25046: Loop lost in Fusion
//LOOP END
259             {
260             for (j=0; j<4; j++)
261                 {
262                 avgtime[j] = avgtime[j] + times[j][k];
263                 mintime[j] = MIN(mintime[j], times[j][k]);
264                 maxtime[j] = MAX(maxtime[j], times[j][k]);
265                 }
266             }
267
268         printf("Function      Rate (MB/s)   Avg time     Min time     Max time\n");
269         for (j=0; j<4; j++) {
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(269,5)
//   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
//   remark #25436: completely unrolled by 4
//LOOP END
270             avgtime[j] = avgtime[j]/(double)(NTIMES-1);
271
272             printf("%s%11.4f  %11.4f  %11.4f  %11.4f\n", label[j],
273                    1.0E-06 * bytes[j]/mintime[j],
274                    avgtime[j],
275                    mintime[j],
276                    maxtime[j]);
277         }
278         printf(HLINE);
279
280         /* --- Check Results --- */
281         checkSTREAMresults();
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(351,2) inlined into /mnt/a/u/staff/arnoldg/stream/stream.c(281,5)
//   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
//   remark #15346: vector dependence: assumed ANTI dependence between aj (354:13) and aj (356:13)
//   remark #25436: completely unrolled by 10
//LOOP END
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(365,2) inlined into /mnt/a/u/staff/arnoldg/stream/stream.c(281,5)
//   remark #15300: LOOP WAS VECTORIZED
//LOOP END
282         printf(HLINE);
283
284         return 0;
285     }
286
287     # define        M       20
288
289     int
290     checktick()
291         {
//INLINE REPORT: (checktick()) [2] /mnt/a/u/staff/arnoldg/stream/stream.c(291,5)
//  -> INLINE: (298,7) mysecond()
//  -> INLINE: (299,14) mysecond()
//  -> INLINE: (299,14) mysecond()
//
///mnt/a/u/staff/arnoldg/stream/stream.c(291,5):remark #34051: REGISTER ALLOCATION : [checktick] /mnt/a/u/staff/arnoldg/stream/stream.c:291
//
//    Hardware registers
//        Reserved     :    2[ rsp rip]
//        Available    :   39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15]
//        Callee-save  :    6[ rbx rbp r12-r15]
//        Assigned     :   22[ rax rdx rcx rbp rsi rdi zmm0-zmm15]
//
//    Routine temporaries
//        Total         :     144
//            Global    :      16
//            Local     :     128
//        Regenerable   :      18
//        Spilled       :       2
//
//    Routine stack
//        Variables     :     232 bytes*
//            Reads     :      26 [3.00e+02 ~ 12.0%]
//            Writes    :       1 [2.00e+01 ~ 0.8%]
//        Spills        :       8 bytes*
//            Reads     :       2 [1.20e+02 ~ 4.8%]
//            Writes    :       1 [2.00e+01 ~ 0.8%]
//
//    Notes
//
//        *Non-overlapping variables and spills may share stack space,
//         so the total stack size might be less than this.
//
//
292         int         i, minDelta, Delta;
293         double      t1, t2, timesfound[M];
294
295     /*  Collect a sequence of M unique time values from the system. */
296
297         for (i = 0; i < M; i++) {
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(297,5)
//   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
//   remark #15346: vector dependence: assumed OUTPUT dependence between call:gettimeofday(struct timeval *__restrict__, __timezone_ptr_t (332:13) and call:gettimeofday(struct timeval *__restrict__, __timezone_ptr_t (332:13)
//
//   LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(299,2)
//      remark #15521: loop was not vectorized: loop control variable was not identified. Explicitly compute the iteration count before executing the loop or try using canonical loop form from OpenMP specification
//   LOOP END
//LOOP END
298             t1 = mysecond();
299             while( ((t2=mysecond()) - t1) < 1.0E-6 )
300                 ;
301             timesfound[i] = t1 = t2;
302             }
303
304     /*
305      * Determine the minimum difference between these M values.
306      * This result will be our estimate (in microseconds) for the
307      * clock granularity.
308      */
309
310         minDelta = 1000000;
311         for (i = 1; i < M; i++) {
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(311,5)
//   remark #15300: LOOP WAS VECTORIZED
//LOOP END
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(311,5)
//<Remainder loop for vectorization>
//   remark #25436: completely unrolled by 3
//LOOP END
312             Delta = (int)( 1.0E6 * (timesfound[i]-timesfound[i-1]));
///mnt/a/u/staff/arnoldg/stream/stream.c(312,26):remark #34055: adjacent dense (unit-strided stencil) loads are not optimized. Details: stride { 8 }, step { 8 }, types { F64-V128, F64-V128 }, number of elements { 2 }, select mask { 0x000000003 }.
///mnt/a/u/staff/arnoldg/stream/stream.c(312,26):remark #34055: adjacent dense (unit-strided stencil) loads are not optimized. Details: stride { 8 }, step { 8 }, types { F64-V128, F64-V128 }, number of elements { 2 }, select mask { 0x000000003 }.
///mnt/a/u/staff/arnoldg/stream/stream.c(312,26):remark #34055: adjacent dense (unit-strided stencil) loads are not optimized. Details: stride { 8 }, step { 8 }, types { F64-V128, F64-V128 }, number of elements { 2 }, select mask { 0x000000003 }.
///mnt/a/u/staff/arnoldg/stream/stream.c(312,26):remark #34055: adjacent dense (unit-strided stencil) loads are not optimized. Details: stride { 8 }, step { 8 }, types { F64-V128, F64-V128 }, number of elements { 2 }, select mask { 0x000000003 }.
///mnt/a/u/staff/arnoldg/stream/stream.c(312,26):remark #34055: adjacent dense (unit-strided stencil) loads are not optimized. Details: stride { 8 }, step { 8 }, types { F64-V128, F64-V128 }, number of elements { 2 }, select mask { 0x000000003 }.
///mnt/a/u/staff/arnoldg/stream/stream.c(312,26):remark #34055: adjacent dense (unit-strided stencil) loads are not optimized. Details: stride { 8 }, step { 8 }, types { F64-V128, F64-V128 }, number of elements { 2 }, select mask { 0x000000003 }.
///mnt/a/u/staff/arnoldg/stream/stream.c(312,26):remark #34055: adjacent dense (unit-strided stencil) loads are not optimized. Details: stride { 8 }, step { 8 }, types { F64-V128, F64-V128 }, number of elements { 2 }, select mask { 0x000000003 }.
///mnt/a/u/staff/arnoldg/stream/stream.c(312,26):remark #34055: adjacent dense (unit-strided stencil) loads are not optimized. Details: stride { 8 }, step { 8 }, types { F64-V128, F64-V128 }, number of elements { 2 }, select mask { 0x000000003 }.
///mnt/a/u/staff/arnoldg/stream/stream.c(312,26):remark #34055: adjacent dense (unit-strided stencil) loads are not optimized. Details: stride { 8 }, step { 8 }, types { F64-V128, F64-V128 }, number of elements { 2 }, select mask { 0x000000003 }.
///mnt/a/u/staff/arnoldg/stream/stream.c(312,26):remark #34055: adjacent dense (unit-strided stencil) loads are not optimized. Details: stride { 8 }, step { 8 }, types { F64-V128, F64-V128 }, number of elements { 2 }, select mask { 0x000000003 }.
///mnt/a/u/staff/arnoldg/stream/stream.c(312,26):remark #34055: adjacent dense (unit-strided stencil) loads are not optimized. Details: stride { 8 }, step { 8 }, types { F64-V128, F64-V128 }, number of elements { 2 }, select mask { 0x000000003 }.
///mnt/a/u/staff/arnoldg/stream/stream.c(312,26):remark #34055: adjacent dense (unit-strided stencil) loads are not optimized. Details: stride { 8 }, step { 8 }, types { F64-V128, F64-V128 }, number of elements { 2 }, select mask { 0x000000003 }.
///mnt/a/u/staff/arnoldg/stream/stream.c(312,26):remark #34055: adjacent dense (unit-strided stencil) loads are not optimized. Details: stride { 8 }, step { 8 }, types { F64-V128, F64-V128 }, number of elements { 2 }, select mask { 0x000000003 }.
///mnt/a/u/staff/arnoldg/stream/stream.c(312,26):remark #34055: adjacent dense (unit-strided stencil) loads are not optimized. Details: stride { 8 }, step { 8 }, types { F64-V128, F64-V128 }, number of elements { 2 }, select mask { 0x000000003 }.
///mnt/a/u/staff/arnoldg/stream/stream.c(312,26):remark #34055: adjacent dense (unit-strided stencil) loads are not optimized. Details: stride { 8 }, step { 8 }, types { F64-V128, F64-V128 }, number of elements { 2 }, select mask { 0x000000003 }.
///mnt/a/u/staff/arnoldg/stream/stream.c(312,26):remark #34055: adjacent dense (unit-strided stencil) loads are not optimized. Details: stride { 8 }, step { 8 }, types { F64-V128, F64-V128 }, number of elements { 2 }, select mask { 0x000000003 }.
313             minDelta = MIN(minDelta, MAX(Delta,0));
314             }
315
316        return(minDelta);
317         }
318
319
320
321     /* A gettimeofday routine to give access to the wall
322        clock timer on most UNIX-like systems.  */
323
324     #include <sys/time.h>
325
326     double mysecond()
327     {
//INLINE REPORT: (mysecond()) [3] /mnt/a/u/staff/arnoldg/stream/stream.c(327,1)
//
///mnt/a/u/staff/arnoldg/stream/stream.c(327,1):remark #34051: REGISTER ALLOCATION : [mysecond] /mnt/a/u/staff/arnoldg/stream/stream.c:327
//
//    Hardware registers
//        Reserved     :    2[ rsp rip]
//        Available    :   39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15]
//        Callee-save  :    6[ rbx rbp r12-r15]
//        Assigned     :    4[ rsi rdi zmm0-zmm1]
//
//    Routine temporaries
//        Total         :      15
//            Global    :       6
//            Local     :       9
//        Regenerable   :       4
//        Spilled       :       0
//
//    Routine stack
//        Variables     :      24 bytes*
//            Reads     :       2 [2.00e+00 ~ 9.1%]
//            Writes    :       0 [0.00e+00 ~ 0.0%]
//        Spills        :       0 bytes*
//            Reads     :       0 [0.00e+00 ~ 0.0%]
//            Writes    :       0 [0.00e+00 ~ 0.0%]
//
//    Notes
//
//        *Non-overlapping variables and spills may share stack space,
//         so the total stack size might be less than this.
//
//
328             struct timeval tp;
329             struct timezone tzp;
330             int i;
331
332             i = gettimeofday(&tp,&tzp);
333             return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
334     }
335
336     void checkSTREAMresults ()
337     {
//INLINE REPORT: (checkSTREAMresults()) [4] /mnt/a/u/staff/arnoldg/stream/stream.c(337,1)
//
///mnt/a/u/staff/arnoldg/stream/stream.c(337,1):remark #34051: REGISTER ALLOCATION : [checkSTREAMresults] /mnt/a/u/staff/arnoldg/stream/stream.c:337
//
//    Hardware registers
//        Reserved     :    2[ rsp rip]
//        Available    :   39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15]
//        Callee-save  :    6[ rbx rbp r12-r15]
//        Assigned     :    9[ rax rdi zmm0-zmm6]
//
//    Routine temporaries
//        Total         :      73
//            Global    :      17
//            Local     :      56
//        Regenerable   :      33
//        Spilled       :       3
//
//    Routine stack
//        Variables     :       0 bytes*
//            Reads     :       0 [0.00e+00 ~ 0.0%]
//            Writes    :       0 [0.00e+00 ~ 0.0%]
//        Spills        :      24 bytes*
//            Reads     :       4 [0.00e+00 ~ 0.0%]
//            Writes    :       3 [0.00e+00 ~ 0.0%]
//
//    Notes
//
//        *Non-overlapping variables and spills may share stack space,
//         so the total stack size might be less than this.
//
//
338             double aj,bj,cj,scalar;
339             double asum,bsum,csum;
340             double epsilon;
341             int     j,k;
342
343         /* reproduce initialization */
344             aj = 1.0;
345             bj = 2.0;
346             cj = 0.0;
347         /* a[] is modified during timing check */
348             aj = 2.0E0 * aj;
349         /* now execute timing loop */
350             scalar = 3.0;
351             for (k=0; k<NTIMES; k++)
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(351,2)
//   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
//   remark #15346: vector dependence: assumed ANTI dependence between aj (354:13) and aj (356:13)
//   remark #25436: completely unrolled by 10
//LOOP END
352             {
353                 cj = aj;
354                 bj = scalar*cj;
355                 cj = aj+bj;
356                 aj = bj+scalar*cj;
357             }
358             aj = aj * (double) (N);
359             bj = bj * (double) (N);
360             cj = cj * (double) (N);
361
362             asum = 0.0;
363             bsum = 0.0;
364             csum = 0.0;
365             for (j=0; j<N; j++) {
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(365,2)
//   remark #15300: LOOP WAS VECTORIZED
//LOOP END
366                     asum += a[j];
367                     bsum += b[j];
368                     csum += c[j];
369             }
370     #ifdef VERBOSE
371             printf ("Results Comparison: \n");
372             printf ("        Expected  : %f %f %f \n",aj,bj,cj);
373             printf ("        Observed  : %f %f %f \n",asum,bsum,csum);
374     #endif
375
376     #ifndef abs
377     #define abs(a) ((a) >= 0 ? (a) : -(a))
378     #endif
379             epsilon = 1.e-8;
380
381             if (abs(aj-asum)/asum > epsilon) {
382                     printf ("Failed Validation on array a[]\n");
383                     printf ("        Expected  : %f \n",aj);
384                     printf ("        Observed  : %f \n",asum);
385             }
386             else if (abs(bj-bsum)/bsum > epsilon) {
387                     printf ("Failed Validation on array b[]\n");
388                     printf ("        Expected  : %f \n",bj);
389                     printf ("        Observed  : %f \n",bsum);
390             }
391             else if (abs(cj-csum)/csum > epsilon) {
392                     printf ("Failed Validation on array c[]\n");
393                     printf ("        Expected  : %f \n",cj);
394                     printf ("        Observed  : %f \n",csum);
395             }
396             else {
397                     printf ("Solution Validates\n");
398             }
399     }
400
401     void tuned_STREAM_Copy()
402     {
//INLINE REPORT: (tuned_STREAM_Copy()) [5] /mnt/a/u/staff/arnoldg/stream/stream.c(402,1)
//
///mnt/a/u/staff/arnoldg/stream/stream.c(402,1):remark #34051: REGISTER ALLOCATION : [tuned_STREAM_Copy] /mnt/a/u/staff/arnoldg/stream/stream.c:402
//
//    Hardware registers
//        Reserved     :    2[ rsp rip]
//        Available    :   39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15]
//        Callee-save  :    6[ rbx rbp r12-r15]
//        Assigned     :   10[ rax rdx rcx rbx rbp rsi rdi r8-r10]
//
//    Routine temporaries
//        Total         :      94
//            Global    :      18
//            Local     :      76
//        Regenerable   :      32
//        Spilled       :       0
//
//    Routine stack
//        Variables     :      20 bytes*
//            Reads     :       4 [0.00e+00 ~ 0.0%]
//            Writes    :       5 [5.00e+00 ~ 0.0%]
//        Spills        :      48 bytes*
//            Reads     :      12 [0.00e+00 ~ 0.0%]
//            Writes    :      12 [1.20e+01 ~ 0.0%]
//
//    Notes
//
//        *Non-overlapping variables and spills may share stack space,
//         so the total stack size might be less than this.
//
//
403             int j;
404     #pragma omp parallel for
//OpenMP Construct at /mnt/a/u/staff/arnoldg/stream/stream.c(404,1) inlined into /mnt/a/u/staff/arnoldg/stream/stream.c(217,9)
//remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED
///mnt/a/u/staff/arnoldg/stream/stream.c(404,1):remark #34026: call to memcpy implemented as a call to optimized library version
//OpenMP Construct at /mnt/a/u/staff/arnoldg/stream/stream.c(404,1)
//remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(404,1)
//   remark #25399: memcopy generated
//   remark #15398: loop was not vectorized: loop was transformed to memset or memcpy
//
//   LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(404,1)
//      remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
//      remark #25439: unrolled with remainder by 2
//   LOOP END
//
//   LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(404,1)
//   <Remainder>
//   LOOP END
//LOOP END
///mnt/a/u/staff/arnoldg/stream/stream.c(404,1):remark #34026: call to memcpy implemented as a call to optimized library version
405             for (j=0; j<N; j++)
406                 c[j] = a[j];
407     }
408
409     void tuned_STREAM_Scale(double scalar)
410     {
//INLINE REPORT: (tuned_STREAM_Scale(double)) [6] /mnt/a/u/staff/arnoldg/stream/stream.c(410,1)
//
///mnt/a/u/staff/arnoldg/stream/stream.c(410,1):remark #34051: REGISTER ALLOCATION : [tuned_STREAM_Scale] /mnt/a/u/staff/arnoldg/stream/stream.c:410
//
//    Hardware registers
//        Reserved     :    2[ rsp rip]
//        Available    :   39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15]
//        Callee-save  :    6[ rbx rbp r12-r15]
//        Assigned     :   14[ rax rdx rcx rbx rbp rsi rdi r8-r11 zmm0-zmm2]
//
//    Routine temporaries
//        Total         :     106
//            Global    :      19
//            Local     :      87
//        Regenerable   :      37
//        Spilled       :       1
//
//    Routine stack
//        Variables     :      28 bytes*
//            Reads     :       4 [0.00e+00 ~ 0.0%]
//            Writes    :       6 [6.00e+00 ~ 0.0%]
//        Spills        :      56 bytes*
//            Reads     :      13 [1.00e+00 ~ 0.0%]
//            Writes    :      13 [1.30e+01 ~ 0.0%]
//
//    Notes
//
//        *Non-overlapping variables and spills may share stack space,
//         so the total stack size might be less than this.
//
//
411             int j;
412     #pragma omp parallel for
//OpenMP Construct at /mnt/a/u/staff/arnoldg/stream/stream.c(412,1) inlined into /mnt/a/u/staff/arnoldg/stream/stream.c(227,9)
//remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED
//OpenMP Construct at /mnt/a/u/staff/arnoldg/stream/stream.c(412,1)
//remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(412,1)
//<Peeled loop for vectorization>
//LOOP END
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(412,1)
//   remark #15300: LOOP WAS VECTORIZED
//LOOP END
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(412,1)
//<Remainder loop for vectorization>
//LOOP END
413             for (j=0; j<N; j++)
414                 b[j] = scalar*c[j];
415     }
416
417     void tuned_STREAM_Add()
418     {
//INLINE REPORT: (tuned_STREAM_Add()) [7] /mnt/a/u/staff/arnoldg/stream/stream.c(418,1)
//
///mnt/a/u/staff/arnoldg/stream/stream.c(418,1):remark #34051: REGISTER ALLOCATION : [tuned_STREAM_Add] /mnt/a/u/staff/arnoldg/stream/stream.c:418
//
//    Hardware registers
//        Reserved     :    2[ rsp rip]
//        Available    :   39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15]
//        Callee-save  :    6[ rbx rbp r12-r15]
//        Assigned     :   12[ rax rdx rcx rbx rbp rsi rdi r8-r11 zmm0]
//
//    Routine temporaries
//        Total         :      94
//            Global    :      17
//            Local     :      77
//        Regenerable   :      33
//        Spilled       :       0
//
//    Routine stack
//        Variables     :      20 bytes*
//            Reads     :       4 [0.00e+00 ~ 0.0%]
//            Writes    :       5 [5.00e+00 ~ 0.0%]
//        Spills        :      48 bytes*
//            Reads     :      12 [0.00e+00 ~ 0.0%]
//            Writes    :      12 [1.20e+01 ~ 0.0%]
//
//    Notes
//
//        *Non-overlapping variables and spills may share stack space,
//         so the total stack size might be less than this.
//
//
419             int j;
420     #pragma omp parallel for
//OpenMP Construct at /mnt/a/u/staff/arnoldg/stream/stream.c(420,1) inlined into /mnt/a/u/staff/arnoldg/stream/stream.c(237,9)
//remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED
//OpenMP Construct at /mnt/a/u/staff/arnoldg/stream/stream.c(420,1)
//remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(420,1)
//<Peeled loop for vectorization>
//LOOP END
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(420,1)
//   remark #15300: LOOP WAS VECTORIZED
//LOOP END
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(420,1)
//<Remainder loop for vectorization>
//LOOP END
421             for (j=0; j<N; j++)
422                 c[j] = a[j]+b[j];
423     }
424
425     void tuned_STREAM_Triad(double scalar)
426     {
//INLINE REPORT: (tuned_STREAM_Triad(double)) [8] /mnt/a/u/staff/arnoldg/stream/stream.c(426,1)
//
///mnt/a/u/staff/arnoldg/stream/stream.c(426,1):remark #34051: REGISTER ALLOCATION : [tuned_STREAM_Triad] /mnt/a/u/staff/arnoldg/stream/stream.c:426
//
//    Hardware registers
//        Reserved     :    2[ rsp rip]
//        Available    :   39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15]
//        Callee-save  :    6[ rbx rbp r12-r15]
//        Assigned     :   14[ rax rdx rcx rbx rbp rsi rdi r8-r11 zmm0-zmm2]
//
//    Routine temporaries
//        Total         :     108
//            Global    :      19
//            Local     :      89
//        Regenerable   :      37
//        Spilled       :       1
//
//    Routine stack
//        Variables     :      28 bytes*
//            Reads     :       4 [0.00e+00 ~ 0.0%]
//            Writes    :       6 [6.00e+00 ~ 0.0%]
//        Spills        :      56 bytes*
//            Reads     :      13 [1.00e+00 ~ 0.0%]
//            Writes    :      13 [1.30e+01 ~ 0.0%]
//
//    Notes
//
//        *Non-overlapping variables and spills may share stack space,
//         so the total stack size might be less than this.
//
//
427             int j;
428     #pragma omp parallel for
//OpenMP Construct at /mnt/a/u/staff/arnoldg/stream/stream.c(428,1) inlined into /mnt/a/u/staff/arnoldg/stream/stream.c(247,9)
//remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED
//OpenMP Construct at /mnt/a/u/staff/arnoldg/stream/stream.c(428,1)
//remark #16200: OpenMP DEFINED LOOP WAS PARALLELIZED
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(428,1)
//<Peeled loop for vectorization>
//LOOP END
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(428,1)
//   remark #15300: LOOP WAS VECTORIZED
//LOOP END
//
//LOOP BEGIN at /mnt/a/u/staff/arnoldg/stream/stream.c(428,1)
//<Remainder loop for vectorization>
//LOOP END
429             for (j=0; j<N; j++)
430                 a[j] = b[j]+scalar*c[j];
431     }
432
arnoldg@h2ologin4:~/stream>
GNU (gcc)
arnoldg@h2ologin4:~/stream> gcc -c -fopenmp -O3 -fopt-info -DTUNED stream.c
stream.c:194:18: note: loop vectorized
stream.c:194:18: note: loop peeled for vectorization to enhance alignment
stream.c:192:9: note: loop turned into non-loop; it never loops
stream.c:192:9: note: loop turned into non-loop; it never loops.
stream.c:192:9: note: loop with 3 iterations completely unrolled
stream.c:175:7: note: Loop 1 distributed: split to 1 loops and 1 library calls.
stream.c:175:7: note: loop vectorized
stream.c:175:7: note: loop peeled for vectorization to enhance alignment
stream.c:173:9: note: loop turned into non-loop; it never loops
stream.c:173:9: note: loop turned into non-loop; it never loops.
stream.c:173:9: note: loop with 8 iterations completely unrolled
stream.c:406:21: note: Loop 1 distributed: split to 0 loops and 1 library calls.
stream.c:414:21: note: loop vectorized
stream.c:414:21: note: loop peeled for vectorization to enhance alignment
stream.c:412:9: note: loop turned into non-loop; it never loops
stream.c:412:9: note: loop turned into non-loop; it never loops.
stream.c:412:9: note: loop with 4 iterations completely unrolled
stream.c:422:14: note: loop vectorized
stream.c:422:14: note: loop peeled for vectorization to enhance alignment
stream.c:420:9: note: loop turned into non-loop; it never loops
stream.c:420:9: note: loop turned into non-loop; it never loops.
stream.c:420:9: note: loop with 3 iterations completely unrolled
stream.c:430:14: note: loop vectorized
stream.c:430:14: note: loop peeled for vectorization to enhance alignment
stream.c:428:9: note: loop turned into non-loop; it never loops
stream.c:428:9: note: loop turned into non-loop; it never loops.
stream.c:428:9: note: loop with 3 iterations completely unrolled
stream.c:311:5: note: loop vectorized
stream.c:311:5: note: loop turned into non-loop; it never loops.
stream.c:311:5: note: loop with 3 iterations completely unrolled
stream.c:290:1: note: loop turned into non-loop; it never loops
stream.c:290:1: note: loop turned into non-loop; it never loops.
stream.c:290:1: note: loop with 4 iterations completely unrolled
stream.c:351:2: note: loop turned into non-loop; it never loops.
stream.c:351:2: note: loop with 10 iterations completely unrolled
stream.c:260:2: note: loop turned into non-loop; it never loops.
stream.c:260:2: note: loop with 5 iterations completely unrolled
stream.c:258:5: note: loop turned into non-loop; it never loops.
stream.c:258:5: note: loop with 9 iterations completely unrolled
arnoldg@h2ologin4:~/stream>

To confirm vectorization, compile to assembly code (gcc -S or similar ) and look for vector instructions.  This may change with optimization levels.

counting vector instructions
arnoldg@h2ologin4:~/stream> gcc -c -fopenmp -O1 -fopt-info -DTUNED -S stream.c
arnoldg@h2ologin4:~/stream> grep xmm stream.s | wc -l
138
arnoldg@h2ologin4:~/stream> gcc -c -fopenmp -O3 -fopt-info -DTUNED -S stream.c
stream.c:194:18: note: loop vectorized
stream.c:194:18: note: loop peeled for vectorization to enhance alignment
stream.c:192:9: note: loop turned into non-loop; it never loops
stream.c:192:9: note: loop turned into non-loop; it never loops.
stream.c:192:9: note: loop with 3 iterations completely unrolled
stream.c:175:7: note: Loop 1 distributed: split to 1 loops and 1 library calls.
stream.c:175:7: note: loop vectorized
stream.c:175:7: note: loop peeled for vectorization to enhance alignment
stream.c:173:9: note: loop turned into non-loop; it never loops
stream.c:173:9: note: loop turned into non-loop; it never loops.
stream.c:173:9: note: loop with 8 iterations completely unrolled
stream.c:406:21: note: Loop 1 distributed: split to 0 loops and 1 library calls.
stream.c:414:21: note: loop vectorized
stream.c:414:21: note: loop peeled for vectorization to enhance alignment
stream.c:412:9: note: loop turned into non-loop; it never loops
stream.c:412:9: note: loop turned into non-loop; it never loops.
stream.c:412:9: note: loop with 4 iterations completely unrolled
stream.c:422:14: note: loop vectorized
stream.c:422:14: note: loop peeled for vectorization to enhance alignment
stream.c:420:9: note: loop turned into non-loop; it never loops
stream.c:420:9: note: loop turned into non-loop; it never loops.
stream.c:420:9: note: loop with 3 iterations completely unrolled
stream.c:430:14: note: loop vectorized
stream.c:430:14: note: loop peeled for vectorization to enhance alignment
stream.c:428:9: note: loop turned into non-loop; it never loops
stream.c:428:9: note: loop turned into non-loop; it never loops.
stream.c:428:9: note: loop with 3 iterations completely unrolled
stream.c:311:5: note: loop vectorized
stream.c:311:5: note: loop turned into non-loop; it never loops.
stream.c:311:5: note: loop with 3 iterations completely unrolled
stream.c:290:1: note: loop turned into non-loop; it never loops
stream.c:290:1: note: loop turned into non-loop; it never loops.
stream.c:290:1: note: loop with 4 iterations completely unrolled
stream.c:351:2: note: loop turned into non-loop; it never loops.
stream.c:351:2: note: loop with 10 iterations completely unrolled
stream.c:260:2: note: loop turned into non-loop; it never loops.
stream.c:260:2: note: loop with 5 iterations completely unrolled
stream.c:258:5: note: loop turned into non-loop; it never loops.
stream.c:258:5: note: loop with 9 iterations completely unrolled
arnoldg@h2ologin4:~/stream> grep xmm stream.s | wc -l
524
arnoldg@h2ologin4:~/stream>
  • No labels