Intel® Advisor Help

Examine Bottlenecks on GPU Roofline Chart

Accuracy Level

Any

Enabled Analyses

Survey with GPU profiling + FLOP (Characterization) with GPU profiling

Note

Other analyses and properties control a CPU Roofline part of the report, which shows metrics for loops executed on CPU. You can add the CPU Roofline panes to the main view using the button on the top pane. For details about CPU Roofline data, see CPU / Memory Roofline Insights Perspective.

Result Interpretation

The farther a dot is from the topmost roofs, the more room for improvement there is. In accordance with Amdahl's Law, optimizing the loops that take the largest portion of the program's total run time will lead to greater speedups than optimizing the loops that take a smaller portion of the run time.

Example of a GPU Roofline chart

To read the GPU Roofline chart:

Memory-Level GPU Roofline

By default, GPU Roofline reports data for all memory levels by default allowing you to examine each loop at different cache levels and arithmetic intensities and provides precise insights into which cache level causes the performance bottlenecks.

To configure the Memory-Level GPU Roofline chart:

  1. Expand the filter pane in the GPU Roofline chart toolbar.
  2. In the Memory Level section, select the memory levels you want to see metrics for.

    Select memory levels for a GPU Roofline chart

  3. Click Apply.
  4. In the GPU Roofline chart, double-click a loop to examine how the relationships between displayed memory levels and roofs. Labeled dots are displayed, representing memory levels with arithmetic intensity for the selected loop/function; lines connect the dots to indicate that they correspond to the selected loop/function.

Memory-Level GPU Roofline Data

Review the changes in the traffic from one memory level to another and compare it to respective to identify the memory hierarchy bottleneck for the kernel and determine optimization steps based on this information.

Example of a GPU Roofline chart for all memory levels

  • When you double-click a loop, it is expanded to several dots and/or X marks representing different memory levels:
    • CARM: Memory traffic generated by all execution units (EUs). Includes traffic between EUs and corresponding GPU cache or direct traffic to main memory. For each retired instruction with memory arguments, the size of each memory operand in bytes is added to this metric.
    • L3: Data transferred directly between execution units and L3 cache.
    • SLM: Memory access to/from Shared Local Memory (SLM), a dedicated structure within the L3 cache.
    • GTI: Represents GTI traffic/GPU memory read bandwidth, the accesses between the GPU, chip uncore (LLC), and main memory. Use this to get a sense of external memory traffic.
  • The vertical distance between memory dots and their respective roofline shows how much you are limited by a given memory subsystem. If a dot is close to its roof line, it means that the kernel is limited by the performance of this memory level.
  • The horizontal distance between memory dots indicates how efficiently the loop/function uses cache. For example, if L3 and DRAM dots are very close on the horizontal axis for a single loop, the loop/function uses L3 and DRAM similarly. This means that it does not use L3 and DRAM efficiently. Improve re-usage of data in the code to improve application performance.
  • Arithmetic intensity on the x axis determines the order in which dots are plotted, which can provide some insight into your code's performance. For example, the CARM dot is typically far to the right of the L3 dot, as read/write access is by cache lines and CARM traffic is the sum of actual bytes used in operations.

Kernel Details

Select a dot on the chart and switch to the Details tab in the right-side pane to examine code analytics for a specific kernel in more details.

In the Roofline, examine the Roofline chart for the selected kernel with the following data:

In the OP/S and Bandwidth, review how well the kernel uses the compute and memory resources of the hardware. It indicates the total number of floating-points and integer operations transferred per second compared to the maximum hardware capability and the amount of data transferred at each cache memory (in gigabytes per second) compared to the memory level bandwidth. The red bar highlights the dominant operation type used in the kernel.

In the Memory Metrics:

Note

Data in the Memory Metrics pane is based on a dominant type of operations in a code (FLOAT or INT). The dominant type is indicated as a red bar in the OP/S and Bandwidth.

In the Instruction Mix, examine types of instructions that the kernel executes. Intel Advisor automatically determines the data type used in operations and groups the instructions collected during Characterization analysis by the following categories:

Category

Instruction Types

Compute (FLOP and INTOP)
  • BASIC COMPUTE: add, addc, mul, rndu, rndd, rnde, rndz, subb, avg, frc, lzd, fbh, fbl, cbit
  • BIT: and, not, or, xor, asr, shr, shl, bfrev, bfe, bfi1, bfi2, ror, rol
  • FMA: mac, mach, mad, madm

    Note

    Intel Advisor counts mac, mach, mad, madm instructions belonging to this class as 2 operations.
  • DIV: INT_DIV_BOTH, INT_DIV_QUOTIENT, INT_DIV_REMAINDER, and FDIV types of extended math function
  • POW extended math function
  • MATH: other function types performed by math instruction
Memory LOAD, STORE, SLM_LOAD, SLM_STORE types depending on the argument: send, sendc, sends, sendsc
Other
  • MOVE: mov, sel, movi, smov, csel
  • CONTROL FLOW: if, else, endif, while, break, cont, call, calla, ret, goto, jmpi, brd, brc, join, halt
  • SYNC: wait, sync
  • OTHER: cmp, cmpn, nop, f32to16, f16to32, dim

In the Performance Characteristics, review how effectively the kernel uses the GPU resources: activity of all execution units, percentage of time when both FPUs are used, percentage of cycles with a thread scheduled. Ideally, you should see a higher percentage of active execution units and other effectiveness metrics to use more GPU resources.

See Also