Intel® Advisor Help
Any
Survey with GPU profiling + FLOP (Characterization) with GPU profiling
button on the top pane. For details about CPU Roofline data, see
CPU / Memory Roofline Insights Perspective.
The farther a dot is from the topmost roofs, the more room for improvement there is. In accordance with Amdahl's Law, optimizing the loops that take the largest portion of the program's total run time will lead to greater speedups than optimizing the loops that take a smaller portion of the run time.

To read the GPU Roofline chart:
By default, GPU Roofline reports data for all memory levels by default allowing you to examine each loop at different cache levels and arithmetic intensities and provides precise insights into which cache level causes the performance bottlenecks.
To configure the Memory-Level GPU Roofline chart:

Memory-Level GPU Roofline Data
Review the changes in the traffic from one memory level to another and compare it to respective to identify the memory hierarchy bottleneck for the kernel and determine optimization steps based on this information.

Select a dot on the chart and switch to the Details tab in the right-side pane to examine code analytics for a specific kernel in more details.
In the Roofline, examine the Roofline chart for the selected kernel with the following data:
Amount of data transferred for each cache memory level
Pointer arrow that shows the exact roof that limits the kernel performance. The arrow points to what you should optimize the kernel for and shows the potential speedup after the optimization in the callout.
If the arrow points to a diagonal line, the kernel is mostly memory bound. If the arrow points to a horizontal line, the kernel is mostly compute bound.
For example, on the screenshot below, the kernel is bounded by the Int32 Vector Add peak. If you optimize the kernel for integer operations, the kernel gets up to 1.4x speedup.

In the OP/S and Bandwidth, review how well the kernel uses the compute and memory resources of the hardware. It indicates the total number of floating-points and integer operations transferred per second compared to the maximum hardware capability and the amount of data transferred at each cache memory (in gigabytes per second) compared to the memory level bandwidth. The red bar highlights the dominant operation type used in the kernel.
In the Memory Metrics:
Review how much time the kernel spends processing requests for each memory level in relation to the total time, in perspective, reported in the Impacts histogram.
A big value indicates a memory level that bounds the selected loop. Examine the difference between the two largest bars to see how much throughput you can gain if you reduce the impact on your main bottleneck. It also gives you a long-time plan to reduce your memory bound limitations as once you will solve the problems coming from the widest bar, your next issue will come from the second biggest bar and so on.
Ideally, you should see the L3 as the most impactful memory.
Review an amount of data that passes through each memory level reported in the Shares histogram.
In the Instruction Mix, examine types of instructions that the kernel executes. Intel Advisor automatically determines the data type used in operations and groups the instructions collected during Characterization analysis by the following categories:
Category |
Instruction Types |
|---|---|
| Compute (FLOP and INTOP) |
|
| Memory | LOAD, STORE, SLM_LOAD, SLM_STORE types depending on the argument: send, sendc, sends, sendsc |
| Other |
|
In the Performance Characteristics, review how effectively the kernel uses the GPU resources: activity of all execution units, percentage of time when both FPUs are used, percentage of cycles with a thread scheduled. Ideally, you should see a higher percentage of active execution units and other effectiveness metrics to use more GPU resources.