1. Introduction

As developers and system administrators, we constantly strive to extract the utmost speed and efficiency from our programs. The perf tool in Linux is useful for this as it offers profound software insights.

In this tutorial, we’ll analyze cache misses using the perf tool. We’ll focus on monitoring and analyzing these events to drive optimal program execution.

2. Understanding Cache Misses

Before diving into the practical aspects of perf, let’s understand cache misses and their role in system performance.

At the heart of modern computer architecture lies the cache – a high-speed memory that stores frequently accessed data for rapid retrieval. However, even the cache isn’t immune to occasional misses, which occur when the required data isn’t found in the cache and must be fetched from the slower main memory.

In several scenarios, cache misses might appear inconsequential at first glance, but can lead to substantial performance degradation. When our program encounters frequent cache misses, it waits to fetch data from the main memory. This causes unnecessary execution delays. Therefore, effectively monitoring and mitigating cache misses becomes important.

Fortunately, this is where hardware counters come into play. These specialized components within our processors keep track of various events, including cache misses, allowing us to gain invaluable insights into how our programs interact with memory. By using the power of hardware counters, we can pinpoint cache misses and other performance bottlenecks.

Ultimately, we must remember that understanding cache misses is crucial.

3. The Basics of the perf Tool

Let’s first get acquainted with perf‘s core capabilities for performance optimization.

At its core, perf is a performance analysis and profiling tool that provides us crucial insights into program behavior. Whether we’re seeking to fine-tune code execution or identify system bottlenecks, perf is a reliable tool to achieve optimal performance.

As developers, we can benefit from perf‘s ability to monitor various system events, including software and hardware events, as well as CPU and memory metrics.

Collecting these event statistics gives us a comprehensive overview of how our program interacts with the underlying hardware. This, in turn, enables us to make informed decisions for code optimization.

Furthermore, as system administrators, we can use perf to diagnose and resolve performance issues at a broader system level. This extends beyond individual programs and applications.

Also, we can pinpoint resource-hungry processes, bottlenecks, and inefficiencies affecting the overall system’s responsiveness.

4. Requesting Cache Events With perf stat

Let’s take a look at cache events to better understand cache misses. To do this, we’ll start with perf stat.

We’ll further tailor its output to suit our specific analysis needs by explicitly requesting cache-related events.

4.1. Sample Cache Analysis

First, let’s consider the following command:

$ perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations sleep 5
 Performance counter stats for 'sleep 5':

       2,315 cache-references                                              (50.01%)
         513 cache-misses              #    22.15% of all cache refs      (50.00%)
     1,004,927 cycles                    #    22.20 GHz                    (50.01%)
     1,002,550 instructions              #     0.99  insns per cycle        (66.67%)
       193,719 branches                  #    14.736 M/sec                  (66.67%)
         1,523 faults                                                      (66.66%)
             2 migrations                                                  (66.67%)

       0.100918620 seconds time elapsed

Here, we use the -e flag, which allows us to specify a set of events we wish to monitor.

By including cache-references and cache-misses alongside other metrics like cycles, instructions, and branches, we create a comprehensive snapshot of our program’s execution, including cache behavior. Also, the percentages in parentheses represent the relative weight of each event among the monitored events.

In addition, these percentages don’t always sum up to 100%. This is because they indicate the fraction of time a specific counter was active (or sampled) compared to the total time the program was running. We must understand that different events can be active simultaneously. For instance, a cache-miss could happen simultaneously as an instruction fetch, leading both events to register activity concurrently.

4.2. Output Explanation

From the total number of cache references to cache misses, each metric contributes to a richer understanding of how our program interacts with the cache hierarchy.

Let’s better understand our output:

  • cache-references – counts the total number of cache references, which are memory accesses to the cache. In this sample, we have 2,315 cache references.
  • cache-misses – sums up the number of cache misses, which are memory accesses that require fetching data from a higher-level cache or main memory. There were 513 cache misses in this sample, accounting for 22.15% of all cache references.
  • cycles computes the total number of CPU cycles executed. Here, we have 1,004,927 cycles, indicating the time that the CPU was active.
  • instructions – enumerates the total number of instructions executed. Here, we have 1,002,550 instructions, with an average of 0.99 per cycle, to indicate how efficiently the instructions execute.
  • branches calculates the number of branch instructions executed. Here, we have 193,719 branches, indicating the number of times the program changed its execution path.
  • faults adds up the number of page faults, which occur when a process tries to access a page of memory that is not currently mapped in its address space. Here, we have 1,523-page faults.
  • migrations quantifies the number of times a task is migrated between different CPU cores, with 2 migrations in this example.

Also, the time elapsed line shows the total time taken for the sleep 5 command to execute, which is approximately 0.1009 seconds.

5. Using perf list, perf record and perf report for Cache Analysis

Now let’s look at an alternative approach that offers a different perspective on tracking cache-miss data.

We’ll examine perf record and perf report – a combination that records detailed performance data and analyzes it comprehensively.

5.1. Recording Cache-Miss Data Using perf record

Let’s start with recording cache-miss data using perf record. This method captures a detailed trace of events during program execution, which we can later dissect for deeper analysis.

However, a crucial step before recording is identifying the specific event of interest. We can list available events using the perf list command:

$ perf list
List of pre-defined events (to be used in -e):
  cycles:u                                  [Hardware event]
  instructions:u                            [Hardware event]
  cache-references:u                        [Hardware event]
  cache-misses:u                            [Hardware event]
  branch-instructions:u                     [Hardware event]
  branch-misses:u                           [Hardware event]
  cpu-clock:u                               [Software event]
  task-clock:u                              [Software event]
  page-faults:u                             [Software event]
  context-switches:u                        [Software event]
  minor-faults:u                            [Software event]
  major-faults:u                            [Software event]

As we can see, this command provides a comprehensive catalog of events (cache-related metrics). These include various hardware and software events we can monitor to analyze the performance characteristics of a program or system.

Then, once we have identified the desired event, such as cache-misses, we can now record data with perf record:

$ perf record -e cache-misses ./test_program
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.001 MB perf.data (~48 samples) ]

With this, perf record diligently records event data as our program runs, capturing insights into cache misses.

We can see that it captured and wrote approximately 0.001 megabytes of performance data to the perf.data file. Also, it mentions the collection of around 48 samples (events).

5.2. Analyzing Recorded Data With perf report

With data in hand, we can now analyze using the perf report tool. This command transforms the recorded data into a comprehensive report, offering detailed insights into cache-miss events and their implications.

Now, let’s use perf report to analyze our previous output:

$ perf report -v

# Samples: 48K of event 'cache-misses'
# Event count (approx.): 23918294
#
# Overhead  Command      Shared Object             Symbol
# ........  .......  ...................  .....................
#
    100.00%  test_program  [.] main
          |
          |--85.42%-- function_A
          |          |
          |          |--60.22%-- sub_function_X
          |          |          sub_function_Y
          |          |
          |          |--25.20%-- sub_function_Z
          |
          |--14.58%-- function_B
                     |
                     |--12.03%-- sub_function_P
                     |          sub_function_Q
                     |
                     |--02.55%-- sub_function_R

As we can see, the command organizes the report as a hierarchical call graph, showing the distribution of cache-miss events across different functions and sub-functions of our program.

The percentages indicate the proportion of cache-miss events attributed to each function or sub-function, helping us identify hotspots where cache misses occur more frequently:

  • test_program – main program function
  • function_A – a function called from main()
  • sub_function_X – sub-function called from function_A()
  • sub_function_Y – another sub-function called from function_A()
  • sub_function_Z – yet another sub-function called from function_A()
  • function_B – another function called from main()
  • sub_function_P – a sub-function called from function_B()
  • sub_function_Q – another sub-function called from function_B()
  • sub_function_R – yet another sub-function called from function_B()

Notably, in real analysis, we would see several actual functions and symbol names from our program along with corresponding cache-miss event percentages.

6. Conclusion

In this article, we learned about cache misses, hardware counters, and performance metrics using the perf tool in Linux. From understanding cache misses’ impact on performance to overcoming the challenges of default behaviors, we delved into the heart of performance optimization.

Furthermore, by explicitly requesting cache events, we customized our perf analysis, allowing us to dissect cache-miss behavior alongside other critical metrics.

Using the perf tool effectively gives us with a holistic view of program execution, enabling us to make targeted optimizations.

Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.