Two different approachs for exposing NVIDIA’s performance counters in Nouveau

Hello,

I’ll talk again about the interface between the Linux kernel and the userspace (mesa). After few weeks of work, I now have a full implementation which exposes NVIDIA’s performance counters in Nouveau. I actually have two versions with different approachs. The first one is almost “all-userspace” which means that the configuration and the logic of performance counters are stored in the userspace, while the second one is almost “all-kernelspace” and only exposes what events can be monitored from the userspace. These two approachs use a set of software methods and the perfmon engine of Nouveau, initially written by Ben Skeggs, in order to set up performance counters.

This post will only focus on global counters, please refer to my latest article about MP counters on nv50/Tesla if you are interested. Before we continue, let me recall what is a performance counter for NVIDIA.

PCOUNTER: The performance counters engine

A hardware performance counter is a set of special registers which are used to store the counts of hardware-related activities. Hardware counters are oftenly used by developers to identify bottlenecks in their applications.

PCOUNTER is the card unit which contains most of the performance counters. PCOUNTER is divided in 8 domains (or sets) on nv50/Tesla. Each domain has a different source clock and has 255+ input signals that can themselves be the output of one multiplexer. PCOUNTER uses global counters. Counters do not sample one 8-bits signal, they sample a macro signal. A macro signal is the aggregation of 4 signals which have been combined using a function. An overview of this logic is represented by the figure below.

pcounter

Now, let me talk a bit about graphics counter exposed by NVIDIA on nv50/Tesla family.

Graphics counter for 3D applications

Graphics counter can be used to give detailled information for OpenGL/Direct3D applications. These performance counters are only exposed by NVIDIA PerfKit, an advanced software suite for profiling OpenCL and Direct3D/OpenGL applications on Windows (only). Last year, I reverse engineered most of these graphics counter. You can take a quick look at the documentation for nva3 (for example), this will introduce the notion of complex hardware events.

Overview of complex hardware events

A complex hardware event is composed by one or two macro signals which have been combined with a counter mode. Some of them are sometimes multiplexed and thus a multiplexer (a tuple address and value) needs to be configured in the engine which generates the signal. Hardware events are so the aggregation of multiple 8-bits signals and they are harder to monitor than a simple signal. Some events are also too complex to be monitored at one time and thus need multiple passes. As perfkit polls counters after each frame, an event that requires multiple passes will need the same amount of frame to be monitored. For instance, for frame x, the counters are set for the pass #0 while they are set up for pass #1 at frame x+1. The results of the two passes are then combined to create the result of the event. Multi-passes events are thus less accurate because they need more frames to be monitored

The main goal of the interface between the kernel and mesa is to expose these complex hardware events to the userspace.

The first interface (“all-userspace” approach)

The main idea of this interface is to store the configuration of complex hardware events inside mesa. In this approach, the kernel only knows the list of 8-bits signals and exposes them with a unique string identifier, for example, the signal 0xcb on nva3 is associated to ‘gr_idle’ on the set 1. Then, the userspace can build complex events and send the configuration to the kernel through an ioctl call which allocates a NOUVEAU_PERFCTR_CLASS object. A NOUVEAU_PERFCTR_CLASS object is used to init, poll and read performance counters.

This interface is based on a set of softwared methods used to control performance counters. Basically, we first allocate a NOUVEAU_PERFCTR_CLASS object with the configuration (8-bits signal/function/mode …) of the counter. Then, before a frame is rendering (using the begin_query() hook of gallium) we send the handle of this object with a software method to start monitoring. At this time, the configuration is written to PCOUNTER and the counter starts to count hardware related activities. After the frame, we send a sequence number with an another software method to read out values using a notify buffer object which is allocated along the current channel. If you are interested, a previous post gives more details about that interface.

With this “all-userspace” approach, the kernel is not able to monitor complex hardware events because the configuration and the logic is stored in the userspace. Actually, the configuration is shared between the kernel and mesa. The kernel only knows 8-bits signals while the userspace knows the configuration of hardware events.

Perf also called perf_events, is a kernel-based interface for profiling Linux which is able to monitor performance counters like the number of instructions executed. Thus, if the configuration of hardware events is stored in the userspace, this will be a problem for exposing them in perf because we don’t want to duplicate the configuration. I also talked with Daniel Vetter, the maintener of the i965 driver and the responsible of the major part of DRM, and he seems to be agree with the idea that it could be good to expose hardware events in perf.

We also have an another problem related to muxs because the userspace knows the configuration while the kernel does not. So, the kernel has to check address of muxs in order to avoid security issues.

The last problem is that the interface is closely based on the perfmon engine, so if perfmon changes in the future, this will require to add a new interface. But, we don’t want to add another driver private ioctl or design a new interface in case of perfmon must be evolved in the future. However, with the “all-kernelspace” approach we don’t have this problem since the kernel knows the logic and only exposes a list of monitorable events.

However, the “all-userspace” approach has the advantages to reduce the amount of code in the kernel and to facilitate the configuration of counters since all the logic is located in the userspace.

If you are interested you can take a look at the code :

mesa source code: https://github.com/hakzsam/mesa-latest/commits/nv50_pcounter_pm

libdrm source code: https://github.com/hakzsam/drm/commits/expose_perfctr_class

nouveau source code: https://github.com/hakzsam/nouveau/commits/expose_perfctr_class

The second interface (“all-kernelspace” approach)

This interface is kernel-based like Perf. The configuration and the logic (except multi-pass events which need two frames) are stored in the kernel only. The kernel exposes a list of monitorable events. Thus, the userspace just has to allocate a NOUVEAU_PERFEVENT_CLASS used to init, read and poll complex hardware events.

Like the previous interface, this is one is also based on a set of software methods used to control performance counters. The behaviour is almost the same than before except that we allocate a NOUVEAU_PERFEVENT_CLASS object which represents a complex hardware event instead of a NOUVEAU_PERFCTR_CLASS.

With this approach it’s easy to monitor complex hardware events inside Nouveau and to expose them to Perf in the future. Also, there is no security issues because muxs are configured from and by the kernel, we don’t have to check their address.

Since, the kernel only exposes a list of events and stores the configuration, pefmon can change without any impacts to the interface between the kernel and the userspace in the future. Basically, the userspace only knows the name of events, and some flags used to do scheduling. However, it’s hard to expose to the userspace what events are monitorable simultaneously or not.

On nv50/Tesla, we have 8 domains (or sets) and 4 counters per domain. Thus, if all complex events only use one counter per domain, we can monitor 32 events simultaneously. Good! But actually not… Because some events use 2 counters per domain. To handle this case, the userspace can retrieve the number of available domains and the number of counters per domain through an ioctl call. Then, we expose the domain ID and the number of counters needed by an event. With this information, we can schedule events from the userspace. But we still have one problem, how to handle the case where two events on the same domain share a mux?

Some events are multiplexed but two or more events can use the same mux with a different value. To handle this special case, we expose conflicts to the userspace using some 64 bits flags. Thus, the userspace just has to do a AND comparison to check if two events can be monitored simultaneously.

The source code of this “all-kernelspace” version is available below :

mesa source code: https://github.com/hakzsam/mesa-latest/commits/nv50_kernelspace_version

libdrm source code: https://github.com/hakzsam/drm/commits/expose_perfevent_class

nouveau source code: https://github.com/hakzsam/nouveau/commits/nv50_kernelspace_version

What is the best approach ? pros & cons

“all-userspace” approach

 Pros:
  • reduce the amount of code in the kernel
  • easy to apply logic of performance counters
 Cons:
  • not possible to monitor complex hardware events inside Nouveau and perf (Linux)
  • configuration of counters is shared between the userspace (complex events) and the kernelspace (8-bits signals)
  • possible security issues (the kernel must know address of muxs to check queries)
  • the interface (and the userspace) must be changed if perfmon changes in the future

“all-kernelspace” approach

 Pros:
  • possible to monitor complex hardware events inside Nouveau and perf (Linux)
  • configuration and logic (except multi-pass events) are stored in the kernel only
  • no security issues (muxs are configured by the kernel)
  • perfmon can evolve without any impacts regarding the interface since it only exposes a list of events
 Cons:
  • add more code in the kernel
  • hard to expose to the userspace what events are monitorable simultaneously or not

These two interfaces have different pros and cons, but in my opinion, I think the “all-kernelspace” is more elegant and more future-proof since we can monitor complex hardware events inside Nouveau and expose them to perf (Linux) .

To sum up, we still have to choose one version of the interface between the kernel and mesa. I’ll talk about this with Ben Skeggs, the maintener of Nouveau to get his opinion. We hope to get the code upstream in september or october, and before Linux 3.19.

Have a good day!

Advertisements