A first attempt at exposing NVIDIA’s performance counters in Nouveau

Hi folks,

Follow up on this year’s GSoC, it’s time to talk about the interface between the kernel and the userspace (mesa). Basically, the idea is to tell the kernel to monitor signal X and read back results from mesa. At the end of this project, almost-all the graphics counters for GeForce 8, 9 and 2XX (nv50/Tesla) will be exposed and this interface should be almost compatible with Fermi and Kepler. Some MP counters which still have to be reverse engineered will be added later.

To implement this interface between the Linux kernel and mesa, we can use ioctl calls or software methods. Let me first talk a bit about them.

ioctl calls vs software methods

An ioctl (Input/Output control) is the most common hardware-controlling operation, it’s a sort of system call, available in most driver categories. A software method is a special command added to the command stream of the GPU. Basically, the card is processing the command stream (FIFO) and encounter an unimplemented method. Then PFIFO waits until PGRAPH is idle and sends a specific IRQ called INVALID_METHOD to the kernel. At this time, the kernel is inside an interrupt context, the driver then will determine method and object that caused the interrupt and implements the method. The main difference between these two approaches is that software methods can be easily synchronized with the CPU through the command stream and are context-dependent, while ioctls are unsynchronized with the command stream. With SW methods, we can make sure it is called right after the commands we want and the following commands won’t get executed until the sw method is handled by the CPU, this is not possible with an ioctl

Currently, I have a first prototype of that interface using a set of software methods to get the advantage of the synchronization along the command stream, but also because ioctl calls are harder to implement and to maintain in the future. However, since a software method is invoked within an interrupt context we have to limit as much as possible the number of instructions needed to complete the task processed by it and it’s absolutely forbidden to do a sleep call for example.

A first prototype using software methods

Basically that interface, like the NVPerfKit’s, must be able to export a list of available hardware events, add or remove a counter, sample a counter, expose its value to the userspace and synchronize the different queries which will send by the userspace to the kernel. All of these operations are sent through a set of software methods.

Configure a counter

To configure a counter we will use a software method which is still not currently defined, but since we can send 32 bits of data along with it, it’s sufficient to identify a counter. For this, we can send the global ID of the counter or to allocate an object which represents a counter from the userspace and send its handle with that sw method. Then, the kernel pushes that counter in a staging area waiting for the next batch of counters or for the sample command. This command can be invoked successively to add several counters. Once all counters added by the user are known by the kernel it’s the time to send the sample command. It’s also possible to synchronization the configuration with the beginning and the end of a frame using software methods.

Sample a counter

This command also uses a software method which just tells the kernel to start monitoring. At this time, the kernel is configuring counters (ie. write values to a set of special registers), reading and storing their values, including the number of cycles processed which may be used by the userspace to compute a ratio.

Expose counter’s data to the userspace

Currently, we can configure and sample a counter but the result of this counting period is not yet exposed to the userspace. Basically, to be able to send results from the kernel to mesa we use a notifier buffer object which is dedicated to the communication from the kernelspace to the userspace. A notifier BO is allocated and mapped along a channel, so it can be accessible both by the kernel and the userspace. When mesa creates a channel, this special BO is automatically allocated by the kernel, then we just have to map it. At this time, the kernel can write results to this BO, and the userspace can read back from it. The result of a counting period is copied by the kernel to this notifier BO from an other software method which is also used in order to synchronize queries.

Synchronize queries with a sequence number

To synchronize queries we use a different sequence ID (like a fence) for each query we send to the kernel space. When the user wants to read out result it sends a query ID through a software method. Then this method does the read out, copies the counter’s value to the notifier BO and the sequence number at the offset 0. Also, we use a ringbuffer in the notify BO to store the list of counter ID, cycles and the counter’s value. This ringbuffer is a nice way to avoid stalling the command submission and is a good fit for the gallium HUD which queues up to 8 frames before having to read back the counters. As for the HUD, this ringbuffer stores the result of the N previous readouts. Since the offset 0 stores the latest sequence ID, we can easily check if the result is available in the ringbuffer. To check the result, we can do a busy waiting until the query we want to get it’s available in the ringbuffer or we can check if the result of that query has not been overwrittne by a newer one.

This buffer looks like this :



To sum up, almost all of these software methods use the perfmon engine initially written by Ben Skeggs. However, to support complex hardware events like special counter modes and multiple passes I still had to improve it.

Currently, the connection between these software methods and perfmon is in a work in progress state. I will try to complete this task as soon as possible to provide a full implementation.

I already have a set of patches in a Request For Comments state for perfmon and the software methods interface on my github account, you can take a look at them here. I also have an example out-of-mesa, initially written by Martin Peres, which shows how to use that first protoype (link). Two days ago, Ben Skeggs made good suggestions that I am currently investigating. Will get back to you on them when I’m done experimenting with them.

Design and implement a kernel interface with an elegant way takes a while…

See you soon for the full implementation!

A deeper look into NVPerfKit

NVIDIA NVPerfKit is a suite of performance tools to help developpers in identifying the performance bottleneck of OpenGL and Direct3D applications. It allows you to monitor hardware performance counters which are used to store the counts of hardware-related activities from the GPU itself. These performance counters (called “graphics counters” by NVIDIA) are usually used by developers to identify bottlenecks in their applications, like “how the gpu is busy?” or “how many triangles have been drawn in the current frame?” and so on. But, NVPerfKit is only available on Windows.

This year, my Google Summer of Code project is to expose NVIDIA’s graphics counter to help Linux/Nouveau developpers in improving their OpenGL applications. At the end of this summer, this project aims to offer a Linux version of NVPerfkit for NVIDIA’s graphics cards (only GeForce 8, 9 and 2XX in a first time) .  To expose these hardware events to the userspace, we have to write an interface between the Linux kernel and mesa. Basically, the idea is to tell to the kernel to monitor signal X and read back results from the userspace (i.e. mesa). However, before writing that interface we have to study the behaviours of NVPerfKit on Windows.

In a first time, let me explain (again) what is really a hardware performance counter. A hardware performance counter is a set of special registers used to count hardware-relatd activities. There are two type of counters, global counters from PCOUNTER and (local) MP counters. PCOUNTER is the card unit which contains most of the performance counters. PCOUNTER is divided in 8 domains (or sets) on nv50/Tesla. Each domain has a different source clock and has 255+ input signals that can themselves be the output of one multiplexer. PCOUNTER uses global counters whereas MP counters are per-app and context switched. Actually, these two types of counters are not really independent and may share some configuration parts, for example, the output of a signal multiplexer. On Tesla/nv50, it is possible to monitor 4 macro signals concurrently per domain. A macro signal is the aggregation of 4 signals which have been combined with a function. In this post, we are only focusing on global counters. Now, the question is how NVPerfKit monitors these global performance counters ?

Case #1 : How NVPerfKit handles multiple apps being monitored concurrently ?

NVIDIA does not handle this case at all, and the behaviour is thus undefined when more than one application is monitoring performance counters at the same time. Then, because of the issue of shared configuration of global counters (PCOUNTER) and local counters (MP counters), I think it’s a bad idea to allow monitoring multiple applications concurrently. To solve this problem, I suggest, at first, to use a global lock for allowing only one application at a time and for simplifying the implementation.

Case #2 : How NVPerfKit handles only one counter per domain ?

This is the simplest case, and there are no particular requirements.

Case #3 : How NVPerfKit handles multiple counters per domain ?

NVPerfKit uses a round robin mode, then it still monitors only one counter per domain and it switches the current counter after each frame.

Case #4 : How NVPerfKit handles multiple counters on different domains ?

No problem here, NVPerfKit is able to monitor multiple counters on different domains (each domain having up to one event to monitor).

To sum up, NVPerfKit always uses a round robin mode when it has to monitor more than one hw event on the same domain.

Concerning the sampling part, NVIDIA say (NVPerfKit User Guide – page 11 – Appendix B. Counters reference):

All of the software/driver counters represent a per frame accounting. These counters are accumulated and updated in the driver per frame, so even if you sample at a sub-frame rate frequency, the software counters will hold the same data (from the previous frame) until the end of the current frame.

This article should have been published the last month, but during this time I worked on the prototype’s definition and its implementation. Currently, I have a first prototype which works quite well, I’ll submit it the next week.

See you the next week!