NVIDIA performance counters in Nouveau with Linux 4.2


This weekend my work on implementing NVIDIA global performance counters has been merged in Nouveau.

With Linux 4.2, Nouveau will allow the userspace to monitor both compute and graphics (global) counters for Tesla, but only compute counters for Fermi. I need to go back to Windows for reverse engineering graphics counters with NVIDIA Perfkit. About Kepler, I have to figure out how to deal with clock gating but this is not going to be hard, so I’ll probably submit a series which adds compute counters this month.

All of these performance counters will be exposed through the Gallium’s HUD and GL_AMD_performance_monitor once I have finished writing the code in mesa.

But don’t be too excited for the moment, because we still need to implement the new nvif interface exposed by Nouveau in libdrm.

My plan is to complete all of this work before the XDC 2015.


Reverse engineering Windows or Linux PCI drivers with Intel VT-d and QEMU – Part 1

Today, I will describe a new way to reverse engineer PCI drivers by creating a PCI passthrough with a QEMU virtual machine. In this article, I will show you how to use the Intel VT-d technology in order to trace memory mapped input/output (MMIO) accesses of a QEMU VM. As a member of Nouveau community, this howto will only be focused on the NVIDIA‘s proprietary driver but it should be pretty similar for all PCI drivers.


Reverse engineering the NVIDIA’s proprietary driver is not an easy task, especially on Windows because we have no support for both mmiotrace, a toolbox for tracing memory mapped I/O access within the Linux kernel, and valgrind-mmt which allows tracing application accesses to mmaped memory.

When I started to reverse engineer NVIDIA Perfkit on Windows (for graphics performance counters) in-between the Google Summer of Code 2013 and 2014, I wrote some tools for dumping the configuration of these performance counters, but it was very painful to find multiplexers because I couldn’t really trace MMIO accesses. I would have liked to use Intel VT-d but my old computer didn’t support that recent technology, but recently I got a new computer and my life has changed. ūüėČ

But what is VT-d and how to use it with QEMU ?

An input/output memory management unit (IOMMU) allows guest virtual machines to directly use peripheral devices, such as Ethernet, accelerated graphics cards, through DMA and interrupt remapping. This is called VT-d at Intel and AMD-Vi at AMD.

QEMU allows to use that technology through the VFIO driver which is an IOMMU/device agnostic framework for exposing direct device access to userspace, in a secure, IOMMU protected environment. In other words, this allows safe, non-privileged, userspace drivers. Initially developed by Cisco, VFIO is now maintened by Alex Williamson at Red Hat.

In this howto, I will use Fedora as guest OS but whatever you use it should work for both Linux and Windows OS. Let’s get start.

Tested hardware

Motherboard: ASUS B85 PRO GAMER

CPU: Intel Core i5-4460 3.20GHz

GPU: NVIDIA GeForce 210 (host) and NVIDIA GeForce 9500 GT (guest)

OS: Arch Linux (host) and Fedora 21 (guest)


Your CPU needs to support both virtualization and IOMMU (Intel VT-d technology, Core i5 at least). You will also need two NVIDIA GPUs and two monitors, or one with two different inputs (one plugged into your host GPU, one into your guest GPU). I would also recommend you to have a separate keyboard and mouse for the guest OS.

Step 1: Hardware setup

Check if your CPU supports virtualization.

egrep -i '^flags.*(svm|vmx)' /proc/cpuinfo

If so, enable CPU virtualization support and Intel VT-d from the BIOS.

Step 2: Kernel config

1) Modify kernel config
Device Drivers --->
    [*] IOMMU Hardware Support  --->
        [*]   Support for Intel IOMMU using DMA Remapping Devices
        [*]   Support for Interrupt Remapping
Device Drivers --->
    [*] VFIO Non-Privileged userspace driver framework  --->
        [*]   VFIO PCI support for VGA devices
Bus options (PCI etc.) --->
    [*] PCI Stub driver
2) Build kernel
3) Reboot, and check if your system has support for both IOMMU and DMA remapping
dmesg | grep -e IOMMU -e DMAR
[    0.000000] ACPI: DMAR 0x00000000BD9373C0 000080 (v01 INTEL  HSW      00000001 INTL 00000001)
[    0.019360] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap d2008c20660462 ecap f010da
[    0.019362] IOAPIC id 8 under DRHD base  0xfed90000 IOMMU 0
[    0.292166] DMAR: No ATSR found
[    0.292235] IOMMU: dmar0 using Queued invalidation
[    0.292237] IOMMU: Setting RMRR:
[    0.292246] IOMMU: Setting identity map for device 0000:00:14.0 [0xbd8a6000 - 0xbd8b2fff]
[    0.292269] IOMMU: Setting identity map for device 0000:00:1a.0 [0xbd8a6000 - 0xbd8b2fff]
[    0.292288] IOMMU: Setting identity map for device 0000:00:1d.0 [0xbd8a6000 - 0xbd8b2fff]
[    0.292301] IOMMU: Prepare 0-16MiB unity mapping for LPC
[    0.292307] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff]

!!! If you have no output, you have to fix this before continuing !!!

Step 3: Build QEMU

git clone git://git.qemu-project.org/qemu.git --depth 1
cd qemu
./configure --python=/usr/bin/python2 # Python 3 is not yet supported
make && make install

You can also install QEMU from your favorite package manager, but I would recommend you to get the source code if you want to enable VFIO tracing support.

Step 4: Unbind the GPU with pci-stub

According to my hardware config, I have two NVIDIA GPUs, so blacklisting the Nouveau kernel module is not so good. Instead, I will use pci-stub in order to unbind the GPU which will be assigned to the guest OS.

NOTE: If pci-stub was built as a module, you’ll need to modify /etc/mkinitcpio.conf, add pci-stub in the MODULES section, and update your initramfs.

01:00.0 VGA compatible controller: NVIDIA Corporation GT218 [GeForce 210] (rev a2)
05:00.0 VGA compatible controller: NVIDIA Corporation G96 [GeForce 9500 GT] (rev a1)
lspci -n
01:00.0 0300: 10de:0a65 (rev a2) # GT218
05:00.0 0300: 10de:0640 (rev a1) # G96

Now add the following kernel parameter to your bootloader.


Reboot, and check.

dmesg | grep pci-stub
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-nouveau root=UUID=5f64607c-5c72-4f65-9960-d5c7a981059e rw quiet pci-stub.ids=10de:0640
[    0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-nouveau root=UUID=5f64607c-5c72-4f65-9960-d5c7a981059e rw quiet pci-stub.ids=10de:0640
[    0.295763] pci-stub: add 10DE:0640 sub=FFFFFFFF:FFFFFFFF cls=00000000/00000000
[    0.295768] pci-stub 0000:05:00.0: claimed by stub

Step 5: Bind the GPU with VFIO

Now, it’s time to bind the GPU (the G96 card in this example) with VFIO in order to pass through it to the VM. You can use this script to make life easier:


modprobe vfio-pci

for dev in "$@"; do
        vendor=$(cat /sys/bus/pci/devices/$dev/vendor)
        device=$(cat /sys/bus/pci/devices/$dev/device)
        if [ -e /sys/bus/pci/devices/$dev/driver ]; then
                echo $dev > /sys/bus/pci/devices/$dev/driver/unbind
        echo $vendor $device > /sys/bus/pci/drivers/vfio-pci/new_id

Bind the GPU:

./vfio-bind.sh 0000:05:00.0 # G96

Step 6: Testing KVM VGA-Passthrough

Let’s test if it works, as root:

qemu-system-x86_64 \
    -enable-kvm \
    -M q35 \
    -m 2G \
    -cpu host, kvm=off \
    -device vfio-pci,host=05:00.0,multifunction=on,x-vga=on

If it works fine, you should see a black QEMU window with the message “Guest has not initialized the display (yet)”. You will need to pass -vga none, otherwise it won’t work. I’ll show you all the options I use a bit later.

NOTE: kvm=off is required for some recent NVIDIA proprietary drivers because it won’t be loaded if it detects KVM…

Step 7: Add USB support

At this step, we have assigned the GPU to the virtual machine, but it would be a good idea to be able to use that guest OS with a keyboard, for example. To do this, we need to add USB support to the VM. The preferred way is to pass through an entire USB controller like we already did for the GPU.

lspci | grep USB
00:14.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI (rev 05)
00:1a.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2 (rev 05)
00:1d.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #1 (rev 05)

Add the following line to QEMU, example for 00:14.0:

-device vfio-pci,host=00:14.0,bus=pcie.0

Before trying USB support inside the VM, you need to assign that USB controller to VFIO, but you will lose your keyboard and your mouse from the host in case they are connected to that controller.

./vfio-bind.sh 0000:00:14.0

In order to re-enable the USB support from the host, you will need to unbind the controller, and to bind it to xhci_hcd.

echo 0000:00:14.0 > /sys/bus/pci/drivers/vfio-pci/unbind
echo 0000:00:14.0 > /sys/bus/pci/drivers/xhci_hcd/bind

If you get an error with USB support, you might simply try a different controller, or try to assign USB devices by ID.

Step 8: Install guest OS

Now, it’s time to install the guest OS. I installed Fedora 21 because it’s just not possible to run Arch Linux inside QEMU due to a bug in syslinux… Whatever, install your favorite Linux OS and go ahead. I would also recommend to install envytools (a collection of tools developed by the members of the Nouveau community) in order to easily test the tracing support.

You can use the script below to launch a VM with VGA and USB passthrough, and all the stuff we need.


modprobe vfio-pci

        vendor=$(cat /sys/bus/pci/devices/$dev/vendor)
        device=$(cat /sys/bus/pci/devices/$dev/device)
        if [ -e /sys/bus/pci/devices/$dev/driver ]; then
                echo $dev > /sys/bus/pci/devices/$dev/driver/unbind
        echo $vendor $device > /sys/bus/pci/drivers/vfio-pci/new_id

# Bind devices.
modprobe vfio-pci
vfio_bind 0000:05:00.0  # GPU (NVIDIA G96)
vfio_bind 0000:00:14.0  # USB controller

qemu-system-x86_64 \
    -enable-kvm \
    -M q35 \
    -m 2G \
    -hda fedora.img \
    -boot d \
    -cpu host,kvm=off \
    -vga none \
    -device vfio-pci,host=05:00.0,multifunction=on,x-vga=on \
    -device vfio-pci,host=00:14.0,bus=pcie.0

# Restore USB controller
echo 0000:00:14.0 > /sys/bus/pci/drivers/vfio-pci/unbind
echo 0000:00:14.0 > /sys/bus/pci/drivers/xhci_hcd/bind

Step 9: Enable VFIO tracing support for QEMU

1) Configure QEMU to enable tracing

Enable the stderr trace backend. Please refer to docs/tracing.txt if you want to change the backend.

./configure --python=/usr/bin/python2 --enable-trace-backends=stderr
2) Disable MMAP support

Disabling MMAP support uses the slower read/write accesses to MMIO space that will get traced. To do this, open the file include/hw/vfio/vfio-common.h, and change #define VFIO_ALLOW_MMAP from 1 to 0.

 /* Extra debugging, trap acceleration paths for more logging */
-#define VFIO_ALLOW_MMAP 1
+#define VFIO_ALLOW_MMAP 0

Re-build QEMU.

3) Add the trace points you want to observe

Create a events.txt file and add the vfio_region_write trace point which dumps MMIO read/write accesses of the GPU.

echo "vfio_region_write" > events.txt

VFIO tracing support is now enabled and configured, really easy, huh?

Thanks to Alex Williamson for these hints.

Step 10: Trace MMIO write accesses

Let’s now test VFIO tracing support. Enable events tracing by adding the following line to the script which launchs the VM.

-trace events=events.txt

Launch the VM. You should see lot of traces from the standard error output, this is a good news.

Open a terminal in the VM, go to the directory where envytools has been built, and run (as root) the following command.

./nvahammer 0xa404 0xdeadbeef

This command writes a 32-bit value (0xdeadbeef) to the MMIO register at 0xa404 and repeats the write in an infinite loop. It needs to be manually aborted.

Go back to the host, and you should see the following traces if it works fine.

12347@1424299207.289770:vfio_region_write  (0000:05:00.0:region0+0xa404, 0xdeadbeef, 4)
12347@1424299207.289774:vfio_region_write  (0000:05:00.0:region0+0xa404, 0xdeadbeef, 4)
12347@1424299207.289778:vfio_region_write  (0000:05:00.0:region0+0xa404, 0xdeadbeef, 4)

In this example, we have only traced MMIO write accesses, but of course, if you want to trace read accesses, you just have to change vfio_region_write to vfio_region_read.


In this article I showed you how to trace MMIO accesses using a PCI passthrough with QEMU, Intel VT-d and VFIO. However, all PCI accesses are currently traced including USB controller and this is not ideal unlike mmiotrace which only dumps accesses for one peripheral. It would be also a good idea to have the same format as mmiotrace in order to use the decoding tools we already have for it in envytools.

Future work

– do not trace all PCI accesses (device and subrange address filtering)

– VFIO traces to the mmiotrace format

– compare performance when tracing support is enabled or not

Related ressources

KVM VGA-Passthrough on ArchLinux

VGA-Passthrough on Debian

VFIO documentation

QEMU VFIO tracing documentation

Two different approachs for exposing NVIDIA’s performance counters in Nouveau


I’ll talk again about the interface between the Linux kernel and the userspace (mesa). After few weeks of work, I now have a full implementation which exposes NVIDIA’s performance counters in Nouveau. I actually have two versions with different approachs. The first one is almost “all-userspace” which means that the configuration and the logic of performance counters are stored in the userspace, while the second one is almost “all-kernelspace” and only exposes what events can be monitored from the userspace. These two approachs use a set of software methods and the perfmon engine of Nouveau, initially written by Ben Skeggs, in order to set up performance counters.

This post will only focus on global counters, please refer to my latest article about MP counters on nv50/Tesla if you are interested. Before we continue, let me recall what is a performance counter for NVIDIA.

PCOUNTER: The performance counters engine

A hardware performance counter is a set of special registers which are used to store the counts of hardware-related activities. Hardware counters are oftenly used by developers to identify bottlenecks in their applications.

PCOUNTER is the card unit which contains most of the performance counters. PCOUNTER is divided in 8 domains (or sets) on nv50/Tesla. Each domain has a different source clock and has 255+ input signals that can themselves be the output of one multiplexer. PCOUNTER uses global counters. Counters do not sample one 8-bits signal, they sample a macro signal. A macro signal is the aggregation of 4 signals which have been combined using a function. An overview of this logic is represented by the figure below.


Now, let me talk a bit about graphics counter exposed by NVIDIA on nv50/Tesla family.

Graphics counter for 3D applications

Graphics counter can be used to give detailled information for OpenGL/Direct3D applications. These performance counters are only exposed by NVIDIA PerfKit, an advanced software suite for profiling OpenCL and Direct3D/OpenGL applications on Windows (only). Last year, I reverse engineered most of these graphics counter. You can take a quick look at the documentation for nva3 (for example), this will introduce the notion of complex hardware events.

Overview of complex hardware events

A complex hardware event is composed by one or two macro signals which have been combined with a counter mode. Some of them are sometimes multiplexed and thus a multiplexer (a tuple address and value) needs to be configured in the engine which generates the signal. Hardware events are so the aggregation of multiple 8-bits signals and they are harder to monitor than a simple signal. Some events are also too complex to be monitored at one time and thus need multiple passes. As perfkit polls counters after each frame, an event that requires multiple passes will need the same amount of frame to be monitored. For instance, for frame x, the counters are set for the pass #0 while they are set up for pass #1 at frame x+1. The results of the two passes are then combined to create the result of the event. Multi-passes events are thus less accurate because they need more frames to be monitored

The main goal of the interface between the kernel and mesa is to expose these complex hardware events to the userspace.

The first interface (“all-userspace” approach)

The main idea of this interface is to store the configuration of complex hardware events inside mesa. In this approach, the kernel only knows the list of 8-bits signals and exposes them with a unique string identifier, for example, the signal 0xcb on nva3 is associated to ‘gr_idle’ on the set 1. Then, the userspace can build complex events and send the configuration to the kernel through an ioctl call which allocates a NOUVEAU_PERFCTR_CLASS object. A NOUVEAU_PERFCTR_CLASS object is used to init, poll and read performance counters.

This interface is based on a set of softwared methods used to control performance counters. Basically, we first allocate a NOUVEAU_PERFCTR_CLASS object with the configuration (8-bits signal/function/mode …) of the counter. Then, before a frame is rendering (using the begin_query() hook of gallium) we send the handle of this object with a software method to start monitoring. At this time, the configuration is written to PCOUNTER and the counter starts to count hardware related activities. After the frame, we send a sequence number with an another software method to read out values using a notify buffer object which is allocated along the current channel. If you are interested, a previous post gives more details about that interface.

With this “all-userspace” approach, the kernel is not able to monitor complex hardware events because the configuration and the logic is stored in the userspace. Actually, the configuration is shared between the kernel and mesa. The kernel only knows 8-bits signals while the userspace knows the configuration of hardware events.

Perf also called perf_events, is a kernel-based interface for profiling Linux which is able to monitor performance counters like the number of instructions executed. Thus, if the configuration of hardware events is stored in the userspace, this will be a problem for exposing them in perf because we don’t want to duplicate the configuration. I also talked with Daniel Vetter, the maintener of the i965 driver and the responsible of the major part of DRM, and he seems to be agree with the idea that it could be good to expose hardware events in perf.

We also have an another problem related to muxs because the userspace knows the configuration while the kernel does not. So, the kernel has to check address of muxs in order to avoid security issues.

The last problem is that the interface is closely based on the perfmon engine, so if perfmon changes in the future, this will require to add a new interface. But, we don’t want to add another driver private ioctl or design a new interface in case of perfmon must be evolved in the future. However, with the “all-kernelspace” approach we don’t have this problem since the kernel knows the logic and only exposes a list of monitorable events.

However, the “all-userspace” approach has the advantages to reduce the amount of code in the kernel and to facilitate the configuration of counters since all the logic is located in the userspace.

If you are interested you can take a look at the code :

mesa source code: https://github.com/hakzsam/mesa-latest/commits/nv50_pcounter_pm

libdrm source code: https://github.com/hakzsam/drm/commits/expose_perfctr_class

nouveau source code: https://github.com/hakzsam/nouveau/commits/expose_perfctr_class

The second interface (“all-kernelspace” approach)

This interface is kernel-based like Perf. The configuration and the logic (except multi-pass events which need two frames) are stored in the kernel only. The kernel exposes a list of monitorable events. Thus, the userspace just has to allocate a NOUVEAU_PERFEVENT_CLASS used to init, read and poll complex hardware events.

Like the previous interface, this is one is also based on a set of software methods used to control performance counters. The behaviour is almost the same than before except that we allocate a NOUVEAU_PERFEVENT_CLASS object which represents a complex hardware event instead of a NOUVEAU_PERFCTR_CLASS.

With this approach it’s easy to monitor complex hardware events inside Nouveau and to expose them to Perf in the future. Also, there is no security issues because muxs are configured from and by the kernel, we don’t have to check their address.

Since, the kernel only exposes a list of events and stores the configuration, pefmon can change without any impacts to the interface between the kernel and the userspace in the future. Basically, the userspace only knows the name of events, and some flags used to do scheduling. However, it’s hard to expose to the userspace what events are monitorable simultaneously or not.

On nv50/Tesla, we have 8 domains (or sets) and 4 counters per domain. Thus, if all complex events only use one counter per domain, we can monitor 32 events simultaneously. Good! But actually not… Because some events use 2 counters per domain. To handle this case, the userspace can retrieve the number of available domains and the number of counters per domain through an ioctl call. Then, we expose the domain ID and the number of counters needed by an event. With this information, we can schedule events from the userspace. But we still have one problem, how to handle the case where two events on the same domain share a mux?

Some events are multiplexed but two or more events can use the same mux with a different value. To handle this special case, we expose conflicts to the userspace using some 64 bits flags. Thus, the userspace just has to do a AND comparison to check if two events can be monitored simultaneously.

The source code of this “all-kernelspace” version is available below :

mesa source code: https://github.com/hakzsam/mesa-latest/commits/nv50_kernelspace_version

libdrm source code: https://github.com/hakzsam/drm/commits/expose_perfevent_class

nouveau source code: https://github.com/hakzsam/nouveau/commits/nv50_kernelspace_version

What is the best approach ? pros & cons

“all-userspace” approach

  • reduce the amount of code in the kernel
  • easy to apply logic of performance counters
  • not possible to monitor complex hardware events inside Nouveau and perf (Linux)
  • configuration of counters is shared between the userspace (complex events) and the kernelspace (8-bits signals)
  • possible security issues (the kernel must know address of muxs to check queries)
  • the interface (and the userspace) must be changed if perfmon changes in the future

“all-kernelspace” approach

  • possible to monitor complex hardware events inside Nouveau and perf (Linux)
  • configuration and logic (except multi-pass events) are stored in the kernel only
  • no security issues (muxs are configured by the kernel)
  • perfmon can evolve without any impacts regarding the interface since it only exposes a list of events
  • add more code in the kernel
  • hard to expose to the userspace what events are monitorable simultaneously or not

These two interfaces have different pros and cons, but in my opinion, I think the “all-kernelspace” is more elegant and more future-proof since we can monitor complex hardware events inside Nouveau and expose them to perf (Linux) .

To sum up, we still have to choose one version of the interface between the kernel and mesa. I’ll talk about this with Ben Skeggs, the maintener of Nouveau to get his opinion. We hope to get the code upstream in september or october, and before Linux 3.19.

Have a good day!

Implement MP counters for nv50 (compute only)


As part of my Google Summer of Code project I implemented MP counters (for compute only) on nv50/tesla. This work follows the implementation of MP counters for nvc0/fermi I did the last year.

Compute counters are used by OpenCL while graphics counters are used to count hardware-related activities of OpenGL applications. The distinction between these two types of counters made by NVIDIA is arbitrary and won’t be present in my implementation. That’s why compute counters can also be used to give detailed information of OpenGL applications like the number of instructions processed per frame or the number of launched warps.

MP performance counters are local and per-context while performance counters, programmed through the PCOUNTER engine, are global. A MP counter is more accurate than a global counter because it counts hardware-related activities for each context separately while a global counter reports activities regardless of the context that generates it.

All of these MP counters have been reverse engineered using CUPTI, the NVIDIA CUDA profiling tools interface which only exposes compute counters. On nv50/tesla, CUPTI exposes 13 performance counters like instructions or warp_serialize. The nv50 family has 4 MP counters per TPC (Texture Processing Cluster).

Currently, this prototype implements an interface between the kernel and mesa which exposes these MP performance counters to the user through the Gallium HUD. Basically, this interface can configure and poll a counter using the push buffer and a set of software methods.

To configure a MP counter we use the command stream like the blob does. We have two methods, the first one is for configuring the counter (mode, signal, unit and logic operation) and the second one is just used to reinitialize the counter. Then, to select the group of the MP counter we have added a software method. To poll counters we use a notifier buffer object which is allocated along a channel. This notifier allows to communicate between the kernel and mesa. This approach has already been explained in my latest article.

To sum up, this prototype adds support for 13 performance counters on nv50/tesla. All of the code is available on my github account. If you are interested, you can take a look at the mesa and the nouveau code.

Have a good day.

A first attempt at exposing NVIDIA’s performance counters in Nouveau

Hi folks,

Follow up on this year’s GSoC, it’s time to talk about the interface between the kernel and the userspace (mesa). Basically, the idea is to tell the kernel to monitor signal X and read back results from mesa. At the end of this project, almost-all the graphics counters for GeForce 8, 9 and 2XX (nv50/Tesla) will be exposed and this interface should be almost compatible with Fermi and Kepler. Some MP counters which still have to be reverse engineered will be added later.

To implement this interface between the Linux kernel and mesa, we can use ioctl calls or software methods. Let me first talk a bit about them.

ioctl calls vs software methods

An ioctl (Input/Output control) is the most common hardware-controlling operation, it’s a sort of system call, available in most driver categories. A software method is a special command added to the command stream of the GPU. Basically, the card is processing the command stream (FIFO) and encounter an unimplemented method. Then PFIFO waits until PGRAPH is idle and sends a specific IRQ called INVALID_METHOD to the kernel. At this time, the kernel is inside an interrupt context, the driver then will determine method and object that caused the interrupt and implements the method. The main difference between these two¬†approaches is that software methods can be easily synchronized with the CPU through the command stream and are context-dependent, while ioctls are unsynchronized with the command stream. With SW methods, we can make sure it is called right after the commands we want and the following commands won’t get executed until the sw method is handled by the CPU, this is not possible with an ioctl

Currently, I have a first prototype of that interface using a set of software methods to get the advantage of the synchronization along the command stream, but also because ioctl calls are harder to implement and to maintain in the future. However, since a software method is invoked within an interrupt context we have to limit as much as possible the number of instructions needed to complete the task processed by it and it’s absolutely forbidden to do a sleep call for example.

A first prototype using software methods

Basically that interface, like the NVPerfKit’s, must be able to export a list of available hardware events, add or remove a counter, sample a counter, expose its value to the userspace and synchronize the different queries which will send by the userspace to the kernel. All of these operations are sent through a set of software methods.

Configure a counter

To configure a counter we will use a software method which is still not currently defined, but since we can send 32 bits of data along with it, it’s sufficient to identify a counter. For this, we can send the global ID of the counter or to allocate an object which represents a counter from the userspace and send its handle with that sw method. Then, the kernel pushes that counter in a staging area waiting for the next batch of counters or for the sample command. This command can be invoked successively to add several counters. Once all counters added by the user are known by the kernel it’s the time to send the sample command. It’s also possible to synchronization the configuration with the beginning and the end of a frame using software methods.

Sample a counter

This command also uses a software method which just tells the kernel to start monitoring. At this time, the kernel is configuring counters (ie. write values to a set of special registers), reading and storing their values, including the number of cycles processed which may be used by the userspace to compute a ratio.

Expose counter’s data to the userspace

Currently, we can configure and sample a counter but the result of this counting period is not yet exposed to the userspace. Basically, to be able to send results from the kernel to mesa we use a notifier buffer object which is dedicated to the communication from the kernelspace to the userspace. A notifier BO is allocated and mapped along a channel, so it can be accessible both by the kernel and the userspace. When mesa creates a channel, this special BO is automatically allocated by the kernel, then we just have to map it. At this time, the kernel can write results to this BO, and the userspace can read back from it. The result of a counting period is copied by the kernel to this notifier BO from an other software method which is also used in order to synchronize queries.

Synchronize queries with a sequence number

To synchronize queries we use a different sequence ID (like a fence) for each query we send to the kernel space. When the user wants to read out result it sends a query ID through a software method. Then this method does the read out, copies the counter’s value to the notifier BO and the sequence number at the offset 0. Also, we use a ringbuffer in the notify BO to store the list of counter ID, cycles and the counter’s value. This ringbuffer is a nice way to avoid stalling the command submission and is a good fit for the gallium HUD which queues up to 8 frames before having to read back the counters. As for the HUD, this ringbuffer stores the result of the N previous readouts. Since the offset 0 stores the latest sequence ID, we can easily check if the result is available in the ringbuffer. To check the result, we can do a busy waiting until the query we want to get it’s available in the ringbuffer or we can check if the result of that query has not been overwrittne by a newer one.

This buffer looks like this :



To sum up, almost all of these software methods use the perfmon engine initially written by Ben Skeggs. However, to support complex hardware events like special counter modes and multiple passes I still had to improve it.

Currently, the connection between these software methods and perfmon is in a work in progress state. I will try to complete this task as soon as possible to provide a full implementation.

I already have a set of patches in a Request For Comments state for perfmon and the software methods interface on my github account, you can take a look at them here. I also have an example out-of-mesa, initially written by Martin Peres, which shows how to use that first protoype (link). Two days ago, Ben Skeggs made good suggestions that I am currently investigating. Will get back to you on them when I’m done experimenting with them.

Design and implement a kernel interface with an elegant way takes a while…

See you soon for the full implementation!

A deeper look into NVPerfKit

NVIDIA NVPerfKit is a suite of performance tools to help developpers in identifying the performance bottleneck of OpenGL and Direct3D applications. It allows you to monitor hardware performance counters which are used to store the counts of hardware-related activities from the GPU itself. These performance counters (called “graphics counters” by NVIDIA) are usually used by developers to identify bottlenecks in their applications, like “how the gpu is busy?” or “how many triangles have been drawn in the current frame?” and so on. But, NVPerfKit is only available on Windows.

This year, my Google Summer of Code project is to expose NVIDIA’s graphics counter to help Linux/Nouveau developpers in improving their OpenGL applications. At the end of this summer, this project aims to offer a Linux version of NVPerfkit for NVIDIA’s graphics cards (only GeForce 8, 9 and 2XX in a first time) .¬† To expose these hardware events to the userspace, we have to write an interface between the Linux kernel and mesa. Basically, the idea is to tell to the kernel to monitor signal X and read back results from the userspace (i.e. mesa). However, before writing that interface we have to study the behaviours of NVPerfKit on Windows.

In a first time, let me explain (again) what is really a hardware performance counter. A hardware performance counter is a set of special registers used to count hardware-relatd activities. There are two type of counters, global counters from PCOUNTER and (local) MP counters. PCOUNTER is the card unit which contains most of the performance counters. PCOUNTER is divided in 8 domains (or sets) on nv50/Tesla. Each domain has a different source clock and has 255+ input signals that can themselves be the output of one multiplexer. PCOUNTER uses global counters whereas MP counters are per-app and context switched. Actually, these two types of counters are not really independent and may share some configuration parts, for example, the output of a signal multiplexer. On Tesla/nv50, it is possible to monitor 4 macro signals concurrently per domain. A macro signal is the aggregation of 4 signals which have been combined with a function. In this post, we are only focusing on global counters. Now, the question is how NVPerfKit monitors these global performance counters ?

Case #1 : How NVPerfKit handles multiple apps being monitored concurrently ?

NVIDIA does not handle this case at all, and the behaviour is thus undefined when more than one application is monitoring performance counters at the same time. Then, because of the issue of shared configuration of global counters (PCOUNTER) and local counters (MP counters), I think it’s a bad idea to allow monitoring multiple applications concurrently. To solve this problem, I suggest, at first, to use a global lock for allowing only one application at a time and for simplifying the implementation.

Case #2 : How NVPerfKit handles only one counter per domain ?

This is the simplest case, and there are no particular requirements.

Case #3 : How NVPerfKit handles multiple counters per domain ?

NVPerfKit uses a round robin mode, then it still monitors only one counter per domain and it switches the current counter after each frame.

Case #4 : How NVPerfKit handles multiple counters on different domains ?

No problem here, NVPerfKit is able to monitor multiple counters on different domains (each domain having up to one event to monitor).

To sum up, NVPerfKit always uses a round robin mode when it has to monitor more than one hw event on the same domain.

Concerning the sampling part, NVIDIA say (NVPerfKit User Guide – page 11 – Appendix B. Counters reference):

All of the software/driver counters represent a per frame accounting. These counters are accumulated and updated in the driver per frame, so even if you sample at a sub-frame rate frequency, the software counters will hold the same data (from the previous frame) until the end of the current frame.

This article should have been published the last month, but during this time I worked on the prototype’s definition and its implementation. Currently, I have a first prototype which works quite well, I’ll submit it the next week.

See you the next week!

GSoC 2014 – The clock is again ticking!


The Google Summer of Code 2014 coding period starts tomorrow. This year, my project is to expose NVIDIA’s GPU graphics counter to the userspace through mesa. This idea follows my previous Google Summer of Code which was mainly focused on reverse engineering NVIDIA‚Äôs performance counters.

The main goal of this project is to help Linux developpers in identifying the performance bottleneck of OpenGL applications. At the end of this GSoC, NVIDIA’s GPU graphics counter for GeForce 8, 9 and 2XX (nv50/tesla) will (almost-all) be exposed for Nouveau. Some counters won’t be available until the compute support (ie. the ability to launch kernels) for nv50 is not implemented.

During the past weeks, I continued to reverse engineering NVIDIA’s graphics counter for nv50 until now. Currently, the documentation is almost complete (except for aa, ac and af because I don’t have them), and recently I started this process for nvc0 cards. At the moment this documentation hasn’t been pushed to envytools and it is only available in my personal repository.

For checking the reverse engineered configuration of the performance counters, I developed a modified version of OGLPerfHarness (the OpenGL sample code of NVPerfKit). This OpenGL sample automatically monitors and exports values of performance counters by using NVPerfSDK on Windows. The figure below shows an example.


This tool is called (using a bash script) for all available counters and it produces the following output (for shader_busy signal in this example) :


All stats produced by the OpenGL sample are available in my repo. However, I didn’t publish the code because I don’t have the right to redistribute it, but I can send a patch if anyone is interested.

For checking the configuration of these performance counters on Nouveau, I ported my tool to Linux. Then, I was able to compare values exported from Windows using nv_perfmon for monitoring counters.

Now, the plan for the next weeks is to work on the kernel ioctls interface.

See you later!