Implement MP counters for nv50 (compute only)

Hello,

As part of my Google Summer of Code project I implemented MP counters (for compute only) on nv50/tesla. This work follows the implementation of MP counters for nvc0/fermi I did the last year.

Compute counters are used by OpenCL while graphics counters are used to count hardware-related activities of OpenGL applications. The distinction between these two types of counters made by NVIDIA is arbitrary and won’t be present in my implementation. That’s why compute counters can also be used to give detailed information of OpenGL applications like the number of instructions processed per frame or the number of launched warps.

MP performance counters are local and per-context while performance counters, programmed through the PCOUNTER engine, are global. A MP counter is more accurate than a global counter because it counts hardware-related activities for each context separately while a global counter reports activities regardless of the context that generates it.

All of these MP counters have been reverse engineered using CUPTI, the NVIDIA CUDA profiling tools interface which only exposes compute counters. On nv50/tesla, CUPTI exposes 13 performance counters like instructions or warp_serialize. The nv50 family has 4 MP counters per TPC (Texture Processing Cluster).

Currently, this prototype implements an interface between the kernel and mesa which exposes these MP performance counters to the user through the Gallium HUD. Basically, this interface can configure and poll a counter using the push buffer and a set of software methods.

To configure a MP counter we use the command stream like the blob does. We have two methods, the first one is for configuring the counter (mode, signal, unit and logic operation) and the second one is just used to reinitialize the counter. Then, to select the group of the MP counter we have added a software method. To poll counters we use a notifier buffer object which is allocated along a channel. This notifier allows to communicate between the kernel and mesa. This approach has already been explained in my latest article.

To sum up, this prototype adds support for 13 performance counters on nv50/tesla. All of the code is available on my github account. If you are interested, you can take a look at the mesa and the nouveau code.

Have a good day.

A first attempt at exposing NVIDIA’s performance counters in Nouveau

Hi folks,

Follow up on this year’s GSoC, it’s time to talk about the interface between the kernel and the userspace (mesa). Basically, the idea is to tell the kernel to monitor signal X and read back results from mesa. At the end of this project, almost-all the graphics counters for GeForce 8, 9 and 2XX (nv50/Tesla) will be exposed and this interface should be almost compatible with Fermi and Kepler. Some MP counters which still have to be reverse engineered will be added later.

To implement this interface between the Linux kernel and mesa, we can use ioctl calls or software methods. Let me first talk a bit about them.

ioctl calls vs software methods

An ioctl (Input/Output control) is the most common hardware-controlling operation, it’s a sort of system call, available in most driver categories. A software method is a special command added to the command stream of the GPU. Basically, the card is processing the command stream (FIFO) and encounter an unimplemented method. Then PFIFO waits until PGRAPH is idle and sends a specific IRQ called INVALID_METHOD to the kernel. At this time, the kernel is inside an interrupt context, the driver then will determine method and object that caused the interrupt and implements the method. The main difference between these two approaches is that software methods can be easily synchronized with the CPU through the command stream and are context-dependent, while ioctls are unsynchronized with the command stream. With SW methods, we can make sure it is called right after the commands we want and the following commands won’t get executed until the sw method is handled by the CPU, this is not possible with an ioctl

Currently, I have a first prototype of that interface using a set of software methods to get the advantage of the synchronization along the command stream, but also because ioctl calls are harder to implement and to maintain in the future. However, since a software method is invoked within an interrupt context we have to limit as much as possible the number of instructions needed to complete the task processed by it and it’s absolutely forbidden to do a sleep call for example.

A first prototype using software methods

Basically that interface, like the NVPerfKit’s, must be able to export a list of available hardware events, add or remove a counter, sample a counter, expose its value to the userspace and synchronize the different queries which will send by the userspace to the kernel. All of these operations are sent through a set of software methods.

Configure a counter

To configure a counter we will use a software method which is still not currently defined, but since we can send 32 bits of data along with it, it’s sufficient to identify a counter. For this, we can send the global ID of the counter or to allocate an object which represents a counter from the userspace and send its handle with that sw method. Then, the kernel pushes that counter in a staging area waiting for the next batch of counters or for the sample command. This command can be invoked successively to add several counters. Once all counters added by the user are known by the kernel it’s the time to send the sample command. It’s also possible to synchronization the configuration with the beginning and the end of a frame using software methods.

Sample a counter

This command also uses a software method which just tells the kernel to start monitoring. At this time, the kernel is configuring counters (ie. write values to a set of special registers), reading and storing their values, including the number of cycles processed which may be used by the userspace to compute a ratio.

Expose counter’s data to the userspace

Currently, we can configure and sample a counter but the result of this counting period is not yet exposed to the userspace. Basically, to be able to send results from the kernel to mesa we use a notifier buffer object which is dedicated to the communication from the kernelspace to the userspace. A notifier BO is allocated and mapped along a channel, so it can be accessible both by the kernel and the userspace. When mesa creates a channel, this special BO is automatically allocated by the kernel, then we just have to map it. At this time, the kernel can write results to this BO, and the userspace can read back from it. The result of a counting period is copied by the kernel to this notifier BO from an other software method which is also used in order to synchronize queries.

Synchronize queries with a sequence number

To synchronize queries we use a different sequence ID (like a fence) for each query we send to the kernel space. When the user wants to read out result it sends a query ID through a software method. Then this method does the read out, copies the counter’s value to the notifier BO and the sequence number at the offset 0. Also, we use a ringbuffer in the notify BO to store the list of counter ID, cycles and the counter’s value. This ringbuffer is a nice way to avoid stalling the command submission and is a good fit for the gallium HUD which queues up to 8 frames before having to read back the counters. As for the HUD, this ringbuffer stores the result of the N previous readouts. Since the offset 0 stores the latest sequence ID, we can easily check if the result is available in the ringbuffer. To check the result, we can do a busy waiting until the query we want to get it’s available in the ringbuffer or we can check if the result of that query has not been overwrittne by a newer one.

This buffer looks like this :

 

schema_notifer_bo

To sum up, almost all of these software methods use the perfmon engine initially written by Ben Skeggs. However, to support complex hardware events like special counter modes and multiple passes I still had to improve it.

Currently, the connection between these software methods and perfmon is in a work in progress state. I will try to complete this task as soon as possible to provide a full implementation.

I already have a set of patches in a Request For Comments state for perfmon and the software methods interface on my github account, you can take a look at them here. I also have an example out-of-mesa, initially written by Martin Peres, which shows how to use that first protoype (link). Two days ago, Ben Skeggs made good suggestions that I am currently investigating. Will get back to you on them when I’m done experimenting with them.

Design and implement a kernel interface with an elegant way takes a while…

See you soon for the full implementation!

A deeper look into NVPerfKit

NVIDIA NVPerfKit is a suite of performance tools to help developpers in identifying the performance bottleneck of OpenGL and Direct3D applications. It allows you to monitor hardware performance counters which are used to store the counts of hardware-related activities from the GPU itself. These performance counters (called “graphics counters” by NVIDIA) are usually used by developers to identify bottlenecks in their applications, like “how the gpu is busy?” or “how many triangles have been drawn in the current frame?” and so on. But, NVPerfKit is only available on Windows.

This year, my Google Summer of Code project is to expose NVIDIA’s graphics counter to help Linux/Nouveau developpers in improving their OpenGL applications. At the end of this summer, this project aims to offer a Linux version of NVPerfkit for NVIDIA’s graphics cards (only GeForce 8, 9 and 2XX in a first time) .  To expose these hardware events to the userspace, we have to write an interface between the Linux kernel and mesa. Basically, the idea is to tell to the kernel to monitor signal X and read back results from the userspace (i.e. mesa). However, before writing that interface we have to study the behaviours of NVPerfKit on Windows.

In a first time, let me explain (again) what is really a hardware performance counter. A hardware performance counter is a set of special registers used to count hardware-relatd activities. There are two type of counters, global counters from PCOUNTER and (local) MP counters. PCOUNTER is the card unit which contains most of the performance counters. PCOUNTER is divided in 8 domains (or sets) on nv50/Tesla. Each domain has a different source clock and has 255+ input signals that can themselves be the output of one multiplexer. PCOUNTER uses global counters whereas MP counters are per-app and context switched. Actually, these two types of counters are not really independent and may share some configuration parts, for example, the output of a signal multiplexer. On Tesla/nv50, it is possible to monitor 4 macro signals concurrently per domain. A macro signal is the aggregation of 4 signals which have been combined with a function. In this post, we are only focusing on global counters. Now, the question is how NVPerfKit monitors these global performance counters ?

Case #1 : How NVPerfKit handles multiple apps being monitored concurrently ?

NVIDIA does not handle this case at all, and the behaviour is thus undefined when more than one application is monitoring performance counters at the same time. Then, because of the issue of shared configuration of global counters (PCOUNTER) and local counters (MP counters), I think it’s a bad idea to allow monitoring multiple applications concurrently. To solve this problem, I suggest, at first, to use a global lock for allowing only one application at a time and for simplifying the implementation.

Case #2 : How NVPerfKit handles only one counter per domain ?

This is the simplest case, and there are no particular requirements.

Case #3 : How NVPerfKit handles multiple counters per domain ?

NVPerfKit uses a round robin mode, then it still monitors only one counter per domain and it switches the current counter after each frame.

Case #4 : How NVPerfKit handles multiple counters on different domains ?

No problem here, NVPerfKit is able to monitor multiple counters on different domains (each domain having up to one event to monitor).

To sum up, NVPerfKit always uses a round robin mode when it has to monitor more than one hw event on the same domain.

Concerning the sampling part, NVIDIA say (NVPerfKit User Guide – page 11 – Appendix B. Counters reference):

All of the software/driver counters represent a per frame accounting. These counters are accumulated and updated in the driver per frame, so even if you sample at a sub-frame rate frequency, the software counters will hold the same data (from the previous frame) until the end of the current frame.

This article should have been published the last month, but during this time I worked on the prototype’s definition and its implementation. Currently, I have a first prototype which works quite well, I’ll submit it the next week.

See you the next week!

GSoC 2014 – The clock is again ticking!

Hello,

The Google Summer of Code 2014 coding period starts tomorrow. This year, my project is to expose NVIDIA’s GPU graphics counter to the userspace through mesa. This idea follows my previous Google Summer of Code which was mainly focused on reverse engineering NVIDIA’s performance counters.

The main goal of this project is to help Linux developpers in identifying the performance bottleneck of OpenGL applications. At the end of this GSoC, NVIDIA’s GPU graphics counter for GeForce 8, 9 and 2XX (nv50/tesla) will (almost-all) be exposed for Nouveau. Some counters won’t be available until the compute support (ie. the ability to launch kernels) for nv50 is not implemented.

During the past weeks, I continued to reverse engineering NVIDIA’s graphics counter for nv50 until now. Currently, the documentation is almost complete (except for aa, ac and af because I don’t have them), and recently I started this process for nvc0 cards. At the moment this documentation hasn’t been pushed to envytools and it is only available in my personal repository.

For checking the reverse engineered configuration of the performance counters, I developed a modified version of OGLPerfHarness (the OpenGL sample code of NVPerfKit). This OpenGL sample automatically monitors and exports values of performance counters by using NVPerfSDK on Windows. The figure below shows an example.

openglharness-screenshot

This tool is called (using a bash script) for all available counters and it produces the following output (for shader_busy signal in this example) :

OPTIONS:
model=bunny
model-count=27
render-mode=vbo
texture=small
num-frames=100
fullscreen=no
STATS:
fps=9.53
mean=98.5%
min=98.5%
max=98.6%

All stats produced by the OpenGL sample are available in my repo. However, I didn’t publish the code because I don’t have the right to redistribute it, but I can send a patch if anyone is interested.

For checking the configuration of these performance counters on Nouveau, I ported my tool to Linux. Then, I was able to compare values exported from Windows using nv_perfmon for monitoring counters.

Now, the plan for the next weeks is to work on the kernel ioctls interface.

See you later!

Google Summer of Code 2014 – Proposal for X.Org Foundation

Title

Expose NVIDIA’s GPU graphics counters to the userspace.

Short description

This project aims to expose NVIDIA’s GPU graphics counters to the userspace through mesa. This idea follows my previous Google Summer of Code which was mainly focused on reverse engineering NVIDIA’s performance counters. The main goal of this project is to help Linux developpers in identifying the performance bottleneck of OpenGL applications.

Personal information

I’m a student in his final year of a MSc degree at the university of Bordeaux,
France. I already participated to the Google Summer of Code last year [1] and
my project was to reverse engineering NVIDIA’s performance counters.

Context

Performance counters

A hardware performance counter is a set of special registers which are used
to store the counts of hardware-related activities. Hardware counters are
oftenly used by developers to identify bottlenecks in their applications.

In this proposal, we are only focusing on NVIDIA’s performance counters.

There are two types of counters offered by NVIDIA which provide data directly
from various points of the GPU. Compute counters are used for OpenCL, while
graphics counters give detailed information for OpenGL/Direct3D.

On Windows, compute and graphics counters are both exposed by PerfKit[2], an
advanced software suite (except when it crashes my computer for no particular
reason :-)), which can be used by advanced users for profiling OpenCL and
Direct3D/OpenGL applications.

On Linux, the proprietary driver *only* exposes compute counters through the
CUDA compute profiler[3] (CUPTI), and not graphics counters like PerfKit which
is only available on Windows.

On Nouveau/Linux, some counters are already exposed. Compute counters for
nvc0/Fermi and nve0/Kepler are available in mesa which manages counters’
allocation and monitoring through some software methods provided by the kernel.

The compute and graphics counters distinction made by NVIDIA is arbitrary and
won’t be present in our re-implementation.

Google Summer of Code 2013 review

I took part in the GSoC 2013 and my project was to reverse engineering NVIDIA’s
performance counters and to expose them via nv_perfmon.

Let me now sum up the important tasks I have done during this project.

The first part I have done was to take a look at cupti to understand how GPU
compute counters are implemented on Fermi. After playing a bit with that
profiler, I wrote a tool named cupti_trace[4] to make the reverse engineering
process as automatic as possible. At this stage, I was able to start the
implementation of MP counters on nvc0/Fermi in mesa, based on the previous work
of Christoph Bumiller (aka calim) who already had implemented that support for
nve0/Kepler. To complete this task, I had to implement parts of the compute
runtime for nvc0 (ie. the ability to launch kernels).

MP compute counters support for Fermi :
http://lists.freedesktop.org/archives/mesa-commit/2013-July/044444.html
http://lists.freedesktop.org/archives/mesa-commit/2013-August/044573.html
http://lists.freedesktop.org/archives/mesa-commit/2013-August/044574.html
http://lists.freedesktop.org/archives/mesa-commit/2013-August/044576.html

The second part of my project was to start reverse engineering graphics
counters on nv50/Tesla through PerfKit and gDEBugger[5], an advanced OpenGL and
OpenCL debugger, profiler and memory analyzer. Knowing that PerfKit was only
available on Windows, I was unable to use envytools[6], a tools suite for
reverse engineering the NVIDIA proprietary driver because it depends on
libpciaccess which was not available on Windows. To complete this
task, I then ported this library by using WinIO in order to use tools provided
by envytools like nvapeek and nvawatch.

libpciaccess support on Windows/Cygwin:
http://hakzsam.wordpress.com/2014/01/28/libpciaccess-has-now-official-support-for-windowscygwin/
http://www.phoronix.com/scan.php?page=news_item&px=MTU4NTU
http://cgit.freedesktop.org/xorg/lib/libpciaccess/commit/?id=6bfccc7ec4f0705595385f6684b6849663f781b4

At the end of this Google Summer of Code, some graphics counters had already been
reverse engineered on nv98/Tesla.

This project has been successfully completed except for the implementation of
graphics counters in nv_perfmon and the reverse engineering of MP counters on
Tesla (regarding the schedule). And it has been a very interesting experience
for me even if that was very hard at the beginning. I’m now able to say that
low level hardware programming on GPU is not a trivial task -:).

After GSoC 2013 until now

From October to January, I didn’t work on Nouveau at all because I was
completely busy by the university work.

In February, I returned to work on the reverse engineering of these graphics
counters, and I mostly completed all the documentation of nv50/Tesla chipsets[7].

Project

Benefits to the community

Help Linux developpers in identifying the performance bottleneck of OpenGL
applications.

Description

Compute counters for nvc0+ are already exposed by Nouveau, but there are still
many performance counters exposed by NVIDIA that are left to be exposed in
Nouveau. Last year, I added compute counters support used by OpenCL and CUDA
for nvc0/Fermi.

Graphics counters are currently only available on Windows, but I reverse
engineered them and the documentation is mostly complete. At the time, nv50,
84, 86, 92, 98, a0, a3 and a5 are documented. In few days, I should be able to
complete this list by adding 94, 96 and a8 chipsets. In this GSoC project, I would like to
expose them in Nouveau but there is some problems between PCOUNTER[8] and MP
counters.

PCOUNTER is the card unit which contains most of the performance counters.
PCOUNTER is divided in 8 domains (or sets) on nv50/Tesla. Each domain has a
different source clock and has 255+ input signals that can themselves be the
output of one multiplexer. PCOUNTER uses global counters whereas MP counters
are per-app and context switched like compute counters used for nvc0+.

Actually, these two types of counters are not really independent and may share
some configuration parts, for example, the output of a signal multiplexer.

Because of the issue of shared configuration of global counters (PCOUNTER)
and local counters (MP counters), I think it’s a bad idea to allow monitoring
multiple applications concurrently. To solve this problem, I suggest, at first,
to use a global lock for allowing only one application at a time and
for simplifying the implementation.

NVIDIA does not handle this case at all, and the behaviour is thus undefined when more
than one application is monitoring performance counters at the same time.

Implementation

kernel interface and ioctls

Some performance counters are globals and have to be programmed through MMIO.
They have to be managed by the Linux Kernel using an ioctls interface that are
to be defined.

mesa

Only mesa should directly uses performance counters because it has all the
information to expose them. Mesa is able to allocate and manage MP
counters (per-app) and can also call the Kernel in order to program global
counters via the ioctls interface that will be implemented. At this stage, mesa
will be able to expose them in GL_AMD_performance_monitor and nouveau-perfkit.

GL_AMD_performance_monitor

GL_AMD_performance_monitor[9] is an OpenGL extension which can be used to
capture and report performance counters. This is a great extension for Linux
developers which currently does not report any performance counters from
NVIDIA’s GPU. After having the core implementation in mesa, this task should
not be too harder since I already have a branch[7] of mesa with core support of
GL_AMD_performance_monitor. Thanks to Kenneth Graunke and Christoph Bumiller.

nouveau-perfkit

Nouveau-perfkit will be a Linux/Nouveau version of NVPerfKit. This tool will be based
on mesa’s implementation. nouveau-perfkit will export both GPU graphics
counters (only nv50/Tesla in a first time) and compute counters (nvc0+). To
maintain interoperability with NVIDIA, I am thinking about re-using the
interface of NVidia’s NVPerfkit. This tool will be for nouveau only.

GSoC work

Required tasks:
– core implementation (kernel interface + ioctls + mesa)
– expose graphics counters through GL_AMD_performance_monitor
– add nouveau-perfkit a Linux version of NVPerfkit

Optionnal tasks (if I have the time):
– reverse engineering NVIDIA’s GPU graphics counters for Fermi and Kepler
– all the work which can be done around performance counters

Approximative schedule

(now until 19 May)
– complete the documentation of signals on nv50/tesla
– write OpenGL samples code to test these graphics counters
– test the reverse engineering on Nouveau (mostly done) and write piglit tests
– think more about the core implementation

(19 May until 18 July)
– core implementation of GPU graphics counters
(kernel interface + ioctls + mesa)

(18 July to 28 July)
– expose graphics counters through GL_AMD_performance_monitor

(28 July to 18 August)
– implement nouveau-perfkit based on mesa, which follows nv-perfkit interface

(after GSoC)
– As the last year, I’ll continue to work on Nouveau after the end of this
Google Summer of Code 2014 because I like this job, it’s fun -:).

Thank you for reading. Have a good days.

References

[1] http://hakzsam.wordpress.com/2013/05/27/google-summer-of-code-2013-proposal-for-x-org-foundation/
[2] https://developer.nvidia.com/nvidia-perfkit
[3] http://docs.nvidia.com/cuda/cupti/index.html
[4] https://github.com/hakzsam/re-pcounter-tools/tree/master/src
[5] http://www.gremedy.com/
[6] https://github.com/envytools/envytools
[7] https://github.com/hakzsam/re-pcounter-tools/tree/master/hwdocs/pcounter
[8] https://github.com/envytools/envytools/blob/master/hwdocs/pcounter/intro.rst
[9] https://www.opengl.org/registry/specs/AMD/performance_monitor.txt

libpciaccess has now official support for Windows/Cygwin

Hey,

During my Google Summer of Code 2013, a part of my project was to reverse engineered GPU graphics counters on NVIDIA Tesla. However, these counters are only exposed on Windows through the NVIDIA NVPerfKit performance tools.

Usually the Nouveau community uses envytools, a collection of tools to help developers understand how NVIDIA GPUs work. Envytools depends on libpciaccess which is only available on POSIX platforms. That’s why I decided to port libpciaccess to Windows/Cygwin to be able to use these tools.

This port depends on WinIo which allows direct I/O port and physical memory access under Windows NT/2000/XP/2003/Vista/7 and 2008.

This port has been accepted in libpciaccess/master and merged today. It has only been tested on Windows Seven 32 bits, and has to be checked and fixed on 64 bits.

To use it, please follow the instructions found in README.cygwin.

This support helped me to understand how GPU graphics counters work on NVIDIA Tesla. I started writing a documentation of these counters here.

See you later!

NV50 graphics counters are now almost fully documented

Hello everyone,

The second part of my GSoC project was to understand how NVidia graphics counters work on Tesla family.  According to my previous post, I used my own implementation of libpciaccess on Windows 7 in order to read the PCOUNTER configuration of these signals through NVPerfkit and GDebugger.

After some week of hard work, I have succeeded in documenting most of these signals. However, some of them (like vertex_shader_busy for example) are still currently not understandable for me but I’ll try to do this task as soon as possible.

The result of my researches is available on my Github.

The next part is to complete the documentation and, after, it could be interesting to provide an implementation like the NVPerfSDK for Linux.

Have a good day. ;)