Google Summer of Code 2014 – Proposal for X.Org Foundation

Title

Expose NVIDIA’s GPU graphics counters to the userspace.

Short description

This project aims to expose NVIDIA’s GPU graphics counters to the userspace through mesa. This idea follows my previous Google Summer of Code which was mainly focused on reverse engineering NVIDIA’s performance counters. The main goal of this project is to help Linux developpers in identifying the performance bottleneck of OpenGL applications.

Personal information

I’m a student in his final year of a MSc degree at the university of Bordeaux,
France. I already participated to the Google Summer of Code last year [1] and
my project was to reverse engineering NVIDIA’s performance counters.

Context

Performance counters

A hardware performance counter is a set of special registers which are used
to store the counts of hardware-related activities. Hardware counters are
oftenly used by developers to identify bottlenecks in their applications.

In this proposal, we are only focusing on NVIDIA’s performance counters.

There are two types of counters offered by NVIDIA which provide data directly
from various points of the GPU. Compute counters are used for OpenCL, while
graphics counters give detailed information for OpenGL/Direct3D.

On Windows, compute and graphics counters are both exposed by PerfKit[2], an
advanced software suite (except when it crashes my computer for no particular
reason :-)), which can be used by advanced users for profiling OpenCL and
Direct3D/OpenGL applications.

On Linux, the proprietary driver *only* exposes compute counters through the
CUDA compute profiler[3] (CUPTI), and not graphics counters like PerfKit which
is only available on Windows.

On Nouveau/Linux, some counters are already exposed. Compute counters for
nvc0/Fermi and nve0/Kepler are available in mesa which manages counters’
allocation and monitoring through some software methods provided by the kernel.

The compute and graphics counters distinction made by NVIDIA is arbitrary and
won’t be present in our re-implementation.

Google Summer of Code 2013 review

I took part in the GSoC 2013 and my project was to reverse engineering NVIDIA’s
performance counters and to expose them via nv_perfmon.

Let me now sum up the important tasks I have done during this project.

The first part I have done was to take a look at cupti to understand how GPU
compute counters are implemented on Fermi. After playing a bit with that
profiler, I wrote a tool named cupti_trace[4] to make the reverse engineering
process as automatic as possible. At this stage, I was able to start the
implementation of MP counters on nvc0/Fermi in mesa, based on the previous work
of Christoph Bumiller (aka calim) who already had implemented that support for
nve0/Kepler. To complete this task, I had to implement parts of the compute
runtime for nvc0 (ie. the ability to launch kernels).

MP compute counters support for Fermi :
http://lists.freedesktop.org/archives/mesa-commit/2013-July/044444.html
http://lists.freedesktop.org/archives/mesa-commit/2013-August/044573.html
http://lists.freedesktop.org/archives/mesa-commit/2013-August/044574.html
http://lists.freedesktop.org/archives/mesa-commit/2013-August/044576.html

The second part of my project was to start reverse engineering graphics
counters on nv50/Tesla through PerfKit and gDEBugger[5], an advanced OpenGL and
OpenCL debugger, profiler and memory analyzer. Knowing that PerfKit was only
available on Windows, I was unable to use envytools[6], a tools suite for
reverse engineering the NVIDIA proprietary driver because it depends on
libpciaccess which was not available on Windows. To complete this
task, I then ported this library by using WinIO in order to use tools provided
by envytools like nvapeek and nvawatch.

libpciaccess support on Windows/Cygwin:
https://hakzsam.wordpress.com/2014/01/28/libpciaccess-has-now-official-support-for-windowscygwin/
http://www.phoronix.com/scan.php?page=news_item&px=MTU4NTU
http://cgit.freedesktop.org/xorg/lib/libpciaccess/commit/?id=6bfccc7ec4f0705595385f6684b6849663f781b4

At the end of this Google Summer of Code, some graphics counters had already been
reverse engineered on nv98/Tesla.

This project has been successfully completed except for the implementation of
graphics counters in nv_perfmon and the reverse engineering of MP counters on
Tesla (regarding the schedule). And it has been a very interesting experience
for me even if that was very hard at the beginning. I’m now able to say that
low level hardware programming on GPU is not a trivial task -:).

After GSoC 2013 until now

From October to January, I didn’t work on Nouveau at all because I was
completely busy by the university work.

In February, I returned to work on the reverse engineering of these graphics
counters, and I mostly completed all the documentation of nv50/Tesla chipsets[7].

Project

Benefits to the community

Help Linux developpers in identifying the performance bottleneck of OpenGL
applications.

Description

Compute counters for nvc0+ are already exposed by Nouveau, but there are still
many performance counters exposed by NVIDIA that are left to be exposed in
Nouveau. Last year, I added compute counters support used by OpenCL and CUDA
for nvc0/Fermi.

Graphics counters are currently only available on Windows, but I reverse
engineered them and the documentation is mostly complete. At the time, nv50,
84, 86, 92, 98, a0, a3 and a5 are documented. In few days, I should be able to
complete this list by adding 94, 96 and a8 chipsets. In this GSoC project, I would like to
expose them in Nouveau but there is some problems between PCOUNTER[8] and MP
counters.

PCOUNTER is the card unit which contains most of the performance counters.
PCOUNTER is divided in 8 domains (or sets) on nv50/Tesla. Each domain has a
different source clock and has 255+ input signals that can themselves be the
output of one multiplexer. PCOUNTER uses global counters whereas MP counters
are per-app and context switched like compute counters used for nvc0+.

Actually, these two types of counters are not really independent and may share
some configuration parts, for example, the output of a signal multiplexer.

Because of the issue of shared configuration of global counters (PCOUNTER)
and local counters (MP counters), I think it’s a bad idea to allow monitoring
multiple applications concurrently. To solve this problem, I suggest, at first,
to use a global lock for allowing only one application at a time and
for simplifying the implementation.

NVIDIA does not handle this case at all, and the behaviour is thus undefined when more
than one application is monitoring performance counters at the same time.

Implementation

kernel interface and ioctls

Some performance counters are globals and have to be programmed through MMIO.
They have to be managed by the Linux Kernel using an ioctls interface that are
to be defined.

mesa

Only mesa should directly uses performance counters because it has all the
information to expose them. Mesa is able to allocate and manage MP
counters (per-app) and can also call the Kernel in order to program global
counters via the ioctls interface that will be implemented. At this stage, mesa
will be able to expose them in GL_AMD_performance_monitor and nouveau-perfkit.

GL_AMD_performance_monitor

GL_AMD_performance_monitor[9] is an OpenGL extension which can be used to
capture and report performance counters. This is a great extension for Linux
developers which currently does not report any performance counters from
NVIDIA’s GPU. After having the core implementation in mesa, this task should
not be too harder since I already have a branch[7] of mesa with core support of
GL_AMD_performance_monitor. Thanks to Kenneth Graunke and Christoph Bumiller.

nouveau-perfkit

Nouveau-perfkit will be a Linux/Nouveau version of NVPerfKit. This tool will be based
on mesa’s implementation. nouveau-perfkit will export both GPU graphics
counters (only nv50/Tesla in a first time) and compute counters (nvc0+). To
maintain interoperability with NVIDIA, I am thinking about re-using the
interface of NVidia’s NVPerfkit. This tool will be for nouveau only.

GSoC work

Required tasks:
– core implementation (kernel interface + ioctls + mesa)
– expose graphics counters through GL_AMD_performance_monitor
– add nouveau-perfkit a Linux version of NVPerfkit

Optionnal tasks (if I have the time):
– reverse engineering NVIDIA’s GPU graphics counters for Fermi and Kepler
– all the work which can be done around performance counters

Approximative schedule

(now until 19 May)
– complete the documentation of signals on nv50/tesla
– write OpenGL samples code to test these graphics counters
– test the reverse engineering on Nouveau (mostly done) and write piglit tests
– think more about the core implementation

(19 May until 18 July)
– core implementation of GPU graphics counters
(kernel interface + ioctls + mesa)

(18 July to 28 July)
– expose graphics counters through GL_AMD_performance_monitor

(28 July to 18 August)
– implement nouveau-perfkit based on mesa, which follows nv-perfkit interface

(after GSoC)
– As the last year, I’ll continue to work on Nouveau after the end of this
Google Summer of Code 2014 because I like this job, it’s fun -:).

Thank you for reading. Have a good days.

References

[1] https://hakzsam.wordpress.com/2013/05/27/google-summer-of-code-2013-proposal-for-x-org-foundation/
[2] https://developer.nvidia.com/nvidia-perfkit
[3] http://docs.nvidia.com/cuda/cupti/index.html
[4] https://github.com/hakzsam/re-pcounter-tools/tree/master/src
[5] http://www.gremedy.com/
[6] https://github.com/envytools/envytools
[7] https://github.com/hakzsam/re-pcounter-tools/tree/master/hwdocs/pcounter
[8] https://github.com/envytools/envytools/blob/master/hwdocs/pcounter/intro.rst
[9] https://www.opengl.org/registry/specs/AMD/performance_monitor.txt

libpciaccess has now official support for Windows/Cygwin

Hey,

During my Google Summer of Code 2013, a part of my project was to reverse engineered GPU graphics counters on NVIDIA Tesla. However, these counters are only exposed on Windows through the NVIDIA NVPerfKit performance tools.

Usually the Nouveau community uses envytools, a collection of tools to help developers understand how NVIDIA GPUs work. Envytools depends on libpciaccess which is only available on POSIX platforms. That’s why I decided to port libpciaccess to Windows/Cygwin to be able to use these tools.

This port depends on WinIo which allows direct I/O port and physical memory access under Windows NT/2000/XP/2003/Vista/7 and 2008.

This port has been accepted in libpciaccess/master and merged today. It has only been tested on Windows Seven 32 bits, and has to be checked and fixed on 64 bits.

To use it, please follow the instructions found in README.cygwin.

This support helped me to understand how GPU graphics counters work on NVIDIA Tesla. I started writing a documentation of these counters here.

See you later!

NV50 graphics counters are now almost fully documented

Hello everyone,

The second part of my GSoC project was to understand how NVidia graphics counters work on Tesla family.  According to my previous post, I used my own implementation of libpciaccess on Windows 7 in order to read the PCOUNTER configuration of these signals through NVPerfkit and GDebugger.

After some week of hard work, I have succeeded in documenting most of these signals. However, some of them (like vertex_shader_busy for example) are still currently not understandable for me but I’ll try to do this task as soon as possible.

The result of my researches is available on my Github.

The next part is to complete the documentation and, after, it could be interesting to provide an implementation like the NVPerfSDK for Linux.

Have a good day. 😉

libpciaccess has now Windows support through WinIo and Cygwin

libpciaccess is the most famous generic library which allows us to access to PCI drivers under Linux and BSD systems.

As you may know, libpciaccess is not supported under Windows for various reasons that I don’t really know, but the most important one is probably because almost all developers of open source drivers use Linux only.

However, I need to use Windows in order to reverse engineer GPU graphics counters which are only available through NVPerfKit as part of my Google Summer of Code.

These counters are programmed using PCOUNTER, the hardware unit that contains performance monitoring counters, and they are exposed by MMIO. So, I need to have a full access to PCI drivers in order to map physical memory of the blob into the virtual address space.

So, I added Windows support into libpciaccess that now allows me to use the NVA tools (nvapeek, nvapoke…). That support is for Cygwin only mainly because I didn’t test my implementation under MinGW, but I believe that may be really easy to port it.

See you soon!

nvc0 compute support is now fixed

If you try to monitor MP performance counters through the HUD on nvc0 you should get the following error message :

gallium_hud: all queries are busy after 8 frames, can’t add another query.

This message occurs when the kernel is not synchronized, ie. when it doesn’t run correctly.

Now, if you take a quick look to the kernel error messages, you should get the following precision :

DATA_ERROR [INVALID_VALUE] ch 4 [0x000027f839 glxgears[11550]] subc 1 class 0xa0c0 mthd 0x02e8 data 0x0040cccc

Actually, data must be aligned to 0x8000 on nvc0 according to rnndb.

A 3 lines patch fixes the compute support on nvc0.

How to decode the pushbuffer using valgrind-mmt and dedma ?

In some cases, informations are not presently exposed through MMIO registers and the blob uses FIFO methods instead. Actually, the blob uses FIFO methods for enabling MP counters. Let start to explain how to do that.

 
In this example, I use the NVC1 chipset, and I want to decode the pushbuffer used by the NVC0_COMPUTE class (0x000090c0).

First you have to trace a signal using cupti_trace :

$ cupti_trace --trace NVC1 --event active_cycles

Now, you have to grep the FIFO object class id 0x000090c0.

$ grep 0x000090c0 active_cycles.trace
--6903-- out2 0x00000004 0x00000002 0x00000003 0x0000003d 0x0000003e 0x0000003f 0x00000040 0x00009197 0x000090b8 0x00000073 0x00005080 0x00009072 0x00009074 0x0000844c 0x000090dd 0x000090b2 0x000090b1 0x00008570 0x0000857a 0x0000857b 0x0000857c 0x0000857d 0x0000857e 0x0000007d 0x00009068 0x0000907f 0x0000906f 0x0000902d 0x00009097 0x000090c0 0x00009039 0x000090e0 0x000090e6 0x000090e2 0x000090e3 0x000050a0 0x00009096 0x000090e1 0x000090b3 0x000090b5 0x0000208a 0x000085b6 0x00009067 0x000090f1 0x0000503b 0x0000503c 0x00000075 
--6903-- out2 0x00000004 0x00000002 0x00000003 0x0000003d 0x0000003e 0x0000003f 0x00000040 0x00009197 0x000090b8 0x00000073 0x00005080 0x00009072 0x00009074 0x0000844c 0x000090dd 0x000090b2 0x000090b1 0x00008570 0x0000857a 0x0000857b 0x0000857c 0x0000857d 0x0000857e 0x0000007d 0x00009068 0x0000907f 0x0000906f 0x0000902d 0x00009097 0x000090c0 0x00009039 0x000090e0 0x000090e6 0x000090e2 0x000090e3 0x000050a0 0x00009096 0x000090e1 0x000090b3 0x000090b5 0x0000208a 0x000085b6 0x00009067 0x000090f1 0x0000503b 0x0000503c 0x00000075 
--6903-- pre_ioctl: fd:3, id:0x2b (full:0xc020462b), data: 0xc1d00511 0x5c0000c9 0x5c0000ca 0x000090c0 0x00000000 0x00000000 0x00000000 0x00000000 
--6903-- post_ioctl: fd:3, id:0x2b (full:0xc020462b), data: 0xc1d00511 0x5c0000c9 0x5c0000ca 0x000090c0 0x00000000 0x00000000 0x00000000 0x00000000 
--6903-- out 0x5c0000ca 0x000090c0 0x000090c0 0x00000001 
--6903-- w 2:0x2004, 0x000090c0 
--6903-- w 11:0x24300, 0x000090c3,0x000090c2,0x000090c1,0x000090c0 
--6903-- w 9:0x24300, 0x000090c3,0x000090c2,0x000090c1,0x000090c0 
--6903-- r 10:0x12180, 0x000090c6,0x000090c4,0x000090c2,0x000090c0 
--6903-- pre_ioctl: fd:3, id:0x2b (full:0xc020462b), data: 0xc1d00511 0x5c0000ec 0x5c0000ed 0x000090c0 0x00000000 0x00000000 0x00000000 0x00000000 
--6903-- post_ioctl: fd:3, id:0x2b (full:0xc020462b), data: 0xc1d00511 0x5c0000ec 0x5c0000ed 0x000090c0 0x00000000 0x00000000 0x00000000 0x00000000 
--6903-- out 0x5c0000ed 0x000090c0 0x000090c0 0x00000001 
--6903-- w 15:0x2004, 0x000090c0

The following line contains the map id, which is 2 in this example :

--6903-- w 2:0x2004, 0x000090c0

Now, you have to use dedma which decodes the pusbuffer using rnndb (the output is truncated here).

$ dedma -m c0 -v 2 active_cycles.trace > active_cycles.dedma
20014000  size 1, subchannel 2 (0x0), offset 0x0000, increment
000090c0    NVC0_COMPUTE mapped to subchannel 2
20014040  size 1, subchannel 2 (0x90c0), offset 0x0100, increment
00000000    NVC0_COMPUTE.GRAPH.NOP = 0
200141d6  size 1, subchannel 2 (0x90c0), offset 0x0758, increment
00000002    NVC0_COMPUTE.MP_LIMIT = 0x2
200141e4  size 1, subchannel 2 (0x90c0), offset 0x0790, increment
00000000    NVC0_COMPUTE.TEMP_ADDRESS_HIGH = 0
200141e5  size 1, subchannel 2 (0x90c0), offset 0x0794, increment
10000000    NVC0_COMPUTE.TEMP_ADDRESS_LOW = 0x10000000
200141e6  size 1, subchannel 2 (0x90c0), offset 0x0798, increment
00000000    NVC0_COMPUTE.TEMP_SIZE_HIGH = 0
200141e7  size 1, subchannel 2 (0x90c0), offset 0x079c, increment
00700000    NVC0_COMPUTE.TEMP_SIZE_LOW = 0x700000
200141e8  size 1, subchannel 2 (0x90c0), offset 0x07a0, increment
00012600    NVC0_COMPUTE.WARP_TEMP_ALLOC = 0x12600
200141df  size 1, subchannel 2 (0x90c0), offset 0x077c, increment
03000000    NVC0_COMPUTE.LOCAL_BASE = 0x3000000
20014081  size 1, subchannel 2 (0x90c0), offset 0x0204, increment
000000f0    NVC0_COMPUTE.LOCAL_POS_ALLOC = 0xf0
20014082  size 1, subchannel 2 (0x90c0), offset 0x0208, increment
000007c0    NVC0_COMPUTE.LOCAL_NEG_ALLOC = 0x7c0
20014083  size 1, subchannel 2 (0x90c0), offset 0x020c, increment
00001000    NVC0_COMPUTE.WARP_CSTACK_SIZE = 0x1000
20014359  size 1, subchannel 2 (0x90c0), offset 0x0d64, increment
0000000f    NVC0_COMPUTE.CALL_LIMIT_LOG = 0xf
200140c2  size 1, subchannel 2 (0x90c0), offset 0x0308, increment
00000003    NVC0_COMPUTE.CACHE_SPLIT = 48K_SHARED_16K_L1
20014085  size 1, subchannel 2 (0x90c0), offset 0x0214, increment
01000000    NVC0_COMPUTE.SHARED_BASE = 0x1000000
20014093  size 1, subchannel 2 (0x90c0), offset 0x024c, increment
00000000    NVC0_COMPUTE.SHARED_SIZE = 0
200140a8  size 1, subchannel 2 (0x90c0), offset 0x02a0, increment
00008000    NVC0_COMPUTE.UNK02A0 = 0x8000
2001408e  size 1, subchannel 2 (0x90c0), offset 0x0238, increment
00010001    NVC0_COMPUTE.GRIDDIM_YX = { X = 1 | Y = 1 }
2001408f  size 1, subchannel 2 (0x90c0), offset 0x023c, increment
00000001    NVC0_COMPUTE.GRIDDIM_Z = 1
200140eb  size 1, subchannel 2 (0x90c0), offset 0x03ac, increment
00010001    NVC0_COMPUTE.BLOCKDIM_YX = { X = 1 | Y = 1 }
200140ec  size 1, subchannel 2 (0x90c0), offset 0x03b0, increment
00000001    NVC0_COMPUTE.BLOCKDIM_Z = 1
200140b1  size 1, subchannel 2 (0x90c0), offset 0x02c4, increment
00000000    NVC0_COMPUTE.UNK02C4 = FALSE
...

However, dedma fails parsing when the blob uses method data from a different buffer, so you have to do that by hand but it’s pretty easy. You just have to find the data after the 0x20014cef address. In this example, I find 0xaaaa0 which is the value of MP_PM_OP https://github.com/pathscale/envytools/blob/master/rnndb/nvc0_compute.xml#L252.

See you! 😉

MP performance counters are now implemented on nvc0:nvc8

After two weeks of hard work, I managed to add support of MP performance counters on nvc0:nvc8. I tested my implementation only on nvc1 but it should work on other chipsets except nvc8 but I’ll add it in the next few weeks. In order to add this support, I had to implement compute support for nvc0, which is the ability to launch a kernel. My work is based on the compute support implementation of Christoph Bumiller (alias calim) http://people.freedesktop.org/~chrisbmr/90c0.c .

http://lists.freedesktop.org/archives/mesa-dev/2013-July/041448.html

http://lists.freedesktop.org/archives/mesa-dev/2013-July/041449.html