Google Summer of Code 2013 – Proposal for X.Org Foundation

Title
Reverse engineering NVidia’s performance counters and exposing them via nv_perfmon.

Short description
The goal of this project is to reverse engineering NVidia’s performance counters which are exposed through the CUDA compute profiler which uses CUpti, a high-level API. That profiler allows users to gather timing information about kernel execution and memory transfer operations. The profiler can be used to identify performance bottlenecks in multi-kernel applications or to quantify the benefit of optimizing a single kernel. The main goal of this proposal is to implement the same kind profiler for the nouveau open source driver and then extend it by adding non-compute-related signals.

Name and Contact Information
Name: Samuel Pitoiset
E-mail: samuel.pitoiset at gmail.com
Nickname: hakzsam
IRC: hakzsam at irc.freenode.org

Biographical Information
I’m a student in a master degree at the university of Bordeaux, France. I already have some experience in open source. I participated to the Google Summer of Code 2010 on FreedroidRPG [1] (with Arthur Huillet as a mentor) and again in 2012 but, this time, on libav. The last summer, my project [2] was to implement natively (in libav) the variants of RTMP protocol. According to these experience, I have solid skills in C programming language and I know the rules of an open source project (workflow, IRC, ML, bug trackers…). Otherwise, I use Git and Vim for programming and I’m an ArchLinux user since 3 years.

Synopsis
NVidia’s performance counters allow a GPU application developer to trace his application to identify performance bottlenecks. The logic behind the performance counters has mostly been reversed engineered by Marcin Kościelnicki (mwk) but the signals it is monitoring is still mostly unknown.

Some signals have been reversed engineered by mwk and Martin Peres (mupuf) on the nv40-c0 family. The result is visible in nvacounter. However, most signals are very difficult to find, such as the number of cache hit or miss.

Using NVidia’s CUpti (the CUDA compute profiler [3]), Christoph Bumiller (calim) and Ben Skeggs (darktama) have been able to use NVidia’s documented profiler to increase the number of known signal.

The main goal of this project is to follow calim and darktama’s path and continue on documenting signals by exploiting CUpti.

These signals are very dependant of the chipset. Consequently, I’ll write an automatic tool which will help us to reverse them. I will test this tool on every cards I will have (around 20) and that tool could be added to envytools.

By the way, I wrote a first version http://paste.awesom.eu/nM6 .

Benefits to the Community
The benefits of this project to the community are mainly focused to developers to identify the performance bottleneck.

Investigation
Even if I was not active on #nouveau before the last week, I already talked with mupuf a few months ago about Nvidia’s hardware and about OpenCL. We talked about the possibility of participating to an EVoC but I had lot of work during my school year so it was not possible before the summer. Now, I’m very motivated and I have a full-time to work.

During one week, I investigated about performance counters provided by Nvidia. I tried to understand what is a performance counter by reading some documentation in the envytools repository [4]. In a first time, I tried to reverse some signals using envytools (nvapeek, nvapoke and lookup) and I used a python script [5] written by mupuf which compares two traces and displays the differences.

I’ll describe the method I used (on my nv86)  with some signals such as branch, instructions or tex_cache_hit:
1. write 0 to registers a100 to b000
2. dump the first trace
3. enable the profiler and launch a cuda sample
4. dump the second trace
5. compare these two traces
(I also wrote a little C program for that [6])
However, as the proprietary driver is making use of the performance counters to implement DVFS (Dynamic Voltage/Frequency Scaling), this method, the values keep on being modified by the kernel.

After that, I contacted calim who had implemented [7] MP performance counters monitoring on nvc0. He told me that the CUPti interface was making ioctls calls to the kernel driver to set up the counters. Lukily, the interface is very simple as the blob is simply request a read or write on a register of a value with or without a mask. I thus started to use valgrind-mmt for tracing ioctls calls made by the blob’s userspace to get a trace of the registers modified by CUPti to monitor the wanted signals.

I applied a patch [8] from calim to valgrind-mmt in order to display more useful informations related to these ioctl calls. With this modified version of valgrind I was now able to monitor the changes done by nvidia, and, thanks to the already-reverse-engineered logic, I was able to reverse some signals.

I’ll give you an example of trace that I obtained using the first version of my automatic tool. In this example, I monitored the signal ‘warps_launched’, you can see the trace here http://pastebin.com/raw.php?i=m3BxVcch and the log of the CUDA compute profiler herre http://paste.awesom.eu/dAv . The most interesting line is :

(r) register: 504674, value: 00000318, mask: 00000000 ==> PGRAPH.GPC[0].TP[0].MP.PM_COUNTER[0] => 0x318

Indeed, 0×318 is the number of launched warps returned by the CUDA compute profiler (warps_launched=[ 792 ]). Even if this example is very simple, it proves that the method I use allows to RE these signals.

GSoC work
Priority taks:
- continue to reverse some performance counters on kepler/fermi card (nvcf/c1)
- try to reverse MP counters on tesla cards and their signals (nv86/a3)
- write shaders/cl kernels in order to better understand these counters and reverse non-compute related signals
- add support in nv_perfmon for everything we find
- try to make the reverse engineering process as automatic as possible

Future work:
- expose the API in userspace (kernel + libdrm)
- expose these counters through AMD performance counters
- try to expose them in APItrace
- write a HUD in mesa

After GSoC work
Currently, I cannot say how much time the previously mentioned tasks will take. Then, some of these tasks (mainly future work tasks) could be developed after the Google Summer of Code.

Schedule
This schedule is approximative.
(May 14 to June 17)
- continue to study documentation and to reverse some performance counters on kepler/fermi card
(June 17 to July 1)
- try to make the reverse engineering process as automatic as possible
(July 1 to July 29)
- try to reverse MP counters on tesla cards and their signals
(July 29 to end)
- write shaders/cl kernels in order to better understand these counters and reverse non-compute-related signals

The support of the reversed signals in nv_perfmon will added progressively.

Regarding my school work at the university, I think the start of the next school year will be programmed for the 2nd september. But this is not a big problem in my opinion, since I’m already in holidays, so I will have my fulltime during four months (May to August).

References
[1] http://code.google.com/p/google-summer-of-code-2010-freedroidrpg/source/browse/#svn%2Ftrunk
[2] http://www.google-melange.com/gsoc/project/google/gsoc2012/hakzsam66/63002
[3] http://www.cs.cmu.edu/afs/cs/academic/class/15668-s11/www/cuda-doc/Compute_Profiler.txt
[4] https://github.com/pathscale/envytools/blob/master/hwdocs/pcounter.txt
[5] http://paste.awesom.eu/1vy [peek_diff.py]
[6] http://paste.awesom.eu/1bj
[7] http://cgit.freedesktop.org/mesa/mesa/commit/?id=ee624ced364bfd2f896809874ef3a808a11c5ecf
[8] http://people.freedesktop.org/~chrisbmr/vmmt-ioc208.diff

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s