CUPTI: Understand the event collection modes

The event collection mode determines the period over which the events within the enabled event groups will be collected. There are mainly 2 modes :

  • Continuous mode : Events are collected for the entire duration between the cuptiEventGroupEnable and cuptiEventGroupDisable calls. This is the default mode.
  • Kernel mode : Events are collected only for the durations of kernel executions that occur between the cuptiEventGroupEnable and cuptiEventGroupDisable calls. Event collection begins when a kernel execution begins, and stops when kernel execution completes. If multiple kernel executions occur between the cuptiEventGroupEnable and cuptiEventGroupDisable calls then the event values must be read after each kernel launch if those events need to be associated with the specific kernel launch.

1. Program the continuous mode (default mode)
Before configuring the sources selection, the blob initializes the following registers according to the number of sources.

One source :

(w) register: 504660, value: 0000aaaa, mask: ffffffff  { 0 = 0xaaaa | 1 = 0 }
(w) register: 504664, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(w) register: 504668, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(w) register: 50466c, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }

Two sources :

(w) register: 504660, value: aaaaaaaa, mask: ffffffff  { 0 = 0xaaaa | 1 = 0xaaaa }
(w) register: 504664, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(w) register: 504668, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(w) register: 50466c, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }

Three sources :

(w) register: 504660, value: aaaaaaaa, mask: ffffffff  { 0 = 0xaaaa | 1 = 0xaaaa }
(w) register: 504664, value: 0000aaaa, mask: ffffffff  { 0 = 0xaaaa | 1 = 0 }
(w) register: 504668, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(w) register: 50466c, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }

And so on…

In the trace below, there is only one source.

(Configure signals selection)
(Configure mode)
(w) register: 504660, value: 0000aaaa, mask: ffffffff  { 0 = 0xaaaa | 1 = 0 }
(w) register: 504664, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(w) register: 504668, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(w) register: 50466c, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(Configure sources selection)
(Read counters)
(w) register: 504660, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0]   => { 0 = 0 | 1 = 0 }
(w) register: 504664, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x1] => { 0 = 0 | 1 = 0 }
(w) register: 504668, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x2] => { 0 = 0 | 1 = 0 }
(w) register: 50466c, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x3] => { 0 = 0 | 1 = 0 }
(w) register: 504660, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0]   => { 0 = 0 | 1 = 0 }
(w) register: 504664, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x1] => { 0 = 0 | 1 = 0 }
(w) register: 504668, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x2] => { 0 = 0 | 1 = 0 }
(w) register: 50466c, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x3] => { 0 = 0 | 1 = 0 }

After reading counters, the blob re-initializes (twice?) these registers to 0 (see above).

2. Program the kernel mode
Before configuring the sources selection, the blob initializes the following registers to 0.

(Configure signals selection)
(Configure mode)
(w) register: 504660, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0]   => { 0 = 0 | 1 = 0 }
(w) register: 504664, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x1] => { 0 = 0 | 1 = 0 }
(w) register: 504668, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x2] => { 0 = 0 | 1 = 0 }
(w) register: 50466c, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x3] => { 0 = 0 | 1 = 0 }
(Configure sources selection)
(Read counters)

Tested for the following signals (domain c and domain d) (chipset NVC1) :

active_cycles
active_warps
atom_count
branch
divergent_branch
gld_inst_128bit
gld_inst_16bit
gld_inst_32bit
gld_inst_64bit
gld_inst_8bit
gld_request
gred_count
gst_inst_128bit
gst_inst_16bit
gst_inst_32bit
gst_inst_64bit
gst_inst_8bit
gst_request
inst_executed
inst_issued1_0
inst_issued1_1
inst_issued2_0
inst_issued2_1
local_load
local_store
prof_trigger_00
prof_trigger_01
prof_trigger_02
prof_trigger_03
prof_trigger_04
prof_trigger_05
prof_trigger_06
prof_trigger_07
shared_load
shared_store
thread_inst_executed_0
thread_inst_executed_1
thread_inst_executed_2
thread_inst_executed_3
threads_launched
warps_launched

Source

How To: Reverse engineering a performance counter

In this example, we will study the warps_launched event which is quite simple.

Please make sure, you have the CUDA toolkit installed on your system and a CUDA sample compiled before to continue.

Step 1: Enable and configure the profiler

Enable the profiler :

export COMPUTE_PROFILE=1
export COMPUTE_PROFILE_CONFIG=perf_conf.txt

Configure the profiler :

# perf_conf.txt
warps_launched

Step 2: Take a trace with a modified version of valgrind-mmt

valgrind --tool=mmt --mmt-trace-file=/dev/nvidia0 --mmt-trace-nvidia-ioctls ./vectorAddDrv &> valgrind_mmt_trace.log

You can also take a look at the profiling output :

$ cat cuda_profile_0.log 
# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 GeForce GT 430
# CUDA_CONTEXT 1
# TIMESTAMPFACTOR fffff68311f26108
method,gputime,cputime,occupancy,warps_launched
method=[ memcpyHtoD ] gputime=[ 116.064 ] cputime=[ 69128.000 ] 
method=[ memcpyHtoD ] gputime=[ 116.032 ] cputime=[ 51292.000 ] 
method=[ VecAdd_kernel ] gputime=[ 67.008 ] cputime=[ 27084.000 ] occupancy=[ 1.000 ] warps_launched=[ 792 ] 
method=[ memcpyDtoH ] gputime=[ 189.120 ] cputime=[ 6512.000 ]

Step 3: Extract post ioctl calls of the trace and make it more user-friendly

grep RETURND valgrind_mmt_trace.log | cut -d ' ' -f2-

Now, the output looks like this :

RETURND: DIR=1 MMIO=504600 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=1 MMIO=504e00 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=0 MMIO=504600 VALUE=00000000 MASK=00000000 UNK=00000000,00000000,00000000,00000000
RETURND: DIR=1 MMIO=504600 VALUE=80000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504604 VALUE=00000026 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504608 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=50465c VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504660 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504664 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504668 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=50466c VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504730 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=100 MMIO=504674 VALUE=00000318 MASK=00000000 UNK=00000000,00000000,00000000,00000000
RETURND: DIR=100 MMIO=504670 VALUE=00000000 MASK=00000000 UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504674 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504678 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=50467c VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504680 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504684 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504688 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=50468c VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504690 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=0 MMIO=504600 VALUE=80000000 MASK=00000000 UNK=00000000,00000000,00000000,00000000
RETURND: DIR=1 MMIO=504600 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000

Step 4: Use lookup (envytools) for printing register names

$ lookup -a NVC1 504604 26
PGRAPH.GPC[0].TP[0].MP.PM_SIGSEL[0] => { 0 = 0x26 | 1 = 0 | 2 = 0 | 3 = 0 }

$ lookup -a NVC1 504674 318
PGRAPH.GPC[0].TP[0].MP.PM_COUNTER[0] => 0x318

Step 5: Results
We can see that PCOUNTER selects the signal 0x26 and that the result is in the register 0x504674 (0x318 = 792). 🙂

To conclude, this method seems to work fine. However, it’s a bit annoying to do these steps for each events. So, I wrote a tool to make the reverse engineering process as automatic as possible.

Trace NVidia’s ioctl calls with valgrind-mmt

Valgrind-mmt is a Valgrind modification which allows tracing application accesses to mmaped memory (which is how userspace parts of graphics drivers communicate with hardware). It was created by Dave Airlie and then extended/fixed by others.

In order to trace ioctl calls made by the blob’s userspace, I used a modified version of valgrind-mmt to get a trace of the registers modified by CUPTI to monitor the wanted signals. I applied the following patch of Christoph Bumiller (calim) :

diff --git a/mmt/mmt_nv_ioctl.c b/mmt/mmt_nv_ioctl.c
index 23682e7..11890b0 100644
--- a/mmt/mmt_nv_ioctl.c
+++ b/mmt/mmt_nv_ioctl.c
@@ -386,6 +386,24 @@ void mmt_nv_ioctl_pre(UWord *args)
 				UInt *addr2 = (*(UInt **) (&data[4]));
 				dumpmem("in2 ", addr2[2], 0x3c);
 			}
+         else if (data[2] == 0x20800122)
+         {
+            UInt k;
+            UInt *in = (UInt *)mmt_2x4to8(data[5], data[4]);
+            UInt cnt = in[5];
+            UInt *tx = (UInt *)mmt_2x4to8(in[7], in[6]);
+            VG_(message) (Vg_DebugMsg, "<==(%u at %p)\n", cnt, tx);
+            for (k = 0; k < cnt; ++k)
+               VG_(message) (Vg_DebugMsg, "REQUEST: DIR=%x MMIO=%x VALUE=%08x MASK=%08x UNK=%08x,%08x,%08x,%08x\n",
+                             tx[k * 8 + 0],
+                             tx[k * 8 + 3],
+                             tx[k * 8 + 5],
+                             tx[k * 8 + 7],
+                             tx[k * 8 + 1],
+                             tx[k * 8 + 2],
+                             tx[k * 8 + 4],
+                             tx[k * 8 + 6]);
+         }
 			break;

 		case 0xc040464d:
@@ -565,6 +583,23 @@ void mmt_nv_ioctl_post(UWord *args)
 				UInt *addr2 = (*(UInt **) (&data[4]));
 				dumpmem("out2 ", addr2[2], 0x3c);
 			}
+         else if (data[2] == 0x20800122)
+         {
+            UInt *out = (UInt *)mmt_2x4to8(data[5], data[4]);
+            UInt cnt = out[5];
+            UInt *tx = (UInt *)mmt_2x4to8(out[7], out[6]);
+            UInt k;
+            for (k = 0; k < cnt; ++k)
+               VG_(message) (Vg_DebugMsg, "RETURND: DIR=%x MMIO=%x VALUE=%08x MASK=%08x UNK=%08x,%08x,%08x,%08x\n",
+                             tx[k * 8 + 0],
+                             tx[k * 8 + 3],
+                             tx[k * 8 + 5],
+                             tx[k * 8 + 7],
+                             tx[k * 8 + 1],
+                             tx[k * 8 + 2],
+                             tx[k * 8 + 4],
+                             tx[k * 8 + 6]);
+         }
 			break;
 			// 0x37 read configuration parameter
 		case 0xc0204638:

That patch displays MMIO register of pre/post ioctl calls made by the blob. In order to trace these calls, you have to use valgrind-mmt as this way :

valgrind --tool=mmt --mmt-trace-file=/dev/nvidia0 --mmt-trace-nvidia-ioctls

For example, if I want to see the post ioctl calls of the vectorAddDrv CUDA sample when I trace the inst_executed event, I’ll use :

valgrind --tool=mmt --mmt-trace-file=/dev/nvidia0 --mmt-trace-nvidia-ioctls ./vectorAddDrv 2>&1 | grep RETURND

And the trace looks like this:

--4803-- RETURND: DIR=1 MMIO=504600 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=1 MMIO=504e00 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=0 MMIO=504600 VALUE=00000000 MASK=00000000 UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=1 MMIO=504600 VALUE=80000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504604 VALUE=002d2d2d MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504608 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=50465c VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504660 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504664 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504668 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=50466c VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504730 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504734 VALUE=00000011 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504738 VALUE=00000022 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=100 MMIO=504674 VALUE=0000137c MASK=00000000 UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=100 MMIO=504678 VALUE=00001208 MASK=00000000 UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=100 MMIO=50467c VALUE=000003e7 MASK=00000000 UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=100 MMIO=504670 VALUE=00000000 MASK=00000000 UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504674 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504678 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=50467c VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504680 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504684 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504688 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=50468c VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504690 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=0 MMIO=504600 VALUE=80000000 MASK=00000000 UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=1 MMIO=504600 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000

The CUDA Profiling Tools Interface (CUPTI)

The CUDA Profiling Tools Interface (CUPTI) enables the creation of profiling and tracing tools that target CUDA applications. CUPTI provides four APIs: the Activity API, the Callback API, the Event API, and the Metric API. Using these APIs, you can develop profiling tools that give insight into the CPU and GPU behavior of CUDA applications. CUPTI is delivered as a dynamic library on all platforms supported by CUDA.

  • The CUPTI Activity API allows you to asychronously collect a trace of an application’s CPU and GPU CUDA activity.
  • The CUPTI Callback API allows you to register a callback into your own code. Your callback will be invoked when the application being profiled calls a CUDA runtime or driver function, or when certain events occur in the CUDA driver.
  • The CUPTI Event API allows you to query, configure, start, stop, and read the event counters on a CUDA-enabled device.
  • The CUPTI Metric API allows you to collect application metrics calculated from one or more event values.

The CUPTI Event API is the most interesting part regarding the goal of my GSoC project. That API can determine the available events on a device. An event is just a various activity like the number of instructions executed, the number of threads launched on a device, and so on… An avent has also an ID, a short/long description, a category (memory, instructions…) and a domain. For example, on my NVC1, I have 85 events available.

A device exposes one or more event domains. Each event domain represents a group of related events available on that device. A device may have multiple instances of a domain, indicating that the device can simultaneously record multiple instances of each event within that domain.

In order to retrieve the profiling information, CUPTI through the blob uses ioctl calls for reading/writing registers.

NVidia’s performance counters

Currently, the Nouveau project has RE’d some performance counters through envytools, a toolbox for people envious of nvidia’s blob driver.

The card unit which contains performance monitoring counters is named PCOUNTER, you can find more information about it here.

PCOUNTER is used for monitoring various activity signals from all over the card.

The CUDA Compute Profiler

NVidia provides a simple profiler for CUDA and OpenCL programs in order to allow users to identify performance bottlenecks in multi-kernel applications or to quantify the benefit of optimizing a single kernel.

That profiler can be easily used through some environement variables and the latest version of documentation can be found in the CUDA toolkit.

The profiler suppports lot of options (ie. events) like the number of instructions executed (inst_executed), the number of texture cache hits (tex_cache_hit)…

For example, if you want to trace the number of instructions executed, you have to set the following environement variables :

# Enable the profiler
export COMPUTE_PROFILE=1

# Specify a config file for enabling performance counters in the GPU
export COMPUTE_PROFILE_CONFIG=perf_config.txt

# Set to the desired file path for profiling output (cuda_profile_0.log by default) (optionnal)
export COMPUTE_PROFILE_LOG=perf_log.txt

# Set to either 1 (set) or 0 (unset) to enable or disable a comma separated version of the log output. (optionnal)
export COMPUTE_PROFILE_CSV=1

Then, you have to specify what performance counters you want to trace in the perf_config.txt. in this example, perf_config.txt looks like this :

inst_executed

Now, you can run your CUDA sample and you can see the output report in cuda_profile_0.log (depends of COMPUTE_PROFILE_LOG).

In my example, I ran the vectorAddDrv CUDA sample, and the output looks like this :

# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 GeForce GT 430
# CUDA_CONTEXT 1
# TIMESTAMPFACTOR fffff68322358270
method,gputime,cputime,occupancy,inst_executed
method=[ memcpyHtoD ] gputime=[ 115.328 ] cputime=[ 81.000 ]
method=[ memcpyHtoD ] gputime=[ 115.328 ] cputime=[ 65.000 ]
method=[ VecAdd_kernel ] gputime=[ 65.984 ] cputime=[ 90.000 ] occupancy=[ 1.000 ] inst_executed=[ 17848 ]
method=[ memcpyDtoH ] gputime=[ 188.288 ] cputime=[ 397.000 ]

Google Summer of Code 2013 – Proposal for X.Org Foundation

Title
Reverse engineering NVidia’s performance counters and exposing them via nv_perfmon.

Short description
The goal of this project is to reverse engineering NVidia’s performance counters which are exposed through the CUDA compute profiler which uses CUpti, a high-level API. That profiler allows users to gather timing information about kernel execution and memory transfer operations. The profiler can be used to identify performance bottlenecks in multi-kernel applications or to quantify the benefit of optimizing a single kernel. The main goal of this proposal is to implement the same kind profiler for the nouveau open source driver and then extend it by adding non-compute-related signals.

Name and Contact Information
Name: Samuel Pitoiset
E-mail: samuel.pitoiset at gmail.com
Nickname: hakzsam
IRC: hakzsam at irc.freenode.org

Biographical Information
I’m a student in a master degree at the university of Bordeaux, France. I already have some experience in open source. I participated to the Google Summer of Code 2010 on FreedroidRPG [1] (with Arthur Huillet as a mentor) and again in 2012 but, this time, on libav. The last summer, my project [2] was to implement natively (in libav) the variants of RTMP protocol. According to these experience, I have solid skills in C programming language and I know the rules of an open source project (workflow, IRC, ML, bug trackers…). Otherwise, I use Git and Vim for programming and I’m an ArchLinux user since 3 years.

Synopsis
NVidia’s performance counters allow a GPU application developer to trace his application to identify performance bottlenecks. The logic behind the performance counters has mostly been reversed engineered by Marcin Kościelnicki (mwk) but the signals it is monitoring is still mostly unknown.

Some signals have been reversed engineered by mwk and Martin Peres (mupuf) on the nv40-c0 family. The result is visible in nvacounter. However, most signals are very difficult to find, such as the number of cache hit or miss.

Using NVidia’s CUpti (the CUDA compute profiler [3]), Christoph Bumiller (calim) and Ben Skeggs (darktama) have been able to use NVidia’s documented profiler to increase the number of known signal.

The main goal of this project is to follow calim and darktama’s path and continue on documenting signals by exploiting CUpti.

These signals are very dependant of the chipset. Consequently, I’ll write an automatic tool which will help us to reverse them. I will test this tool on every cards I will have (around 20) and that tool could be added to envytools.

By the way, I wrote a first version http://paste.awesom.eu/nM6 .

Benefits to the Community
The benefits of this project to the community are mainly focused to developers to identify the performance bottleneck.

Investigation
Even if I was not active on #nouveau before the last week, I already talked with mupuf a few months ago about Nvidia’s hardware and about OpenCL. We talked about the possibility of participating to an EVoC but I had lot of work during my school year so it was not possible before the summer. Now, I’m very motivated and I have a full-time to work.

During one week, I investigated about performance counters provided by Nvidia. I tried to understand what is a performance counter by reading some documentation in the envytools repository [4]. In a first time, I tried to reverse some signals using envytools (nvapeek, nvapoke and lookup) and I used a python script [5] written by mupuf which compares two traces and displays the differences.

I’ll describe the method I used (on my nv86)  with some signals such as branch, instructions or tex_cache_hit:
1. write 0 to registers a100 to b000
2. dump the first trace
3. enable the profiler and launch a cuda sample
4. dump the second trace
5. compare these two traces
(I also wrote a little C program for that [6])
However, as the proprietary driver is making use of the performance counters to implement DVFS (Dynamic Voltage/Frequency Scaling), this method, the values keep on being modified by the kernel.

After that, I contacted calim who had implemented [7] MP performance counters monitoring on nvc0. He told me that the CUPti interface was making ioctls calls to the kernel driver to set up the counters. Lukily, the interface is very simple as the blob is simply request a read or write on a register of a value with or without a mask. I thus started to use valgrind-mmt for tracing ioctls calls made by the blob’s userspace to get a trace of the registers modified by CUPti to monitor the wanted signals.

I applied a patch [8] from calim to valgrind-mmt in order to display more useful informations related to these ioctl calls. With this modified version of valgrind I was now able to monitor the changes done by nvidia, and, thanks to the already-reverse-engineered logic, I was able to reverse some signals.

I’ll give you an example of trace that I obtained using the first version of my automatic tool. In this example, I monitored the signal ‘warps_launched’, you can see the trace here http://pastebin.com/raw.php?i=m3BxVcch and the log of the CUDA compute profiler herre http://paste.awesom.eu/dAv . The most interesting line is :

(r) register: 504674, value: 00000318, mask: 00000000 ==> PGRAPH.GPC[0].TP[0].MP.PM_COUNTER[0] => 0x318

Indeed, 0x318 is the number of launched warps returned by the CUDA compute profiler (warps_launched=[ 792 ]). Even if this example is very simple, it proves that the method I use allows to RE these signals.

GSoC work
Priority taks:
– continue to reverse some performance counters on kepler/fermi card (nvcf/c1)
– try to reverse MP counters on tesla cards and their signals (nv86/a3)
– write shaders/cl kernels in order to better understand these counters and reverse non-compute related signals
– add support in nv_perfmon for everything we find
– try to make the reverse engineering process as automatic as possible

Future work:
– expose the API in userspace (kernel + libdrm)
– expose these counters through AMD performance counters
– try to expose them in APItrace
– write a HUD in mesa

After GSoC work
Currently, I cannot say how much time the previously mentioned tasks will take. Then, some of these tasks (mainly future work tasks) could be developed after the Google Summer of Code.

Schedule
This schedule is approximative.
(May 14 to June 17)
– continue to study documentation and to reverse some performance counters on kepler/fermi card
(June 17 to July 1)
– try to make the reverse engineering process as automatic as possible
(July 1 to July 29)
– try to reverse MP counters on tesla cards and their signals
(July 29 to end)
– write shaders/cl kernels in order to better understand these counters and reverse non-compute-related signals

The support of the reversed signals in nv_perfmon will added progressively.

Regarding my school work at the university, I think the start of the next school year will be programmed for the 2nd september. But this is not a big problem in my opinion, since I’m already in holidays, so I will have my fulltime during four months (May to August).

References
[1] http://code.google.com/p/google-summer-of-code-2010-freedroidrpg/source/browse/#svn%2Ftrunk
[2] http://www.google-melange.com/gsoc/project/google/gsoc2012/hakzsam66/63002
[3] http://www.cs.cmu.edu/afs/cs/academic/class/15668-s11/www/cuda-doc/Compute_Profiler.txt
[4] https://github.com/pathscale/envytools/blob/master/hwdocs/pcounter.txt
[5] http://paste.awesom.eu/1vy [peek_diff.py]
[6] http://paste.awesom.eu/1bj
[7] http://cgit.freedesktop.org/mesa/mesa/commit/?id=ee624ced364bfd2f896809874ef3a808a11c5ecf
[8] http://people.freedesktop.org/~chrisbmr/vmmt-ioc208.diff