libpciaccess has now Windows support through WinIo and Cygwin

libpciaccess is the most famous generic library which allows us to access to PCI drivers under Linux and BSD systems.

As you may know, libpciaccess is not supported under Windows for various reasons that I don’t really know, but the most important one is probably because almost all developers of open source drivers use Linux only.

However, I need to use Windows in order to reverse engineer GPU graphics counters which are only available through NVPerfKit as part of my Google Summer of Code.

These counters are programmed using PCOUNTER, the hardware unit that contains performance monitoring counters, and they are exposed by MMIO. So, I need to have a full access to PCI drivers in order to map physical memory of the blob into the virtual address space.

So, I added Windows support into libpciaccess that now allows me to use the NVA tools (nvapeek, nvapoke…). That support is for Cygwin only mainly because I didn’t test my implementation under MinGW, but I believe that may be really easy to port it.

See you soon!

nvc0 compute support is now fixed

If you try to monitor MP performance counters through the HUD on nvc0 you should get the following error message :

gallium_hud: all queries are busy after 8 frames, can’t add another query.

This message occurs when the kernel is not synchronized, ie. when it doesn’t run correctly.

Now, if you take a quick look to the kernel error messages, you should get the following precision :

DATA_ERROR [INVALID_VALUE] ch 4 [0x000027f839 glxgears[11550]] subc 1 class 0xa0c0 mthd 0x02e8 data 0x0040cccc

Actually, data must be aligned to 0x8000 on nvc0 according to rnndb.

A 3 lines patch fixes the compute support on nvc0.

How to decode the pushbuffer using valgrind-mmt and dedma ?

In some cases, informations are not presently exposed through MMIO registers and the blob uses FIFO methods instead. Actually, the blob uses FIFO methods for enabling MP counters. Let start to explain how to do that.

 
In this example, I use the NVC1 chipset, and I want to decode the pushbuffer used by the NVC0_COMPUTE class (0x000090c0).

First you have to trace a signal using cupti_trace :

$ cupti_trace --trace NVC1 --event active_cycles

Now, you have to grep the FIFO object class id 0x000090c0.

$ grep 0x000090c0 active_cycles.trace
--6903-- out2 0x00000004 0x00000002 0x00000003 0x0000003d 0x0000003e 0x0000003f 0x00000040 0x00009197 0x000090b8 0x00000073 0x00005080 0x00009072 0x00009074 0x0000844c 0x000090dd 0x000090b2 0x000090b1 0x00008570 0x0000857a 0x0000857b 0x0000857c 0x0000857d 0x0000857e 0x0000007d 0x00009068 0x0000907f 0x0000906f 0x0000902d 0x00009097 0x000090c0 0x00009039 0x000090e0 0x000090e6 0x000090e2 0x000090e3 0x000050a0 0x00009096 0x000090e1 0x000090b3 0x000090b5 0x0000208a 0x000085b6 0x00009067 0x000090f1 0x0000503b 0x0000503c 0x00000075 
--6903-- out2 0x00000004 0x00000002 0x00000003 0x0000003d 0x0000003e 0x0000003f 0x00000040 0x00009197 0x000090b8 0x00000073 0x00005080 0x00009072 0x00009074 0x0000844c 0x000090dd 0x000090b2 0x000090b1 0x00008570 0x0000857a 0x0000857b 0x0000857c 0x0000857d 0x0000857e 0x0000007d 0x00009068 0x0000907f 0x0000906f 0x0000902d 0x00009097 0x000090c0 0x00009039 0x000090e0 0x000090e6 0x000090e2 0x000090e3 0x000050a0 0x00009096 0x000090e1 0x000090b3 0x000090b5 0x0000208a 0x000085b6 0x00009067 0x000090f1 0x0000503b 0x0000503c 0x00000075 
--6903-- pre_ioctl: fd:3, id:0x2b (full:0xc020462b), data: 0xc1d00511 0x5c0000c9 0x5c0000ca 0x000090c0 0x00000000 0x00000000 0x00000000 0x00000000 
--6903-- post_ioctl: fd:3, id:0x2b (full:0xc020462b), data: 0xc1d00511 0x5c0000c9 0x5c0000ca 0x000090c0 0x00000000 0x00000000 0x00000000 0x00000000 
--6903-- out 0x5c0000ca 0x000090c0 0x000090c0 0x00000001 
--6903-- w 2:0x2004, 0x000090c0 
--6903-- w 11:0x24300, 0x000090c3,0x000090c2,0x000090c1,0x000090c0 
--6903-- w 9:0x24300, 0x000090c3,0x000090c2,0x000090c1,0x000090c0 
--6903-- r 10:0x12180, 0x000090c6,0x000090c4,0x000090c2,0x000090c0 
--6903-- pre_ioctl: fd:3, id:0x2b (full:0xc020462b), data: 0xc1d00511 0x5c0000ec 0x5c0000ed 0x000090c0 0x00000000 0x00000000 0x00000000 0x00000000 
--6903-- post_ioctl: fd:3, id:0x2b (full:0xc020462b), data: 0xc1d00511 0x5c0000ec 0x5c0000ed 0x000090c0 0x00000000 0x00000000 0x00000000 0x00000000 
--6903-- out 0x5c0000ed 0x000090c0 0x000090c0 0x00000001 
--6903-- w 15:0x2004, 0x000090c0

The following line contains the map id, which is 2 in this example :

--6903-- w 2:0x2004, 0x000090c0

Now, you have to use dedma which decodes the pusbuffer using rnndb (the output is truncated here).

$ dedma -m c0 -v 2 active_cycles.trace > active_cycles.dedma
20014000  size 1, subchannel 2 (0x0), offset 0x0000, increment
000090c0    NVC0_COMPUTE mapped to subchannel 2
20014040  size 1, subchannel 2 (0x90c0), offset 0x0100, increment
00000000    NVC0_COMPUTE.GRAPH.NOP = 0
200141d6  size 1, subchannel 2 (0x90c0), offset 0x0758, increment
00000002    NVC0_COMPUTE.MP_LIMIT = 0x2
200141e4  size 1, subchannel 2 (0x90c0), offset 0x0790, increment
00000000    NVC0_COMPUTE.TEMP_ADDRESS_HIGH = 0
200141e5  size 1, subchannel 2 (0x90c0), offset 0x0794, increment
10000000    NVC0_COMPUTE.TEMP_ADDRESS_LOW = 0x10000000
200141e6  size 1, subchannel 2 (0x90c0), offset 0x0798, increment
00000000    NVC0_COMPUTE.TEMP_SIZE_HIGH = 0
200141e7  size 1, subchannel 2 (0x90c0), offset 0x079c, increment
00700000    NVC0_COMPUTE.TEMP_SIZE_LOW = 0x700000
200141e8  size 1, subchannel 2 (0x90c0), offset 0x07a0, increment
00012600    NVC0_COMPUTE.WARP_TEMP_ALLOC = 0x12600
200141df  size 1, subchannel 2 (0x90c0), offset 0x077c, increment
03000000    NVC0_COMPUTE.LOCAL_BASE = 0x3000000
20014081  size 1, subchannel 2 (0x90c0), offset 0x0204, increment
000000f0    NVC0_COMPUTE.LOCAL_POS_ALLOC = 0xf0
20014082  size 1, subchannel 2 (0x90c0), offset 0x0208, increment
000007c0    NVC0_COMPUTE.LOCAL_NEG_ALLOC = 0x7c0
20014083  size 1, subchannel 2 (0x90c0), offset 0x020c, increment
00001000    NVC0_COMPUTE.WARP_CSTACK_SIZE = 0x1000
20014359  size 1, subchannel 2 (0x90c0), offset 0x0d64, increment
0000000f    NVC0_COMPUTE.CALL_LIMIT_LOG = 0xf
200140c2  size 1, subchannel 2 (0x90c0), offset 0x0308, increment
00000003    NVC0_COMPUTE.CACHE_SPLIT = 48K_SHARED_16K_L1
20014085  size 1, subchannel 2 (0x90c0), offset 0x0214, increment
01000000    NVC0_COMPUTE.SHARED_BASE = 0x1000000
20014093  size 1, subchannel 2 (0x90c0), offset 0x024c, increment
00000000    NVC0_COMPUTE.SHARED_SIZE = 0
200140a8  size 1, subchannel 2 (0x90c0), offset 0x02a0, increment
00008000    NVC0_COMPUTE.UNK02A0 = 0x8000
2001408e  size 1, subchannel 2 (0x90c0), offset 0x0238, increment
00010001    NVC0_COMPUTE.GRIDDIM_YX = { X = 1 | Y = 1 }
2001408f  size 1, subchannel 2 (0x90c0), offset 0x023c, increment
00000001    NVC0_COMPUTE.GRIDDIM_Z = 1
200140eb  size 1, subchannel 2 (0x90c0), offset 0x03ac, increment
00010001    NVC0_COMPUTE.BLOCKDIM_YX = { X = 1 | Y = 1 }
200140ec  size 1, subchannel 2 (0x90c0), offset 0x03b0, increment
00000001    NVC0_COMPUTE.BLOCKDIM_Z = 1
200140b1  size 1, subchannel 2 (0x90c0), offset 0x02c4, increment
00000000    NVC0_COMPUTE.UNK02C4 = FALSE
...

However, dedma fails parsing when the blob uses method data from a different buffer, so you have to do that by hand but it’s pretty easy. You just have to find the data after the 0x20014cef address. In this example, I find 0xaaaa0 which is the value of MP_PM_OP https://github.com/pathscale/envytools/blob/master/rnndb/nvc0_compute.xml#L252.

See you! 😉

MP performance counters are now implemented on nvc0:nvc8

After two weeks of hard work, I managed to add support of MP performance counters on nvc0:nvc8. I tested my implementation only on nvc1 but it should work on other chipsets except nvc8 but I’ll add it in the next few weeks. In order to add this support, I had to implement compute support for nvc0, which is the ability to launch a kernel. My work is based on the compute support implementation of Christoph Bumiller (alias calim) http://people.freedesktop.org/~chrisbmr/90c0.c .

http://lists.freedesktop.org/archives/mesa-dev/2013-July/041448.html

http://lists.freedesktop.org/archives/mesa-dev/2013-July/041449.html

Read only one NVidia’s performance counter through nv_perfmon

nv_perfmon is a tool developed by Ben Skeggs, it allows users to read some NVidia’s performance counters. Currently, only the NVE0 chipset is supported. nv_perfmon provides a ncurses interface in order to be more user-friendly and it displays the performance counters in a continuously mode. In order to only read once a counter, I wrote a little tool based on the original code of Ben Skeggs. That tool takes only one command line argument which is the name of the signal.

#include <core/device.h>
#include <core/class.h>

static struct nouveau_object *client;
static struct nouveau_object *device;
static char **signals;
static int nr_signals;

static void
trace_event(char *name, u32 *value)
{
        struct nv_perfctr_sample sample;
        struct nouveau_object *object;
        struct nv_perfctr_read read;
        int ret;

        ret = nouveau_object_new(client, 0x00000000, 0xc0000000,
                NV_PERFCTR_CLASS,
                &(struct nv_perfctr_class) {
                    .logic_op = 0xaaaa,
                    .signal[0].name = name,
                    .signal[0].size = strlen(name)
                }, sizeof(struct nv_perfctr_class),
                &object);
        assert(ret == 0);

        do {
                ret = nv_exec(object, NV_PERFCTR_SAMPLE, &sample, sizeof(sample));
                assert(ret == 0);

                ret = nv_exec(object, NV_PERFCTR_READ, &read, sizeof(read));
                assert(ret == 0 || ret == -EAGAIN);
                if (ret == 0) {
                        *value = read.ctr;
                }
        } while (ret == -EAGAIN);
}

int
main(int argc, char **argv)
{
	struct nv_perfctr_query args = {};
	struct nouveau_object *object;
        char *signal_name;
        u32 value;
	int ret;

        if (argc < 2) {
                fprintf(stderr, "Usage: %s <signal_name>\n", argv[0]);
                return 1;
        }
        signal_name = argv[1];

	ret = os_client_new(NULL, "error", argc, argv, &client);
	if (ret)
		return ret;

	ret = nouveau_object_new(client, 0xffffffff, 0x00000000,
				 NV_DEVICE_CLASS, &(struct nv_device_class) {
					.device = ~0ULL,
					.disable = ~(NV_DEVICE_DISABLE_MMIO |
						     NV_DEVICE_DISABLE_VBIOS |
						     NV_DEVICE_DISABLE_CORE |
						     NV_DEVICE_DISABLE_IDENTIFY),
					.debug0 = ~((1ULL << NVDEV_SUBDEV_TIMER) |
						    (1ULL << NVDEV_ENGINE_PERFMON)),
				}, sizeof(struct nv_device_class), &device);
	if (ret)
		return ret;

	ret = nouveau_object_new(client, 0x00000000, 0xdeadbeef,
				 NV_PERFCTR_CLASS, &(struct nv_perfctr_class) {
				 }, sizeof(struct nv_perfctr_class), &object);
	assert(ret == 0);
	do {
		u32 prev_iter = args.iter;

		args.name = NULL;
		ret = nv_exec(object, NV_PERFCTR_QUERY, &args, sizeof(args));
		assert(ret == 0);

		if (prev_iter) {
			nr_signals++;
			signals = realloc(signals, nr_signals * sizeof(char*));
			signals[nr_signals - 1] = malloc(args.size);

			args.iter = prev_iter;
			args.name = signals[nr_signals - 1];

			ret = nv_exec(object, NV_PERFCTR_QUERY,
				      &args, sizeof(args));
			assert(ret == 0);
		}
	} while (args.iter != 0xffffffff);

        nouveau_object_del(client, 0x00000000, 0xdeadbeef);

        trace_event(signal_name, &value);
        printf("Name  = %s\n", signal_name);
        printf("Value = %10u\n", value);

	while (nr_signals--)
		free(signals[nr_signals]);
	free(signals);
	return 0;
}

CUPTI: Understand the event collection modes

The event collection mode determines the period over which the events within the enabled event groups will be collected. There are mainly 2 modes :

  • Continuous mode : Events are collected for the entire duration between the cuptiEventGroupEnable and cuptiEventGroupDisable calls. This is the default mode.
  • Kernel mode : Events are collected only for the durations of kernel executions that occur between the cuptiEventGroupEnable and cuptiEventGroupDisable calls. Event collection begins when a kernel execution begins, and stops when kernel execution completes. If multiple kernel executions occur between the cuptiEventGroupEnable and cuptiEventGroupDisable calls then the event values must be read after each kernel launch if those events need to be associated with the specific kernel launch.

1. Program the continuous mode (default mode)
Before configuring the sources selection, the blob initializes the following registers according to the number of sources.

One source :

(w) register: 504660, value: 0000aaaa, mask: ffffffff  { 0 = 0xaaaa | 1 = 0 }
(w) register: 504664, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(w) register: 504668, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(w) register: 50466c, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }

Two sources :

(w) register: 504660, value: aaaaaaaa, mask: ffffffff  { 0 = 0xaaaa | 1 = 0xaaaa }
(w) register: 504664, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(w) register: 504668, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(w) register: 50466c, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }

Three sources :

(w) register: 504660, value: aaaaaaaa, mask: ffffffff  { 0 = 0xaaaa | 1 = 0xaaaa }
(w) register: 504664, value: 0000aaaa, mask: ffffffff  { 0 = 0xaaaa | 1 = 0 }
(w) register: 504668, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(w) register: 50466c, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }

And so on…

In the trace below, there is only one source.

(Configure signals selection)
(Configure mode)
(w) register: 504660, value: 0000aaaa, mask: ffffffff  { 0 = 0xaaaa | 1 = 0 }
(w) register: 504664, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(w) register: 504668, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(w) register: 50466c, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(Configure sources selection)
(Read counters)
(w) register: 504660, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0]   => { 0 = 0 | 1 = 0 }
(w) register: 504664, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x1] => { 0 = 0 | 1 = 0 }
(w) register: 504668, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x2] => { 0 = 0 | 1 = 0 }
(w) register: 50466c, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x3] => { 0 = 0 | 1 = 0 }
(w) register: 504660, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0]   => { 0 = 0 | 1 = 0 }
(w) register: 504664, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x1] => { 0 = 0 | 1 = 0 }
(w) register: 504668, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x2] => { 0 = 0 | 1 = 0 }
(w) register: 50466c, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x3] => { 0 = 0 | 1 = 0 }

After reading counters, the blob re-initializes (twice?) these registers to 0 (see above).

2. Program the kernel mode
Before configuring the sources selection, the blob initializes the following registers to 0.

(Configure signals selection)
(Configure mode)
(w) register: 504660, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0]   => { 0 = 0 | 1 = 0 }
(w) register: 504664, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x1] => { 0 = 0 | 1 = 0 }
(w) register: 504668, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x2] => { 0 = 0 | 1 = 0 }
(w) register: 50466c, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x3] => { 0 = 0 | 1 = 0 }
(Configure sources selection)
(Read counters)

Tested for the following signals (domain c and domain d) (chipset NVC1) :

active_cycles
active_warps
atom_count
branch
divergent_branch
gld_inst_128bit
gld_inst_16bit
gld_inst_32bit
gld_inst_64bit
gld_inst_8bit
gld_request
gred_count
gst_inst_128bit
gst_inst_16bit
gst_inst_32bit
gst_inst_64bit
gst_inst_8bit
gst_request
inst_executed
inst_issued1_0
inst_issued1_1
inst_issued2_0
inst_issued2_1
local_load
local_store
prof_trigger_00
prof_trigger_01
prof_trigger_02
prof_trigger_03
prof_trigger_04
prof_trigger_05
prof_trigger_06
prof_trigger_07
shared_load
shared_store
thread_inst_executed_0
thread_inst_executed_1
thread_inst_executed_2
thread_inst_executed_3
threads_launched
warps_launched

Source

How To: Reverse engineering a performance counter

In this example, we will study the warps_launched event which is quite simple.

Please make sure, you have the CUDA toolkit installed on your system and a CUDA sample compiled before to continue.

Step 1: Enable and configure the profiler

Enable the profiler :

export COMPUTE_PROFILE=1
export COMPUTE_PROFILE_CONFIG=perf_conf.txt

Configure the profiler :

# perf_conf.txt
warps_launched

Step 2: Take a trace with a modified version of valgrind-mmt

valgrind --tool=mmt --mmt-trace-file=/dev/nvidia0 --mmt-trace-nvidia-ioctls ./vectorAddDrv &> valgrind_mmt_trace.log

You can also take a look at the profiling output :

$ cat cuda_profile_0.log 
# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 GeForce GT 430
# CUDA_CONTEXT 1
# TIMESTAMPFACTOR fffff68311f26108
method,gputime,cputime,occupancy,warps_launched
method=[ memcpyHtoD ] gputime=[ 116.064 ] cputime=[ 69128.000 ] 
method=[ memcpyHtoD ] gputime=[ 116.032 ] cputime=[ 51292.000 ] 
method=[ VecAdd_kernel ] gputime=[ 67.008 ] cputime=[ 27084.000 ] occupancy=[ 1.000 ] warps_launched=[ 792 ] 
method=[ memcpyDtoH ] gputime=[ 189.120 ] cputime=[ 6512.000 ]

Step 3: Extract post ioctl calls of the trace and make it more user-friendly

grep RETURND valgrind_mmt_trace.log | cut -d ' ' -f2-

Now, the output looks like this :

RETURND: DIR=1 MMIO=504600 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=1 MMIO=504e00 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=0 MMIO=504600 VALUE=00000000 MASK=00000000 UNK=00000000,00000000,00000000,00000000
RETURND: DIR=1 MMIO=504600 VALUE=80000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504604 VALUE=00000026 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504608 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=50465c VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504660 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504664 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504668 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=50466c VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504730 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=100 MMIO=504674 VALUE=00000318 MASK=00000000 UNK=00000000,00000000,00000000,00000000
RETURND: DIR=100 MMIO=504670 VALUE=00000000 MASK=00000000 UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504674 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504678 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=50467c VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504680 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504684 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504688 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=50468c VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=101 MMIO=504690 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
RETURND: DIR=0 MMIO=504600 VALUE=80000000 MASK=00000000 UNK=00000000,00000000,00000000,00000000
RETURND: DIR=1 MMIO=504600 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000

Step 4: Use lookup (envytools) for printing register names

$ lookup -a NVC1 504604 26
PGRAPH.GPC[0].TP[0].MP.PM_SIGSEL[0] => { 0 = 0x26 | 1 = 0 | 2 = 0 | 3 = 0 }

$ lookup -a NVC1 504674 318
PGRAPH.GPC[0].TP[0].MP.PM_COUNTER[0] => 0x318

Step 5: Results
We can see that PCOUNTER selects the signal 0x26 and that the result is in the register 0x504674 (0x318 = 792). 🙂

To conclude, this method seems to work fine. However, it’s a bit annoying to do these steps for each events. So, I wrote a tool to make the reverse engineering process as automatic as possible.

Trace NVidia’s ioctl calls with valgrind-mmt

Valgrind-mmt is a Valgrind modification which allows tracing application accesses to mmaped memory (which is how userspace parts of graphics drivers communicate with hardware). It was created by Dave Airlie and then extended/fixed by others.

In order to trace ioctl calls made by the blob’s userspace, I used a modified version of valgrind-mmt to get a trace of the registers modified by CUPTI to monitor the wanted signals. I applied the following patch of Christoph Bumiller (calim) :

diff --git a/mmt/mmt_nv_ioctl.c b/mmt/mmt_nv_ioctl.c
index 23682e7..11890b0 100644
--- a/mmt/mmt_nv_ioctl.c
+++ b/mmt/mmt_nv_ioctl.c
@@ -386,6 +386,24 @@ void mmt_nv_ioctl_pre(UWord *args)
 				UInt *addr2 = (*(UInt **) (&data[4]));
 				dumpmem("in2 ", addr2[2], 0x3c);
 			}
+         else if (data[2] == 0x20800122)
+         {
+            UInt k;
+            UInt *in = (UInt *)mmt_2x4to8(data[5], data[4]);
+            UInt cnt = in[5];
+            UInt *tx = (UInt *)mmt_2x4to8(in[7], in[6]);
+            VG_(message) (Vg_DebugMsg, "<==(%u at %p)\n", cnt, tx);
+            for (k = 0; k < cnt; ++k)
+               VG_(message) (Vg_DebugMsg, "REQUEST: DIR=%x MMIO=%x VALUE=%08x MASK=%08x UNK=%08x,%08x,%08x,%08x\n",
+                             tx[k * 8 + 0],
+                             tx[k * 8 + 3],
+                             tx[k * 8 + 5],
+                             tx[k * 8 + 7],
+                             tx[k * 8 + 1],
+                             tx[k * 8 + 2],
+                             tx[k * 8 + 4],
+                             tx[k * 8 + 6]);
+         }
 			break;

 		case 0xc040464d:
@@ -565,6 +583,23 @@ void mmt_nv_ioctl_post(UWord *args)
 				UInt *addr2 = (*(UInt **) (&data[4]));
 				dumpmem("out2 ", addr2[2], 0x3c);
 			}
+         else if (data[2] == 0x20800122)
+         {
+            UInt *out = (UInt *)mmt_2x4to8(data[5], data[4]);
+            UInt cnt = out[5];
+            UInt *tx = (UInt *)mmt_2x4to8(out[7], out[6]);
+            UInt k;
+            for (k = 0; k < cnt; ++k)
+               VG_(message) (Vg_DebugMsg, "RETURND: DIR=%x MMIO=%x VALUE=%08x MASK=%08x UNK=%08x,%08x,%08x,%08x\n",
+                             tx[k * 8 + 0],
+                             tx[k * 8 + 3],
+                             tx[k * 8 + 5],
+                             tx[k * 8 + 7],
+                             tx[k * 8 + 1],
+                             tx[k * 8 + 2],
+                             tx[k * 8 + 4],
+                             tx[k * 8 + 6]);
+         }
 			break;
 			// 0x37 read configuration parameter
 		case 0xc0204638:

That patch displays MMIO register of pre/post ioctl calls made by the blob. In order to trace these calls, you have to use valgrind-mmt as this way :

valgrind --tool=mmt --mmt-trace-file=/dev/nvidia0 --mmt-trace-nvidia-ioctls

For example, if I want to see the post ioctl calls of the vectorAddDrv CUDA sample when I trace the inst_executed event, I’ll use :

valgrind --tool=mmt --mmt-trace-file=/dev/nvidia0 --mmt-trace-nvidia-ioctls ./vectorAddDrv 2>&1 | grep RETURND

And the trace looks like this:

--4803-- RETURND: DIR=1 MMIO=504600 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=1 MMIO=504e00 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=0 MMIO=504600 VALUE=00000000 MASK=00000000 UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=1 MMIO=504600 VALUE=80000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504604 VALUE=002d2d2d MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504608 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=50465c VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504660 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504664 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504668 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=50466c VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504730 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504734 VALUE=00000011 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504738 VALUE=00000022 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=100 MMIO=504674 VALUE=0000137c MASK=00000000 UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=100 MMIO=504678 VALUE=00001208 MASK=00000000 UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=100 MMIO=50467c VALUE=000003e7 MASK=00000000 UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=100 MMIO=504670 VALUE=00000000 MASK=00000000 UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504674 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504678 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=50467c VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504680 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504684 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504688 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=50468c VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=101 MMIO=504690 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=0 MMIO=504600 VALUE=80000000 MASK=00000000 UNK=00000000,00000000,00000000,00000000
--4803-- RETURND: DIR=1 MMIO=504600 VALUE=00000000 MASK=ffffffff UNK=00000000,00000000,00000000,00000000

The CUDA Profiling Tools Interface (CUPTI)

The CUDA Profiling Tools Interface (CUPTI) enables the creation of profiling and tracing tools that target CUDA applications. CUPTI provides four APIs: the Activity API, the Callback API, the Event API, and the Metric API. Using these APIs, you can develop profiling tools that give insight into the CPU and GPU behavior of CUDA applications. CUPTI is delivered as a dynamic library on all platforms supported by CUDA.

  • The CUPTI Activity API allows you to asychronously collect a trace of an application’s CPU and GPU CUDA activity.
  • The CUPTI Callback API allows you to register a callback into your own code. Your callback will be invoked when the application being profiled calls a CUDA runtime or driver function, or when certain events occur in the CUDA driver.
  • The CUPTI Event API allows you to query, configure, start, stop, and read the event counters on a CUDA-enabled device.
  • The CUPTI Metric API allows you to collect application metrics calculated from one or more event values.

The CUPTI Event API is the most interesting part regarding the goal of my GSoC project. That API can determine the available events on a device. An event is just a various activity like the number of instructions executed, the number of threads launched on a device, and so on… An avent has also an ID, a short/long description, a category (memory, instructions…) and a domain. For example, on my NVC1, I have 85 events available.

A device exposes one or more event domains. Each event domain represents a group of related events available on that device. A device may have multiple instances of a domain, indicating that the device can simultaneously record multiple instances of each event within that domain.

In order to retrieve the profiling information, CUPTI through the blob uses ioctl calls for reading/writing registers.

NVidia’s performance counters

Currently, the Nouveau project has RE’d some performance counters through envytools, a toolbox for people envious of nvidia’s blob driver.

The card unit which contains performance monitoring counters is named PCOUNTER, you can find more information about it here.

PCOUNTER is used for monitoring various activity signals from all over the card.